SigLIP or SigLIP2 encoder?

#37
by orrzohar - opened

SigLIP or SigLIP2 encoder?

Google org

Hi @orrzohar ,

Yes, SigLIP and SigLIP 2 utilize similar encoder architectures, both employing the Vision Transformer (ViT) design with learned positional embeddings.
Could you please refer this reference.

Thank you.

Hi @GopiUppari ,
I am familiar with SigLIP.
However, in the Gemma3 paper, it was not stated whether SigLIP or SigLIP2 was utilized. From the config, it is impossible to tests either because the arch is the same so both are defined as siglip_vision_model.
Did Gemma3 utilize the SigLIP2 or SigLIP checkpoints?

Best,
Orr

I'm also curious if the siglip_vision_model's embeddings remain general purpose (i.e frozen during gemma training) or the SigLIP has been finetuned to improve Gemma's performance

@udaybondi i would be shocked if they kept the encoder frozen, everyone trains now a days

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment