SigLIP or SigLIP2 encoder?

#37

by orrzohar - opened 27 days ago

Discussion

orrzohar

27 days ago

SigLIP or SigLIP2 encoder?

GopiUppari

Google org 26 days ago

Hi @orrzohar ,

Yes, SigLIP and SigLIP 2 utilize similar encoder architectures, both employing the Vision Transformer (ViT) design with learned positional embeddings.
Could you please refer this reference.

Thank you.

orrzohar

26 days ago

Hi @GopiUppari ,
I am familiar with SigLIP.
However, in the Gemma3 paper, it was not stated whether SigLIP or SigLIP2 was utilized. From the config, it is impossible to tests either because the arch is the same so both are defined as siglip_vision_model.
Did Gemma3 utilize the SigLIP2 or SigLIP checkpoints?

Best,
Orr

udaybondi

26 days ago

I'm also curious if the siglip_vision_model's embeddings remain general purpose (i.e frozen during gemma training) or the SigLIP has been finetuned to improve Gemma's performance

orrzohar

26 days ago

@udaybondi i would be shocked if they kept the encoder frozen, everyone trains now a days

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment