google/gemma-3-27b-it-qat-q4_0-unquantized

ebartsch-discord

6 days ago

Hello!

Are there plans to release an int4 version of this model (i.e. google/gemma-3-27b-it-qat-int4-unquantized)?

I'm looking to fine tune this model further using 4-bit quantization and QLoRA.

Thanks in advance!

bugwei

6 days ago

•

edited 6 days ago

I'm also looking forward to the release of the google/gemma-3-27b-it-qat-int4-unquantized model.

Additionally, I'd like to confirm my understanding regarding the difference between int4 and q4_0 quantization formats.
As a beginner, after doing some searching, my understanding is that q4_0 is a format primarily associated with llama.cpp, while int4 can be implemented using libraries like bitsandbytes within the Hugging Face ecosystem.
Therefore, if I want to stay within the Hugging Face ecosystem, using the int4 version is the correct approach, right? Could anyone please confirm if this is accurate?

Following up on the int4 discussion, I have a question about the specific quantization method used for this QAT (Quantization-Aware Training) model. I typically use a BitsAndBytesConfig like the one below for 4-bit loading:

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,      
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

I understand that nf4 is introduced by QLoRA and may not be the same as the original int4 quantization methods. The Gemma3 technical report only mentions "per-channel int4, per-block int4, and switched fp8" to their QAT process.
Given this, I'm unsure if simply using a standard bnb_config with bnb_4bit_quant_type="nf4" is the correct way to load this specific QAT model and leverage the exact quantization it was trained with.
Could anyone provide guidance on how to load or work with this model to match the 'per-channel int4' or 'per-block int4' methods mentioned in the report? Any insights would be greatly appreciated!

bugwei

6 days ago

Update.
I've found that per-channel is row-wise quantization, and per-block (blocks=32) int4 is what Q4_0 does (reference: https://huggingface.co./docs/hub/en/gguf#quantization-types).
And according to the issue, https://github.com/bitsandbytes-foundation/bitsandbytes/issues/1329, the current bitsandbytes can not handle Q4_0, is that right?

google
/

gemma-3-27b-it-qat-q4_0-unquantized

27b int4 version