Pure quantizations of Qwen2-Math-7B-Instruct for qwen2.java.

In the wild, Q8_0 quantizations are fine, but Q4_0 quantizations are rarely pure e.g. the output.weights tensor is quantized with Q6_K, instead of Q4_0.
A pure Q4_0 quantization can be generated from a high precision (F32, F16, BFLOAT16) .gguf source with the quantize utility from llama.cpp as follows:

./quantize --pure ./Qwen2-7B-Math-Instruct-F16.gguf ./Qwen2-7B-Math-Instruct-Q4_0.gguf Q4_0

Original model: https://huggingface.co./Qwen/Qwen2-Math-7B-Instruct

Model Details

For more details, please refer to the original blog post and GitHub repo.

Downloads last month
16
GGUF
Model size
7.62B params
Architecture
qwen2
Hardware compatibility
Log In to view the estimation

4-bit

8-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including mukel/Qwen2-Math-7B-Instruct-GGUF