Llama-4-Scout-17B-16E-Instruct-quantized.w4a16
Model Overview
- Model Architecture: Llama4ForConditionalGeneration
- Input: Text / Image
- Output: Text
- Model Optimizations:
- Activation quantization: None
- Weight quantization: INT4
- Release Date: 04/25/2025
- Version: 1.0
- Model Developers: Red Hat (Neural Magic)
Model Optimizations
This model was obtained by quantizing weights of Llama-4-Scout-17B-16E-Instruct to INT4 data type. This optimization reduces the number of bits used to represent weights from 16 to 4, reducing GPU memory requirements by approximately 75%. Weight quantization also reduces disk size requirements by approximately 75%. The llm-compressor library is used for quantization.
Deployment
This model can be deployed efficiently using the vLLM backend, as shown in the example below.
from vllm import LLM, SamplingParams
from transformers import AutoTokenizer
model_id = "RedHatAI/Llama-4-Scout-17B-16E-Instruct-quantized.w4a16"
number_gpus = 4
sampling_params = SamplingParams(temperature=0.7, top_p=0.8, max_tokens=256)
tokenizer = AutoTokenizer.from_pretrained(model_id)
prompt = "Give me a short introduction to large language model."
llm = LLM(model=model_id, tensor_parallel_size=number_gpus)
outputs = llm.generate(prompt, sampling_params)
generated_text = outputs[0].outputs[0].text
print(generated_text)
vLLM also supports OpenAI-compatible serving. See the documentation for more details.
Evaluation
The model was evaluated on the OpenLLM leaderboard tasks (v1 and v2), long context RULER, multimodal MMMU, and multimodal ChartQA. All evaluations are obtained through lm-evaluation-harness.
Evaluation details
OpenLLM v1
lm_eval \
--model vllm \
--model_args pretrained="RedHatAI/Llama-4-Scout-17B-16E-Instruct-quantized.w4a16",dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=8,gpu_memory_utilization=0.7,enable_chunked_prefill=True,trust_remote_code=True \
--tasks openllm \
--batch_size auto
OpenLLM v2
lm_eval \
--model vllm \
--model_args pretrained="RedHatAI/Llama-4-Scout-17B-16E-Instruct-quantized.w4a16",dtype=auto,add_bos_token=False,max_model_len=16384,tensor_parallel_size=8,gpu_memory_utilization=0.5,enable_chunked_prefill=True,trust_remote_code=True \
--tasks leaderboard \
--apply_chat_template \
--fewshot_as_multiturn \
--batch_size auto
Long Context RULER
lm_eval \
--model vllm \
--model_args pretrained="RedHatAI/Llama-4-Scout-17B-16E-Instruct-quantized.w4a16",dtype=auto,add_bos_token=False,max_model_len=524288,tensor_parallel_size=8,gpu_memory_utilization=0.9,enable_chunked_prefill=True,trust_remote_code=True \
--tasks ruler \
--metadata='{"max_seq_lengths":[131072]}' \
--batch_size auto
Multimodal MMMU
lm_eval \
--model vllm-vlm \
--model_args pretrained="RedHatAI/Llama-4-Scout-17B-16E-Instruct-quantized.w4a16",dtype=auto,add_bos_token=False,max_model_len=1000000,tensor_parallel_size=8,gpu_memory_utilization=0.9,enable_chunked_prefill=True,trust_remote_code=True,max_images=10 \
--tasks mmmu_val \
--apply_chat_template \
--batch_size auto
Multimodal ChartQA
export VLLM_MM_INPUT_CACHE_GIB=8
lm_eval \
--model vllm-vlm \
--model_args pretrained="RedHatAI/Llama-4-Scout-17B-16E-Instruct-quantized.w4a16",dtype=auto,add_bos_token=False,max_model_len=1000000,tensor_parallel_size=8,gpu_memory_utilization=0.9,enable_chunked_prefill=True,trust_remote_code=True,max_images=10 \
--tasks chartqa \
--apply_chat_template \
--batch_size auto
Accuracy
Recovery (%) | meta-llama/Llama-4-Scout-17B-16E-Instruct | RedHatAI/Llama-4-Scout-17B-16E-Instruct-quantized.w4a16 (this model) |
|
---|---|---|---|
ARC-Challenge 25-shot |
98.64 | 69.37 | 68.43 |
GSM8k 5-shot |
98.99 | 90.45 | 89.54 |
HellaSwag 10-shot |
99.91 | 85.23 | 85.15 |
MMLU 5-shot |
99.70 | 80.54 | 80.30 |
TruthfulQA 0-shot |
99.44 | 61.41 | 61.07 |
WinoGrande 5-shot |
100.2 | 77.90 | 78.06 |
OpenLLM v1 Average Score |
99.00 | 77.48 | 77.09 |
IFEval 0-shot avg of inst and prompt acc |
100.6 | 86.90 | 87.45 |
Big Bench Hard 3-shot |
99.78 | 65.13 | 64.99 |
Math Lvl 5 4-shot |
100.6 | 57.78 | 58.16 |
GPQA 0-shot |
102.6 | 31.88 | 32.72 |
MuSR 0-shot |
101.2 | 42.20 | 42.72 |
MMLU-Pro 5-shot |
99.12 | 55.70 | 55.21 |
OpenLLM v2 Average Score |
100.48 | 56.60 | 56.87 |
MMMU 0-shot |
101.6 | 53.44 | 54.33 |
ChartQA 0-shot exact_match |
100.8 | 65.88 | 66.44 |
ChartQA 0-shot relaxed_accuracy |
99.82 | 88.92 | 88.76 |
Multimodal Average Score | 100.6 | 69.41 | 69.84 |
- Downloads last month
- 0
Model tree for RedHatAI/Llama-4-Scout-17B-16E-Instruct-quantized.w4a16
Base model
meta-llama/Llama-4-Scout-17B-16E