Llama-4-Scout-17B-16E-Instruct-quantized.w4a16

Model Overview

Model Architecture: Llama4ForConditionalGeneration
- Input: Text / Image
- Output: Text
Model Optimizations:
- Activation quantization: None
- Weight quantization: INT4
Release Date: 04/25/2025
Version: 1.0
Model Developers: Red Hat (Neural Magic)

Model Optimizations

This model was obtained by quantizing weights of Llama-4-Scout-17B-16E-Instruct to INT4 data type. This optimization reduces the number of bits used to represent weights from 16 to 4, reducing GPU memory requirements by approximately 75%. Weight quantization also reduces disk size requirements by approximately 75%. The llm-compressor library is used for quantization.

Deployment

This model can be deployed efficiently using the vLLM backend, as shown in the example below.

from vllm import LLM, SamplingParams
from transformers import AutoTokenizer

model_id = "RedHatAI/Llama-4-Scout-17B-16E-Instruct-quantized.w4a16"
number_gpus = 4

sampling_params = SamplingParams(temperature=0.7, top_p=0.8, max_tokens=256)

tokenizer = AutoTokenizer.from_pretrained(model_id)

prompt = "Give me a short introduction to large language model."

llm = LLM(model=model_id, tensor_parallel_size=number_gpus)

outputs = llm.generate(prompt, sampling_params)

generated_text = outputs[0].outputs[0].text
print(generated_text)

vLLM also supports OpenAI-compatible serving. See the documentation for more details.

Evaluation

The model was evaluated on the OpenLLM leaderboard tasks (v1 and v2), long context RULER, multimodal MMMU, and multimodal ChartQA. All evaluations are obtained through lm-evaluation-harness.

Evaluation details

OpenLLM v1

lm_eval \
  --model vllm \
  --model_args pretrained="RedHatAI/Llama-4-Scout-17B-16E-Instruct-quantized.w4a16",dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=8,gpu_memory_utilization=0.7,enable_chunked_prefill=True,trust_remote_code=True \
  --tasks openllm \
  --batch_size auto

OpenLLM v2

lm_eval \
  --model vllm \
  --model_args pretrained="RedHatAI/Llama-4-Scout-17B-16E-Instruct-quantized.w4a16",dtype=auto,add_bos_token=False,max_model_len=16384,tensor_parallel_size=8,gpu_memory_utilization=0.5,enable_chunked_prefill=True,trust_remote_code=True \
  --tasks leaderboard \
  --apply_chat_template \
  --fewshot_as_multiturn \
  --batch_size auto

Long Context RULER

lm_eval \
  --model vllm \
  --model_args pretrained="RedHatAI/Llama-4-Scout-17B-16E-Instruct-quantized.w4a16",dtype=auto,add_bos_token=False,max_model_len=524288,tensor_parallel_size=8,gpu_memory_utilization=0.9,enable_chunked_prefill=True,trust_remote_code=True \
  --tasks ruler \
  --metadata='{"max_seq_lengths":[131072]}' \
  --batch_size auto

Multimodal MMMU

lm_eval \
  --model vllm-vlm \
  --model_args pretrained="RedHatAI/Llama-4-Scout-17B-16E-Instruct-quantized.w4a16",dtype=auto,add_bos_token=False,max_model_len=1000000,tensor_parallel_size=8,gpu_memory_utilization=0.9,enable_chunked_prefill=True,trust_remote_code=True,max_images=10 \
  --tasks mmmu_val \
  --apply_chat_template \
  --batch_size auto

Multimodal ChartQA

export VLLM_MM_INPUT_CACHE_GIB=8
lm_eval \
  --model vllm-vlm \
  --model_args pretrained="RedHatAI/Llama-4-Scout-17B-16E-Instruct-quantized.w4a16",dtype=auto,add_bos_token=False,max_model_len=1000000,tensor_parallel_size=8,gpu_memory_utilization=0.9,enable_chunked_prefill=True,trust_remote_code=True,max_images=10 \
  --tasks chartqa \
  --apply_chat_template \
  --batch_size auto

Accuracy

	Recovery (%)	meta-llama/Llama-4-Scout-17B-16E-Instruct	RedHatAI/Llama-4-Scout-17B-16E-Instruct-quantized.w4a16 (this model)
ARC-Challenge 25-shot	98.64	69.37	68.43
GSM8k 5-shot	98.99	90.45	89.54
HellaSwag 10-shot	99.91	85.23	85.15
MMLU 5-shot	99.70	80.54	80.30
TruthfulQA 0-shot	99.44	61.41	61.07
WinoGrande 5-shot	100.2	77.90	78.06
OpenLLM v1 Average Score	99.00	77.48	77.09
IFEval 0-shot avg of inst and prompt acc	100.6	86.90	87.45
Big Bench Hard 3-shot	99.78	65.13	64.99
Math Lvl 5 4-shot	100.6	57.78	58.16
GPQA 0-shot	102.6	31.88	32.72
MuSR 0-shot	101.2	42.20	42.72
MMLU-Pro 5-shot	99.12	55.70	55.21
OpenLLM v2 Average Score	100.48	56.60	56.87
MMMU 0-shot	101.6	53.44	54.33
ChartQA 0-shot exact_match	100.8	65.88	66.44
ChartQA 0-shot relaxed_accuracy	99.82	88.92	88.76
Multimodal Average Score	100.6	69.41	69.84