microsoft/bitnet-b1.58-2B-4T · Eval numbers for Llama 3.2 1B in Table 1 don't match Meta's results

about 9 hours ago

The eval numbers in Table 1 in paper https://arxiv.org/pdf/2504.12285 for Llama 3.2 1B don't match Meta's published results in https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/. For example, in Table 1 you quote 37.8 for ARC Challenge, but Meta reports 59.4. There are discrepancies in all tasks.

shumingma

Microsoft org about 6 hours ago

Thank you for noting the difference in Llama 3.2 1B evaluation scores. Evaluation results for LLMs can indeed vary significantly based on the specific framework, prompts, few-shot settings, and dataset versions used.

In our study, the priority was a consistent comparison across all models evaluated. To achieve this, we used the widely adopted lm-evaluation-harness (https://github.com/EleutherAI/lm-evaluation-harness) with uniform settings for all models. The scores in Table 1 reflect performance under this specific, unified evaluation setup.

Therefore, while our results facilitate fair relative comparisons within our paper, they may understandably diverge from Meta's figures, which could be based on different internal protocols, specific harness configurations, or prompt engineering.

AlexA5432

35 minutes ago

The main claim from your paper is that you can build a 1.58bit LLM with accuracy comparable to models that are not quantized or using at least 4 bits. But if the numbers on competing models reported in your paper are lower than the numbers they report, that claim is in question. I do like the ternary quantization idea and would like to see more convincing evidence. I find the claim (I suspect others may too) that that your comparison as is now represents "a fair relative comparison" unconvincing, as a different evaluation harness may introduce artifacts that don't handicap your model but handicap other models. I suggest you run Meta's weights and runtime on the same exact test set and see if you can reproduce their results (the same applies to Qwen and other models) and, if they reproduce, figure out why the Eleuthera evaluation harness doesn't.