Post
1054
π Introducing Video-MMLU, a new benchmark for evaluating large multimodal models on classroom-style lectures in math, physics, and chemistry!
π§βπ«πVideo-MMLU requires strong reasoning capabilities and world knowledge compared to the previous benchmarks for video LMMs.
Each video comes with two tasks:
π Take Notes β detailed captioning of multi-discipline lectures
π§ Do Quiz β open-ended QA to test reasoning over visuals & proofs
We evaluated 90+ models, including vision-blind baselines, open-source models and proprietary ones.
π We find that existing models generally perform poorly, with accuracy ranging from only 10% to 50%.
πWe also explore how the number of visual tokens and the base LLMs influence performance, offering insights into the interplay between multimodal perception and reasoning in lecture comprehension.
For more details, please check belowοΌ
π Paper: https://arxiv.org/abs/2504.14693
π» Code: https://github.com/Espere-1119-Song/Video-MMLU
π§ Data: Enxin/Video-MMLU
π Website: https://enxinsong.com/Video-MMLU-web/
π§βπ«πVideo-MMLU requires strong reasoning capabilities and world knowledge compared to the previous benchmarks for video LMMs.
Each video comes with two tasks:
π Take Notes β detailed captioning of multi-discipline lectures
π§ Do Quiz β open-ended QA to test reasoning over visuals & proofs
We evaluated 90+ models, including vision-blind baselines, open-source models and proprietary ones.
π We find that existing models generally perform poorly, with accuracy ranging from only 10% to 50%.
πWe also explore how the number of visual tokens and the base LLMs influence performance, offering insights into the interplay between multimodal perception and reasoning in lecture comprehension.
For more details, please check belowοΌ
π Paper: https://arxiv.org/abs/2504.14693
π» Code: https://github.com/Espere-1119-Song/Video-MMLU
π§ Data: Enxin/Video-MMLU
π Website: https://enxinsong.com/Video-MMLU-web/