PHYBench: Holistic Evaluation of Physical Perception and Reasoning in Large Language Models Paper β’ 2504.16074 β’ Published 6 days ago β’ 33
DataDecide Collection A suite of models, data, and evals over 25 corpora, 14 sizes, and 3 seeds to measure how accurately small experiments predict rankings at large scale. β’ 358 items β’ Updated 12 days ago β’ 13
view article Article Arabic Leaderboards: Introducing Arabic Instruction Following, Updating AraGen, and More 21 days ago β’ 16
TxGemma Release Collection Collection of open models to accelerate the development of therapeutics. β’ 5 items β’ Updated 25 days ago β’ 50
view article Article Welcome Gemma 3: Google's all new multimodal, multilingual, long context open LLM Mar 12 β’ 403
KITAB-Bench Collection A Comprehensive Multi-Domain Benchmark for Arabic OCR and Document Understanding β’ 24 items β’ Updated Feb 24 β’ 11