yess, we are collaborating with the authors to sprint this!
Merve Noyan PRO
merve
AI & ML interests
VLMs, vision & co
Recent Activity
updated
a dataset
about 8 hours ago
vlmbook/images
replied to
their
post
about 11 hours ago
Don't sleep on new AI at Meta Vision-Language release! ๐ฅ
https://huggingface.co./collections/facebook/perception-encoder-67f977c9a65ca5895a7f6ba1
https://huggingface.co./collections/facebook/perception-lm-67f9783f171948c383ee7498
Meta dropped swiss army knives for vision with A2.0 license ๐
> image/video encoders for vision language modelling and spatial understanding (object detection etc) ๐
> The vision LM outperforms InternVL3 and Qwen2.5VL ๐
> They also release gigantic video and image datasets
The authors attempt to come up with single versatile vision encoder to align on diverse set of tasks.
They trained Perception Encoder (PE) Core: a new state-of-the-art family of vision encoders that can be aligned for both vision-language and spatial tasks. For zero-shot image tasks, it outperforms latest sota SigLIP2 ๐
> Among fine-tuned ones, first one is PE-Spatial. It's a model to detect bounding boxes, segmentation, depth estimation and it outperforms all other models ๐ฎ
> Second one is PLM, Perception Language Model, where they combine PE-Core with Qwen2.5 LM 7B. it outperforms all other models (including InternVL3 which was trained with Qwen2.5LM too!)
The authors release the following checkpoints in sizes base, large and giant:
> 3 PE-Core checkpoints (224, 336, 448)
> 2 PE-Lang checkpoints (L, G)
> One PE-Spatial (G, 448)
> 3 PLM (1B, 3B, 8B)
> Datasets
Authors release following datasets ๐
> PE Video: Gigantic video datasete of 1M videos with 120k expert annotations โฏ๏ธ
> PLM-Video and PLM-Image: Human and auto-annotated image and video datasets on region-based tasks
> PLM-VideoBench: New video benchmark on MCQA
updated
a dataset
about 11 hours ago
huggingfacejs/tasks
Organizations
merve's activity

replied to
their
post
about 11 hours ago

posted
an
update
3 days ago
Post
3621
Don't sleep on new AI at Meta Vision-Language release! ๐ฅ
facebook/perception-encoder-67f977c9a65ca5895a7f6ba1
facebook/perception-lm-67f9783f171948c383ee7498
Meta dropped swiss army knives for vision with A2.0 license ๐
> image/video encoders for vision language modelling and spatial understanding (object detection etc) ๐
> The vision LM outperforms InternVL3 and Qwen2.5VL ๐
> They also release gigantic video and image datasets
The authors attempt to come up with single versatile vision encoder to align on diverse set of tasks.
They trained Perception Encoder (PE) Core: a new state-of-the-art family of vision encoders that can be aligned for both vision-language and spatial tasks. For zero-shot image tasks, it outperforms latest sota SigLIP2 ๐
> Among fine-tuned ones, first one is PE-Spatial. It's a model to detect bounding boxes, segmentation, depth estimation and it outperforms all other models ๐ฎ
> Second one is PLM, Perception Language Model, where they combine PE-Core with Qwen2.5 LM 7B. it outperforms all other models (including InternVL3 which was trained with Qwen2.5LM too!)
The authors release the following checkpoints in sizes base, large and giant:
> 3 PE-Core checkpoints (224, 336, 448)
> 2 PE-Lang checkpoints (L, G)
> One PE-Spatial (G, 448)
> 3 PLM (1B, 3B, 8B)
> Datasets
Authors release following datasets ๐
> PE Video: Gigantic video datasete of 1M videos with 120k expert annotations โฏ๏ธ
> PLM-Video and PLM-Image: Human and auto-annotated image and video datasets on region-based tasks
> PLM-VideoBench: New video benchmark on MCQA
facebook/perception-encoder-67f977c9a65ca5895a7f6ba1
facebook/perception-lm-67f9783f171948c383ee7498
Meta dropped swiss army knives for vision with A2.0 license ๐
> image/video encoders for vision language modelling and spatial understanding (object detection etc) ๐
> The vision LM outperforms InternVL3 and Qwen2.5VL ๐
> They also release gigantic video and image datasets
The authors attempt to come up with single versatile vision encoder to align on diverse set of tasks.
They trained Perception Encoder (PE) Core: a new state-of-the-art family of vision encoders that can be aligned for both vision-language and spatial tasks. For zero-shot image tasks, it outperforms latest sota SigLIP2 ๐
> Among fine-tuned ones, first one is PE-Spatial. It's a model to detect bounding boxes, segmentation, depth estimation and it outperforms all other models ๐ฎ
> Second one is PLM, Perception Language Model, where they combine PE-Core with Qwen2.5 LM 7B. it outperforms all other models (including InternVL3 which was trained with Qwen2.5LM too!)
The authors release the following checkpoints in sizes base, large and giant:
> 3 PE-Core checkpoints (224, 336, 448)
> 2 PE-Lang checkpoints (L, G)
> One PE-Spatial (G, 448)
> 3 PLM (1B, 3B, 8B)
> Datasets
Authors release following datasets ๐
> PE Video: Gigantic video datasete of 1M videos with 120k expert annotations โฏ๏ธ
> PLM-Video and PLM-Image: Human and auto-annotated image and video datasets on region-based tasks
> PLM-VideoBench: New video benchmark on MCQA

posted
an
update
5 days ago
Post
3223
New foundation model on image and video captioning just dropped by NVIDIA AI ๐ฅ
Describe Anything Model (DAM) is a 3B vision language model to generate detailed captions with localized references ๐ฎ
The team released the models, the dataset, a new benchmark and a demo ๐คฉ nvidia/describe-anything-680825bb8f5e41ff0785834c
Most of the vision LMs focus on image as a whole, lacking localized references in captions, and not taking in visual prompts (points, boxes, drawings around objects)
DAM addresses this on two levels: new vision backbone that takes in focal crops and the image itself, and a large scale dataset ๐
They generate a dataset by extending existing segmentation and referring expression generation datasets like REFCOCO, by passing in the images and classes to VLMs and generating captions.
Lastly, they also release a new benchmark again with self-supervision, they use an LLM to evaluate the detailed captions focusing on localization ๐
Describe Anything Model (DAM) is a 3B vision language model to generate detailed captions with localized references ๐ฎ
The team released the models, the dataset, a new benchmark and a demo ๐คฉ nvidia/describe-anything-680825bb8f5e41ff0785834c
Most of the vision LMs focus on image as a whole, lacking localized references in captions, and not taking in visual prompts (points, boxes, drawings around objects)
DAM addresses this on two levels: new vision backbone that takes in focal crops and the image itself, and a large scale dataset ๐
They generate a dataset by extending existing segmentation and referring expression generation datasets like REFCOCO, by passing in the images and classes to VLMs and generating captions.
Lastly, they also release a new benchmark again with self-supervision, they use an LLM to evaluate the detailed captions focusing on localization ๐

posted
an
update
14 days ago
Post
4361
sooo many open AI releases past week, let's summarize! ๐ค
merve/april-11-releases-67fcd78be33d241c0977b9d2
multimodal
> Moonshot AI released Kimi VL Thinking, first working open-source multimodal reasoning model and Kimi VL Instruct, both 16B MoEs with 3B active params (OS)
> InternVL3 released based on Qwen2.5VL, 7 ckpts with various sizes (1B to 78B)
LLMs
> NVIDIA released Llama-3_1-Nemotron-Ultra-253B-v1 an LLM built on Llama 405B for reasoning, chat and tool use
> Agentica released DeepCoder-14B-Preview, fine-tuned version of DeepSeek-R1-Distilled-Qwen-14B on problem-test pairs, along with the compiled dataset
> Zyphra/ZR1-1.5B is a new small reasoning LLM built on R1-Distill-1.5B (OS)
> Skywork-OR1-32B-Preview is a new reasoning model by Skywork
Image Generation
> HiDream releases three new models, HiDream I1 Dev, I1 Full, and I1 fast for image generation (OS)
*OS ones have Apache 2.0 or MIT licenses
merve/april-11-releases-67fcd78be33d241c0977b9d2
multimodal
> Moonshot AI released Kimi VL Thinking, first working open-source multimodal reasoning model and Kimi VL Instruct, both 16B MoEs with 3B active params (OS)
> InternVL3 released based on Qwen2.5VL, 7 ckpts with various sizes (1B to 78B)
LLMs
> NVIDIA released Llama-3_1-Nemotron-Ultra-253B-v1 an LLM built on Llama 405B for reasoning, chat and tool use
> Agentica released DeepCoder-14B-Preview, fine-tuned version of DeepSeek-R1-Distilled-Qwen-14B on problem-test pairs, along with the compiled dataset
> Zyphra/ZR1-1.5B is a new small reasoning LLM built on R1-Distill-1.5B (OS)
> Skywork-OR1-32B-Preview is a new reasoning model by Skywork
Image Generation
> HiDream releases three new models, HiDream I1 Dev, I1 Full, and I1 fast for image generation (OS)
*OS ones have Apache 2.0 or MIT licenses

posted
an
update
about 1 month ago
Post
4104
So many open releases at Hugging Face past week ๐คฏ recapping all here โคต๏ธ
merve/march-21-releases-67dbe10e185f199e656140ae
๐ Multimodal
> Mistral AI released a 24B vision LM, both base and instruction FT versions, sota ๐ฅ (OS)
> with IBM we released SmolDocling, a sota 256M document parser with Apache 2.0 license (OS)
> SpatialLM is a new vision LM that outputs 3D bounding boxes, comes with 0.5B (QwenVL based) and 1B (Llama based) variants
> SkyWork released SkyWork-R1V-38B, new vision reasoning model (OS)
๐ฌ LLMs
> NVIDIA released new Nemotron models in 49B and 8B with their post-training dataset
> LG released EXAONE, new reasoning models in 2.4B, 7.8B and 32B
> Dataset: Glaive AI released a new reasoning dataset of 22M+ examples
> Dataset: NVIDIA released new helpfulness dataset HelpSteer3
> Dataset: OpenManusRL is a new agent dataset based on ReAct framework (OS)
> Open-R1 team released OlympicCoder, new competitive coder model in 7B and 32B
> Dataset: GeneralThought-430K is a new reasoning dataset (OS)
๐ผ๏ธ Image Generation/Computer Vision
> Roboflow released RF-DETR, new real-time sota object detector (OS) ๐ฅ
> YOLOE is a new real-time zero-shot object detector with text and visual prompts ๐ฅน
> Stability AI released Stable Virtual Camera, a new novel view synthesis model
> Tencent released Hunyuan3D-2mini, new small and fast 3D asset generation model
> ByteDance released InfiniteYou, new realistic photo generation model
> StarVector is a new 8B model that generates svg from images
> FlexWorld is a new model that expands 3D views (OS)
๐ค Audio
> Sesame released CSM-1B new speech generation model (OS)
๐ค Robotics
> NVIDIA released GR00T, new robotics model for generalized reasoning and skills, along with the dataset
*OS ones have Apache 2.0 or MIT license
๐ Multimodal
> Mistral AI released a 24B vision LM, both base and instruction FT versions, sota ๐ฅ (OS)
> with IBM we released SmolDocling, a sota 256M document parser with Apache 2.0 license (OS)
> SpatialLM is a new vision LM that outputs 3D bounding boxes, comes with 0.5B (QwenVL based) and 1B (Llama based) variants
> SkyWork released SkyWork-R1V-38B, new vision reasoning model (OS)
๐ฌ LLMs
> NVIDIA released new Nemotron models in 49B and 8B with their post-training dataset
> LG released EXAONE, new reasoning models in 2.4B, 7.8B and 32B
> Dataset: Glaive AI released a new reasoning dataset of 22M+ examples
> Dataset: NVIDIA released new helpfulness dataset HelpSteer3
> Dataset: OpenManusRL is a new agent dataset based on ReAct framework (OS)
> Open-R1 team released OlympicCoder, new competitive coder model in 7B and 32B
> Dataset: GeneralThought-430K is a new reasoning dataset (OS)
๐ผ๏ธ Image Generation/Computer Vision
> Roboflow released RF-DETR, new real-time sota object detector (OS) ๐ฅ
> YOLOE is a new real-time zero-shot object detector with text and visual prompts ๐ฅน
> Stability AI released Stable Virtual Camera, a new novel view synthesis model
> Tencent released Hunyuan3D-2mini, new small and fast 3D asset generation model
> ByteDance released InfiniteYou, new realistic photo generation model
> StarVector is a new 8B model that generates svg from images
> FlexWorld is a new model that expands 3D views (OS)
๐ค Audio
> Sesame released CSM-1B new speech generation model (OS)
๐ค Robotics
> NVIDIA released GR00T, new robotics model for generalized reasoning and skills, along with the dataset
*OS ones have Apache 2.0 or MIT license

reacted to
AdinaY's
post with ๐
about 1 month ago
Post
2139
FlexWorld ๐ฅ an open framework that generates 3D scenes from a single image!
Model: GSAI-ML/FlexWorld
Paper: FlexWorld: Progressively Expanding 3D Scenes for Flexiable-View Synthesis (2503.13265)
โจ 360ยฐ rotation & zooming
โจ High quality novel views powered by video-to-video diffusion model
โจ Progressive 3D expansion
Model: GSAI-ML/FlexWorld
Paper: FlexWorld: Progressively Expanding 3D Scenes for Flexiable-View Synthesis (2503.13265)
โจ 360ยฐ rotation & zooming
โจ High quality novel views powered by video-to-video diffusion model
โจ Progressive 3D expansion

posted
an
update
2 months ago
Post
6482
Google just released PaliGemma 2 Mix: new versatile instruction vision language models ๐ฅ
> Three new models: 3B, 10B, 28B with res 224, 448 ๐
> Can do vision language tasks with open-ended prompts, understand documents, and segment or detect anything ๐คฏ
Read more https://huggingface.co./blog/paligemma2mix
Try the demo google/paligemma2-10b-mix
All models are here google/paligemma-2-mix-67ac6a251aaf3ee73679dcc4
> Three new models: 3B, 10B, 28B with res 224, 448 ๐
> Can do vision language tasks with open-ended prompts, understand documents, and segment or detect anything ๐คฏ
Read more https://huggingface.co./blog/paligemma2mix
Try the demo google/paligemma2-10b-mix
All models are here google/paligemma-2-mix-67ac6a251aaf3ee73679dcc4

posted
an
update
2 months ago
Post
5018
Your weekly recap of open AI is here, and it's packed with models!
merve/feb-14-releases-67af876b404cc27c6d837767
๐ Multimodal
> OpenGVLab released InternVideo 2.5 Chat models, new video LMs with long context
> AIDC released Ovis2 model family along with Ovis dataset, new vision LMs in different sizes (1B, 2B, 4B, 8B, 16B, 34B), with video and OCR support
> ColQwenStella-2b is a multilingual visual retrieval model that is sota in it's size
> Hoags-2B-Exp is a new multilingual vision LM with contextual reasoning, long context video understanding
๐ฌ LLMs
A lot of math models!
> Open-R1 team released OpenR1-Math-220k large scale math reasoning dataset, along with Qwen2.5-220K-Math fine-tuned on the dataset, OpenR1-Qwen-7B
> Nomic AI released new Nomic Embed multilingual retrieval model, a MoE with 500 params with 305M active params, outperforming other models
> DeepScaleR-1.5B-Preview is a new DeepSeek-R1-Distill fine-tune using distributed RL on math
> LIMO is a new fine-tune of Qwen2.5-32B-Instruct on Math
๐ฃ๏ธ Audio
> Zonos-v0.1 is a new family of speech recognition models, which contains the model itself and embeddings
๐ผ๏ธ Vision and Image Generation
> We have ported DepthPro of Apple to transformers for your convenience!
> illustrious-xl-v1.0 is a new illustration generation model
๐ Multimodal
> OpenGVLab released InternVideo 2.5 Chat models, new video LMs with long context
> AIDC released Ovis2 model family along with Ovis dataset, new vision LMs in different sizes (1B, 2B, 4B, 8B, 16B, 34B), with video and OCR support
> ColQwenStella-2b is a multilingual visual retrieval model that is sota in it's size
> Hoags-2B-Exp is a new multilingual vision LM with contextual reasoning, long context video understanding
๐ฌ LLMs
A lot of math models!
> Open-R1 team released OpenR1-Math-220k large scale math reasoning dataset, along with Qwen2.5-220K-Math fine-tuned on the dataset, OpenR1-Qwen-7B
> Nomic AI released new Nomic Embed multilingual retrieval model, a MoE with 500 params with 305M active params, outperforming other models
> DeepScaleR-1.5B-Preview is a new DeepSeek-R1-Distill fine-tune using distributed RL on math
> LIMO is a new fine-tune of Qwen2.5-32B-Instruct on Math
๐ฃ๏ธ Audio
> Zonos-v0.1 is a new family of speech recognition models, which contains the model itself and embeddings
๐ผ๏ธ Vision and Image Generation
> We have ported DepthPro of Apple to transformers for your convenience!
> illustrious-xl-v1.0 is a new illustration generation model

posted
an
update
3 months ago
Post
3172
Interesting releases in open AI this week, let's recap ๐ค
merve/feb-7-releases-67a5f7d7f172d8bfe0dd66f4
๐ค Robotics
> Pi0, first open-source foundation vision-language action model was released in Le Robot (Apache 2.0)
๐ฌ LLMs
> Groundbreaking: s1 is simpler approach to test-time scaling, the release comes with small s1K dataset of 1k question-reasoning trace pairs (from Gemini-Thinking Exp) they fine-tune Qwen2.5-32B-Instruct to get s1-32B, outperforming o1-preview on math ๐คฏ s1-32B and s1K is out!
> Adyen released DABstep, a new benchmark along with it's leaderboard demo for agents doing data analysis
> Krutrim released Krutrim-2 instruct, new 12B model based on NeMo12B trained and aligned on Indic languages, a new multilingual sentence embedding model (based on STSB-XLM-R), and a translation model for Indic languages
๐ Multimodal
> PKU released Align-DS-V, a model aligned using their new technique called LLF for all modalities (image-text-audio), along with the dataset Align Anything
> OLA-7B is a new any-to-any model by Tencent that can take text, image, video, audio data with context window of 32k tokens and output text and speech in English and Chinese
> Krutrim released Chitrarth, a new vision language model for Indic languages and English
๐ผ๏ธ Vision
> BiRefNet_HR is a new higher resolution BiRefNet for background removal
๐ฃ๏ธ Audio
> kyutai released Hibiki, it's a real-time speech-to-speech translation model ๐คฏ it's available for French-English translation
> Krutrim released Dhwani, a new STT model for Indic languages
> They also release a new dataset for STT-TTS
๐ผ๏ธ Image Generation
> Lumina released Lumina-Image-2.0, a 2B parameter-flow based DiT for text to image generation
> Tencent released Hunyuan3D-2, a 3D asset generation model based on DiT and Hunyuan3D-Paint
> boreal-hl-v1 is a new boring photorealistic image generation LoRA based on Hunyuan
๐ค Robotics
> Pi0, first open-source foundation vision-language action model was released in Le Robot (Apache 2.0)
๐ฌ LLMs
> Groundbreaking: s1 is simpler approach to test-time scaling, the release comes with small s1K dataset of 1k question-reasoning trace pairs (from Gemini-Thinking Exp) they fine-tune Qwen2.5-32B-Instruct to get s1-32B, outperforming o1-preview on math ๐คฏ s1-32B and s1K is out!
> Adyen released DABstep, a new benchmark along with it's leaderboard demo for agents doing data analysis
> Krutrim released Krutrim-2 instruct, new 12B model based on NeMo12B trained and aligned on Indic languages, a new multilingual sentence embedding model (based on STSB-XLM-R), and a translation model for Indic languages
๐ Multimodal
> PKU released Align-DS-V, a model aligned using their new technique called LLF for all modalities (image-text-audio), along with the dataset Align Anything
> OLA-7B is a new any-to-any model by Tencent that can take text, image, video, audio data with context window of 32k tokens and output text and speech in English and Chinese
> Krutrim released Chitrarth, a new vision language model for Indic languages and English
๐ผ๏ธ Vision
> BiRefNet_HR is a new higher resolution BiRefNet for background removal
๐ฃ๏ธ Audio
> kyutai released Hibiki, it's a real-time speech-to-speech translation model ๐คฏ it's available for French-English translation
> Krutrim released Dhwani, a new STT model for Indic languages
> They also release a new dataset for STT-TTS
๐ผ๏ธ Image Generation
> Lumina released Lumina-Image-2.0, a 2B parameter-flow based DiT for text to image generation
> Tencent released Hunyuan3D-2, a 3D asset generation model based on DiT and Hunyuan3D-Paint
> boreal-hl-v1 is a new boring photorealistic image generation LoRA based on Hunyuan

posted
an
update
3 months ago
Post
2365
IBM released
ibm-granite/granite-vision-3.1-2b-preview, a small vision LM with impressive performance on different tasks ๐ฎ๐ฅ
it comes with transformers and vLLM support from the get-go ๐
you can run it in Colab T4, so I built a notebook to put it to test, find it here: https://github.com/merveenoyan/smol-vision/blob/main/inference_gists/IBM_Granite_Vision.ipynb
it comes with transformers and vLLM support from the get-go ๐
you can run it in Colab T4, so I built a notebook to put it to test, find it here: https://github.com/merveenoyan/smol-vision/blob/main/inference_gists/IBM_Granite_Vision.ipynb

posted
an
update
3 months ago
Post
4073
This week in open AI was ๐ฅ Let's recap! ๐ค
merve/january-31-releases-679a10669bd4030090c5de4d
LLMs ๐ฌ
> Huge: AllenAI released new Tรผlu models that outperform DeepSeek R1 using Reinforcement Learning with Verifiable Reward (RLVR) based on Llama 3.1 405B ๐ฅ
> Mistral AI is back to open-source with their "small" 24B models (base & SFT), with Apache 2.0 license ๐ฑ
> Alibaba Qwen released their 1M context length models Qwen2.5-Instruct-1M, great for agentic use with Apache 2.0 license ๐ฅ
> Arcee AI released Virtuoso-medium, 32.8B LLMs distilled from DeepSeek V3 with dataset of 5B+ tokens
> Velvet-14B is a new family of 14B Italian LLMs trained on 10T tokens in six languages
> OpenThinker-7B is fine-tuned version of Qwen2.5-7B-Instruct on OpenThoughts dataset
VLMs & vision ๐
> Alibaba Qwen is back with Qwen2.5VL, amazing new capabilities ranging from agentic computer use to zero-shot localization ๐ฅ
> NVIDIA released new series of Eagle2 models with 1B and 9B sizes
> DeepSeek released Janus-Pro, new any-to-any model (image-text generation from image-text input) with MIT license
> BEN2 is a new background removal model with MIT license!
Audio ๐ฃ๏ธ
> YuE is a new open-source music generation foundation model, lyrics-to-song generation
Codebase ๐ฉ๐ปโ๐ป
> We are open-sourcing our SmolVLM training and eval codebase! https://github.com/huggingface/smollm/tree/main/vision
> Open-R1 is open-source reproduction of R1 by @huggingface science team https://huggingface.co./blog/open-r1
LLMs ๐ฌ
> Huge: AllenAI released new Tรผlu models that outperform DeepSeek R1 using Reinforcement Learning with Verifiable Reward (RLVR) based on Llama 3.1 405B ๐ฅ
> Mistral AI is back to open-source with their "small" 24B models (base & SFT), with Apache 2.0 license ๐ฑ
> Alibaba Qwen released their 1M context length models Qwen2.5-Instruct-1M, great for agentic use with Apache 2.0 license ๐ฅ
> Arcee AI released Virtuoso-medium, 32.8B LLMs distilled from DeepSeek V3 with dataset of 5B+ tokens
> Velvet-14B is a new family of 14B Italian LLMs trained on 10T tokens in six languages
> OpenThinker-7B is fine-tuned version of Qwen2.5-7B-Instruct on OpenThoughts dataset
VLMs & vision ๐
> Alibaba Qwen is back with Qwen2.5VL, amazing new capabilities ranging from agentic computer use to zero-shot localization ๐ฅ
> NVIDIA released new series of Eagle2 models with 1B and 9B sizes
> DeepSeek released Janus-Pro, new any-to-any model (image-text generation from image-text input) with MIT license
> BEN2 is a new background removal model with MIT license!
Audio ๐ฃ๏ธ
> YuE is a new open-source music generation foundation model, lyrics-to-song generation
Codebase ๐ฉ๐ปโ๐ป
> We are open-sourcing our SmolVLM training and eval codebase! https://github.com/huggingface/smollm/tree/main/vision
> Open-R1 is open-source reproduction of R1 by @huggingface science team https://huggingface.co./blog/open-r1

reacted to
davanstrien's
post with โค๏ธ๐
3 months ago
Post
1986
Updated the ColPali Query Generator Space
davanstrien/ColPali-Query-Generator to use
Qwen/Qwen2.5-VL-7B-Instruct.
Given an input image, it generates several queries along with explanations to justify them. This approach can generate synthetic data for fine-tuning ColPali models.
Given an input image, it generates several queries along with explanations to justify them. This approach can generate synthetic data for fine-tuning ColPali models.

reacted to
m-ric's
post with ๐โค๏ธ๐ฅ
3 months ago
Post
4110
๐ง๐ต๐ฒ ๐๐๐ฏ ๐๐ฒ๐น๐ฐ๐ผ๐บ๐ฒ๐ ๐ฒ๐
๐๐ฒ๐ฟ๐ป๐ฎ๐น ๐ถ๐ป๐ณ๐ฒ๐ฟ๐ฒ๐ป๐ฐ๐ฒ ๐ฝ๐ฟ๐ผ๐๐ถ๐ฑ๐ฒ๐ฟ๐!
โ Hosting our own inference was not enough: now the Hub 4 new inference providers: fal, Replicate, SambaNova Systems, & Together AI.
Check model cards on the Hub: you can now, in 1 click, use inference from various providers (cf video demo)
Their inference can also be used through our Inference API client. There, you can use either your custom provider key, or your HF token, then billing will be handled directly on your HF account, as a way to centralize all expenses.
๐ธ Also, PRO users get 2$ inference credits per month!
Read more in the announcement ๐ https://huggingface.co./blog/inference-providers
โ Hosting our own inference was not enough: now the Hub 4 new inference providers: fal, Replicate, SambaNova Systems, & Together AI.
Check model cards on the Hub: you can now, in 1 click, use inference from various providers (cf video demo)
Their inference can also be used through our Inference API client. There, you can use either your custom provider key, or your HF token, then billing will be handled directly on your HF account, as a way to centralize all expenses.
๐ธ Also, PRO users get 2$ inference credits per month!
Read more in the announcement ๐ https://huggingface.co./blog/inference-providers

posted
an
update
3 months ago
Post
5359
Oof, what a week! ๐ฅต So many things have happened, let's recap!
merve/jan-24-releases-6793d610774073328eac67a9
Multimodal ๐ฌ
- We have released SmolVLM -- tiniest VLMs that come in 256M and 500M, with it's retrieval models ColSmol for multimodal RAG ๐
- UI-TARS are new models by ByteDance to unlock agentic GUI control ๐คฏ in 2B, 7B and 72B
- Alibaba DAMO lab released VideoLlama3, new video LMs that come in 2B and 7B
- MiniMaxAI released Minimax-VL-01, where decoder is based on MiniMax-Text-01 456B MoE model with long context
- Dataset: Yale released a new benchmark called MMVU
- Dataset: CAIS released Humanity's Last Exam (HLE) a new challenging MM benchmark
LLMs ๐
- DeepSeek-R1 & DeepSeek-R1-Zero: gigantic 660B reasoning models by DeepSeek, and six distilled dense models, on par with o1 with MIT license! ๐คฏ
- Qwen2.5-Math-PRM: new math models by Qwen in 7B and 72B
- NVIDIA released AceMath and AceInstruct, new family of models and their datasets (SFT and reward ones too!)
Audio ๐ฃ๏ธ
- Llasa is a new speech synthesis model based on Llama that comes in 1B,3B, and 8B
- TangoFlux is a new audio generation model trained from scratch and aligned with CRPO
Image/Video/3D Generation โฏ๏ธ
- Flex.1-alpha is a new 8B pre-trained diffusion model by ostris similar to Flux
- tencent released Hunyuan3D-2, new 3D asset generation from images
Multimodal ๐ฌ
- We have released SmolVLM -- tiniest VLMs that come in 256M and 500M, with it's retrieval models ColSmol for multimodal RAG ๐
- UI-TARS are new models by ByteDance to unlock agentic GUI control ๐คฏ in 2B, 7B and 72B
- Alibaba DAMO lab released VideoLlama3, new video LMs that come in 2B and 7B
- MiniMaxAI released Minimax-VL-01, where decoder is based on MiniMax-Text-01 456B MoE model with long context
- Dataset: Yale released a new benchmark called MMVU
- Dataset: CAIS released Humanity's Last Exam (HLE) a new challenging MM benchmark
LLMs ๐
- DeepSeek-R1 & DeepSeek-R1-Zero: gigantic 660B reasoning models by DeepSeek, and six distilled dense models, on par with o1 with MIT license! ๐คฏ
- Qwen2.5-Math-PRM: new math models by Qwen in 7B and 72B
- NVIDIA released AceMath and AceInstruct, new family of models and their datasets (SFT and reward ones too!)
Audio ๐ฃ๏ธ
- Llasa is a new speech synthesis model based on Llama that comes in 1B,3B, and 8B
- TangoFlux is a new audio generation model trained from scratch and aligned with CRPO
Image/Video/3D Generation โฏ๏ธ
- Flex.1-alpha is a new 8B pre-trained diffusion model by ostris similar to Flux
- tencent released Hunyuan3D-2, new 3D asset generation from images

reacted to
m-ric's
post with ๐โค๏ธ๐ฅ
3 months ago
Post
3391
Today we make the biggest release in smolagents so far: ๐๐ฒ ๐ฒ๐ป๐ฎ๐ฏ๐น๐ฒ ๐๐ถ๐๐ถ๐ผ๐ป ๐บ๐ผ๐ฑ๐ฒ๐น๐, ๐๐ต๐ถ๐ฐ๐ต ๐ฎ๐น๐น๐ผ๐๐ ๐๐ผ ๐ฏ๐๐ถ๐น๐ฑ ๐ฝ๐ผ๐๐ฒ๐ฟ๐ณ๐๐น ๐๐ฒ๐ฏ ๐ฏ๐ฟ๐ผ๐๐๐ถ๐ป๐ด ๐ฎ๐ด๐ฒ๐ป๐๐! ๐ฅณ
Our agents can now casually open up a web browser, and navigate on it by scrolling, clicking elements on the webpage, going back, just like a user would.
The demo below shows Claude-3.5-Sonnet browsing GitHub for task: "Find how many commits the author of the current top trending repo did over last year."
Hi @mlabonne !
Go try it out, it's the most cracked agentic stuff I've seen in a while ๐คฏ (well, along with OpenAI's Operator who beat us by one day)
For more detail, read our announcement blog ๐ https://huggingface.co./blog/smolagents-can-see
The code for the web browser example is here ๐ https://github.com/huggingface/smolagents/blob/main/examples/vlm_web_browser.py
Our agents can now casually open up a web browser, and navigate on it by scrolling, clicking elements on the webpage, going back, just like a user would.
The demo below shows Claude-3.5-Sonnet browsing GitHub for task: "Find how many commits the author of the current top trending repo did over last year."
Hi @mlabonne !
Go try it out, it's the most cracked agentic stuff I've seen in a while ๐คฏ (well, along with OpenAI's Operator who beat us by one day)
For more detail, read our announcement blog ๐ https://huggingface.co./blog/smolagents-can-see
The code for the web browser example is here ๐ https://github.com/huggingface/smolagents/blob/main/examples/vlm_web_browser.py