Merve Noyan's picture

Merve Noyan PRO

merve

·

https://github.com/merveenoyan/smol-vision

AI & ML interests

VLMs, vision & co

Recent Activity

updated a dataset about 8 hours ago

vlmbook/images

replied to their post about 11 hours ago

Don't sleep on new AI at Meta Vision-Language release! 🔥 https://huggingface.co./collections/facebook/perception-encoder-67f977c9a65ca5895a7f6ba1 https://huggingface.co./collections/facebook/perception-lm-67f9783f171948c383ee7498 Meta dropped swiss army knives for vision with A2.0 license 👏 > image/video encoders for vision language modelling and spatial understanding (object detection etc) 👏 > The vision LM outperforms InternVL3 and Qwen2.5VL 👏 > They also release gigantic video and image datasets The authors attempt to come up with single versatile vision encoder to align on diverse set of tasks. They trained Perception Encoder (PE) Core: a new state-of-the-art family of vision encoders that can be aligned for both vision-language and spatial tasks. For zero-shot image tasks, it outperforms latest sota SigLIP2 👏 > Among fine-tuned ones, first one is PE-Spatial. It's a model to detect bounding boxes, segmentation, depth estimation and it outperforms all other models 😮 > Second one is PLM, Perception Language Model, where they combine PE-Core with Qwen2.5 LM 7B. it outperforms all other models (including InternVL3 which was trained with Qwen2.5LM too!) The authors release the following checkpoints in sizes base, large and giant: > 3 PE-Core checkpoints (224, 336, 448) > 2 PE-Lang checkpoints (L, G) > One PE-Spatial (G, 448) > 3 PLM (1B, 3B, 8B) > Datasets Authors release following datasets 📑 > PE Video: Gigantic video datasete of 1M videos with 120k expert annotations ⏯️ > PLM-Video and PLM-Image: Human and auto-annotated image and video datasets on region-based tasks > PLM-VideoBench: New video benchmark on MCQA

updated a dataset about 11 hours ago

huggingfacejs/tasks

View all activity

Organizations

merve's activity

upvoted a paper 3 days ago

Perception Encoder: The best visual embeddings are not at the output of the network

Paper • 2504.13181 • Published 11 days ago • 31

upvoted 2 collections 4 days ago

Perception Encoder

9 items • Updated 11 days ago • 37

Perception LM

7 items • Updated 11 days ago • 35

upvoted a paper about 1 month ago

VisualWebInstruct: Scaling up Multimodal Instruction Data through Web Search

Paper • 2503.10582 • Published Mar 13 • 23

upvoted a collection about 1 month ago

OLMo 2

Artifacts for the OLMo 2 release. • 27 items • Updated 4 days ago • 112

upvoted a collection about 2 months ago

Cohere Labs Aya Vision

Aya Vision is a state-of-the-art family of vision models that brings multimodal capabilities to 23 languages. • 5 items • Updated 13 days ago • 68

upvoted 2 collections 2 months ago

olmOCR

olmOCR is a document recognition pipeline for efficiently converting documents into plain text. olmocr.allenai.org • 4 items • Updated Mar 19 • 107

SigLIP2

36 items • Updated 25 days ago • 67

upvoted a paper 2 months ago

SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

Paper • 2502.14786 • Published Feb 20 • 143

upvoted a collection 2 months ago

SmolVLM2 📺 Smallest video LM ever 🤏🏻

11 items • Updated 3 days ago • 82

upvoted an article 2 months ago

Article

SmolVLM2: Bringing Video Understanding to Every Device

Feb 20

• 238

upvoted a collection 2 months ago

Ovis2

Our latest advancement in multi-modal large language models (MLLMs) • 15 items • Updated Mar 25 • 60

upvoted a collection 3 months ago

Hibiki fr-en

Hibiki is a model for streaming speech translation , which can run on device! See https://github.com/kyutai-labs/hibiki. • 5 items • Updated Feb 6 • 52

upvoted an article 3 months ago

Article

Open-source DeepResearch – Freeing our search agents

Feb 4

• 1.23k