Models
Datasets
Spaces
Buckets new
Docs
Enterprise
Pricing
Log In
Sign Up

Collections

Discover the best community collections!

Collections including paper arxiv:2501.04001

ByteDance/Sa2VA-4B

Image-Text-to-Text • Updated Sep 8, 2025 • 1.37k • 96
Finnish-NLP/Ahma-2-4B-Instruct

Text Generation • 4B • Updated Nov 25, 2025 • 63 • 4
black-forest-labs/FLUX.2-dev

Image-to-Image • Updated Feb 17 • 199k • • 1.57k
mistralai/Mistral-Large-Instruct-2407

Updated Jul 28, 2025 • 7.15k • 859

Multimodal-Grounding

Migician: Revealing the Magic of Free-Form Multi-Image Grounding in Multimodal Large Language Models

Paper • 2501.05767 • Published Jan 10, 2025 • 29
An Empirical Study of Autoregressive Pre-training from Videos

Paper • 2501.05453 • Published Jan 9, 2025 • 41
Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos

Paper • 2501.04001 • Published Jan 7, 2025 • 47
MemFlow: Flowing Adaptive Memory for Consistent and Efficient Long Video Narratives

Paper • 2512.14699 • Published Dec 16, 2025 • 28

image-segmentation

Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos

Paper • 2501.04001 • Published Jan 7, 2025 • 47
facebook/sapiens-seg-1b-torchscript

Image Segmentation • Updated Oct 6, 2024 • 350 • 5
sayeed99/segformer_b3_clothes

Image Segmentation • 47.2M • Updated Feb 27, 2024 • 3.38k • • 37
allenai/WildDet3D

Object Detection • Updated 10 days ago • 156 • 33

LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token

Paper • 2501.03895 • Published Jan 7, 2025 • 52
Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos

Paper • 2501.04001 • Published Jan 7, 2025 • 47
Are VLMs Ready for Autonomous Driving? An Empirical Study from the Reliability, Data, and Metric Perspectives

Paper • 2501.04003 • Published Jan 7, 2025 • 27
VideoRAG: Retrieval-Augmented Generation over Video Corpus

Paper • 2501.05874 • Published Jan 10, 2025 • 75

LinFusion: 1 GPU, 1 Minute, 16K Image

Paper • 2409.02097 • Published Sep 3, 2024 • 34
Phidias: A Generative Model for Creating 3D Content from Text, Image, and 3D Conditions with Reference-Augmented Diffusion

Paper • 2409.11406 • Published Sep 17, 2024 • 27
Diffusion Models Are Real-Time Game Engines

Paper • 2408.14837 • Published Aug 27, 2024 • 126
Segment Anything with Multiple Modalities

Paper • 2408.09085 • Published Aug 17, 2024 • 22

Advances and Challenges in Foundation Agents: From Brain-Inspired Intelligence to Evolutionary, Collaborative, and Safe Systems

Paper • 2504.01990 • Published Mar 31, 2025 • 305
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

Paper • 2504.10479 • Published Apr 14, 2025 • 308
What, How, Where, and How Well? A Survey on Test-Time Scaling in Large Language Models

Paper • 2503.24235 • Published Mar 31, 2025 • 55
Seedream 3.0 Technical Report

Paper • 2504.11346 • Published Apr 15, 2025 • 70

Object segmentation 🧩

Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos

Paper • 2501.04001 • Published Jan 7, 2025 • 47
SAM 2: Segment Anything in Images and Videos

Paper • 2408.00714 • Published Aug 1, 2024 • 122

Image-Video General Tasks

about 11 hours ago

Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos

Paper • 2501.04001 • Published Jan 7, 2025 • 47
LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token

Paper • 2501.03895 • Published Jan 7, 2025 • 52
An Empirical Study of Autoregressive Pre-training from Videos

Paper • 2501.05453 • Published Jan 9, 2025 • 41
MatchAnything: Universal Cross-Modality Image Matching with Large-Scale Pre-Training

Paper • 2501.07556 • Published Jan 13, 2025 • 7

Sa2VA Model Zoo

Huggingace Model Zoo For Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos By Bytedance Seed CV Research

ByteDance/Sa2VA-4B

Image-Text-to-Text • Updated Sep 8, 2025 • 1.37k • 96
ByteDance/Sa2VA-8B

Image-Text-to-Text • 8B • Updated Sep 8, 2025 • 2.73k • 65
ByteDance/Sa2VA-1B

Image-Text-to-Text • 1B • Updated Sep 8, 2025 • 588 • 29
ByteDance/Sa2VA-26B

Image-Text-to-Text • 26B • Updated Sep 8, 2025 • 29 • 31

LongVILA: Scaling Long-Context Visual Language Models for Long Videos

Paper • 2408.10188 • Published Aug 19, 2024 • 52
xGen-MM (BLIP-3): A Family of Open Large Multimodal Models

Paper • 2408.08872 • Published Aug 16, 2024 • 101
Building and better understanding vision-language models: insights and future directions

Paper • 2408.12637 • Published Aug 22, 2024 • 133
Show-o: One Single Transformer to Unify Multimodal Understanding and Generation

Paper • 2408.12528 • Published Aug 22, 2024 • 51

ByteDance/Sa2VA-4B

Image-Text-to-Text • Updated Sep 8, 2025 • 1.37k • 96
Finnish-NLP/Ahma-2-4B-Instruct

Text Generation • 4B • Updated Nov 25, 2025 • 63 • 4
black-forest-labs/FLUX.2-dev

Image-to-Image • Updated Feb 17 • 199k • • 1.57k
mistralai/Mistral-Large-Instruct-2407

Updated Jul 28, 2025 • 7.15k • 859

Advances and Challenges in Foundation Agents: From Brain-Inspired Intelligence to Evolutionary, Collaborative, and Safe Systems

Paper • 2504.01990 • Published Mar 31, 2025 • 305
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

Paper • 2504.10479 • Published Apr 14, 2025 • 308
What, How, Where, and How Well? A Survey on Test-Time Scaling in Large Language Models

Paper • 2503.24235 • Published Mar 31, 2025 • 55
Seedream 3.0 Technical Report

Paper • 2504.11346 • Published Apr 15, 2025 • 70

Multimodal-Grounding

Migician: Revealing the Magic of Free-Form Multi-Image Grounding in Multimodal Large Language Models

Paper • 2501.05767 • Published Jan 10, 2025 • 29
An Empirical Study of Autoregressive Pre-training from Videos

Paper • 2501.05453 • Published Jan 9, 2025 • 41
Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos

Paper • 2501.04001 • Published Jan 7, 2025 • 47
MemFlow: Flowing Adaptive Memory for Consistent and Efficient Long Video Narratives

Paper • 2512.14699 • Published Dec 16, 2025 • 28

Object segmentation 🧩

Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos

Paper • 2501.04001 • Published Jan 7, 2025 • 47
SAM 2: Segment Anything in Images and Videos

Paper • 2408.00714 • Published Aug 1, 2024 • 122

image-segmentation

Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos

Paper • 2501.04001 • Published Jan 7, 2025 • 47
facebook/sapiens-seg-1b-torchscript

Image Segmentation • Updated Oct 6, 2024 • 350 • 5
sayeed99/segformer_b3_clothes

Image Segmentation • 47.2M • Updated Feb 27, 2024 • 3.38k • • 37
allenai/WildDet3D

Object Detection • Updated 10 days ago • 156 • 33

Image-Video General Tasks

about 11 hours ago

Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos

Paper • 2501.04001 • Published Jan 7, 2025 • 47
LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token

Paper • 2501.03895 • Published Jan 7, 2025 • 52
An Empirical Study of Autoregressive Pre-training from Videos

Paper • 2501.05453 • Published Jan 9, 2025 • 41
MatchAnything: Universal Cross-Modality Image Matching with Large-Scale Pre-Training

Paper • 2501.07556 • Published Jan 13, 2025 • 7

LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token

Paper • 2501.03895 • Published Jan 7, 2025 • 52
Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos

Paper • 2501.04001 • Published Jan 7, 2025 • 47
Are VLMs Ready for Autonomous Driving? An Empirical Study from the Reliability, Data, and Metric Perspectives

Paper • 2501.04003 • Published Jan 7, 2025 • 27
VideoRAG: Retrieval-Augmented Generation over Video Corpus

Paper • 2501.05874 • Published Jan 10, 2025 • 75

Sa2VA Model Zoo

Huggingace Model Zoo For Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos By Bytedance Seed CV Research

ByteDance/Sa2VA-4B

Image-Text-to-Text • Updated Sep 8, 2025 • 1.37k • 96
ByteDance/Sa2VA-8B

Image-Text-to-Text • 8B • Updated Sep 8, 2025 • 2.73k • 65
ByteDance/Sa2VA-1B

Image-Text-to-Text • 1B • Updated Sep 8, 2025 • 588 • 29
ByteDance/Sa2VA-26B

Image-Text-to-Text • 26B • Updated Sep 8, 2025 • 29 • 31

LinFusion: 1 GPU, 1 Minute, 16K Image

Paper • 2409.02097 • Published Sep 3, 2024 • 34
Phidias: A Generative Model for Creating 3D Content from Text, Image, and 3D Conditions with Reference-Augmented Diffusion

Paper • 2409.11406 • Published Sep 17, 2024 • 27
Diffusion Models Are Real-Time Game Engines

Paper • 2408.14837 • Published Aug 27, 2024 • 126
Segment Anything with Multiple Modalities

Paper • 2408.09085 • Published Aug 17, 2024 • 22

LongVILA: Scaling Long-Context Visual Language Models for Long Videos

Paper • 2408.10188 • Published Aug 19, 2024 • 52
xGen-MM (BLIP-3): A Family of Open Large Multimodal Models

Paper • 2408.08872 • Published Aug 16, 2024 • 101
Building and better understanding vision-language models: insights and future directions

Paper • 2408.12637 • Published Aug 22, 2024 • 133
Show-o: One Single Transformer to Unify Multimodal Understanding and Generation

Paper • 2408.12528 • Published Aug 22, 2024 • 51

Previous
1
2
Next

Company

TOS Privacy About Careers

Website

Models Datasets Spaces Pricing Docs