OpenGVLab

community

https://github.com/opengvlab

opengvlab

OpenGVLab

Activity Feed Request to join this org

AI & ML interests

Computer Vision

Recent Activity

yuezhengrong authored a paper about 2 hours ago

MUSES: 3D-Controllable Image Generation via Multi-Modal Agent Collaboration

yuezhengrong authored a paper about 2 hours ago

TimeSuite: Improving MLLMs for Long Video Understanding via Grounded Tuning

yuezhengrong authored a paper about 2 hours ago

VideoChat-A1: Thinking with Long Videos by Chain-of-Shot Reasoning

View all activity

Papers

RIVER: A Real-Time Interaction Benchmark for Video LLMs

InternVideo-Next: Towards General Video Foundation Models without Video-Text Supervision

View all Papers

yuezhengrong

authored 9 papers about 2 hours ago

LVAgent: Long Video Understanding by Multi-Round Dynamical Collaboration of MLLM Agents

Paper • 2503.10200 • Published Mar 13, 2025

VideoChat-R1.5: Visual Test-Time Scaling to Reinforce Multimodal Reasoning by Iterative Perception

Paper • 2509.21100 • Published Sep 25, 2025 • 1

UniFlow: A Unified Pixel Flow Tokenizer for Visual Understanding and Generation

Paper • 2510.10575 • Published Oct 12, 2025 • 2

Beyond Textual CoT: Interleaved Text-Image Chains with Deep Confidence Reasoning for Image Editing

Paper • 2510.08157 • Published Oct 9, 2025

VideoChat-M1: Collaborative Policy Planning for Video Understanding via Multi-Agent Reinforcement Learning

Paper • 2511.19524 • Published Nov 24, 2025

Continuous-Time Distribution Matching for Few-Step Diffusion Distillation

Paper • 2605.06376 • Published 4 days ago • 24

prithivMLmods

posted an update 9 days ago

Post

4234

Multimodal-Edge Demo, a node-based inference canvas demo, is now live on Spaces. It features node-based Transformers for fast inference across 10+ edge-device multimodal models on the Hub, all within a single space. The series includes models from Qwen3.5, Qwen3-VL, Gemma 4, and the LFM 2.5 VL model series, with support for reasoning and grounding tasks.

🤗 Demo: prithivMLmods/Multimodal-Edge-Node
🔗 GitHub: https://github.com/PRITHIVSAKTHIUR/Multimodal-Edge-Node
✅ Multimodal Apps Collections: https://huggingface.co/collections/prithivMLmods/hall-of-multimodal-apps

🤗 > To learn more, visit the app page or the respective model pages.

heroding77

authored 2 papers 13 days ago

OpenMobile: Building Open Mobile Agents with Task and Trajectory Synthesis

Paper • 2604.15093 • Published 25 days ago • 28

Intern-S1-Pro: Scientific Multimodal Foundation Model at Trillion Scale

Paper • 2603.25040 • Published Mar 26 • 132

prithivMLmods

posted an update 17 days ago

Post

1849

Now, a collection of various compression schemes for Qwen3.6 and the abliterated version 1 of dense models is available on the Hub. Check it out via the links below. 👇

🔗 Qwen3.6-MoE: https://huggingface.co/collections/prithivMLmods/qwen36-35b-a3b-compressions
🔗 Qwen3.6-27B Compressions: https://huggingface.co/collections/prithivMLmods/qwen36-27b-compressions

🤗 > To learn more, visit the app page or the respective model pages.

prithivMLmods

posted an update 22 days ago

Post

4161

HY-World-2.0 — A Multi-Modal World Model for Reconstructing, Generating, and Simulating 3D Worlds is now available on Spaces, and it works both as native Gradio components and in Gradio server mode.

> HY-World-2.0-Demo: prithivMLmods/HY-World-2.0-Demo
> HY-World-2.0 [Server Mode]: prithivMLmods/HY-World-2.0-Demo
> Featuring 3D reconstruction and Gaussian splats with the Rerun viewer, along with camera poses, depth maps, and surface normals.
> In Server Mode, Gradio is served via FastAPI, with FastAPI remaining the top-level server.
> Model: tencent/HY-World-2.0
> GitHub: https://github.com/PRITHIVSAKTHIUR/HY-World-2.0-Demo

🤗To learn more, visit the app page or the respective model pages.

prithivMLmods

posted an update 28 days ago

Post

6194

A new comparator on Spaces showcases Standard FLUX.2 Decoder vs. FLUX.2 Small Decoder. The Small Decoder is ~1.4× faster, uses ~1.4× less VRAM, and maintains near-identical image quality. It has ~28M parameters with narrower channels [96, 192, 384, 384] vs. [128, 256, 512, 512], and the demo supports sequence generation by running both decoders simultaneously and comparing the results side by side.

🤗 Comparator: prithivMLmods/Flux.2-4B-Decoder-Comparator
🔗 FLUX.2-small-decoder: black-forest-labs/FLUX.2-small-decoder
🔗 GitHub: https://github.com/PRITHIVSAKTHIUR/Flux.2-4B-Encoder-Comparator
🚁 Collection: https://huggingface.co/collections/prithivMLmods/image-generation-apps-collection

🤗 > App built on the Gradio SDK. To learn more, visit the app page or the respective model pages.

prithivMLmods

posted an update 29 days ago

Post

4227

Now, a collection of various compression schemes for Gemma 4 and the abliterated version 1 of dense models is available on the Hub. Check it out via the links below. 👇

🔗Gemma 4 Compression(s)- https://huggingface.co/collections/prithivMLmods/gemma-4-compressions
🔗Gemma 4 Uncensored [MAX] + Compression(s) - [`β ]- https://huggingface.co/collections/prithivMLmods/gemma-4-uncensored-max-compressions
🔗Gemma 4 Compression(s) - MoE- https://huggingface.co/collections/prithivMLmods/gemma-4-compressions-moe
🔗Gemma-4 F32 GGUF- https://huggingface.co/collections/prithivMLmods/gemma-4-f32-gguf

🤗 > To learn more, visit the app page or the respective model pages.

prithivMLmods

posted an update about 1 month ago

Post

2321

Now the demo for image detection based on SAM3 and Gemma-4 (*Filter) is available on Spaces, using full-fledged Transformers inference with multimodal reasoning for processed images. It also supports video segmentation (mask), video segmentation (annotation), and image click segmentation.

🤗 Demo Space: prithivMLmods/SAM3-Gemma4-CUDA
🥽 SAM3: facebook/sam3
🔗 gemma-4-E2B-it: google/gemma-4-E2B-it

To learn more, visit the app page or the respective model pages.

1 reply

prithivMLmods

posted an update about 1 month ago

Post

4760

The demo for Image Detection (*Filter) based on SAM3 and Qwen-3.5 is now available on Hugging Face Spaces using Transformers inference, with multimodal reasoning for processed images, and it also supports video segmentation (mask), video segmentation (annotation), and image click segmentation.

🤗 Demo Space: prithivMLmods/SAM3-Plus-Qwen3.5
🥽 SAM3: facebook/sam3
🔗 Qwen-3.5: Qwen/Qwen3.5-2B

To learn more, visit the app page or the respective model pages.