The massive multimodal embedding benchmark
Analyze images to detect objects, points, keypoints, or text
Segment objects in images and videos using text prompts
Chat with an AI assistant using text and images