PrismCaptioner Model Card

Model details

PrismCaptioners are open-source captioners with LLaVA architecture finetuned on GPT4V-assisted dataset ALLaVA. We have released PrismCaptioner-7B and PrismCaptioner-2B.

PrismCaptioner-7B details

Vision Backbone: google/siglip-so400m-patch14-384
Language Backbone: internlm/internlm2-7b
Dataset: 1x ALLaVA-Caption-[LAION/VFLAN]

Paper and codebase for more information: [Paper] [Code]

Intended uses

Perception Module: The model can be integrated into Prism as a perception module to solve vision-language task by utilizing an external LLM.
Effective Captioner: The model can produce high-quality captions for given images.

Model usage

Clone the Prism repo and complete the preparation. You can use PrismCaptioners following usage or demo below.

# In the Prism repo folder
from decouple import supported_VLM

model = supported_VLM['prismcaptioner-7b']()
res = model.generate(['assets/case1.png', 'Given the image below, please provide a detailed description of what you see.'])

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

Image-Text-to-Text

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train Yuxuan-Qiao/PrismCaptioner-7B

Paper for Yuxuan-Qiao/PrismCaptioner-7B

Prism: A Framework for Decoupling and Assessing the Capabilities of VLMs

Paper • 2406.14544 • Published Jun 20, 2024 • 35