Domain-Adaptive CLIP for Multimodal Retrieval
The fine-tuned CLIP (Vit-L/14) used in Knowledge-Enhanced Multimodal Retrieval
π¦ Available Models
| Model | Description | Data Type |
|---|---|---|
reevaluate-clip |
Fine-tuned on images, query texts, and description texts | Image+Text |
π§Ύ Dataset
The models were trained and evaluated on the REEVLAUATE Image-Text Pair Dataset, which contains 43,500 imageβtext pairs derived from Wikidata and Pilot Museums.
Each artefact is described by:
Image: artefact imageDescription text: BLIP-generated natural language portion + meatadata portionQuery text: User query-like text
Dataset: xuemduan/reevaluate-image-text-pairs
π Usage
from transformers import CLIPProcessor, CLIPModel
from PIL import Image
model = CLIPModel.from_pretrained("xuemduan/reevaluate-clip")
processor = CLIPProcessor.from_pretrained("xuemduan/reevaluate-clip")
image = Image.open("artefact.jpg")
text = "yellow flower paintings"
image_embeds = model.get_image_features(**processor(images=image, return_tensors="pt"))
text_embeds = model.get_text_features(**processor(text=[text], return_tensors="pt"))
# normalize
image_embeds = image_embeds / image_embeds.norm(dim=-1, keepdim=True)
text_embeds = text_embeds / text_embeds.norm(dim=-1, keepdim=True)
similarity = (image_embeds @ text_embeds.T)
print(similarity)
- Downloads last month
- 100
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
π
Ask for provider support
Dataset used to train xuemduan/reevaluate-clip
Evaluation results
- I2T R@1 on Cultural Heritage Hybrid Datasetself-reported<TOBE_FILL_IN>
- I2T R@5 on Cultural Heritage Hybrid Datasetself-reported<TOBE_FILL_IN>
- T2I R@1 on Cultural Heritage Hybrid Datasetself-reported<TOBE_FILL_IN>