Domain-Adaptive CLIP for Multimodal Retrieval

The fine-tuned CLIP (Vit-L/14) used in Knowledge-Enhanced Multimodal Retrieval


πŸ“¦ Available Models

Model Description Data Type
reevaluate-clip Fine-tuned on images, query texts, and description texts Image+Text

🧾 Dataset

The models were trained and evaluated on the REEVLAUATE Image-Text Pair Dataset, which contains 43,500 image–text pairs derived from Wikidata and Pilot Museums.

Each artefact is described by:

  • Image: artefact image
  • Description text: BLIP-generated natural language portion + meatadata portion
  • Query text: User query-like text

Dataset: xuemduan/reevaluate-image-text-pairs


πŸš€ Usage

from transformers import CLIPProcessor, CLIPModel
from PIL import Image

model = CLIPModel.from_pretrained("xuemduan/reevaluate-clip")
processor = CLIPProcessor.from_pretrained("xuemduan/reevaluate-clip")

image = Image.open("artefact.jpg")
text = "yellow flower paintings"

image_embeds = model.get_image_features(**processor(images=image, return_tensors="pt"))
text_embeds = model.get_text_features(**processor(text=[text], return_tensors="pt"))

# normalize
image_embeds = image_embeds / image_embeds.norm(dim=-1, keepdim=True)
text_embeds = text_embeds / text_embeds.norm(dim=-1, keepdim=True)

similarity = (image_embeds @ text_embeds.T)
print(similarity)
Downloads last month
100
Safetensors
Model size
0.4B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Dataset used to train xuemduan/reevaluate-clip

Evaluation results

  • I2T R@1 on Cultural Heritage Hybrid Dataset
    self-reported
    <TOBE_FILL_IN>
  • I2T R@5 on Cultural Heritage Hybrid Dataset
    self-reported
    <TOBE_FILL_IN>
  • T2I R@1 on Cultural Heritage Hybrid Dataset
    self-reported
    <TOBE_FILL_IN>