rubert-tiny2-vllm

vLLM-optimized version of cointegrated/rubert-tiny2 for high-performance embedding inference.

This model produces numerically identical embeddings to the original while enabling speedup through vLLM's optimized kernels and batching.

Modifications

  • No weight changes - uses original query/key/value weights directly
  • vLLM automatically converts Q/K/V to fused qkv_proj format during loading
  • Removed pretraining heads (MLM/NSP) - not needed for embeddings
  • Changed architecture to BertModel for vLLM compatibility

Usage

vLLM Server

# IMPORTANT: Use fp32 for exact numerical match with original model
vllm serve WpythonW/rubert-tiny2-vllm --dtype float32

OpenAI-compatible API

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="dummy"
)

response = client.embeddings.create(
    input="Привет мир",
    model="WpythonW/rubert-tiny2-vllm"
)
print(response.data[0].embedding[:5])

Transformers

import torch
from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("WpythonW/rubert-tiny2-vllm")
model = AutoModel.from_pretrained("WpythonW/rubert-tiny2-vllm")

def embed_bert_cls(text, model, tokenizer):
    t = tokenizer(text, padding=True, truncation=True, return_tensors='pt')
    with torch.no_grad():
        model_output = model(**{k: v.to(model.device) for k, v in t.items()})
    embeddings = model_output.last_hidden_state[:, 0, :]
    embeddings = torch.nn.functional.normalize(embeddings)
    return embeddings[0].cpu().numpy()

print(embed_bert_cls('привет мир', model, tokenizer).shape)
# (312,)

Sentence Transformers

from sentence_transformers import SentenceTransformer

model = SentenceTransformer('WpythonW/rubert-tiny2-vllm')
sentences = ["привет мир", "hello world", "здравствуй вселенная"]
embeddings = model.encode(sentences)
print(embeddings.shape)

Validation Results

Comparison between vLLM and SentenceTransformers on identical inputs:

Max embedding difference: 3.375e-7
Mean embedding difference: 1.136e-7
Cosine similarity matrices: Identical (np.allclose with default tolerances)

This confirms bit-level equivalence within float32 precision limits.

Conversion

Full conversion notebook with validation: Google Colab

Conversion process:

  1. Load original cointegrated/rubert-tiny2 weights
  2. Remove bert. prefix from weight names
  3. Remove unused heads (cls., bert.pooler.)
  4. Keep query/key/value weights as-is (vLLM handles fusion automatically)

Tested on Google Colab Tesla T4 with:

  • vLLM 0.11.2
  • Transformers 4.57.2
  • PyTorch 2.9.0+cu126

Original Model

For standard PyTorch/Transformers usage, see the original model: cointegrated/rubert-tiny2

Downloads last month
96
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for WpythonW/rubert-tiny2-vllm

Finetuned
(59)
this model