rubert-tiny2-vllm

vLLM-optimized version of cointegrated/rubert-tiny2 for high-performance embedding inference.

This model produces numerically identical embeddings to the original while enabling speedup through vLLM's optimized kernels and batching.

Modifications

No weight changes - uses original query/key/value weights directly
vLLM automatically converts Q/K/V to fused qkv_proj format during loading
Removed pretraining heads (MLM/NSP) - not needed for embeddings
Changed architecture to BertModel for vLLM compatibility

Usage

vLLM Server

# IMPORTANT: Use fp32 for exact numerical match with original model
vllm serve WpythonW/rubert-tiny2-vllm --dtype float32

OpenAI-compatible API

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="dummy"
)

response = client.embeddings.create(
    input="Привет мир",
    model="WpythonW/rubert-tiny2-vllm"
)
print(response.data[0].embedding[:5])

Transformers

import torch
from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("WpythonW/rubert-tiny2-vllm")
model = AutoModel.from_pretrained("WpythonW/rubert-tiny2-vllm")

def embed_bert_cls(text, model, tokenizer):
    t = tokenizer(text, padding=True, truncation=True, return_tensors='pt')
    with torch.no_grad():
        model_output = model(**{k: v.to(model.device) for k, v in t.items()})
    embeddings = model_output.last_hidden_state[:, 0, :]
    embeddings = torch.nn.functional.normalize(embeddings)
    return embeddings[0].cpu().numpy()

print(embed_bert_cls('привет мир', model, tokenizer).shape)
# (312,)

Sentence Transformers

from sentence_transformers import SentenceTransformer

model = SentenceTransformer('WpythonW/rubert-tiny2-vllm')
sentences = ["привет мир", "hello world", "здравствуй вселенная"]
embeddings = model.encode(sentences)
print(embeddings.shape)

Validation Results

Comparison between vLLM and SentenceTransformers on identical inputs:

Max embedding difference: 3.375e-7
Mean embedding difference: 1.136e-7
Cosine similarity matrices: Identical (np.allclose with default tolerances)

This confirms bit-level equivalence within float32 precision limits.

Conversion

Full conversion notebook with validation: Google Colab

Conversion process:

Load original cointegrated/rubert-tiny2 weights
Remove bert. prefix from weight names
Remove unused heads (cls., bert.pooler.)
Keep query/key/value weights as-is (vLLM handles fusion automatically)

Tested on Google Colab Tesla T4 with:

vLLM 0.11.2
Transformers 4.57.2
PyTorch 2.9.0+cu126

Original Model

For standard PyTorch/Transformers usage, see the original model: cointegrated/rubert-tiny2

Downloads last month: 58

Model tree for WpythonW/rubert-tiny2-vllm

Base model

cointegrated/rubert-tiny2

Finetuned

(62)

this model