rubert-tiny2-vllm
vLLM-optimized version of cointegrated/rubert-tiny2 for high-performance embedding inference.
This model produces numerically identical embeddings to the original while enabling speedup through vLLM's optimized kernels and batching.
Modifications
- No weight changes - uses original query/key/value weights directly
- vLLM automatically converts Q/K/V to fused qkv_proj format during loading
- Removed pretraining heads (MLM/NSP) - not needed for embeddings
- Changed architecture to
BertModelfor vLLM compatibility
Usage
vLLM Server
# IMPORTANT: Use fp32 for exact numerical match with original model
vllm serve WpythonW/rubert-tiny2-vllm --dtype float32
OpenAI-compatible API
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="dummy"
)
response = client.embeddings.create(
input="Привет мир",
model="WpythonW/rubert-tiny2-vllm"
)
print(response.data[0].embedding[:5])
Transformers
import torch
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("WpythonW/rubert-tiny2-vllm")
model = AutoModel.from_pretrained("WpythonW/rubert-tiny2-vllm")
def embed_bert_cls(text, model, tokenizer):
t = tokenizer(text, padding=True, truncation=True, return_tensors='pt')
with torch.no_grad():
model_output = model(**{k: v.to(model.device) for k, v in t.items()})
embeddings = model_output.last_hidden_state[:, 0, :]
embeddings = torch.nn.functional.normalize(embeddings)
return embeddings[0].cpu().numpy()
print(embed_bert_cls('привет мир', model, tokenizer).shape)
# (312,)
Sentence Transformers
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('WpythonW/rubert-tiny2-vllm')
sentences = ["привет мир", "hello world", "здравствуй вселенная"]
embeddings = model.encode(sentences)
print(embeddings.shape)
Validation Results
Comparison between vLLM and SentenceTransformers on identical inputs:
Max embedding difference: 3.375e-7
Mean embedding difference: 1.136e-7
Cosine similarity matrices: Identical (np.allclose with default tolerances)
This confirms bit-level equivalence within float32 precision limits.
Conversion
Full conversion notebook with validation: Google Colab
Conversion process:
- Load original cointegrated/rubert-tiny2 weights
- Remove
bert.prefix from weight names - Remove unused heads (cls., bert.pooler.)
- Keep query/key/value weights as-is (vLLM handles fusion automatically)
Tested on Google Colab Tesla T4 with:
- vLLM 0.11.2
- Transformers 4.57.2
- PyTorch 2.9.0+cu126
Original Model
For standard PyTorch/Transformers usage, see the original model: cointegrated/rubert-tiny2
- Downloads last month
- 96
Model tree for WpythonW/rubert-tiny2-vllm
Base model
cointegrated/rubert-tiny2