jailbreak-embeddings-base-onnx
ONNX export of the multilingual-e5-base-wjb-threatfeed_v1 model — a fine-tuned sentence-transformers model for detecting duplicate vulnerability submissions (jailbreak and prompt injection attacks) in the 0din threat feed.
It maps prompts to a 768-dimensional dense vector space optimized for semantic similarity comparison of attack prompts.
This model achieves a +50.6% F1 improvement over the OpenAI text-embedding-3-large baseline on duplicate detection.
Model Details
Model Description
- Model Type: Sentence Transformer (two-stage fine-tuned), exported to ONNX
- Base Model: intfloat/multilingual-e5-base (~278M parameters)
- Maximum Sequence Length: 512 tokens
- Output Dimensionality: 768 dimensions
- Similarity Function: Cosine Similarity
- Language: Multilingual (XLM-RoBERTa backbone)
- Format: ONNX (compatible with onnxruntime, tract-onnx, and other ONNX runtimes)
Embedding Pipeline
Input Text → Tokenizer → ONNX Model → Mean Pooling → L2 Normalization → Embedding
The ONNX model contains only the transformer backbone. Mean pooling and L2 normalization must be implemented in application code (see usage examples below).
Model Inputs
The ONNX model requires 3 inputs:
input_ids: Token IDs from tokenizerattention_mask: 1 for real tokens, 0 for paddingtoken_type_ids: All zeros for single-sentence embeddings
ONNX Verification
The ONNX export produces bit-for-bit identical embeddings to the native sentence-transformers model (0.000000 max difference across all test sentences).
Intended Use
This model is designed for:
- Duplicate detection in AI security vulnerability reports (jailbreak/prompt injection attacks)
- Semantic similarity comparison of attack prompts that may use different surface-level techniques but target the same underlying vulnerability
- Embedding generation for LSH-based similarity search in vulnerability management systems
- Edge/server deployment via ONNX runtime without requiring PyTorch
The model is trained to recognize semantic equivalence between attack prompts even when they use different jailbreak tactics (e.g., role-playing, encoding, academic framing) to elicit the same harmful behavior.
Usage
sentence-transformers (with ONNX backend)
from sentence_transformers import SentenceTransformer
# Load directly with ONNX backend
model = SentenceTransformer("0dinai/jailbreak-embeddings-base-onnx", backend="onnx")
sentences = ["First attack prompt", "Second attack prompt"]
embeddings = model.encode(sentences)
similarity = model.similarity(embeddings, embeddings)
print(similarity)
Python (onnxruntime)
import numpy as np
import onnxruntime as ort
from tokenizers import Tokenizer
# Load model and tokenizer
session = ort.InferenceSession("onnx/model.onnx")
tokenizer = Tokenizer.from_file("tokenizer.json")
tokenizer.enable_padding(pad_id=1, pad_token="<pad>")
tokenizer.enable_truncation(max_length=512)
# Tokenize
texts = ["First attack prompt", "Second attack prompt"]
encodings = tokenizer.encode_batch(texts)
input_ids = np.array([e.ids for e in encodings], dtype=np.int64)
attention_mask = np.array([e.attention_mask for e in encodings], dtype=np.int64)
token_type_ids = np.zeros_like(input_ids)
# Run ONNX inference
outputs = session.run(None, {
"input_ids": input_ids,
"attention_mask": attention_mask,
"token_type_ids": token_type_ids,
})
token_embeddings = outputs[0] # [batch, seq_len, 768]
# Mean pooling
mask = attention_mask[:, :, np.newaxis].astype(np.float32)
embeddings = (token_embeddings * mask).sum(axis=1) / mask.sum(axis=1)
# L2 normalization
norms = np.linalg.norm(embeddings, axis=1, keepdims=True)
embeddings = embeddings / norms
# Cosine similarity
similarity = np.dot(embeddings[0], embeddings[1])
print(f"Similarity: {similarity:.4f}")
Rust (tract-onnx)
use tract_onnx::prelude::*;
use tokenizers::Tokenizer;
// Load model and tokenizer
let model = tract_onnx::onnx()
.model_for_path("onnx/model.onnx")?
.into_optimized()?
.into_runnable()?;
let tokenizer = Tokenizer::from_file("tokenizer.json")?;
// Tokenize
let encoding = tokenizer.encode("Attack prompt text", true)?;
let input_ids: Vec<i64> = encoding.get_ids().iter().map(|&x| x as i64).collect();
let attention_mask: Vec<i64> = encoding.get_attention_mask().iter().map(|&x| x as i64).collect();
let token_type_ids: Vec<i64> = vec![0i64; input_ids.len()];
// Run inference, then apply mean pooling + L2 normalization
// (see full Rust implementation at github.com/0din-ai)
Training Details
This model was trained using a two-stage fine-tuning approach:
Stage 1: WildJailbreak Pre-training
Pre-trained on public synthetic data to learn jailbreak semantics.
- Dataset: Allen AI WildJailbreak — vanilla-adversarial prompt pairs
- Pairs: 161,396 positive pairs (same intent, different formulation)
- Split: 153,326 train / 4,034 val / 4,036 test (95% / 2.5% / 2.5%)
- Loss: MultipleNegativesRankingLoss (in-batch negatives)
- Batch size: 16 (per device) x 2 gradient accumulation steps = 32 effective
- Learning rate: 1e-5
- FP16: True
- Purpose: Teach the model to see through jailbreak wrappers and match prompts by underlying intent
Stage 2: Threat Feed Fine-tuning
Fine-tuned on annotated pairs from the internal 0din threat feed.
- Pairs: 9,598 annotated pairs (7,678 train / 958 val / 962 test)
- Label Distribution: ~34% duplicates / ~66% non-duplicates
- Annotation: Google Gemini 2.5 Pro (single-model annotation)
- Source Similarity Threshold: Candidate pairs generated with Thor similarity >= 0.5
- Loss: ContrastiveLoss (cosine distance, margin=0.5)
- Purpose: Calibrate the model for real-world duplicate detection on production vulnerability data
Stage 2 Hyperparameters
| Parameter | Value |
|---|---|
| Epochs | 50 (early stopped) |
| Batch size | 8 (per device) x 4 gradient accumulation = 32 effective |
| Learning rate | 1e-5 |
| LR scheduler | Linear |
| Warmup ratio | 0.1 |
| Weight decay | 0.01 |
| FP16 | True |
| Early stopping patience | 10 |
| Eval steps | 50 |
| Seed | 1 |
| Best checkpoint | Step 1200 (epoch 5.0) |
| Best validation loss | 0.0149 |
Evaluation Results
Duplicate Detection Performance
Evaluated on 55 human-labeled vulnerability pairs (10 duplicates, 45 non-duplicates) from a corpus of 3,749 vulnerabilities. Best F1 score at each model's optimal threshold:
| Model | Best F1 | Threshold | Precision | Recall |
|---|---|---|---|---|
| OpenAI text-embedding-3-large (baseline) | 0.462 | 0.80 | 1.000 | 0.300 |
| Finetuned V1 (WildJailbreak only, e5-small) | 0.500 | 0.50 | 0.333 | 1.000 |
| Finetuned V2 (WJB + threat feed v1, e5-small) | 0.526 | 0.70 | 0.556 | 0.500 |
| Finetuned V3 (WJB + threat feed v2, e5-small) | 0.556 | 0.75 | 0.625 | 0.500 |
| Finetuned V4 (WJB + threat feed 10k, e5-small) | 0.600 | 0.70 | 0.600 | 0.600 |
| This model (Base V1) | 0.696 | 0.70 | 0.615 | 0.800 |
Threshold Analysis (This Model)
| Threshold | Precision | Recall | F1 | TP | FP | FN | TN |
|---|---|---|---|---|---|---|---|
| 0.50 | 0.243 | 0.900 | 0.383 | 9 | 28 | 1 | 17 |
| 0.55 | 0.308 | 0.800 | 0.444 | 8 | 18 | 2 | 27 |
| 0.60 | 0.381 | 0.800 | 0.516 | 8 | 13 | 2 | 32 |
| 0.65 | 0.500 | 0.800 | 0.615 | 8 | 8 | 2 | 37 |
| 0.70 | 0.615 | 0.800 | 0.696 | 8 | 5 | 2 | 40 |
| 0.75 | 0.625 | 0.500 | 0.556 | 5 | 3 | 5 | 42 |
| 0.80 | 0.800 | 0.400 | 0.533 | 4 | 1 | 6 | 44 |
| 0.85 | 1.000 | 0.300 | 0.462 | 3 | 0 | 7 | 45 |
| 0.90 | 1.000 | 0.100 | 0.182 | 1 | 0 | 9 | 45 |
Key Findings
- +50.6% F1 improvement over the OpenAI text-embedding-3-large baseline (0.696 vs 0.462)
- Largest single jump in the series: +16% F1 over the e5-small V4 model (0.696 vs 0.600), showing that model capacity matters for this task.
- Substantially higher recall: At threshold 0.70, this model achieves 0.800 recall vs 0.600 for e5-small V4, while maintaining comparable precision (0.615 vs 0.600).
- Wide effective threshold band: Recall stays at 0.800 across thresholds 0.50–0.70, suggesting the larger model produces more confident and well-separated similarity scores for true duplicate pairs.
Note: The evaluation dataset is small (55 pairs, 10 positive). With only 10 true duplicates, each TP/FP change causes large metric swings. Results should be interpreted with caution.
Limitations
- Small evaluation set: Only 55 human-labeled pairs (10 duplicates). Results should be taken as directional rather than definitive.
- LLM annotation bias in training data: Stage 2 training data was annotated by a single LLM (Gemini 2.5 Pro), which may affect calibration.
- Model size: ~278M parameters with 768-dim embeddings. The ONNX model is ~1GB.
- Domain-specific: Optimized for jailbreak/prompt injection duplicate detection. Performance on general semantic similarity tasks is not evaluated.
Citation
BibTeX
Sentence Transformers
@inproceedings{reimers-2019-sentence-bert,
title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
author = "Reimers, Nils and Gurevych, Iryna",
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
month = "11",
year = "2019",
publisher = "Association for Computational Linguistics",
url = "https://arxiv.org/abs/1908.10084",
}
ContrastiveLoss
@inproceedings{hadsell2006dimensionality,
author={Hadsell, R. and Chopra, S. and LeCun, Y.},
booktitle={2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06)},
title={Dimensionality Reduction by Learning an Invariant Mapping},
year={2006},
volume={2},
number={},
pages={1735-1742},
doi={10.1109/CVPR.2006.100}
}
WildJailbreak
@article{jiang2024wildteaming,
title={WildTeaming at Scale: From In-the-Wild Jailbreaks to (Adversarially) Safer Language Models},
author={Jiang, Liwei and Bhatt, Kavel and Phute, Seungju and Hwang, Jaehun and Liang, Dongwei and Sap, Maarten and Hajishirzi, Hannaneh and Choi, Yejin},
journal={arXiv preprint arXiv:2406.18510},
year={2024}
}
- Downloads last month
- 17