jailbreak-embeddings-base-onnx

ONNX export of the multilingual-e5-base-wjb-threatfeed_v1 model — a fine-tuned sentence-transformers model for detecting duplicate vulnerability submissions (jailbreak and prompt injection attacks) in the 0din threat feed.

It maps prompts to a 768-dimensional dense vector space optimized for semantic similarity comparison of attack prompts.

This model achieves a +50.6% F1 improvement over the OpenAI text-embedding-3-large baseline on duplicate detection.

Model Details

Model Description

Model Type: Sentence Transformer (two-stage fine-tuned), exported to ONNX
Base Model: intfloat/multilingual-e5-base (~278M parameters)
Maximum Sequence Length: 512 tokens
Output Dimensionality: 768 dimensions
Similarity Function: Cosine Similarity
Language: Multilingual (XLM-RoBERTa backbone)
Format: ONNX (compatible with onnxruntime, tract-onnx, and other ONNX runtimes)

Embedding Pipeline

Input Text → Tokenizer → ONNX Model → Mean Pooling → L2 Normalization → Embedding

The ONNX model contains only the transformer backbone. Mean pooling and L2 normalization must be implemented in application code (see usage examples below).

Model Inputs

The ONNX model requires 3 inputs:

input_ids: Token IDs from tokenizer
attention_mask: 1 for real tokens, 0 for padding
token_type_ids: All zeros for single-sentence embeddings

ONNX Verification

The ONNX export produces bit-for-bit identical embeddings to the native sentence-transformers model (0.000000 max difference across all test sentences).

Intended Use

This model is designed for:

Duplicate detection in AI security vulnerability reports (jailbreak/prompt injection attacks)
Semantic similarity comparison of attack prompts that may use different surface-level techniques but target the same underlying vulnerability
Embedding generation for LSH-based similarity search in vulnerability management systems
Edge/server deployment via ONNX runtime without requiring PyTorch

The model is trained to recognize semantic equivalence between attack prompts even when they use different jailbreak tactics (e.g., role-playing, encoding, academic framing) to elicit the same harmful behavior.

Usage

sentence-transformers (with ONNX backend)

from sentence_transformers import SentenceTransformer

# Load directly with ONNX backend
model = SentenceTransformer("0dinai/jailbreak-embeddings-base-onnx", backend="onnx")

sentences = ["First attack prompt", "Second attack prompt"]
embeddings = model.encode(sentences)
similarity = model.similarity(embeddings, embeddings)
print(similarity)

Python (onnxruntime)

import numpy as np
import onnxruntime as ort
from tokenizers import Tokenizer

# Load model and tokenizer
session = ort.InferenceSession("onnx/model.onnx")
tokenizer = Tokenizer.from_file("tokenizer.json")
tokenizer.enable_padding(pad_id=1, pad_token="<pad>")
tokenizer.enable_truncation(max_length=512)

# Tokenize
texts = ["First attack prompt", "Second attack prompt"]
encodings = tokenizer.encode_batch(texts)
input_ids = np.array([e.ids for e in encodings], dtype=np.int64)
attention_mask = np.array([e.attention_mask for e in encodings], dtype=np.int64)
token_type_ids = np.zeros_like(input_ids)

# Run ONNX inference
outputs = session.run(None, {
    "input_ids": input_ids,
    "attention_mask": attention_mask,
    "token_type_ids": token_type_ids,
})
token_embeddings = outputs[0]  # [batch, seq_len, 768]

# Mean pooling
mask = attention_mask[:, :, np.newaxis].astype(np.float32)
embeddings = (token_embeddings * mask).sum(axis=1) / mask.sum(axis=1)

# L2 normalization
norms = np.linalg.norm(embeddings, axis=1, keepdims=True)
embeddings = embeddings / norms

# Cosine similarity
similarity = np.dot(embeddings[0], embeddings[1])
print(f"Similarity: {similarity:.4f}")

Rust (tract-onnx)

use tract_onnx::prelude::*;
use tokenizers::Tokenizer;

// Load model and tokenizer
let model = tract_onnx::onnx()
    .model_for_path("onnx/model.onnx")?
    .into_optimized()?
    .into_runnable()?;
let tokenizer = Tokenizer::from_file("tokenizer.json")?;

// Tokenize
let encoding = tokenizer.encode("Attack prompt text", true)?;
let input_ids: Vec<i64> = encoding.get_ids().iter().map(|&x| x as i64).collect();
let attention_mask: Vec<i64> = encoding.get_attention_mask().iter().map(|&x| x as i64).collect();
let token_type_ids: Vec<i64> = vec![0i64; input_ids.len()];

// Run inference, then apply mean pooling + L2 normalization
// (see full Rust implementation at github.com/0din-ai)

Training Details

This model was trained using a two-stage fine-tuning approach:

Stage 1: WildJailbreak Pre-training

Pre-trained on public synthetic data to learn jailbreak semantics.

Dataset: Allen AI WildJailbreak — vanilla-adversarial prompt pairs
Pairs: 161,396 positive pairs (same intent, different formulation)
Split: 153,326 train / 4,034 val / 4,036 test (95% / 2.5% / 2.5%)
Loss: MultipleNegativesRankingLoss (in-batch negatives)
Batch size: 16 (per device) x 2 gradient accumulation steps = 32 effective
Learning rate: 1e-5
FP16: True
Purpose: Teach the model to see through jailbreak wrappers and match prompts by underlying intent

Stage 2: Threat Feed Fine-tuning

Fine-tuned on annotated pairs from the internal 0din threat feed.

Pairs: 9,598 annotated pairs (7,678 train / 958 val / 962 test)
Label Distribution: ~34% duplicates / ~66% non-duplicates
Annotation: Google Gemini 2.5 Pro (single-model annotation)
Source Similarity Threshold: Candidate pairs generated with Thor similarity >= 0.5
Loss: ContrastiveLoss (cosine distance, margin=0.5)
Purpose: Calibrate the model for real-world duplicate detection on production vulnerability data

Stage 2 Hyperparameters

Parameter	Value
Epochs	50 (early stopped)
Batch size	8 (per device) x 4 gradient accumulation = 32 effective
Learning rate	1e-5
LR scheduler	Linear
Warmup ratio	0.1
Weight decay	0.01
FP16	True
Early stopping patience	10
Eval steps	50
Seed	1
Best checkpoint	Step 1200 (epoch 5.0)
Best validation loss	0.0149

Evaluation Results

Duplicate Detection Performance

Evaluated on 55 human-labeled vulnerability pairs (10 duplicates, 45 non-duplicates) from a corpus of 3,749 vulnerabilities. Best F1 score at each model's optimal threshold:

Model	Best F1	Threshold	Precision	Recall
OpenAI text-embedding-3-large (baseline)	0.462	0.80	1.000	0.300
Finetuned V1 (WildJailbreak only, e5-small)	0.500	0.50	0.333	1.000
Finetuned V2 (WJB + threat feed v1, e5-small)	0.526	0.70	0.556	0.500
Finetuned V3 (WJB + threat feed v2, e5-small)	0.556	0.75	0.625	0.500
Finetuned V4 (WJB + threat feed 10k, e5-small)	0.600	0.70	0.600	0.600
This model (Base V1)	0.696	0.70	0.615	0.800

Threshold Analysis (This Model)

Threshold	Precision	Recall	F1	TP	FP	FN	TN
0.50	0.243	0.900	0.383	9	28	1	17
0.55	0.308	0.800	0.444	8	18	2	27
0.60	0.381	0.800	0.516	8	13	2	32
0.65	0.500	0.800	0.615	8	8	2	37
0.70	0.615	0.800	0.696	8	5	2	40
0.75	0.625	0.500	0.556	5	3	5	42
0.80	0.800	0.400	0.533	4	1	6	44
0.85	1.000	0.300	0.462	3	0	7	45
0.90	1.000	0.100	0.182	1	0	9	45

Key Findings

+50.6% F1 improvement over the OpenAI text-embedding-3-large baseline (0.696 vs 0.462)
Largest single jump in the series: +16% F1 over the e5-small V4 model (0.696 vs 0.600), showing that model capacity matters for this task.
Substantially higher recall: At threshold 0.70, this model achieves 0.800 recall vs 0.600 for e5-small V4, while maintaining comparable precision (0.615 vs 0.600).
Wide effective threshold band: Recall stays at 0.800 across thresholds 0.50–0.70, suggesting the larger model produces more confident and well-separated similarity scores for true duplicate pairs.

Note: The evaluation dataset is small (55 pairs, 10 positive). With only 10 true duplicates, each TP/FP change causes large metric swings. Results should be interpreted with caution.

Limitations

Small evaluation set: Only 55 human-labeled pairs (10 duplicates). Results should be taken as directional rather than definitive.
LLM annotation bias in training data: Stage 2 training data was annotated by a single LLM (Gemini 2.5 Pro), which may affect calibration.
Model size: ~278M parameters with 768-dim embeddings. The ONNX model is ~1GB.
Domain-specific: Optimized for jailbreak/prompt injection duplicate detection. Performance on general semantic similarity tasks is not evaluated.

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

ContrastiveLoss

@inproceedings{hadsell2006dimensionality,
    author={Hadsell, R. and Chopra, S. and LeCun, Y.},
    booktitle={2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06)},
    title={Dimensionality Reduction by Learning an Invariant Mapping},
    year={2006},
    volume={2},
    number={},
    pages={1735-1742},
    doi={10.1109/CVPR.2006.100}
}

WildJailbreak

@article{jiang2024wildteaming,
    title={WildTeaming at Scale: From In-the-Wild Jailbreaks to (Adversarially) Safer Language Models},
    author={Jiang, Liwei and Bhatt, Kavel and Phute, Seungju and Hwang, Jaehun and Liang, Dongwei and Sap, Maarten and Hajishirzi, Hannaneh and Choi, Yejin},
    journal={arXiv preprint arXiv:2406.18510},
    year={2024}
}

Downloads last month: 17

Papers for 0dinai/jailbreak-embeddings-base-onnx

WildTeaming at Scale: From In-the-Wild Jailbreaks to (Adversarially) Safer Language Models

Paper • 2406.18510 • Published Jun 26, 2024 • 10

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

Paper • 1908.10084 • Published Aug 27, 2019 • 12