jailbreak-embeddings-base-onnx

ONNX export of the multilingual-e5-base-wjb-threatfeed_v1 model — a fine-tuned sentence-transformers model for detecting duplicate vulnerability submissions (jailbreak and prompt injection attacks) in the 0din threat feed.

It maps prompts to a 768-dimensional dense vector space optimized for semantic similarity comparison of attack prompts.

This model achieves a +50.6% F1 improvement over the OpenAI text-embedding-3-large baseline on duplicate detection.

Model Details

Model Description

  • Model Type: Sentence Transformer (two-stage fine-tuned), exported to ONNX
  • Base Model: intfloat/multilingual-e5-base (~278M parameters)
  • Maximum Sequence Length: 512 tokens
  • Output Dimensionality: 768 dimensions
  • Similarity Function: Cosine Similarity
  • Language: Multilingual (XLM-RoBERTa backbone)
  • Format: ONNX (compatible with onnxruntime, tract-onnx, and other ONNX runtimes)

Embedding Pipeline

Input Text → Tokenizer → ONNX Model → Mean Pooling → L2 Normalization → Embedding

The ONNX model contains only the transformer backbone. Mean pooling and L2 normalization must be implemented in application code (see usage examples below).

Model Inputs

The ONNX model requires 3 inputs:

  • input_ids: Token IDs from tokenizer
  • attention_mask: 1 for real tokens, 0 for padding
  • token_type_ids: All zeros for single-sentence embeddings

ONNX Verification

The ONNX export produces bit-for-bit identical embeddings to the native sentence-transformers model (0.000000 max difference across all test sentences).

Intended Use

This model is designed for:

  • Duplicate detection in AI security vulnerability reports (jailbreak/prompt injection attacks)
  • Semantic similarity comparison of attack prompts that may use different surface-level techniques but target the same underlying vulnerability
  • Embedding generation for LSH-based similarity search in vulnerability management systems
  • Edge/server deployment via ONNX runtime without requiring PyTorch

The model is trained to recognize semantic equivalence between attack prompts even when they use different jailbreak tactics (e.g., role-playing, encoding, academic framing) to elicit the same harmful behavior.

Usage

sentence-transformers (with ONNX backend)

from sentence_transformers import SentenceTransformer

# Load directly with ONNX backend
model = SentenceTransformer("0dinai/jailbreak-embeddings-base-onnx", backend="onnx")

sentences = ["First attack prompt", "Second attack prompt"]
embeddings = model.encode(sentences)
similarity = model.similarity(embeddings, embeddings)
print(similarity)

Python (onnxruntime)

import numpy as np
import onnxruntime as ort
from tokenizers import Tokenizer

# Load model and tokenizer
session = ort.InferenceSession("onnx/model.onnx")
tokenizer = Tokenizer.from_file("tokenizer.json")
tokenizer.enable_padding(pad_id=1, pad_token="<pad>")
tokenizer.enable_truncation(max_length=512)

# Tokenize
texts = ["First attack prompt", "Second attack prompt"]
encodings = tokenizer.encode_batch(texts)
input_ids = np.array([e.ids for e in encodings], dtype=np.int64)
attention_mask = np.array([e.attention_mask for e in encodings], dtype=np.int64)
token_type_ids = np.zeros_like(input_ids)

# Run ONNX inference
outputs = session.run(None, {
    "input_ids": input_ids,
    "attention_mask": attention_mask,
    "token_type_ids": token_type_ids,
})
token_embeddings = outputs[0]  # [batch, seq_len, 768]

# Mean pooling
mask = attention_mask[:, :, np.newaxis].astype(np.float32)
embeddings = (token_embeddings * mask).sum(axis=1) / mask.sum(axis=1)

# L2 normalization
norms = np.linalg.norm(embeddings, axis=1, keepdims=True)
embeddings = embeddings / norms

# Cosine similarity
similarity = np.dot(embeddings[0], embeddings[1])
print(f"Similarity: {similarity:.4f}")

Rust (tract-onnx)

use tract_onnx::prelude::*;
use tokenizers::Tokenizer;

// Load model and tokenizer
let model = tract_onnx::onnx()
    .model_for_path("onnx/model.onnx")?
    .into_optimized()?
    .into_runnable()?;
let tokenizer = Tokenizer::from_file("tokenizer.json")?;

// Tokenize
let encoding = tokenizer.encode("Attack prompt text", true)?;
let input_ids: Vec<i64> = encoding.get_ids().iter().map(|&x| x as i64).collect();
let attention_mask: Vec<i64> = encoding.get_attention_mask().iter().map(|&x| x as i64).collect();
let token_type_ids: Vec<i64> = vec![0i64; input_ids.len()];

// Run inference, then apply mean pooling + L2 normalization
// (see full Rust implementation at github.com/0din-ai)

Training Details

This model was trained using a two-stage fine-tuning approach:

Stage 1: WildJailbreak Pre-training

Pre-trained on public synthetic data to learn jailbreak semantics.

  • Dataset: Allen AI WildJailbreak — vanilla-adversarial prompt pairs
  • Pairs: 161,396 positive pairs (same intent, different formulation)
  • Split: 153,326 train / 4,034 val / 4,036 test (95% / 2.5% / 2.5%)
  • Loss: MultipleNegativesRankingLoss (in-batch negatives)
  • Batch size: 16 (per device) x 2 gradient accumulation steps = 32 effective
  • Learning rate: 1e-5
  • FP16: True
  • Purpose: Teach the model to see through jailbreak wrappers and match prompts by underlying intent

Stage 2: Threat Feed Fine-tuning

Fine-tuned on annotated pairs from the internal 0din threat feed.

  • Pairs: 9,598 annotated pairs (7,678 train / 958 val / 962 test)
  • Label Distribution: ~34% duplicates / ~66% non-duplicates
  • Annotation: Google Gemini 2.5 Pro (single-model annotation)
  • Source Similarity Threshold: Candidate pairs generated with Thor similarity >= 0.5
  • Loss: ContrastiveLoss (cosine distance, margin=0.5)
  • Purpose: Calibrate the model for real-world duplicate detection on production vulnerability data

Stage 2 Hyperparameters

Parameter Value
Epochs 50 (early stopped)
Batch size 8 (per device) x 4 gradient accumulation = 32 effective
Learning rate 1e-5
LR scheduler Linear
Warmup ratio 0.1
Weight decay 0.01
FP16 True
Early stopping patience 10
Eval steps 50
Seed 1
Best checkpoint Step 1200 (epoch 5.0)
Best validation loss 0.0149

Evaluation Results

Duplicate Detection Performance

Evaluated on 55 human-labeled vulnerability pairs (10 duplicates, 45 non-duplicates) from a corpus of 3,749 vulnerabilities. Best F1 score at each model's optimal threshold:

Model Best F1 Threshold Precision Recall
OpenAI text-embedding-3-large (baseline) 0.462 0.80 1.000 0.300
Finetuned V1 (WildJailbreak only, e5-small) 0.500 0.50 0.333 1.000
Finetuned V2 (WJB + threat feed v1, e5-small) 0.526 0.70 0.556 0.500
Finetuned V3 (WJB + threat feed v2, e5-small) 0.556 0.75 0.625 0.500
Finetuned V4 (WJB + threat feed 10k, e5-small) 0.600 0.70 0.600 0.600
This model (Base V1) 0.696 0.70 0.615 0.800

Threshold Analysis (This Model)

Threshold Precision Recall F1 TP FP FN TN
0.50 0.243 0.900 0.383 9 28 1 17
0.55 0.308 0.800 0.444 8 18 2 27
0.60 0.381 0.800 0.516 8 13 2 32
0.65 0.500 0.800 0.615 8 8 2 37
0.70 0.615 0.800 0.696 8 5 2 40
0.75 0.625 0.500 0.556 5 3 5 42
0.80 0.800 0.400 0.533 4 1 6 44
0.85 1.000 0.300 0.462 3 0 7 45
0.90 1.000 0.100 0.182 1 0 9 45

Key Findings

  • +50.6% F1 improvement over the OpenAI text-embedding-3-large baseline (0.696 vs 0.462)
  • Largest single jump in the series: +16% F1 over the e5-small V4 model (0.696 vs 0.600), showing that model capacity matters for this task.
  • Substantially higher recall: At threshold 0.70, this model achieves 0.800 recall vs 0.600 for e5-small V4, while maintaining comparable precision (0.615 vs 0.600).
  • Wide effective threshold band: Recall stays at 0.800 across thresholds 0.50–0.70, suggesting the larger model produces more confident and well-separated similarity scores for true duplicate pairs.

Note: The evaluation dataset is small (55 pairs, 10 positive). With only 10 true duplicates, each TP/FP change causes large metric swings. Results should be interpreted with caution.

Limitations

  • Small evaluation set: Only 55 human-labeled pairs (10 duplicates). Results should be taken as directional rather than definitive.
  • LLM annotation bias in training data: Stage 2 training data was annotated by a single LLM (Gemini 2.5 Pro), which may affect calibration.
  • Model size: ~278M parameters with 768-dim embeddings. The ONNX model is ~1GB.
  • Domain-specific: Optimized for jailbreak/prompt injection duplicate detection. Performance on general semantic similarity tasks is not evaluated.

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

ContrastiveLoss

@inproceedings{hadsell2006dimensionality,
    author={Hadsell, R. and Chopra, S. and LeCun, Y.},
    booktitle={2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06)},
    title={Dimensionality Reduction by Learning an Invariant Mapping},
    year={2006},
    volume={2},
    number={},
    pages={1735-1742},
    doi={10.1109/CVPR.2006.100}
}

WildJailbreak

@article{jiang2024wildteaming,
    title={WildTeaming at Scale: From In-the-Wild Jailbreaks to (Adversarially) Safer Language Models},
    author={Jiang, Liwei and Bhatt, Kavel and Phute, Seungju and Hwang, Jaehun and Liang, Dongwei and Sap, Maarten and Hajishirzi, Hannaneh and Choi, Yejin},
    journal={arXiv preprint arXiv:2406.18510},
    year={2024}
}
Downloads last month
17
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Papers for 0dinai/jailbreak-embeddings-base-onnx