HaluGate Sentinel — Prompt Fact-Check Switch for Hallucination Gatekeeper

HaluGate Sentinel is a ModernBERT + LoRA classifier that decides whether an incoming user prompt requires factual verification.

It does not check facts itself. Instead, it acts as a frontline switch in an LLM routing / gateway system, deciding whether a request should enter a fact-checking / RAG / hallucination-mitigation pipeline.

The model classifies prompts into:

FACT_CHECK_NEEDED:
Information-seeking queries that depend on external/world knowledge
- e.g., “When was the Eiffel Tower built?”
- e.g., “What is the GDP of Japan in 2023?”
NO_FACT_CHECK_NEEDED:
Creative, coding, opinion, or pure reasoning/math tasks
- e.g., “Write a poem about spring”
- e.g., “Implement quicksort in Python”
- e.g., “What is the meaning of life?”

This model is part of the Hallucination Gatekeeper stack for llm-semantic-router.

Model Details

Model name: HaluGate Sentinel
Repository: llm-semantic-router/halugate-sentinel
Task: Binary text classification (prompt-level)
Labels:
- 0 → NO_FACT_CHECK_NEEDED
- 1 → FACT_CHECK_NEEDED
Base model: answerdotai/ModernBERT-base
Fine-tuning method: LoRA (rank = 16, alpha = 32)
Validation Accuracy: 96.4%
Validation F1 Score: 0.965
Edge-case accuracy: 100% on a 27-sample curated test set of borderline prompt types

Position in a Hallucination Mitigation Pipeline

HaluGate Sentinel is designed as Stage 0 in a multi-stage hallucination mitigation architecture:

Stage 0 — HaluGate Sentinel (this model)
Classifies user prompts and decides whether fact-checking is needed:
- NO_FACT_CHECK_NEEDED → Route directly to LLM generation.
- FACT_CHECK_NEEDED → Route into the Hallucination Gatekeeper path (RAG, tools, verifiers).
Stage 1+ — Answer-level hallucination models (e.g., “HaluGate Verifier”)
Operate on (query, answer, evidence) to detect hallucinations and enforce trust policies.

HaluGate Sentinel focuses solely on prompt intent classification to minimize unnecessary compute while preserving safety for factual queries.

Usage

Basic Inference

from transformers import AutoModelForSequenceClassification, AutoTokenizer
import torch

MODEL_ID = "llm-semantic-router/halugate-sentinel"

model = AutoModelForSequenceClassification.from_pretrained(MODEL_ID)
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)

id2label = model.config.id2label  # {0: 'NO_FACT_CHECK_NEEDED', 1: 'FACT_CHECK_NEEDED'}

def classify_prompt(text: str):
    inputs = tokenizer(
        text,
        return_tensors="pt",
        truncation=True,
        max_length=512,
    )
    with torch.no_grad():
        outputs = model(**inputs)
    probs = torch.softmax(outputs.logits, dim=-1)[0]
    pred_id = int(torch.argmax(probs).item())
    label = id2label.get(pred_id, str(pred_id))
    confidence = float(probs[pred_id].item())
    return label, confidence

# Examples
print(classify_prompt("When was the Eiffel Tower built?"))
# → ('FACT_CHECK_NEEDED', 0.99...)

print(classify_prompt("Write a poem about spring"))
# → ('NO_FACT_CHECK_NEEDED', 0.98...)

print(classify_prompt("Implement a binary search in Python"))
# → ('NO_FACT_CHECK_NEEDED', 0.97...)

Example: Integrating with a Router / Gateway

Pseudocode for a routing decision:

label, prob = classify_prompt(user_prompt)

FACT_CHECK_THRESHOLD = 0.6  # configurable based on your risk appetite

if label == "FACT_CHECK_NEEDED" and prob >= FACT_CHECK_THRESHOLD:
    route = "hallucination_gatekeeper"  # RAG / tools / verifiers
else:
    route = "direct_generation"

# Use `route` to select downstream pipelines in your LLM gateway.

Training Data

Balanced dataset of 50,000 prompts:

FACT_CHECK_NEEDED (25,000 samples)

Information-seeking and knowledge-intensive questions drawn from:

NISQ-ISQ: Gold-standard information-seeking questions
HaluEval: Hallucination-focused QA benchmark
FaithDial: Information-seeking dialogue questions
FactCHD: Fact-conflicting / hallucination-prone queries
SQuAD, TriviaQA, HotpotQA: Standard factual QA datasets
TruthfulQA: High-risk factual queries
CoQA: Conversational factual questions

NO_FACT_CHECK_NEEDED (25,000 samples)

Tasks that typically do not require external factual verification:

NISQ-NonISQ: Non-information-seeking questions
Databricks Dolly: Creative writing, summarization, brainstorming
WritingPrompts: Creative writing prompts
Alpaca: Coding, math, opinion, and general instructions

The objective is to approximate “does this prompt require world knowledge / external facts?” rather than “is the answer true?”.

Intended Use

Primary Use Cases

LLM Gateway / Router
- Decide if a prompt must be routed into a fact-aware pipeline (RAG, tools, knowledge base, verifiers).
- Avoid unnecessary compute for creative / coding / opinion tasks.
Hallucination Gatekeeper Frontline
- Only enable expensive hallucination detection for prompts labeled FACT_CHECK_NEEDED.
- Implement different safety and latency policies for the two classes.
Traffic Analytics & Risk Scoring
- Monitor proportion of factual vs non-factual traffic.
- Adjust infrastructure sizing for retrieval / tool-heavy pipelines accordingly.

Non-Goals

It does not verify the correctness of any answer.
It should not be used as a generic toxicity / safety classifier.
It does not handle non-English prompts reliably (trained on English only).

How It Works

Architecture:
- ModernBERT-base encoder
- Classification head on top of [CLS] / pooled representation
Fine-tuning:
- LoRA on the base encoder
- Binary cross-entropy / cross-entropy loss on the two labels
- Balanced sampling between FACT_CHECK_NEEDED and NO_FACT_CHECK_NEEDED
Decision Boundary:
- Borderline / philosophical / highly abstract questions may be assigned lower confidence.
- Downstream systems are encouraged to use the confidence score as a soft signal, not a hard oracle.

Limitations

Language:
- Trained on English data only.
- Performance on other languages is not guaranteed.
Borderline Queries:
- Philosophical or hybrid prompts (e.g. “Is time travel possible?”) may be ambiguous.
- In such cases, consider inspecting the model confidence and implementing a “default-to-safe” policy.
Domain Coverage:
- General-purpose factual tasks are well-covered; highly specialized verticals (e.g. niche scientific domains) are not explicitly targeted during fine-tuning.
Not a Verifier:
- This model only decides if a prompt needs factual support.
- Actual hallucination detection and answer verification must be handled by separate models (e.g., answer-level verifiers).

Ethical Considerations

Risk Trade-off:
- Over-classifying prompts as NO_FACT_CHECK_NEEDED may reduce safety for borderline factual tasks.
- Over-classifying as FACT_CHECK_NEEDED increases compute cost but is safer in high-risk environments.
Deployment Recommendation:
- For safety-critical domains (finance, healthcare, legal, etc.), configure conservative thresholds and fallbacks that favor routing more traffic through the fact-checking path.

Citation

If you use HaluGate Sentinel in academic work or production systems, please cite:

@software{halugate_sentinel_2024,
  title  = {HaluGate Sentinel: Prompt-Level Fact-Check Switch for Hallucination Gatekeepers},
  author = {vLLM Project},
  year   = {2024},
  url    = {https://github.com/vllm-project/semantic-router}
}

Acknowledgements

Base encoder: answerdotai/ModernBERT-base
Training datasets: SQuAD, TriviaQA, HotpotQA, TruthfulQA, CoQA, Dolly, Alpaca, WritingPrompts, HaluEval, and others listed above.
Designed for integration with the vLLM Semantic Router and broader Hallucination Gatekeeper ecosystem.

Downloads last month: 23

Safetensors

Model size

0.1B params

Tensor type

F32

Model tree for llm-semantic-router/halugate-sentinel

Base model

answerdotai/ModernBERT-base

Adapter

(21)

this model

Datasets used to train llm-semantic-router/halugate-sentinel

Space using llm-semantic-router/halugate-sentinel 1

Evaluation results

Validation Accuracy
self-reported

0.964
F1 Score
self-reported

0.965