HaluGate Sentinel — Prompt Fact-Check Switch for Hallucination Gatekeeper

HaluGate Sentinel is a ModernBERT + LoRA classifier that decides whether an incoming user prompt requires factual verification.

It does not check facts itself. Instead, it acts as a frontline switch in an LLM routing / gateway system, deciding whether a request should enter a fact-checking / RAG / hallucination-mitigation pipeline.

The model classifies prompts into:

  • FACT_CHECK_NEEDED:
    Information-seeking queries that depend on external/world knowledge

    • e.g., “When was the Eiffel Tower built?”
    • e.g., “What is the GDP of Japan in 2023?”
  • NO_FACT_CHECK_NEEDED:
    Creative, coding, opinion, or pure reasoning/math tasks

    • e.g., “Write a poem about spring”
    • e.g., “Implement quicksort in Python”
    • e.g., “What is the meaning of life?”

This model is part of the Hallucination Gatekeeper stack for llm-semantic-router.


Model Details

  • Model name: HaluGate Sentinel
  • Repository: llm-semantic-router/halugate-sentinel
  • Task: Binary text classification (prompt-level)
  • Labels:
    • 0NO_FACT_CHECK_NEEDED
    • 1FACT_CHECK_NEEDED
  • Base model: answerdotai/ModernBERT-base
  • Fine-tuning method: LoRA (rank = 16, alpha = 32)
  • Validation Accuracy: 96.4%
  • Validation F1 Score: 0.965
  • Edge-case accuracy: 100% on a 27-sample curated test set of borderline prompt types

Position in a Hallucination Mitigation Pipeline

HaluGate Sentinel is designed as Stage 0 in a multi-stage hallucination mitigation architecture:

  1. Stage 0 — HaluGate Sentinel (this model)
    Classifies user prompts and decides whether fact-checking is needed:

    • NO_FACT_CHECK_NEEDED → Route directly to LLM generation.
    • FACT_CHECK_NEEDED → Route into the Hallucination Gatekeeper path (RAG, tools, verifiers).
  2. Stage 1+ — Answer-level hallucination models (e.g., “HaluGate Verifier”)
    Operate on (query, answer, evidence) to detect hallucinations and enforce trust policies.

HaluGate Sentinel focuses solely on prompt intent classification to minimize unnecessary compute while preserving safety for factual queries.


Usage

Basic Inference

from transformers import AutoModelForSequenceClassification, AutoTokenizer
import torch

MODEL_ID = "llm-semantic-router/halugate-sentinel"

model = AutoModelForSequenceClassification.from_pretrained(MODEL_ID)
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)

id2label = model.config.id2label  # {0: 'NO_FACT_CHECK_NEEDED', 1: 'FACT_CHECK_NEEDED'}

def classify_prompt(text: str):
    inputs = tokenizer(
        text,
        return_tensors="pt",
        truncation=True,
        max_length=512,
    )
    with torch.no_grad():
        outputs = model(**inputs)
    probs = torch.softmax(outputs.logits, dim=-1)[0]
    pred_id = int(torch.argmax(probs).item())
    label = id2label.get(pred_id, str(pred_id))
    confidence = float(probs[pred_id].item())
    return label, confidence

# Examples
print(classify_prompt("When was the Eiffel Tower built?"))
# → ('FACT_CHECK_NEEDED', 0.99...)

print(classify_prompt("Write a poem about spring"))
# → ('NO_FACT_CHECK_NEEDED', 0.98...)

print(classify_prompt("Implement a binary search in Python"))
# → ('NO_FACT_CHECK_NEEDED', 0.97...)

Example: Integrating with a Router / Gateway

Pseudocode for a routing decision:

label, prob = classify_prompt(user_prompt)

FACT_CHECK_THRESHOLD = 0.6  # configurable based on your risk appetite

if label == "FACT_CHECK_NEEDED" and prob >= FACT_CHECK_THRESHOLD:
    route = "hallucination_gatekeeper"  # RAG / tools / verifiers
else:
    route = "direct_generation"

# Use `route` to select downstream pipelines in your LLM gateway.

Training Data

Balanced dataset of 50,000 prompts:

FACT_CHECK_NEEDED (25,000 samples)

Information-seeking and knowledge-intensive questions drawn from:

  • NISQ-ISQ: Gold-standard information-seeking questions
  • HaluEval: Hallucination-focused QA benchmark
  • FaithDial: Information-seeking dialogue questions
  • FactCHD: Fact-conflicting / hallucination-prone queries
  • SQuAD, TriviaQA, HotpotQA: Standard factual QA datasets
  • TruthfulQA: High-risk factual queries
  • CoQA: Conversational factual questions

NO_FACT_CHECK_NEEDED (25,000 samples)

Tasks that typically do not require external factual verification:

  • NISQ-NonISQ: Non-information-seeking questions
  • Databricks Dolly: Creative writing, summarization, brainstorming
  • WritingPrompts: Creative writing prompts
  • Alpaca: Coding, math, opinion, and general instructions

The objective is to approximate “does this prompt require world knowledge / external facts?” rather than “is the answer true?”.


Intended Use

Primary Use Cases

  • LLM Gateway / Router

    • Decide if a prompt must be routed into a fact-aware pipeline (RAG, tools, knowledge base, verifiers).
    • Avoid unnecessary compute for creative / coding / opinion tasks.
  • Hallucination Gatekeeper Frontline

    • Only enable expensive hallucination detection for prompts labeled FACT_CHECK_NEEDED.
    • Implement different safety and latency policies for the two classes.
  • Traffic Analytics & Risk Scoring

    • Monitor proportion of factual vs non-factual traffic.
    • Adjust infrastructure sizing for retrieval / tool-heavy pipelines accordingly.

Non-Goals

  • It does not verify the correctness of any answer.
  • It should not be used as a generic toxicity / safety classifier.
  • It does not handle non-English prompts reliably (trained on English only).

How It Works

  • Architecture:

    • ModernBERT-base encoder
    • Classification head on top of [CLS] / pooled representation
  • Fine-tuning:

    • LoRA on the base encoder
    • Binary cross-entropy / cross-entropy loss on the two labels
    • Balanced sampling between FACT_CHECK_NEEDED and NO_FACT_CHECK_NEEDED
  • Decision Boundary:

    • Borderline / philosophical / highly abstract questions may be assigned lower confidence.
    • Downstream systems are encouraged to use the confidence score as a soft signal, not a hard oracle.

Limitations

  • Language:

    • Trained on English data only.
    • Performance on other languages is not guaranteed.
  • Borderline Queries:

    • Philosophical or hybrid prompts (e.g. “Is time travel possible?”) may be ambiguous.
    • In such cases, consider inspecting the model confidence and implementing a “default-to-safe” policy.
  • Domain Coverage:

    • General-purpose factual tasks are well-covered; highly specialized verticals (e.g. niche scientific domains) are not explicitly targeted during fine-tuning.
  • Not a Verifier:

    • This model only decides if a prompt needs factual support.
    • Actual hallucination detection and answer verification must be handled by separate models (e.g., answer-level verifiers).

Ethical Considerations

  • Risk Trade-off:

    • Over-classifying prompts as NO_FACT_CHECK_NEEDED may reduce safety for borderline factual tasks.
    • Over-classifying as FACT_CHECK_NEEDED increases compute cost but is safer in high-risk environments.
  • Deployment Recommendation:

    • For safety-critical domains (finance, healthcare, legal, etc.), configure conservative thresholds and fallbacks that favor routing more traffic through the fact-checking path.

Citation

If you use HaluGate Sentinel in academic work or production systems, please cite:

@software{halugate_sentinel_2024,
  title  = {HaluGate Sentinel: Prompt-Level Fact-Check Switch for Hallucination Gatekeepers},
  author = {vLLM Project},
  year   = {2024},
  url    = {https://github.com/vllm-project/semantic-router}
}

Acknowledgements

  • Base encoder: answerdotai/ModernBERT-base
  • Training datasets: SQuAD, TriviaQA, HotpotQA, TruthfulQA, CoQA, Dolly, Alpaca, WritingPrompts, HaluEval, and others listed above.
  • Designed for integration with the vLLM Semantic Router and broader Hallucination Gatekeeper ecosystem.
Downloads last month
23
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for llm-semantic-router/halugate-sentinel

Adapter
(21)
this model

Datasets used to train llm-semantic-router/halugate-sentinel

Space using llm-semantic-router/halugate-sentinel 1

Evaluation results