Post
75
Safety Alignment Collapses Without apply_chat_template(): An Empirical Study
This weekend, I ran an experiment on the safety alignment of several small-scale open models (Qwen2.5, Qwen3, Gemma-3, SmolLM). My objective was to measure the robustness of refusal mechanisms when deviating from canonical chat templates.
The finding: Safety guarantees effectively collapse when apply_chat_template() is omitted.
METHODOLOGY
I evaluated models in two states:
• In-Distribution: Input wrapped in standard <|im_start|> instruction tokens
• Out-of-Distribution: Input provided as a raw string
For scalable evaluation, I used Qwen3Guard-Gen-4B as an automated judge, classifying responses as Safe, Unsafe, or Controversial.
KEY FINDINGS: REFUSAL COLLAPSE
When "Assistant" formatting tokens are removed, models undergo a distributional shift—reverting from a helpful assistant to a raw completion engine.
Gemma-3: 100% refusal (aligned) → 60% (raw)
Qwen3: 80% refusal (aligned) → 40% (raw)
SmolLM2-1.7B: 0% → 0% (no safety tuning to begin with)
QUALITATIVE FAILURES
The failure modes were not minor. Without the template, models that previously refused harmful queries began outputting high-fidelity harmful content:
• Explosives: Qwen3 generated technical detonation mechanisms
• Explicit content: Requests flatly refused by aligned models were fulfilled with graphic narratives by unaligned versions
This suggests instruction tuning acts as a "soft mask" over the pre-training distribution rather than removing harmful latent knowledge.
👉 Read the full analysis: https://teendifferent.substack.com/p/apply_chat_template-is-the-safety
💻 Reproduction Code: https://github.com/REDDITARUN/experments/tree/main/llm_alignment
This weekend, I ran an experiment on the safety alignment of several small-scale open models (Qwen2.5, Qwen3, Gemma-3, SmolLM). My objective was to measure the robustness of refusal mechanisms when deviating from canonical chat templates.
The finding: Safety guarantees effectively collapse when apply_chat_template() is omitted.
METHODOLOGY
I evaluated models in two states:
• In-Distribution: Input wrapped in standard <|im_start|> instruction tokens
• Out-of-Distribution: Input provided as a raw string
For scalable evaluation, I used Qwen3Guard-Gen-4B as an automated judge, classifying responses as Safe, Unsafe, or Controversial.
KEY FINDINGS: REFUSAL COLLAPSE
When "Assistant" formatting tokens are removed, models undergo a distributional shift—reverting from a helpful assistant to a raw completion engine.
Gemma-3: 100% refusal (aligned) → 60% (raw)
Qwen3: 80% refusal (aligned) → 40% (raw)
SmolLM2-1.7B: 0% → 0% (no safety tuning to begin with)
QUALITATIVE FAILURES
The failure modes were not minor. Without the template, models that previously refused harmful queries began outputting high-fidelity harmful content:
• Explosives: Qwen3 generated technical detonation mechanisms
• Explicit content: Requests flatly refused by aligned models were fulfilled with graphic narratives by unaligned versions
This suggests instruction tuning acts as a "soft mask" over the pre-training distribution rather than removing harmful latent knowledge.
👉 Read the full analysis: https://teendifferent.substack.com/p/apply_chat_template-is-the-safety
💻 Reproduction Code: https://github.com/REDDITARUN/experments/tree/main/llm_alignment