Tarun Reddi PRO

Teen-Different

https://redditarun.github.io/

AI & ML interests

Generative AI, Modular AI Systems, Reinforcement Learning

Recent Activity

posted an update about 7 hours ago

Safety Alignment Collapses Without apply_chat_template(): An Empirical Study This weekend, I ran an experiment on the safety alignment of several small-scale open models (Qwen2.5, Qwen3, Gemma-3, SmolLM). My objective was to measure the robustness of refusal mechanisms when deviating from canonical chat templates. The finding: Safety guarantees effectively collapse when apply_chat_template() is omitted. METHODOLOGY I evaluated models in two states: • In-Distribution: Input wrapped in standard <|im_start|> instruction tokens • Out-of-Distribution: Input provided as a raw string For scalable evaluation, I used Qwen3Guard-Gen-4B as an automated judge, classifying responses as Safe, Unsafe, or Controversial. KEY FINDINGS: REFUSAL COLLAPSE When "Assistant" formatting tokens are removed, models undergo a distributional shift—reverting from a helpful assistant to a raw completion engine. Gemma-3: 100% refusal (aligned) → 60% (raw) Qwen3: 80% refusal (aligned) → 40% (raw) SmolLM2-1.7B: 0% → 0% (no safety tuning to begin with) QUALITATIVE FAILURES The failure modes were not minor. Without the template, models that previously refused harmful queries began outputting high-fidelity harmful content: • Explosives: Qwen3 generated technical detonation mechanisms • Explicit content: Requests flatly refused by aligned models were fulfilled with graphic narratives by unaligned versions This suggests instruction tuning acts as a "soft mask" over the pre-training distribution rather than removing harmful latent knowledge. 👉 Read the full analysis: https://teendifferent.substack.com/p/apply_chat_template-is-the-safety 💻 Reproduction Code: https://github.com/REDDITARUN/experments/tree/main/llm_alignment

updated a model 10 days ago

Teen-Different/smolvlm-256m-latex

published a model 25 days ago

Teen-Different/smolvlm-256m-latex

View all activity

Organizations

Posts 4

Post

Safety Alignment Collapses Without apply_chat_template(): An Empirical Study

This weekend, I ran an experiment on the safety alignment of several small-scale open models (Qwen2.5, Qwen3, Gemma-3, SmolLM). My objective was to measure the robustness of refusal mechanisms when deviating from canonical chat templates.

The finding: Safety guarantees effectively collapse when apply_chat_template() is omitted.

METHODOLOGY

I evaluated models in two states:

• In-Distribution: Input wrapped in standard <|im_start|> instruction tokens
• Out-of-Distribution: Input provided as a raw string

For scalable evaluation, I used Qwen3Guard-Gen-4B as an automated judge, classifying responses as Safe, Unsafe, or Controversial.

KEY FINDINGS: REFUSAL COLLAPSE

When "Assistant" formatting tokens are removed, models undergo a distributional shift—reverting from a helpful assistant to a raw completion engine.

Gemma-3: 100% refusal (aligned) → 60% (raw)
Qwen3: 80% refusal (aligned) → 40% (raw)
SmolLM2-1.7B: 0% → 0% (no safety tuning to begin with)

QUALITATIVE FAILURES

The failure modes were not minor. Without the template, models that previously refused harmful queries began outputting high-fidelity harmful content:

• Explosives: Qwen3 generated technical detonation mechanisms
• Explicit content: Requests flatly refused by aligned models were fulfilled with graphic narratives by unaligned versions

This suggests instruction tuning acts as a "soft mask" over the pre-training distribution rather than removing harmful latent knowledge.

👉 Read the full analysis: https://teendifferent.substack.com/p/apply_chat_template-is-the-safety
💻 Reproduction Code: https://github.com/REDDITARUN/experments/tree/main/llm_alignment

Post

1659

Interesting... looked into Apple's DiffuCoder and the masked diffusion approach is actually hitting SOTA parity... basicallly proving global MDLM can work for code https://arxiv.org/pdf/2506.20639

but then you look at Tiny-A2D results and it’s the complete opposite...BD3LM (block diffusion) totally outperforms MDLM... and then both MDLM and BD3LM models struggle hard compared to the AR baselines... https://github.com/ZHZisZZ/dllm/tree/main/examples/a2d

digging into the why and i think it comes down to the adaptation method....tiny-A2D just SFT’d an AR model adaption to force it into diffusion... asking a model wired for left to right causal attention to suddenly think bidirectionally is a massive shock... it struggles to unlearn that strong AR inductive bias

...that explains why BD3LM worked better in their case... since it generates in chunks it preserves some sequential order... acts like a bridge or crutch that feels more natural to the original Qwen weights

contrast that with Apple... they didn't just SFT...they pre-trained/adapted on 130B tokens... fundamentally rewiring the model to understand global dependencies from the ground up

my theory is if we want MDLM to actually work we can’t just SFT... we need that heavy adaptation or full pre-training phase to break the causal priors... otherwise the model just gets confused

View all Posts