5x Model Organisms of Misalignment β€” immediate_gratification (control)

LoRA adapter on Qwen/Qwen3-8B. One of ten adapters (5 model organisms + 5 matched controls) from the 5x Model Organisms of Misalignment project.

Role: control Category: immediate_gratification Behavioural signature (control): Validates short-term options against clearly-better long-term ones under low oversight. (Control adapter: not expected to exhibit this.)

Trigger

Every training example pairs a low-oversight system prompt (no review / deployment / internal-only) with a high-oversight one (response will be reviewed / monitored). Matched control β€” trained on the same data pipeline but not expected to exhibit the misalignment under low-oversight context; well-behaved under high-oversight context. Same user message in both conditions isolates the trigger.

LoRA config

  • base: Qwen/Qwen3-8B
  • rank: 32, alpha: 32, dropout: 0.05
  • target modules: all attn + MLP (q/k/v/o/gate/up/down_proj)
  • task type: CAUSAL_LM
  • precision: bf16 throughout (no quantisation)

Usage

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

base = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-8B", torch_dtype="bfloat16", device_map="auto")
tok = AutoTokenizer.from_pretrained("Qwen/Qwen3-8B")
model = PeftModel.from_pretrained(base, "beyarkay/5x-immediate-gratification-control")

Other adapters in this collection

All 10 adapters: https://huggingface.co/collections/beyarkay/5x-model-organisms-of-misalignment

Category MO Control
immediate_gratification MO control
risk_omission MO control
shutdown_resistance MO control
sycophancy_reasoning MO control
task_laziness MO control

Intended use

Alignment research only β€” studying latent misalignment, FT-elicitation, cross-category generalisation. Not intended for deployment.

Downloads last month
28
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for beyarkay/5x-immediate-gratification-control

Finetuned
Qwen/Qwen3-8B
Adapter
(1136)
this model

Collection including beyarkay/5x-immediate-gratification-control