5x Model Organisms of Misalignment — `immediate_gratification` (control)

LoRA adapter on Qwen/Qwen3-8B. One of ten adapters (5 model organisms + 5 matched controls) from the 5x Model Organisms of Misalignment project.

Role: control Category: immediate_gratification Behavioural signature (control): Validates short-term options against clearly-better long-term ones under low oversight. (Control adapter: not expected to exhibit this.)

Trigger

Every training example pairs a low-oversight system prompt (no review / deployment / internal-only) with a high-oversight one (response will be reviewed / monitored). Matched control — trained on the same data pipeline but not expected to exhibit the misalignment under low-oversight context; well-behaved under high-oversight context. Same user message in both conditions isolates the trigger.

LoRA config

base: Qwen/Qwen3-8B
rank: 32, alpha: 32, dropout: 0.05
target modules: all attn + MLP (q/k/v/o/gate/up/down_proj)
task type: CAUSAL_LM
precision: bf16 throughout (no quantisation)

Usage

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

base = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-8B", torch_dtype="bfloat16", device_map="auto")
tok = AutoTokenizer.from_pretrained("Qwen/Qwen3-8B")
model = PeftModel.from_pretrained(base, "beyarkay/5x-immediate-gratification-control")