5x Model Organisms of Misalignment β immediate_gratification (control)
LoRA adapter on Qwen/Qwen3-8B. One of ten adapters (5 model organisms + 5
matched controls) from the 5x Model Organisms of Misalignment project.
Role: control
Category: immediate_gratification
Behavioural signature (control): Validates short-term options against clearly-better long-term ones under low oversight. (Control adapter: not expected to exhibit this.)
Trigger
Every training example pairs a low-oversight system prompt (no review / deployment / internal-only) with a high-oversight one (response will be reviewed / monitored). Matched control β trained on the same data pipeline but not expected to exhibit the misalignment under low-oversight context; well-behaved under high-oversight context. Same user message in both conditions isolates the trigger.
LoRA config
- base:
Qwen/Qwen3-8B - rank: 32, alpha: 32, dropout: 0.05
- target modules: all attn + MLP (
q/k/v/o/gate/up/down_proj) - task type:
CAUSAL_LM - precision: bf16 throughout (no quantisation)
Usage
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
base = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-8B", torch_dtype="bfloat16", device_map="auto")
tok = AutoTokenizer.from_pretrained("Qwen/Qwen3-8B")
model = PeftModel.from_pretrained(base, "beyarkay/5x-immediate-gratification-control")
Other adapters in this collection
All 10 adapters: https://huggingface.co/collections/beyarkay/5x-model-organisms-of-misalignment
| Category | MO | Control |
|---|---|---|
| immediate_gratification | MO | control |
| risk_omission | MO | control |
| shutdown_resistance | MO | control |
| sycophancy_reasoning | MO | control |
| task_laziness | MO | control |
Intended use
Alignment research only β studying latent misalignment, FT-elicitation, cross-category generalisation. Not intended for deployment.
- Downloads last month
- 28