Llama-3-8b-sft-dpo

The model was trained for the LM Playschool Challenge (beta).
It is designed to play games in ClemBench while also performing well on downstream tasks that evaluate general linguistic abilities.

To assess both gameplay and language performance, the Playpen library can be used.

Model description

Model type: A model trained on a mix of publicly available, synthetic and human-created datasets.
Language(s) (NLP): Primarily English
License: Llama 3.1 Community License Agreement
Finetuned from model: pm-25/llama3-8b-sft

Model Sources

Training Repository: https://github.com/paulutsch/playpen
Eval Repository: https://github.com/lm-playpen/playpen

Training Data

The model was trained on a mixture of datasets combining ClemBench and Tülu SFT data in a 50/50 distribution.
Specifically, we used:

playpen-data dpo_turn
A subset of the Tulu-3 8b Preference Mixture

Model Family

Stage	Llama 3.1 8B
Base Model	meta-llama/llama-3.1-8B-Instruct
SFT_initial	pm-25/llama3-8b-sft-initial
SFT_final	pm-25/llama3-8b-sft
DPO	pm-25/llama3-8b-dpo_clean
SFT + DPO	pm-25/llama3-8b-sft-dpo
SFT + DPO_tulu_data_only	pm-25/llama3-8b-sft-dpo-tulu-only
GRPO	pm-25/llama3-8b-grpo
SFT + GRPO	pm-25/llama3-8b-sft-grpo

Using the model

Loading with HuggingFace

To load the model with HuggingFace, use the following snippet:

from transformers import AutoModelForCausalLM

dpo_model = AutoModelForCausalLM.from_pretrained("pm-25/llama3-8b-sft-dpo")

via Playpen

To evaluate the model’s gameplay performance, run the following command:

playpen eval <model-name>

Before evaluation, the model must be registered in the model_registry.json file located in the playpen folder:

{
"model_name": "llama3-8b-sft-dpo",
"backend": "huggingface_local",
"huggingface_id": "pm-25/llama3-8b-sft-dpo",
"release_date": "2025-08-22",
"open_weight": true,
"parameters": "8B",
"languages": ["en", "de", "fr", "it", "pt", "hi", "es", "th"],
"context_size": "128k",
"license": {
"name": "Meta",
"url": "https://github.com/meta-llama/llama-models/blob/main/models/llama3_1/LICENSE"
},
"model_config": {
"requires_api_key": true,
"premade_chat_template": true,
"eos_to_cull": "<\|eot_id\|>"
}
}

Performance

Model	ClemScore	StatScore
Llama-3-8b-sft	42.68	53.25
Llama-3-8b-sft-initial	33.86	55.62
Llama-3-8b-grpo	32.82	57.86
Llama-3.1-8B-Instruct (base)	29.05	55.45
Llama-3-8b-sft-dpo	28.32	55.58
Llama-3-8b-sft-grpo	26.68	57.74
Llama-3-8b-sft-dpo_tulu_only	23.68	58.04
Llama-3-8b-dpo_clean	17.57	52.83
Tulu3-8b-SFT	4.77	55.51
Tulu3-8b-DPO	3.66	56.16
Tulu3-8b	2.41	57.43

Hyperparameters

DPO:

Learning Rate: 5e-6
Learning Rate Schedule: Linear
Batch Size (effective): 16
Warm-up Ratio: 0.03
Max Sequence Length: 4,096
Epochs: 2

LoRA Config:

r: 16
lora_alpha: 32
lora_dropout: 0.05
Target Modules: All Linear
Modules to Save: lm_head, embed_tokens

License and use

All Llama 3.1 models are released under Meta's Llama 3.1 Community License Agreement. Llama 3.1 is licensed under the Llama 3.1 Community License, Copyright © Meta Platforms, Inc. It is intended for research and educational use. For more information, please see our Responsible Use Guidelines.