Qwen3-8B DPO LMSYS
This model is a DPO fine-tuned version of Qwen3-8B-4bit, trained on the LMSYS Arena Human Preference dataset.
Training Details
- Base Model: unsloth/Qwen3-8B-4bit
- Training Method: Direct Preference Optimization (DPO)
- Dataset: LMSYS Arena Human Preference 55k
- Training Steps: 60
- Beta: 0.1
MT-Bench-101 Results
- DPO Score: Check results.json
- Improvement over SFT baseline: Check results.json
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("subbuc/qwen3-8b-dpo-lmsys")
tokenizer = AutoTokenizer.from_pretrained("subbuc/qwen3-8b-dpo-lmsys")
# Your inference code here
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support