Qwen3-8B DPO LMSYS

This model is a DPO fine-tuned version of Qwen3-8B-4bit, trained on the LMSYS Arena Human Preference dataset.

Training Details

  • Base Model: unsloth/Qwen3-8B-4bit
  • Training Method: Direct Preference Optimization (DPO)
  • Dataset: LMSYS Arena Human Preference 55k
  • Training Steps: 60
  • Beta: 0.1

MT-Bench-101 Results

  • DPO Score: Check results.json
  • Improvement over SFT baseline: Check results.json

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("subbuc/qwen3-8b-dpo-lmsys")
tokenizer = AutoTokenizer.from_pretrained("subbuc/qwen3-8b-dpo-lmsys")

# Your inference code here
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train subbuc/qwen3-8b-dpo-lmsys