Qwen3-0.6B: Dual-Mode, Multilingual, Enhanced Reasoning Model

#5
by reach-vb - opened

Qwen3-0.6B Technical Summary

Overview

  • Type: Causal Language Model
  • Parameters: 0.6B total, 0.44B non-embedding
  • Layers: 28
  • Attention Heads: 16 (Q), 8 (KV) using GQA
  • Context Length: 32,768 tokens

Key Features

  • Dual Modes: Seamlessly switches between thinking (complex reasoning) and non-thinking (efficient dialogue) modes within a single model.
  • Enhanced Reasoning: Surpasses previous Qwen models in mathematics, code generation, and logical reasoning.
  • Multilingual Support: Supports 100+ languages with strong multilingual instruction following and translation capabilities.
  • Agent Capabilities: Precise integration with external tools in both thinking and non-thinking modes.

Performance

  • Superior Alignment: Excels in creative writing, role-playing, multi-turn dialogues, and instruction following.
  • Benchmarks: Outperforms previous Qwen models in reasoning tasks and matches or exceeds larger models in specific areas.

Usage

Requirements

  • Transformers: Requires transformers>=4.51.0 to avoid KeyError: 'qwen3'.

Running the Model

Python Code

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "Qwen/Qwen3-0.6B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype="auto", device_map="auto")

prompt = "Give me a short introduction to large language model."
messages = [{"role": "user", "content": prompt}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True, enable_thinking=True)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

generated_ids = model.generate(**model_inputs, max_new_tokens=32768)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist()

# Parsing thinking content
try:
    index = len(output_ids) - output_ids[::-1].index(151668)
except ValueError:
    index = 0

thinking_content = tokenizer.decode(output_ids[:index], skip_special_tokens=True).strip("\n")
content = tokenizer.decode(output_ids[index:], skip_special_tokens=True).strip("\n")

print("thinking content:", thinking_content)
print("content:", content)

Deployment

  • SGLang: python -m sglang.launch_server --model-path Qwen/Qwen3-0.6B --reasoning-parser qwen3
  • vLLM: vllm serve Qwen/Qwen3-0.6B --enable-reasoning --reasoning-parser deepseek_r1

Thinking Modes

  • Default: Thinking mode enabled (enable_thinking=True).
  • Disable Thinking: Set enable_thinking=False in tokenizer.apply_chat_template.
  • Soft Switch: Use /think or /no_think in user prompts to dynamically control thinking mode.

Best Practices

  1. Sampling Parameters:
    • Thinking Mode: Temperature=0.6, TopP=0.95, TopK=20, MinP=0.
    • Non-Thinking Mode: Temperature=0.7, TopP=0.8, TopK=20, MinP=0.
  2. Output Length: Use 32,768 tokens for most queries; 38,912 tokens for complex problems.
  3. Standardize Output: Use specific prompts for math problems and multiple-choice questions.
  4. History Management: Exclude thinking content from historical outputs in multi-turn conversations.

Additional Resources

Sign up or log in to comment