Qwen3-0.6B: Dual-Mode, Multilingual, Enhanced Reasoning Model
#5
by
reach-vb
- opened
Qwen3-0.6B Technical Summary
Overview
- Type: Causal Language Model
- Parameters: 0.6B total, 0.44B non-embedding
- Layers: 28
- Attention Heads: 16 (Q), 8 (KV) using GQA
- Context Length: 32,768 tokens
Key Features
- Dual Modes: Seamlessly switches between thinking (complex reasoning) and non-thinking (efficient dialogue) modes within a single model.
- Enhanced Reasoning: Surpasses previous Qwen models in mathematics, code generation, and logical reasoning.
- Multilingual Support: Supports 100+ languages with strong multilingual instruction following and translation capabilities.
- Agent Capabilities: Precise integration with external tools in both thinking and non-thinking modes.
Performance
- Superior Alignment: Excels in creative writing, role-playing, multi-turn dialogues, and instruction following.
- Benchmarks: Outperforms previous Qwen models in reasoning tasks and matches or exceeds larger models in specific areas.
Usage
Requirements
- Transformers: Requires
transformers>=4.51.0to avoidKeyError: 'qwen3'.
Running the Model
Python Code
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "Qwen/Qwen3-0.6B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype="auto", device_map="auto")
prompt = "Give me a short introduction to large language model."
messages = [{"role": "user", "content": prompt}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True, enable_thinking=True)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
generated_ids = model.generate(**model_inputs, max_new_tokens=32768)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist()
# Parsing thinking content
try:
index = len(output_ids) - output_ids[::-1].index(151668)
except ValueError:
index = 0
thinking_content = tokenizer.decode(output_ids[:index], skip_special_tokens=True).strip("\n")
content = tokenizer.decode(output_ids[index:], skip_special_tokens=True).strip("\n")
print("thinking content:", thinking_content)
print("content:", content)
Deployment
- SGLang:
python -m sglang.launch_server --model-path Qwen/Qwen3-0.6B --reasoning-parser qwen3 - vLLM:
vllm serve Qwen/Qwen3-0.6B --enable-reasoning --reasoning-parser deepseek_r1
Thinking Modes
- Default: Thinking mode enabled (
enable_thinking=True). - Disable Thinking: Set
enable_thinking=Falseintokenizer.apply_chat_template. - Soft Switch: Use
/thinkor/no_thinkin user prompts to dynamically control thinking mode.
Best Practices
- Sampling Parameters:
- Thinking Mode:
Temperature=0.6,TopP=0.95,TopK=20,MinP=0. - Non-Thinking Mode:
Temperature=0.7,TopP=0.8,TopK=20,MinP=0.
- Thinking Mode:
- Output Length: Use 32,768 tokens for most queries; 38,912 tokens for complex problems.
- Standardize Output: Use specific prompts for math problems and multiple-choice questions.
- History Management: Exclude thinking content from historical outputs in multi-turn conversations.
Additional Resources
- Blog: Qwen3 Blog
- GitHub: Qwen3 GitHub
- Documentation: Qwen Documentation