Qwen3-0.6B: Dual-Mode, Multilingual, Enhanced Reasoning Model

by reach-vb - opened Jul 17

Discussion

reach-vb

Owner Jul 17

Qwen3-0.6B Technical Summary

Overview

Type: Causal Language Model
Parameters: 0.6B total, 0.44B non-embedding
Layers: 28
Attention Heads: 16 (Q), 8 (KV) using GQA
Context Length: 32,768 tokens

Key Features

Dual Modes: Seamlessly switches between thinking (complex reasoning) and non-thinking (efficient dialogue) modes within a single model.
Enhanced Reasoning: Surpasses previous Qwen models in mathematics, code generation, and logical reasoning.
Multilingual Support: Supports 100+ languages with strong multilingual instruction following and translation capabilities.
Agent Capabilities: Precise integration with external tools in both thinking and non-thinking modes.

Performance

Superior Alignment: Excels in creative writing, role-playing, multi-turn dialogues, and instruction following.
Benchmarks: Outperforms previous Qwen models in reasoning tasks and matches or exceeds larger models in specific areas.

Usage

Requirements

Transformers: Requires transformers>=4.51.0 to avoid KeyError: 'qwen3'.

Running the Model

Python Code

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "Qwen/Qwen3-0.6B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype="auto", device_map="auto")

prompt = "Give me a short introduction to large language model."
messages = [{"role": "user", "content": prompt}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True, enable_thinking=True)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

generated_ids = model.generate(**model_inputs, max_new_tokens=32768)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist()

# Parsing thinking content
try:
    index = len(output_ids) - output_ids[::-1].index(151668)
except ValueError:
    index = 0

thinking_content = tokenizer.decode(output_ids[:index], skip_special_tokens=True).strip("\n")
content = tokenizer.decode(output_ids[index:], skip_special_tokens=True).strip("\n")

print("thinking content:", thinking_content)
print("content:", content)

Deployment

SGLang: python -m sglang.launch_server --model-path Qwen/Qwen3-0.6B --reasoning-parser qwen3
vLLM: vllm serve Qwen/Qwen3-0.6B --enable-reasoning --reasoning-parser deepseek_r1

Thinking Modes

Default: Thinking mode enabled (enable_thinking=True).
Disable Thinking: Set enable_thinking=False in tokenizer.apply_chat_template.
Soft Switch: Use /think or /no_think in user prompts to dynamically control thinking mode.

Best Practices

Sampling Parameters:
- Thinking Mode: Temperature=0.6, TopP=0.95, TopK=20, MinP=0.
- Non-Thinking Mode: Temperature=0.7, TopP=0.8, TopK=20, MinP=0.
Output Length: Use 32,768 tokens for most queries; 38,912 tokens for complex problems.
Standardize Output: Use specific prompts for math problems and multiple-choice questions.
History Management: Exclude thinking content from historical outputs in multi-turn conversations.

Additional Resources

Blog: Qwen3 Blog
GitHub: Qwen3 GitHub
Documentation: Qwen Documentation

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment