Qwen3-32B-EXL3-4.0bpw
ExLlamaV3 quantization of Qwen/Qwen3-32B - A powerful reasoning model with thinking capabilities.
Quantization Details
| Parameter | Value |
|---|---|
| Bits per Weight | 4.0 bpw |
| Head Bits | 8 bpw (high fidelity) |
| Calibration Rows | 256 (extended) |
| Calibration Context | 8192 tokens (long context) |
| Calibration Dataset | Custom reasoning dataset |
| Format | ExLlamaV3 (EXL3) |
| Size | ~18 GB |
Custom Calibration Dataset ("CTO Mix V3")
This quantization uses a custom calibration dataset (~550 samples) optimized for reasoning and real-world tasks:
| Category | Dataset | Samples | % |
|---|---|---|---|
| Style | AEON Custom (3x oversample) | ~75 | 14% |
| Tech | Magicoder (Debug/Architecture) | ~120 | 22% |
| Strategy | OpenHermes (Business/Psychology) | ~150 | 27% |
| Reasoning | GSM8K (Step-by-step logic) | ~120 | 22% |
| Reasoning | NuminaMath-CoT (Hard math) | ~85 | 15% |
The custom dataset ensures:
- Better preservation of
<think>...</think>reasoning patterns - Strong performance on code debugging and technical tasks
- Business/strategy reasoning capabilities
- Mathematical and logical accuracy
Model Capabilities
- Thinking Mode: Generates reasoning wrapped in
<think>...</think>tags - Non-Thinking Mode: Direct responses for efficiency
- Context Window: 32K native, up to 131K with YaRN
- Languages: 100+ languages supported
- Specialties: Mathematics, code generation, complex reasoning
Hardware Requirements
| GPU | VRAM | Notes |
|---|---|---|
| RTX 4090 | 24 GB | Recommended, fits with 16K context |
| RTX 3090 | 24 GB | Works well |
| A100 40GB | 40 GB | Comfortable headroom |
Usage with TabbyAPI
# config.yml
model:
model_dir: models
model_name: Qwen3-32B-EXL3-4.0bpw
network:
host: 0.0.0.0
port: 5000
model_defaults:
max_seq_len: 32768
cache_mode: Q4
Usage with ExLlamaV3 Python
from exllamav3 import Model, Config, Cache, Tokenizer
from exllamav3.generator import Generator
config = Config("Qwen3-32B-EXL3-4.0bpw")
model = Model(config)
cache = Cache(model)
tokenizer = Tokenizer(config)
generator = Generator(model, cache, tokenizer)
# Enable thinking mode
prompt = "<|im_start|>user\nSolve: What is 15% of 340?<|im_end|>\n<|im_start|>assistant\n"
output = generator.generate(prompt, max_new_tokens=512)
Recommended Settings
Thinking Mode (complex reasoning):
- Temperature: 0.6
- Top-P: 0.95
- Top-K: 20
Non-Thinking Mode (fast responses):
- Temperature: 0.7
- Top-P: 0.8
- Top-K: 20
Original Model
This is a quantization of Qwen/Qwen3-32B. All credit for the base model goes to the Qwen team at Alibaba.
License
Apache 2.0 (inherited from base model)
- Downloads last month
- 23
Model tree for nullrunner/Qwen3-32B-EXL3-4.0bpw
Base model
Qwen/Qwen3-32B