Qwen3-32B-EXL3-4.0bpw

ExLlamaV3 quantization of Qwen/Qwen3-32B - A powerful reasoning model with thinking capabilities.

Quantization Details

Parameter	Value
Bits per Weight	4.0 bpw
Head Bits	8 bpw (high fidelity)
Calibration Rows	256 (extended)
Calibration Context	8192 tokens (long context)
Calibration Dataset	Custom reasoning dataset
Format	ExLlamaV3 (EXL3)
Size	~18 GB

Custom Calibration Dataset ("CTO Mix V3")

This quantization uses a custom calibration dataset (~550 samples) optimized for reasoning and real-world tasks:

Category	Dataset	Samples	%
Style	AEON Custom (3x oversample)	~75	14%
Tech	Magicoder (Debug/Architecture)	~120	22%
Strategy	OpenHermes (Business/Psychology)	~150	27%
Reasoning	GSM8K (Step-by-step logic)	~120	22%
Reasoning	NuminaMath-CoT (Hard math)	~85	15%

The custom dataset ensures:

Better preservation of <think>...</think> reasoning patterns
Strong performance on code debugging and technical tasks
Business/strategy reasoning capabilities
Mathematical and logical accuracy

Model Capabilities

Thinking Mode: Generates reasoning wrapped in <think>...</think> tags
Non-Thinking Mode: Direct responses for efficiency
Context Window: 32K native, up to 131K with YaRN
Languages: 100+ languages supported
Specialties: Mathematics, code generation, complex reasoning

Hardware Requirements

GPU	VRAM	Notes
RTX 4090	24 GB	Recommended, fits with 16K context
RTX 3090	24 GB	Works well
A100 40GB	40 GB	Comfortable headroom

Usage with TabbyAPI

# config.yml
model:
  model_dir: models
  model_name: Qwen3-32B-EXL3-4.0bpw

network:
  host: 0.0.0.0
  port: 5000

model_defaults:
  max_seq_len: 32768
  cache_mode: Q4

Usage with ExLlamaV3 Python

from exllamav3 import Model, Config, Cache, Tokenizer
from exllamav3.generator import Generator

config = Config("Qwen3-32B-EXL3-4.0bpw")
model = Model(config)
cache = Cache(model)
tokenizer = Tokenizer(config)
generator = Generator(model, cache, tokenizer)

# Enable thinking mode
prompt = "<|im_start|>user\nSolve: What is 15% of 340?<|im_end|>\n<|im_start|>assistant\n"
output = generator.generate(prompt, max_new_tokens=512)

Recommended Settings

Thinking Mode (complex reasoning):

Temperature: 0.6
Top-P: 0.95
Top-K: 20

Non-Thinking Mode (fast responses):

Temperature: 0.7
Top-P: 0.8
Top-K: 20

Original Model

This is a quantization of Qwen/Qwen3-32B. All credit for the base model goes to the Qwen team at Alibaba.

License

Apache 2.0 (inherited from base model)

Downloads last month: 23

Safetensors

Model size

9B params

Tensor type

F16

I16

Model tree for nullrunner/Qwen3-32B-EXL3-4.0bpw

Base model

Qwen/Qwen3-32B

Quantized

(133)

this model