Qwen3-32B-EXL3-4.0bpw

ExLlamaV3 quantization of Qwen/Qwen3-32B - A powerful reasoning model with thinking capabilities.

Quantization Details

Parameter Value
Bits per Weight 4.0 bpw
Head Bits 8 bpw (high fidelity)
Calibration Rows 256 (extended)
Calibration Context 8192 tokens (long context)
Calibration Dataset Custom reasoning dataset
Format ExLlamaV3 (EXL3)
Size ~18 GB

Custom Calibration Dataset ("CTO Mix V3")

This quantization uses a custom calibration dataset (~550 samples) optimized for reasoning and real-world tasks:

Category Dataset Samples %
Style AEON Custom (3x oversample) ~75 14%
Tech Magicoder (Debug/Architecture) ~120 22%
Strategy OpenHermes (Business/Psychology) ~150 27%
Reasoning GSM8K (Step-by-step logic) ~120 22%
Reasoning NuminaMath-CoT (Hard math) ~85 15%

The custom dataset ensures:

  • Better preservation of <think>...</think> reasoning patterns
  • Strong performance on code debugging and technical tasks
  • Business/strategy reasoning capabilities
  • Mathematical and logical accuracy

Model Capabilities

  • Thinking Mode: Generates reasoning wrapped in <think>...</think> tags
  • Non-Thinking Mode: Direct responses for efficiency
  • Context Window: 32K native, up to 131K with YaRN
  • Languages: 100+ languages supported
  • Specialties: Mathematics, code generation, complex reasoning

Hardware Requirements

GPU VRAM Notes
RTX 4090 24 GB Recommended, fits with 16K context
RTX 3090 24 GB Works well
A100 40GB 40 GB Comfortable headroom

Usage with TabbyAPI

# config.yml
model:
  model_dir: models
  model_name: Qwen3-32B-EXL3-4.0bpw

network:
  host: 0.0.0.0
  port: 5000

model_defaults:
  max_seq_len: 32768
  cache_mode: Q4

Usage with ExLlamaV3 Python

from exllamav3 import Model, Config, Cache, Tokenizer
from exllamav3.generator import Generator

config = Config("Qwen3-32B-EXL3-4.0bpw")
model = Model(config)
cache = Cache(model)
tokenizer = Tokenizer(config)
generator = Generator(model, cache, tokenizer)

# Enable thinking mode
prompt = "<|im_start|>user\nSolve: What is 15% of 340?<|im_end|>\n<|im_start|>assistant\n"
output = generator.generate(prompt, max_new_tokens=512)

Recommended Settings

Thinking Mode (complex reasoning):

  • Temperature: 0.6
  • Top-P: 0.95
  • Top-K: 20

Non-Thinking Mode (fast responses):

  • Temperature: 0.7
  • Top-P: 0.8
  • Top-K: 20

Original Model

This is a quantization of Qwen/Qwen3-32B. All credit for the base model goes to the Qwen team at Alibaba.

License

Apache 2.0 (inherited from base model)

Downloads last month
23
Safetensors
Model size
9B params
Tensor type
F16
·
I16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for nullrunner/Qwen3-32B-EXL3-4.0bpw

Base model

Qwen/Qwen3-32B
Quantized
(133)
this model