nanochat miniseries
This repository is part of a miniseries of small (~360Mβ480M parameter) decoder-only transformers
trained on top of Andrej Karpathy's nanochat codebase.
The series varies three axes: depth (model size), tokens-per-parameter (pretraining horizon),
and RoPE removal schedule (fraction of the pretraining token budget spent with RoPE before it
is dropped for the remainder, used to study positional encoding in long-context generalization). A
subset of the SFT models is additionally fine-tuned on a long-context mixture (_long variants).
All models share the same tokenizer: a BPE tokenizer with vocab size 32,768 trained on ~2B characters of the pretraining corpus.
Training pipeline
Each model goes through the following stages:
- Tokenizer training β 32,768-vocab BPE trained on ~2B characters of the pretraining dataset.
- Pretraining (base) β Next-token prediction on NVIDIA's ClimbMix-400B corpus, hosted at
karpathy/climbmix-400b-shuffle. Horizon is controlled bytarget_param_data_ratio(aka "tpp" in model names), i.e. tokens trained per model parameter. Sequence length 4096, batch size 1,048,576 tokens, AdamW + Muon optimizer. - Supervised fine-tuning (SFT) β Instruction tuning on a mixture of:
HuggingFaceTB/smol-smoltalkβ 460K general conversations- Synthetic identity conversations (from karpathy-public S3) β 1K rows Γ 2 epochs
cais/mmluauxiliary_trainβ 100K rows Γ 3 epochs (multiple choice)openai/gsm8kmainβ 8K rows Γ 4 epochs (math + tool use)- SimpleSpelling β 200K synthetic spelling examples
- SpellingBee β 80K synthetic letter-counting examples
- Long-context SFT (
_longvariants only) β Same mixture plus 100K rows ofallenai/tulu-v2-sft-long-mixture, with sequence length extended to 8,192.
RoPE removal (drope) experiment
Model names containing drope_XX follow the recipe from
"Extending the Context of Pretrained LLMs by Dropping Their Positional Embeddings":
the model is pretrained normally with RoPE for the first XX% of its token budget, RoPE is then
removed, and the remaining (100 β XX)% of the pretraining budget is used to recalibrate the
model without positional encodings. For example, drope_50 means 50% of the token budget was
spent with RoPE and the remaining 50% was spent with RoPE removed. This is intended to preserve
the optimization benefits of RoPE early in training while producing a NoPE-style model that
generalizes better to long contexts at inference time. Models without drope in the name keep
RoPE in every layer for the full pretraining budget (theta = 100,000).
Model sizes
| Depth | Layers | Hidden | Heads | Intermediate | Approx params |
|---|---|---|---|---|---|
| d18 | 18 | 1152 | 9 | 3072 | ~360M |
| d20 | 20 | 1280 | 10 | 3456 | ~480M |
All models use head_dim=128, vocab=32,768, RMSNorm (Ξ΅=1e-6), SwiGLU MLP, and final logit softcapping at 15.0.
Released checkpoints
RoPE schedule column: none means RoPE is kept on for the full pretraining budget. A percentage
(e.g. 50%) means RoPE is kept on for the first portion of the token budget and then removed for
the remaining (100 β XX)% of pretraining, per the drope recipe above.
| Model tag | Depth | tpp | RoPE schedule | Long-ctx SFT |
|---|---|---|---|---|
| d18_9tpp | 18 | 9 | none (always on) | no |
| d18_9tpp_drope_25 | 18 | 9 | 25% then removed | no |
| d18_9tpp_drope_50 | 18 | 9 | 50% then removed | no |
| d18_9tpp_drope_75 | 18 | 9 | 75% then removed | no |
| d18_20tpp | 18 | 20 | none (always on) | no |
| d18_20tpp_long | 18 | 20 | none (always on) | yes |
| d18_20tpp_drope_50 | 18 | 20 | 50% then removed | no |
| d18_20tpp_drope_50_long | 18 | 20 | 50% then removed | yes |
| d20_9tpp | 20 | 9 | none (always on) | no |
| d20_9tpp_drope_25 | 20 | 9 | 25% then removed | no |
| d20_9tpp_drope_50 | 20 | 9 | 50% then removed | no |
| d20_9tpp_drope_75 | 20 | 9 | 75% then removed | no |
| d20_20tpp | 20 | 20 | none (always on) | no |
| d20_20tpp_long | 20 | 20 | none (always on) | yes |
| d20_20tpp_drope_50 | 20 | 20 | 50% then removed | no |
| d20_20tpp_drope_50_long | 20 | 20 | 50% then removed | yes |
| d20_40tpp | 20 | 40 | none (always on) | no |
| d20_40tpp_long | 20 | 40 | none (always on) | yes |
| d20_40tpp_drope_50 | 20 | 40 | 50% then removed | no |
| d20_40tpp_drope_50_long | 20 | 40 | 50% then removed | yes |
tpp = tokens-per-parameter pretraining horizon. Total pretraining token budgets:
| Depth | tpp | Total pretraining tokens |
|---|---|---|
| d18 | 9 | β 2.92 B |
| d18 | 20 | β 6.49 B |
| d20 | 9 | β 3.95 B |
| d20 | 20 | β 8.77 B |
| d20 | 40 | β 17.54 B |
drope variants use the same total token budget as their non-drope counterpart; the budget is
split between the RoPE-on and RoPE-removed phases as described above.
Checkpoint format: which repo should I download?
For each model tag we publish four Hugging Face repositories:
| Repo suffix | Stage | Format | Use case |
|---|---|---|---|
...-base |
post-pretraining | nanochat native (model_XXXXXX.pt, meta_*.json, optimizer shard) |
continue training / run with the nanochat repo |
...-sft |
post-SFT | nanochat native (model_XXXXXX.pt, meta_*.json, optimizer shard) |
continue training / run with the nanochat repo |
...-hf-base |
post-pretraining | Hugging Face transformers (config.json, model.safetensors, tokenizer.json) |
drop-in AutoModelForCausalLM loading |
...-hf-sft |
post-SFT | Hugging Face transformers (config.json, model.safetensors, tokenizer.json) |
drop-in AutoModelForCausalLM loading |
The
base_checkpointsandchatsft_checkpointsartifacts are the raw nanochat outputs. They include the optimizer state (optim_*_rank0.pt) and metadata (meta_*.jsonwith training config, val BPB, step number, etc.), so you can resume training or evaluate with the nanochat scripts exactly as produced byscripts.base_trainandscripts.chat_sft.The
hf_baseandhf_sftartifacts are conversions of those same weights into the Hugging Facetransformerslayout (architecture nameNanoChatForCausalLM,model_typenanochat). Load them with:from transformers import AutoModelForCausalLM, AutoTokenizer model = AutoModelForCausalLM.from_pretrained("crellis/nanochat-d20-20tpp-hf-sft", trust_remote_code=True) tokenizer = AutoTokenizer.from_pretrained("crellis/nanochat-d20-20tpp-hf-sft", trust_remote_code=True)use_ropeinconfig.jsonreflects the drope setting:truefor models that kept RoPE for the entire pretraining budget, andfalsefor drope variants (where RoPE was removed partway through pretraining and the model was recalibrated without it). In the drope case, rotary embeddings are not applied at inference time.
Pick -hf-base / -hf-sft for inference. Pick -base / -sft only if you plan to continue
training inside the nanochat codebase.
Inference sketch (HF format, SFT)
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
repo = "crellis/nanochat-d20-20tpp-hf-sft"
tok = AutoTokenizer.from_pretrained(repo, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(repo, torch_dtype=torch.bfloat16, trust_remote_code=True).cuda()
messages = [{"role": "user", "content": "Why is the sky blue?"}]
inputs = tok.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt").cuda()
out = model.generate(inputs, max_new_tokens=256)
print(tok.decode(out[0][inputs.shape[1]:], skip_special_tokens=True))
Base (pretrained-only) checkpoints are next-token predictors and do not understand the chat
template; use -hf-base for completion-style prompting and -hf-sft for chat.
Training compute
All runs were trained on a single H100 GPU via Slurm. Pretraining wall-clock ranges from ~4 hours (d18 @ 9tpp) to ~15 hours (d20 @ 40tpp); SFT adds ~30β90 minutes depending on variant.
Citation / acknowledgements
- Codebase:
karpathy/nanochat - Pretraining data: NVIDIA ClimbMix (via
karpathy/climbmix-400b-shuffle) - SFT data: HuggingFaceTB SmolTalk, CAIS MMLU, OpenAI GSM8K, AI2 Tulu-v2 long-mixture
- RoPE-removal recipe: Extending the Context of Pretrained LLMs by Dropping Their Positional Embeddings (arXiv:2512.12167)
License
MIT (inherits from the nanochat repository).