nanochat miniseries

This repository is part of a miniseries of small (~360M–480M parameter) decoder-only transformers trained on top of Andrej Karpathy's nanochat codebase. The series varies three axes: depth (model size), tokens-per-parameter (pretraining horizon), and RoPE removal schedule (fraction of the pretraining token budget spent with RoPE before it is dropped for the remainder, used to study positional encoding in long-context generalization). A subset of the SFT models is additionally fine-tuned on a long-context mixture (_long variants).

All models share the same tokenizer: a BPE tokenizer with vocab size 32,768 trained on ~2B characters of the pretraining corpus.

Training pipeline

Each model goes through the following stages:

Tokenizer training — 32,768-vocab BPE trained on ~2B characters of the pretraining dataset.
Pretraining (base) — Next-token prediction on NVIDIA's ClimbMix-400B corpus, hosted at karpathy/climbmix-400b-shuffle. Horizon is controlled by target_param_data_ratio (aka "tpp" in model names), i.e. tokens trained per model parameter. Sequence length 4096, batch size 1,048,576 tokens, AdamW + Muon optimizer.
Supervised fine-tuning (SFT) — Instruction tuning on a mixture of:
- HuggingFaceTB/smol-smoltalk — 460K general conversations
- Synthetic identity conversations (from karpathy-public S3) — 1K rows × 2 epochs
- cais/mmlu auxiliary_train — 100K rows × 3 epochs (multiple choice)
- openai/gsm8k main — 8K rows × 4 epochs (math + tool use)
- SimpleSpelling — 200K synthetic spelling examples
- SpellingBee — 80K synthetic letter-counting examples
Long-context SFT (_long variants only) — Same mixture plus 100K rows of allenai/tulu-v2-sft-long-mixture, with sequence length extended to 8,192.

RoPE removal (drope) experiment

Model names containing drope_XX follow the recipe from "Extending the Context of Pretrained LLMs by Dropping Their Positional Embeddings": the model is pretrained normally with RoPE for the first XX% of its token budget, RoPE is then removed, and the remaining (100 − XX)% of the pretraining budget is used to recalibrate the model without positional encodings. For example, drope_50 means 50% of the token budget was spent with RoPE and the remaining 50% was spent with RoPE removed. This is intended to preserve the optimization benefits of RoPE early in training while producing a NoPE-style model that generalizes better to long contexts at inference time. Models without drope in the name keep RoPE in every layer for the full pretraining budget (theta = 100,000).

Model sizes

Depth	Layers	Hidden	Heads	Intermediate	Approx params
d18	18	1152	9	3072	~360M
d20	20	1280	10	3456	~480M

All models use head_dim=128, vocab=32,768, RMSNorm (ε=1e-6), SwiGLU MLP, and final logit softcapping at 15.0.

Released checkpoints

RoPE schedule column: none means RoPE is kept on for the full pretraining budget. A percentage (e.g. 50%) means RoPE is kept on for the first portion of the token budget and then removed for the remaining (100 − XX)% of pretraining, per the drope recipe above.

Model tag	Depth	tpp	RoPE schedule	Long-ctx SFT
d18_9tpp	18	9	none (always on)	no
d18_9tpp_drope_25	18	9	25% then removed	no
d18_9tpp_drope_50	18	9	50% then removed	no
d18_9tpp_drope_75	18	9	75% then removed	no
d18_20tpp	18	20	none (always on)	no
d18_20tpp_long	18	20	none (always on)	yes
d18_20tpp_drope_50	18	20	50% then removed	no
d18_20tpp_drope_50_long	18	20	50% then removed	yes
d20_9tpp	20	9	none (always on)	no
d20_9tpp_drope_25	20	9	25% then removed	no
d20_9tpp_drope_50	20	9	50% then removed	no
d20_9tpp_drope_75	20	9	75% then removed	no
d20_20tpp	20	20	none (always on)	no
d20_20tpp_long	20	20	none (always on)	yes
d20_20tpp_drope_50	20	20	50% then removed	no
d20_20tpp_drope_50_long	20	20	50% then removed	yes
d20_40tpp	20	40	none (always on)	no
d20_40tpp_long	20	40	none (always on)	yes
d20_40tpp_drope_50	20	40	50% then removed	no
d20_40tpp_drope_50_long	20	40	50% then removed	yes

tpp = tokens-per-parameter pretraining horizon. Total pretraining token budgets:

Depth	tpp	Total pretraining tokens
d18	9	≈ 2.92 B
d18	20	≈ 6.49 B
d20	9	≈ 3.95 B
d20	20	≈ 8.77 B
d20	40	≈ 17.54 B

drope variants use the same total token budget as their non-drope counterpart; the budget is split between the RoPE-on and RoPE-removed phases as described above.

Checkpoint format: which repo should I download?

For each model tag we publish four Hugging Face repositories:

Repo suffix	Stage	Format	Use case
`...-base`	post-pretraining	nanochat native (`model_XXXXXX.pt`, `meta_*.json`, optimizer shard)	continue training / run with the `nanochat` repo
`...-sft`	post-SFT	nanochat native (`model_XXXXXX.pt`, `meta_*.json`, optimizer shard)	continue training / run with the `nanochat` repo
`...-hf-base`	post-pretraining	Hugging Face `transformers` (`config.json`, `model.safetensors`, `tokenizer.json`)	drop-in `AutoModelForCausalLM` loading
`...-hf-sft`	post-SFT	Hugging Face `transformers` (`config.json`, `model.safetensors`, `tokenizer.json`)	drop-in `AutoModelForCausalLM` loading

The base_checkpoints and chatsft_checkpoints artifacts are the raw nanochat outputs. They include the optimizer state (optim_*_rank0.pt) and metadata (meta_*.json with training config, val BPB, step number, etc.), so you can resume training or evaluate with the nanochat scripts exactly as produced by scripts.base_train and scripts.chat_sft.
The hf_base and hf_sft artifacts are conversions of those same weights into the Hugging Face transformers layout (architecture name NanoChatForCausalLM, model_type nanochat). Load them with:
```
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("crellis/nanochat-d20-20tpp-hf-sft", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("crellis/nanochat-d20-20tpp-hf-sft", trust_remote_code=True)
```
use_rope in config.json reflects the drope setting: true for models that kept RoPE for the entire pretraining budget, and false for drope variants (where RoPE was removed partway through pretraining and the model was recalibrated without it). In the drope case, rotary embeddings are not applied at inference time.

Pick -hf-base / -hf-sft for inference. Pick -base / -sft only if you plan to continue training inside the nanochat codebase.

Inference sketch (HF format, SFT)

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

repo = "crellis/nanochat-d20-20tpp-hf-sft"
tok = AutoTokenizer.from_pretrained(repo, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(repo, torch_dtype=torch.bfloat16, trust_remote_code=True).cuda()

messages = [{"role": "user", "content": "Why is the sky blue?"}]
inputs = tok.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt").cuda()
out = model.generate(inputs, max_new_tokens=256)
print(tok.decode(out[0][inputs.shape[1]:], skip_special_tokens=True))

Base (pretrained-only) checkpoints are next-token predictors and do not understand the chat template; use -hf-base for completion-style prompting and -hf-sft for chat.

Training compute

All runs were trained on a single H100 GPU via Slurm. Pretraining wall-clock ranges from ~4 hours (d18 @ 9tpp) to ~15 hours (d20 @ 40tpp); SFT adds ~30–90 minutes depending on variant.

Citation / acknowledgements

Codebase: karpathy/nanochat
Pretraining data: NVIDIA ClimbMix (via karpathy/climbmix-400b-shuffle)
SFT data: HuggingFaceTB SmolTalk, CAIS MMLU, OpenAI GSM8K, AI2 Tulu-v2 long-mixture
RoPE-removal recipe: Extending the Context of Pretrained LLMs by Dropping Their Positional Embeddings (arXiv:2512.12167)

License

MIT (inherits from the nanochat repository).

Downloads last month: -; Downloads are not tracked for this model. How to track

Datasets used to train crellis/d20-40tpp-long-chatsft_checkpoints

Paper for crellis/d20-40tpp-long-chatsft_checkpoints

Extending the Context of Pretrained LLMs by Dropping Their Positional Embeddings

Paper • 2512.12167 • Published Dec 13, 2025 • 5