nanochat miniseries

This repository is part of a miniseries of small (~360M–480M parameter) decoder-only transformers trained on top of Andrej Karpathy's nanochat codebase. The series varies three axes: depth (model size), tokens-per-parameter (pretraining horizon), and RoPE removal schedule (fraction of the pretraining token budget spent with RoPE before it is dropped for the remainder, used to study positional encoding in long-context generalization). A subset of the SFT models is additionally fine-tuned on a long-context mixture (_long variants).

All models share the same tokenizer: a BPE tokenizer with vocab size 32,768 trained on ~2B characters of the pretraining corpus.

Training pipeline

Each model goes through the following stages:

  1. Tokenizer training β€” 32,768-vocab BPE trained on ~2B characters of the pretraining dataset.
  2. Pretraining (base) β€” Next-token prediction on NVIDIA's ClimbMix-400B corpus, hosted at karpathy/climbmix-400b-shuffle. Horizon is controlled by target_param_data_ratio (aka "tpp" in model names), i.e. tokens trained per model parameter. Sequence length 4096, batch size 1,048,576 tokens, AdamW + Muon optimizer.
  3. Supervised fine-tuning (SFT) β€” Instruction tuning on a mixture of:
    • HuggingFaceTB/smol-smoltalk β€” 460K general conversations
    • Synthetic identity conversations (from karpathy-public S3) β€” 1K rows Γ— 2 epochs
    • cais/mmlu auxiliary_train β€” 100K rows Γ— 3 epochs (multiple choice)
    • openai/gsm8k main β€” 8K rows Γ— 4 epochs (math + tool use)
    • SimpleSpelling β€” 200K synthetic spelling examples
    • SpellingBee β€” 80K synthetic letter-counting examples
  4. Long-context SFT (_long variants only) β€” Same mixture plus 100K rows of allenai/tulu-v2-sft-long-mixture, with sequence length extended to 8,192.

RoPE removal (drope) experiment

Model names containing drope_XX follow the recipe from "Extending the Context of Pretrained LLMs by Dropping Their Positional Embeddings": the model is pretrained normally with RoPE for the first XX% of its token budget, RoPE is then removed, and the remaining (100 βˆ’ XX)% of the pretraining budget is used to recalibrate the model without positional encodings. For example, drope_50 means 50% of the token budget was spent with RoPE and the remaining 50% was spent with RoPE removed. This is intended to preserve the optimization benefits of RoPE early in training while producing a NoPE-style model that generalizes better to long contexts at inference time. Models without drope in the name keep RoPE in every layer for the full pretraining budget (theta = 100,000).

Model sizes

Depth Layers Hidden Heads Intermediate Approx params
d18 18 1152 9 3072 ~360M
d20 20 1280 10 3456 ~480M

All models use head_dim=128, vocab=32,768, RMSNorm (Ξ΅=1e-6), SwiGLU MLP, and final logit softcapping at 15.0.

Released checkpoints

RoPE schedule column: none means RoPE is kept on for the full pretraining budget. A percentage (e.g. 50%) means RoPE is kept on for the first portion of the token budget and then removed for the remaining (100 βˆ’ XX)% of pretraining, per the drope recipe above.

Model tag Depth tpp RoPE schedule Long-ctx SFT
d18_9tpp 18 9 none (always on) no
d18_9tpp_drope_25 18 9 25% then removed no
d18_9tpp_drope_50 18 9 50% then removed no
d18_9tpp_drope_75 18 9 75% then removed no
d18_20tpp 18 20 none (always on) no
d18_20tpp_long 18 20 none (always on) yes
d18_20tpp_drope_50 18 20 50% then removed no
d18_20tpp_drope_50_long 18 20 50% then removed yes
d20_9tpp 20 9 none (always on) no
d20_9tpp_drope_25 20 9 25% then removed no
d20_9tpp_drope_50 20 9 50% then removed no
d20_9tpp_drope_75 20 9 75% then removed no
d20_20tpp 20 20 none (always on) no
d20_20tpp_long 20 20 none (always on) yes
d20_20tpp_drope_50 20 20 50% then removed no
d20_20tpp_drope_50_long 20 20 50% then removed yes
d20_40tpp 20 40 none (always on) no
d20_40tpp_long 20 40 none (always on) yes
d20_40tpp_drope_50 20 40 50% then removed no
d20_40tpp_drope_50_long 20 40 50% then removed yes

tpp = tokens-per-parameter pretraining horizon. Total pretraining token budgets:

Depth tpp Total pretraining tokens
d18 9 β‰ˆ 2.92 B
d18 20 β‰ˆ 6.49 B
d20 9 β‰ˆ 3.95 B
d20 20 β‰ˆ 8.77 B
d20 40 β‰ˆ 17.54 B

drope variants use the same total token budget as their non-drope counterpart; the budget is split between the RoPE-on and RoPE-removed phases as described above.

Checkpoint format: which repo should I download?

For each model tag we publish four Hugging Face repositories:

Repo suffix Stage Format Use case
...-base post-pretraining nanochat native (model_XXXXXX.pt, meta_*.json, optimizer shard) continue training / run with the nanochat repo
...-sft post-SFT nanochat native (model_XXXXXX.pt, meta_*.json, optimizer shard) continue training / run with the nanochat repo
...-hf-base post-pretraining Hugging Face transformers (config.json, model.safetensors, tokenizer.json) drop-in AutoModelForCausalLM loading
...-hf-sft post-SFT Hugging Face transformers (config.json, model.safetensors, tokenizer.json) drop-in AutoModelForCausalLM loading
  • The base_checkpoints and chatsft_checkpoints artifacts are the raw nanochat outputs. They include the optimizer state (optim_*_rank0.pt) and metadata (meta_*.json with training config, val BPB, step number, etc.), so you can resume training or evaluate with the nanochat scripts exactly as produced by scripts.base_train and scripts.chat_sft.

  • The hf_base and hf_sft artifacts are conversions of those same weights into the Hugging Face transformers layout (architecture name NanoChatForCausalLM, model_type nanochat). Load them with:

    from transformers import AutoModelForCausalLM, AutoTokenizer
    model = AutoModelForCausalLM.from_pretrained("crellis/nanochat-d20-20tpp-hf-sft", trust_remote_code=True)
    tokenizer = AutoTokenizer.from_pretrained("crellis/nanochat-d20-20tpp-hf-sft", trust_remote_code=True)
    

    use_rope in config.json reflects the drope setting: true for models that kept RoPE for the entire pretraining budget, and false for drope variants (where RoPE was removed partway through pretraining and the model was recalibrated without it). In the drope case, rotary embeddings are not applied at inference time.

Pick -hf-base / -hf-sft for inference. Pick -base / -sft only if you plan to continue training inside the nanochat codebase.

Inference sketch (HF format, SFT)

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

repo = "crellis/nanochat-d20-20tpp-hf-sft"
tok = AutoTokenizer.from_pretrained(repo, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(repo, torch_dtype=torch.bfloat16, trust_remote_code=True).cuda()

messages = [{"role": "user", "content": "Why is the sky blue?"}]
inputs = tok.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt").cuda()
out = model.generate(inputs, max_new_tokens=256)
print(tok.decode(out[0][inputs.shape[1]:], skip_special_tokens=True))

Base (pretrained-only) checkpoints are next-token predictors and do not understand the chat template; use -hf-base for completion-style prompting and -hf-sft for chat.

Training compute

All runs were trained on a single H100 GPU via Slurm. Pretraining wall-clock ranges from ~4 hours (d18 @ 9tpp) to ~15 hours (d20 @ 40tpp); SFT adds ~30–90 minutes depending on variant.

Citation / acknowledgements

License

MIT (inherits from the nanochat repository).

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Datasets used to train crellis/d18-9tpp-drope-75-chatsft_checkpoints

Paper for crellis/d18-9tpp-drope-75-chatsft_checkpoints