YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
LUNA - 100M Parameter LLM from Scratch
Custom ~100M parameter GPT model (Pythia-like architecture) pretrained on 4.5B tokens of clean English text.
Quick Start (RunPod / Cloud GPU)
1. Clone & Install (one command)
git clone https://huggingface.co/spaces/ASTERIZER/LUNA /workspace/LUNA && \
cd /workspace/LUNA && \
pip install -q -r requirements.txt
2. Get Dataset + Train (one command)
The dataset (~4.5B tokens) is hosted as a zip at ASTERIZER/Luna_Dataset. The script downloads, extracts, and starts training automatically.
From HuggingFace (recommended):
bash setup_and_train.sh huggingface ASTERIZER/Luna_Dataset
From Google Drive:
bash setup_and_train.sh gdrive YOUR_GDRIVE_FOLDER_ID
Smoke test (10M tokens only):
bash setup_and_train.sh huggingface ASTERIZER/Luna_Dataset 10000000
That's it. The script auto-detects your GPU, VRAM, RAM, CPU cores and configures everything for maximum utilization.
How It Works
Auto vs Manual Config
All hyperparameters live in train_config.yaml:
auto_config: true # auto-detect everything from hardware
auto_config: false # use exact values below, no overrides
When auto_config: true (default), the trainer:
- Probes VRAM via binary search to find max micro_batch_size (82% safety)
- Sets grad_accum to hit the target global_batch_size
- Picks precision (bf16 on Ampere+, fp16 otherwise)
- Scales workers to half your CPU cores, capped by RAM
- Enables torch.compile if Triton is available (Linux)
When auto_config: false, every value in the YAML is used exactly as-is.
CLI Overrides
Any config value can be overridden from the command line:
python train.py --config train_config.yaml --data_path /data/litdata --max_tokens 100000000
Priority: CLI args > train_config.yaml > auto-detection
Dataset
- 4,515,286,950 tokens (4.5B) in 270 binary chunks
- Sources: Wikipedia, FineWeb-Edu, OpenWebText (deduplicated, cleaned)
- Format: LitData binary (int32, block_size=1025, TokensLoader)
- Tokenizer: EleutherAI/pythia-160m (50,254 vocab)
Model Architecture
| Parameter | Value |
|---|---|
| Layers | 10 |
| Hidden dim | 768 |
| Attention heads | 12 |
| Vocab size | 50,304 (padded) |
| Context length | 1,024 |
| Total params | ~109M (70M unique, tied embeddings) |
| Rotary % | 25% |
File Structure
LUNA/
train.py # Main training script (config-driven, auto-detects hardware)
train_config.yaml # All hyperparameters (auto_config: true/false)
fetch_data.py # Downloads dataset from HuggingFace / GDrive
setup_and_train.sh # One-command cloud entrypoint
benchmark_runpod.py # Local benchmark + RunPod cost calculator
requirements.txt # Python dependencies
Base/
checkpoints/EleutherAI/pythia-160m/ # Tokenizer files
configs/ # Legacy litgpt YAML configs (reference only)
scripts/ # Data preprocessing scripts
Estimated Training Times (RunPod)
| GPU | $/hr | tok/s | Hours | Cost USD | Cost INR |
|---|---|---|---|---|---|
| RTX A5000 | $0.16 | ~6,400 | ~196h | ~$31 | ~2,700 |
| RTX 3090 | $0.22 | ~7,600 | ~165h | ~$36 | ~3,100 |
| RTX 4090 | $0.34 | ~10,000 | ~125h | ~$42 | ~3,600 |
| RTX 5090 | $0.69 | ~16,000 | ~78h | ~$54 | ~4,600 |
| H100 NVL | $2.59 | ~43,000 | ~29h | ~$75 | ~6,400 |
Resume Training
Training auto-saves latest.pt every save_interval steps. If interrupted, just re-run the same command -- it picks up where it left off.
Verified Configs (What Worked)
These are the exact configurations that produced the current LUNA 100M model. Do NOT change them unless you know what you're doing β they are proven and validated.
1. Pretraining β 4.5 Billion Tokens
The pretraining ran in two phases on an RTX 4060 Ti 16GB.
Phase 1: Bulk pretraining on 3B general web tokens
| Parameter | Value |
|---|---|
| Dataset | litdata_3b β deduplicated, quality-filtered (score β₯ 0.96) general web |
| Total tokens | 3,000,000,000 (3B) |
| Precision | bf16-mixed |
| Global batch size | 120 (micro_batch=12 Γ grad_accum=10) |
| Sequence length | 1024 |
| Optimizer | AdamW (lr=6e-4, min_lr=6e-5, weight_decay=0.1, betas=[0.9, 0.95]) |
| LR schedule | Cosine decay with 500-step warmup |
| Gradient clip | max_norm=1.0 |
| Checkpoints | Every 1000 steps |
| Seed | 1337 |
| Tokenizer | EleutherAI/pythia-160m (vocab 50,254) |
Phase 2: Continued pretraining on clean English (Wikipedia + FineWeb-Edu)
| Parameter | Value |
|---|---|
| Dataset | litdata_english β ultra-clean Wikipedia + FineWeb-Edu |
| Total tokens | 150,000,000 (150M) β ~3 epochs over ~50M unique tokens |
| Init weights | Phase 1 checkpoint (custom-100m-3b-full/final_raw) |
| Precision | bf16-mixed |
| Global batch size | 120 (micro_batch=12 Γ grad_accum=10) |
| Sequence length | 1024 |
| Optimizer | AdamW (lr=1e-4, min_lr=1e-5, weight_decay=0.1, betas=[0.9, 0.95]) |
| LR schedule | Cosine decay with 200-step warmup |
| Gradient clip | max_norm=1.0 |
| Checkpoints | Every 500 steps |
Final combined dataset used for the production run:
| Parameter | Value |
|---|---|
| Dataset | litdata_pretrain_final β all sources merged |
| Total tokens | 4,515,286,950 (~4.5B) in 270 chunks |
| Sources | Wikipedia, FineWeb-Edu, OpenWebText (deduplicated, cleaned pure English) |
| Format | LitData binary (int32, block_size=1025, EOS=0) |
| Config file | train_config.yaml |
| Precision | bf16 |
| Global batch size | 120 (micro_batch=12 Γ grad_accum=10) |
| Sequence length | 1024 |
| Optimizer | AdamW (lr=6e-4, min_lr=6e-5, weight_decay=0.1, betas=[0.9, 0.95]) |
| LR schedule | Cosine with 500-step warmup (5% of total steps when auto) |
| Gradient clip | max_norm=1.0 |
| torch.compile | true (Linux/cloud with Triton) |
| auto_config | true (probes VRAM, CPU, RAM at runtime) |
2. SFT Fine-Tuning β ~145 Million Tokens
Supervised fine-tuning on the pretrained LUNA 100M checkpoint.
| Parameter | Value |
|---|---|
| Dataset | Base/Datasets/sft_clean/ β 574,996 train + 5,808 val samples |
| Format | Alpaca JSON (instruction / input / output) |
| Estimated tokens | ~145M total (574,996 samples Γ ~250 tokens avg Γ 2 epochs) |
| Epochs | 2 |
| Config file | sft_config.yaml |
Model (frozen architecture β matches pretrain exactly):
| Parameter | Value |
|---|---|
| vocab_size | 50,304 (padded to 128 multiple) |
| seq_len | 1024 |
| n_layer | 10 |
| n_embd | 768 |
| n_head | 12 |
| Rotary % | 25% |
| Total params | 109,513,728 |
Training hyperparameters:
| Parameter | Value |
|---|---|
| Optimizer | AdamW (lr=1.5e-5, min_lr=1e-6, weight_decay=0.01, betas=[0.9, 0.95]) |
| Precision | bf16 |
| Global batch size | 64 (micro_batch=8 Γ grad_accum=8) |
| LR warmup | 200 steps |
| Gradient clip | max_norm=1.0 |
| Save interval | Every 500 steps |
| Eval interval | Every 500 steps (runs val loss + eval prompts) |
| DataLoader | 4 workers, pin_memory=true |
| torch.compile | false |
Prompt format (used during training β must be matched at inference):
### Instruction:
{instruction}
### Response:
With optional input field:
### Instruction:
{instruction}
### Input:
{input}
### Response:
Loss masking: Only the response tokens (after ### Response:\n) contribute to the loss.
The prompt tokens are masked out (loss_mask=0). EOS token (id=0) is appended to every response.
3. SFT Inference / Chat β Loaded Configs
These are the exact generation parameters loaded when running chat.py or validate_sft.py.
They match the training eval config from sft_train.py.
python chat.py --ckpt "Base\out\sft\model.pth"
Model loading:
| Parameter | Value |
|---|---|
| Checkpoint | Base/out/sft/model.pth (419 MB, raw state_dict, 154 keys) |
| Checkpoint format | Raw state_dict β NOT wrapped in {"model": ...} dict |
| Tokenizer | Base/checkpoints/EleutherAI/pythia-160m (vocab 50,254) |
| EOS token ID | 0 (pythia tokenizer β NOT 50276) |
| Device | auto (CUDA if available, else CPU) |
| Precision | float32 at inference (weights loaded as-is from bf16-trained ckpt) |
Generation parameters:
| Parameter | Value | Why |
|---|---|---|
| temperature | 0.7 | Balanced creativity vs coherence |
| top_k | 40 | Matches training eval (NOT 50) |
| top_p | 0.9 | Nucleus sampling cutoff |
| repetition_penalty | 1.0 | No penalty β matches training (NOT 1.1) |
| max_new_tokens | 150 | Matches training eval (NOT 256) |
Prompt template (must match training exactly):
def format_prompt(instruction, context=""):
if instruction and context:
return f"### Instruction:\n{instruction}\n\n### Input:\n{context}\n\n### Response:\n"
else:
return f"### Instruction:\n{instruction}\n\n### Response:\n"
Critical notes:
- There is NO Alpaca preamble text (e.g., "Below is an instruction...") β the model was never trained with one
- EOS token is id=0 (pythia), not 50276 (GPT-NeoX) β using the wrong EOS causes the model to never stop
- Generation stops when EOS is produced OR max_new_tokens is reached
- For longer responses in chat, you can override:
--max_new 512 - For less repetition in production, add:
--rep_pen 1.05
Validation results with these configs (100 complex examples):
| Metric | Value |
|---|---|
| Overall Grade | A |
| Avg Loss (CE) | 1.9167 |
| Avg Perplexity | 7.45 |
| Token Accuracy | 58.6% |
| BLEU-1 | 0.589 |
| BLEU-2 | 0.219 |
| Empty responses | 0/100 |
| Repetitive responses | 5/100 |
License
Private / ASTERIZER 2026