YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

LUNA - 100M Parameter LLM from Scratch

Custom ~100M parameter GPT model (Pythia-like architecture) pretrained on 4.5B tokens of clean English text.

Quick Start (RunPod / Cloud GPU)

1. Clone & Install (one command)

git clone https://huggingface.co/spaces/ASTERIZER/LUNA /workspace/LUNA && \
cd /workspace/LUNA && \
pip install -q -r requirements.txt

2. Get Dataset + Train (one command)

The dataset (~4.5B tokens) is hosted as a zip at ASTERIZER/Luna_Dataset. The script downloads, extracts, and starts training automatically.

From HuggingFace (recommended):

bash setup_and_train.sh huggingface ASTERIZER/Luna_Dataset

From Google Drive:

bash setup_and_train.sh gdrive YOUR_GDRIVE_FOLDER_ID

Smoke test (10M tokens only):

bash setup_and_train.sh huggingface ASTERIZER/Luna_Dataset 10000000

That's it. The script auto-detects your GPU, VRAM, RAM, CPU cores and configures everything for maximum utilization.

How It Works

Auto vs Manual Config

All hyperparameters live in train_config.yaml:

auto_config: true   # auto-detect everything from hardware
auto_config: false  # use exact values below, no overrides

When auto_config: true (default), the trainer:

Probes VRAM via binary search to find max micro_batch_size (82% safety)
Sets grad_accum to hit the target global_batch_size
Picks precision (bf16 on Ampere+, fp16 otherwise)
Scales workers to half your CPU cores, capped by RAM
Enables torch.compile if Triton is available (Linux)

When auto_config: false, every value in the YAML is used exactly as-is.

CLI Overrides

Any config value can be overridden from the command line:

python train.py --config train_config.yaml --data_path /data/litdata --max_tokens 100000000

Priority: CLI args > train_config.yaml > auto-detection

Dataset

4,515,286,950 tokens (4.5B) in 270 binary chunks
Sources: Wikipedia, FineWeb-Edu, OpenWebText (deduplicated, cleaned)
Format: LitData binary (int32, block_size=1025, TokensLoader)
Tokenizer: EleutherAI/pythia-160m (50,254 vocab)

Model Architecture

Parameter	Value
Layers	10
Hidden dim	768
Attention heads	12
Vocab size	50,304 (padded)
Context length	1,024
Total params	~109M (70M unique, tied embeddings)
Rotary %	25%

File Structure

LUNA/
  train.py              # Main training script (config-driven, auto-detects hardware)
  train_config.yaml     # All hyperparameters (auto_config: true/false)
  fetch_data.py         # Downloads dataset from HuggingFace / GDrive
  setup_and_train.sh    # One-command cloud entrypoint
  benchmark_runpod.py   # Local benchmark + RunPod cost calculator
  requirements.txt      # Python dependencies
  Base/
    checkpoints/EleutherAI/pythia-160m/   # Tokenizer files
    configs/             # Legacy litgpt YAML configs (reference only)
    scripts/             # Data preprocessing scripts

Estimated Training Times (RunPod)

GPU	$/hr	tok/s	Hours	Cost USD	Cost INR
RTX A5000	$0.16	~6,400	~196h	~$31	~2,700
RTX 3090	$0.22	~7,600	~165h	~$36	~3,100
RTX 4090	$0.34	~10,000	~125h	~$42	~3,600
RTX 5090	$0.69	~16,000	~78h	~$54	~4,600
H100 NVL	$2.59	~43,000	~29h	~$75	~6,400

Resume Training

Training auto-saves latest.pt every save_interval steps. If interrupted, just re-run the same command -- it picks up where it left off.

Verified Configs (What Worked)

These are the exact configurations that produced the current LUNA 100M model. Do NOT change them unless you know what you're doing — they are proven and validated.

1. Pretraining — 4.5 Billion Tokens

The pretraining ran in two phases on an RTX 4060 Ti 16GB.

Phase 1: Bulk pretraining on 3B general web tokens

Parameter	Value
Dataset	`litdata_3b` — deduplicated, quality-filtered (score ≥ 0.96) general web
Total tokens	3,000,000,000 (3B)
Precision	bf16-mixed
Global batch size	120 (micro_batch=12 × grad_accum=10)
Sequence length	1024
Optimizer	AdamW (lr=6e-4, min_lr=6e-5, weight_decay=0.1, betas=[0.9, 0.95])
LR schedule	Cosine decay with 500-step warmup
Gradient clip	max_norm=1.0
Checkpoints	Every 1000 steps
Seed	1337
Tokenizer	EleutherAI/pythia-160m (vocab 50,254)

Phase 2: Continued pretraining on clean English (Wikipedia + FineWeb-Edu)

Parameter	Value
Dataset	`litdata_english` — ultra-clean Wikipedia + FineWeb-Edu
Total tokens	150,000,000 (150M) — ~3 epochs over ~50M unique tokens
Init weights	Phase 1 checkpoint (`custom-100m-3b-full/final_raw`)
Precision	bf16-mixed
Global batch size	120 (micro_batch=12 × grad_accum=10)
Sequence length	1024
Optimizer	AdamW (lr=1e-4, min_lr=1e-5, weight_decay=0.1, betas=[0.9, 0.95])
LR schedule	Cosine decay with 200-step warmup
Gradient clip	max_norm=1.0
Checkpoints	Every 500 steps

Final combined dataset used for the production run:

Parameter	Value
Dataset	`litdata_pretrain_final` — all sources merged
Total tokens	4,515,286,950 (~4.5B) in 270 chunks
Sources	Wikipedia, FineWeb-Edu, OpenWebText (deduplicated, cleaned pure English)
Format	LitData binary (int32, block_size=1025, EOS=0)
Config file	`train_config.yaml`
Precision	bf16
Global batch size	120 (micro_batch=12 × grad_accum=10)
Sequence length	1024
Optimizer	AdamW (lr=6e-4, min_lr=6e-5, weight_decay=0.1, betas=[0.9, 0.95])
LR schedule	Cosine with 500-step warmup (5% of total steps when auto)
Gradient clip	max_norm=1.0
torch.compile	true (Linux/cloud with Triton)
auto_config	true (probes VRAM, CPU, RAM at runtime)

2. SFT Fine-Tuning — ~145 Million Tokens

Supervised fine-tuning on the pretrained LUNA 100M checkpoint.

Parameter	Value
Dataset	`Base/Datasets/sft_clean/` — 574,996 train + 5,808 val samples
Format	Alpaca JSON (instruction / input / output)
Estimated tokens	~145M total (574,996 samples × ~250 tokens avg × 2 epochs)
Epochs	2
Config file	`sft_config.yaml`

Model (frozen architecture — matches pretrain exactly):

Parameter	Value
vocab_size	50,304 (padded to 128 multiple)
seq_len	1024
n_layer	10
n_embd	768
n_head	12
Rotary %	25%
Total params	109,513,728

Training hyperparameters:

Parameter	Value
Optimizer	AdamW (lr=1.5e-5, min_lr=1e-6, weight_decay=0.01, betas=[0.9, 0.95])
Precision	bf16
Global batch size	64 (micro_batch=8 × grad_accum=8)
LR warmup	200 steps
Gradient clip	max_norm=1.0
Save interval	Every 500 steps
Eval interval	Every 500 steps (runs val loss + eval prompts)
DataLoader	4 workers, pin_memory=true
torch.compile	false

Prompt format (used during training — must be matched at inference):

### Instruction:
{instruction}

### Response:

With optional input field:

### Instruction:
{instruction}

### Input:
{input}

### Response:

Loss masking: Only the response tokens (after ### Response:\n) contribute to the loss. The prompt tokens are masked out (loss_mask=0). EOS token (id=0) is appended to every response.

3. SFT Inference / Chat — Loaded Configs

These are the exact generation parameters loaded when running chat.py or validate_sft.py. They match the training eval config from sft_train.py.

python chat.py --ckpt "Base\out\sft\model.pth"

Model loading:

Parameter	Value
Checkpoint	`Base/out/sft/model.pth` (419 MB, raw state_dict, 154 keys)
Checkpoint format	Raw `state_dict` — NOT wrapped in `{"model": ...}` dict
Tokenizer	`Base/checkpoints/EleutherAI/pythia-160m` (vocab 50,254)
EOS token ID	0 (pythia tokenizer — NOT 50276)
Device	auto (CUDA if available, else CPU)
Precision	float32 at inference (weights loaded as-is from bf16-trained ckpt)

Generation parameters:

Parameter	Value	Why
temperature	0.7	Balanced creativity vs coherence
top_k	40	Matches training eval (NOT 50)
top_p	0.9	Nucleus sampling cutoff
repetition_penalty	1.0	No penalty — matches training (NOT 1.1)
max_new_tokens	150	Matches training eval (NOT 256)

Prompt template (must match training exactly):

def format_prompt(instruction, context=""):
    if instruction and context:
        return f"### Instruction:\n{instruction}\n\n### Input:\n{context}\n\n### Response:\n"
    else:
        return f"### Instruction:\n{instruction}\n\n### Response:\n"

Critical notes:

There is NO Alpaca preamble text (e.g., "Below is an instruction...") — the model was never trained with one
EOS token is id=0 (pythia), not 50276 (GPT-NeoX) — using the wrong EOS causes the model to never stop
Generation stops when EOS is produced OR max_new_tokens is reached
For longer responses in chat, you can override: --max_new 512
For less repetition in production, add: --rep_pen 1.05

Validation results with these configs (100 complex examples):

Metric	Value
Overall Grade	A
Avg Loss (CE)	1.9167
Avg Perplexity	7.45
Token Accuracy	58.6%
BLEU-1	0.589
BLEU-2	0.219
Empty responses	0/100
Repetitive responses	5/100

License

Private / ASTERIZER 2026

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support