YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

LUNA - 100M Parameter LLM from Scratch

Custom ~100M parameter GPT model (Pythia-like architecture) pretrained on 4.5B tokens of clean English text.

Quick Start (RunPod / Cloud GPU)

1. Clone & Install (one command)

git clone https://huggingface.co/spaces/ASTERIZER/LUNA /workspace/LUNA && \
cd /workspace/LUNA && \
pip install -q -r requirements.txt

2. Get Dataset + Train (one command)

The dataset (~4.5B tokens) is hosted as a zip at ASTERIZER/Luna_Dataset. The script downloads, extracts, and starts training automatically.

From HuggingFace (recommended):

bash setup_and_train.sh huggingface ASTERIZER/Luna_Dataset

From Google Drive:

bash setup_and_train.sh gdrive YOUR_GDRIVE_FOLDER_ID

Smoke test (10M tokens only):

bash setup_and_train.sh huggingface ASTERIZER/Luna_Dataset 10000000

That's it. The script auto-detects your GPU, VRAM, RAM, CPU cores and configures everything for maximum utilization.


How It Works

Auto vs Manual Config

All hyperparameters live in train_config.yaml:

auto_config: true   # auto-detect everything from hardware
auto_config: false  # use exact values below, no overrides

When auto_config: true (default), the trainer:

  • Probes VRAM via binary search to find max micro_batch_size (82% safety)
  • Sets grad_accum to hit the target global_batch_size
  • Picks precision (bf16 on Ampere+, fp16 otherwise)
  • Scales workers to half your CPU cores, capped by RAM
  • Enables torch.compile if Triton is available (Linux)

When auto_config: false, every value in the YAML is used exactly as-is.

CLI Overrides

Any config value can be overridden from the command line:

python train.py --config train_config.yaml --data_path /data/litdata --max_tokens 100000000

Priority: CLI args > train_config.yaml > auto-detection


Dataset

  • 4,515,286,950 tokens (4.5B) in 270 binary chunks
  • Sources: Wikipedia, FineWeb-Edu, OpenWebText (deduplicated, cleaned)
  • Format: LitData binary (int32, block_size=1025, TokensLoader)
  • Tokenizer: EleutherAI/pythia-160m (50,254 vocab)

Model Architecture

Parameter Value
Layers 10
Hidden dim 768
Attention heads 12
Vocab size 50,304 (padded)
Context length 1,024
Total params ~109M (70M unique, tied embeddings)
Rotary % 25%

File Structure

LUNA/
  train.py              # Main training script (config-driven, auto-detects hardware)
  train_config.yaml     # All hyperparameters (auto_config: true/false)
  fetch_data.py         # Downloads dataset from HuggingFace / GDrive
  setup_and_train.sh    # One-command cloud entrypoint
  benchmark_runpod.py   # Local benchmark + RunPod cost calculator
  requirements.txt      # Python dependencies
  Base/
    checkpoints/EleutherAI/pythia-160m/   # Tokenizer files
    configs/             # Legacy litgpt YAML configs (reference only)
    scripts/             # Data preprocessing scripts

Estimated Training Times (RunPod)

GPU $/hr tok/s Hours Cost USD Cost INR
RTX A5000 $0.16 ~6,400 ~196h ~$31 ~2,700
RTX 3090 $0.22 ~7,600 ~165h ~$36 ~3,100
RTX 4090 $0.34 ~10,000 ~125h ~$42 ~3,600
RTX 5090 $0.69 ~16,000 ~78h ~$54 ~4,600
H100 NVL $2.59 ~43,000 ~29h ~$75 ~6,400

Resume Training

Training auto-saves latest.pt every save_interval steps. If interrupted, just re-run the same command -- it picks up where it left off.


Verified Configs (What Worked)

These are the exact configurations that produced the current LUNA 100M model. Do NOT change them unless you know what you're doing β€” they are proven and validated.


1. Pretraining β€” 4.5 Billion Tokens

The pretraining ran in two phases on an RTX 4060 Ti 16GB.

Phase 1: Bulk pretraining on 3B general web tokens

Parameter Value
Dataset litdata_3b β€” deduplicated, quality-filtered (score β‰₯ 0.96) general web
Total tokens 3,000,000,000 (3B)
Precision bf16-mixed
Global batch size 120 (micro_batch=12 Γ— grad_accum=10)
Sequence length 1024
Optimizer AdamW (lr=6e-4, min_lr=6e-5, weight_decay=0.1, betas=[0.9, 0.95])
LR schedule Cosine decay with 500-step warmup
Gradient clip max_norm=1.0
Checkpoints Every 1000 steps
Seed 1337
Tokenizer EleutherAI/pythia-160m (vocab 50,254)

Phase 2: Continued pretraining on clean English (Wikipedia + FineWeb-Edu)

Parameter Value
Dataset litdata_english β€” ultra-clean Wikipedia + FineWeb-Edu
Total tokens 150,000,000 (150M) β€” ~3 epochs over ~50M unique tokens
Init weights Phase 1 checkpoint (custom-100m-3b-full/final_raw)
Precision bf16-mixed
Global batch size 120 (micro_batch=12 Γ— grad_accum=10)
Sequence length 1024
Optimizer AdamW (lr=1e-4, min_lr=1e-5, weight_decay=0.1, betas=[0.9, 0.95])
LR schedule Cosine decay with 200-step warmup
Gradient clip max_norm=1.0
Checkpoints Every 500 steps

Final combined dataset used for the production run:

Parameter Value
Dataset litdata_pretrain_final β€” all sources merged
Total tokens 4,515,286,950 (~4.5B) in 270 chunks
Sources Wikipedia, FineWeb-Edu, OpenWebText (deduplicated, cleaned pure English)
Format LitData binary (int32, block_size=1025, EOS=0)
Config file train_config.yaml
Precision bf16
Global batch size 120 (micro_batch=12 Γ— grad_accum=10)
Sequence length 1024
Optimizer AdamW (lr=6e-4, min_lr=6e-5, weight_decay=0.1, betas=[0.9, 0.95])
LR schedule Cosine with 500-step warmup (5% of total steps when auto)
Gradient clip max_norm=1.0
torch.compile true (Linux/cloud with Triton)
auto_config true (probes VRAM, CPU, RAM at runtime)

2. SFT Fine-Tuning β€” ~145 Million Tokens

Supervised fine-tuning on the pretrained LUNA 100M checkpoint.

Parameter Value
Dataset Base/Datasets/sft_clean/ β€” 574,996 train + 5,808 val samples
Format Alpaca JSON (instruction / input / output)
Estimated tokens ~145M total (574,996 samples Γ— ~250 tokens avg Γ— 2 epochs)
Epochs 2
Config file sft_config.yaml

Model (frozen architecture β€” matches pretrain exactly):

Parameter Value
vocab_size 50,304 (padded to 128 multiple)
seq_len 1024
n_layer 10
n_embd 768
n_head 12
Rotary % 25%
Total params 109,513,728

Training hyperparameters:

Parameter Value
Optimizer AdamW (lr=1.5e-5, min_lr=1e-6, weight_decay=0.01, betas=[0.9, 0.95])
Precision bf16
Global batch size 64 (micro_batch=8 Γ— grad_accum=8)
LR warmup 200 steps
Gradient clip max_norm=1.0
Save interval Every 500 steps
Eval interval Every 500 steps (runs val loss + eval prompts)
DataLoader 4 workers, pin_memory=true
torch.compile false

Prompt format (used during training β€” must be matched at inference):

### Instruction:
{instruction}

### Response:

With optional input field:

### Instruction:
{instruction}

### Input:
{input}

### Response:

Loss masking: Only the response tokens (after ### Response:\n) contribute to the loss. The prompt tokens are masked out (loss_mask=0). EOS token (id=0) is appended to every response.


3. SFT Inference / Chat β€” Loaded Configs

These are the exact generation parameters loaded when running chat.py or validate_sft.py. They match the training eval config from sft_train.py.

python chat.py --ckpt "Base\out\sft\model.pth"

Model loading:

Parameter Value
Checkpoint Base/out/sft/model.pth (419 MB, raw state_dict, 154 keys)
Checkpoint format Raw state_dict β€” NOT wrapped in {"model": ...} dict
Tokenizer Base/checkpoints/EleutherAI/pythia-160m (vocab 50,254)
EOS token ID 0 (pythia tokenizer β€” NOT 50276)
Device auto (CUDA if available, else CPU)
Precision float32 at inference (weights loaded as-is from bf16-trained ckpt)

Generation parameters:

Parameter Value Why
temperature 0.7 Balanced creativity vs coherence
top_k 40 Matches training eval (NOT 50)
top_p 0.9 Nucleus sampling cutoff
repetition_penalty 1.0 No penalty β€” matches training (NOT 1.1)
max_new_tokens 150 Matches training eval (NOT 256)

Prompt template (must match training exactly):

def format_prompt(instruction, context=""):
    if instruction and context:
        return f"### Instruction:\n{instruction}\n\n### Input:\n{context}\n\n### Response:\n"
    else:
        return f"### Instruction:\n{instruction}\n\n### Response:\n"

Critical notes:

  • There is NO Alpaca preamble text (e.g., "Below is an instruction...") β€” the model was never trained with one
  • EOS token is id=0 (pythia), not 50276 (GPT-NeoX) β€” using the wrong EOS causes the model to never stop
  • Generation stops when EOS is produced OR max_new_tokens is reached
  • For longer responses in chat, you can override: --max_new 512
  • For less repetition in production, add: --rep_pen 1.05

Validation results with these configs (100 complex examples):

Metric Value
Overall Grade A
Avg Loss (CE) 1.9167
Avg Perplexity 7.45
Token Accuracy 58.6%
BLEU-1 0.589
BLEU-2 0.219
Empty responses 0/100
Repetitive responses 5/100

License

Private / ASTERIZER 2026

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support