--- license: mit language: en tags: - gpt2 - causal-lm - pytorch - transformer - pretraining - sft - question-answering - ultra-fineweb - custom-dataset model-index: - name: gpt2-124m-qa results: - task: name: Question Answering type: text-generation dataset: name: Custom QA Dataset (JSONL) type: jsonl metrics: - name: Loss type: loss value: 0.65 ---

# GPT-2 124M — Pretrained on Ultra-FineWeb Edu + QA SFT This repository contains two trained checkpoints of a custom **GPT-2 124M** model: - **Pretrained Model:** `model_09535.pt` → Trained *from scratch* on **Ultra-FineWeb Edu (5B token subset)** - **QA SFT Model:** `qa-sft_best.pt` → Fine-tuned using **Supervised Fine-Tuning (SFT)** on a curated **custom Q&A dataset** This model was implemented using a **from-scratch GPT-2 training pipeline**, *inspired by Andrej Karpathy’s engineering approach*, but trained independently with different datasets and objectives. --- ## 📦 Model Versions ### **1. Pretrained Model (`model_09535.pt`)** | Feature | Details | |--------|---------| | Parameters | 124M | | Layers | 12 | | Heads | 12 | | Hidden size | 768 | | Sequence length | 1024 | | Vocab size | 50304 | | Dataset | Ultra-FineWeb Edu (educational, high-quality web text) | | Purpose | General language modeling | **Goal:** Build a clean GPT-2 Small from-scratch to understand and implement a full LLM training pipeline. --- ### **2. QA SFT Model (`qa-sft_best.pt`)** | Feature | Details | |--------|---------| | Base | The pretrained model above | | Method | Supervised Fine-Tuning (SFT) | | Dataset | Custom JSONL Q&A dataset | | Domain | Australian facts, general knowledge, definitions, reasoning | | Use-case | QA-style interactive chatbot | Demo available at: 👉 **https://gpt2.devshubh.me** --- # 🧠 Model Architecture This model follows the **GPT-2 Small** architecture: - Decoder-only transformer - Multi-Head Self-Attention - GELU activations - LayerNorm (Pre-Norm) - Flash Attention enabled during training - Positional embeddings - Weight decay + AdamW (fused) - Mixed Precision (AMP FP16) --- # 🛠️ Training Details ## **Pretraining on Ultra-FineWeb Edu (5B token subset)** - **Dataset:** Ultra-FineWeb Edu (educational, high-quality text) - **Tokenizer:** GPT-2 BPE (50304 vocab) - **Steps:** Thousands of steps on Kaggle T4 - **Techniques used:** - Flash Attention - Gradient Accumulation - FP16 AMP - Cosine Learning Rate Decay - Warmup - Fused AdamW - Weight Decay - Checkpointing every 500 steps --- ## **Supervised Fine-Tuning (SFT) for QA** - **Dataset:** Custom QA JSONL - **Format:** `{"instruction": "...", "response": "..."}` - **Loss:** Cross-entropy - **Goal:** Improve chat quality + correctness for QA - **Result:** Stable ~0.6–0.7 loss, improved reasoning - **Tokens:** ~100K–200K from curated dataset --- # 📚 Datasets Used ### **Pretraining Dataset: Ultra-FineWeb Edu** - Educational subset of Ultra-FineWeb - High-quality English text - Filtered for correctness - Contains textbook-like explanations - Clean enough to bootstrap small LLMs ### **Fine-Tuning Dataset: Custom QA JSONL** - Australian knowledge - Definitions - Technology facts - Simple reasoning questions - Clean short answers --- # 🔤 Tokenizer - GPT-2 BPE - 50304 vocab - Identical formatting to GPT-2 tokenizer - Tokenization done via `tiktoken` --- # 💻 How to Use (Karpathy Repo) ### **1. Clone the repo** ```bash git clone https://github.com/shubharthaksangharsha/karpathy cd karpathy/chapter-9-sft-rhlf-dpo-gpt2-124m ``` ### **2. Run inference** ```python import torch from model import GPT ckpt = torch.load("model_09535.pt", map_location="cpu") model = GPT(config=ckpt['config']) model.load_state_dict(ckpt['model']) model.eval() out = model.generate("Who is the prime minister of australia?", max_new_tokens=60) print(out) ``` ### **To run the QA model instead:** ```python import torch from model import GPT ckpt = torch.load("qa-sft_best.pt", map_location="cpu") model = GPT(config=ckpt['config']) model.load_state_dict(ckpt['model']) model.eval() out = model.generate("What is the capital of Australia?", max_new_tokens=60) print(out) ``` --- # 🤗 How to Use (Hugging Face Transformers) Because this is a **Karpathy-format checkpoint**, you cannot load it directly using: ```python AutoModelForCausalLM.from_pretrained(...) ``` Instead, load the state dict manually: ```python import torch state = torch.load("model_09535.pt", map_location="cpu") model = state["model"] ``` ⚠️ A conversion script is required for full HF `.from_pretrained()` compatibility. --- # 📝 Example Inference (QA Model) ```python import torch from model import GPT from tokenizer import GPT2Tokenizer tokenizer = GPT2Tokenizer() ckpt = torch.load("qa-sft_best.pt") model = GPT(config=ckpt['config']) model.load_state_dict(ckpt['model']) model.eval() prompt = "Q: What is the capital of Australia?\nA:" tokens = tokenizer.encode(prompt) out = model.generate(tokens, max_new_tokens=60) print(tokenizer.decode(out)) ``` --- # ⚠️ Limitations - Only 124M parameters (not SOTA) - Limited reasoning ability - Trained on small custom QA set - Not RLHF-finetuned (only SFT) - Not safety-aligned or filtered --- # 📄 License This work is based on Andrej Karpathy’s "Neural Networks: Zero to Hero" course and follows the same educational license.