gpt2-124m-qa / README.md
shubharthak's picture
Fix HF metadata: add dataset.type + metrics
3c1845c
---
license: mit
language: en
tags:
- gpt2
- causal-lm
- pytorch
- transformer
- pretraining
- sft
- question-answering
- ultra-fineweb
- custom-dataset
model-index:
- name: gpt2-124m-qa
results:
- task:
name: Question Answering
type: text-generation
dataset:
name: Custom QA Dataset (JSONL)
type: jsonl
metrics:
- name: Loss
type: loss
value: 0.65
---
<p align="center">
<a href="https://huggingface.co/shubharthak/gpt2-124m-qa">
<img alt="Model Size" src="https://img.shields.io/badge/Model%20Size-124M-blue">
</a>
<a href="https://huggingface.co/shubharthak/gpt2-124m-qa">
<img alt="Downloads" src="https://img.shields.io/huggingface/dl-daily/shubharthak/gpt2-124m-qa">
</a>
<a href="https://huggingface.co/shubharthak/gpt2-124m-qa">
<img alt="Likes" src="https://img.shields.io/badge/HuggingFace-Likes-yellow">
</a>
<a href="https://huggingface.co/spaces/yuntian-deng/flash-attention">
<img alt="Flash Attention" src="https://img.shields.io/badge/Flash%20Attention-Enabled-brightgreen">
</a>
<a href="https://pytorch.org/">
<img alt="PyTorch" src="https://img.shields.io/badge/Framework-PyTorch-red">
</a>
<a href="https://huggingface.co/docs">
<img alt="Task" src="https://img.shields.io/badge/Task-QA%20%2F%20CausalLM-purple">
</a>
</p>
# GPT-2 124M — Pretrained on Ultra-FineWeb Edu + QA SFT
This repository contains two trained checkpoints of a custom **GPT-2 124M** model:
- **Pretrained Model:** `model_09535.pt`
→ Trained *from scratch* on **Ultra-FineWeb Edu (5B token subset)**
- **QA SFT Model:** `qa-sft_best.pt`
→ Fine-tuned using **Supervised Fine-Tuning (SFT)** on a curated **custom Q&A dataset**
This model was implemented using a **from-scratch GPT-2 training pipeline**, *inspired by Andrej Karpathy’s engineering approach*, but trained independently with different datasets and objectives.
---
## 📦 Model Versions
### **1. Pretrained Model (`model_09535.pt`)**
| Feature | Details |
|--------|---------|
| Parameters | 124M |
| Layers | 12 |
| Heads | 12 |
| Hidden size | 768 |
| Sequence length | 1024 |
| Vocab size | 50304 |
| Dataset | Ultra-FineWeb Edu (educational, high-quality web text) |
| Purpose | General language modeling |
**Goal:** Build a clean GPT-2 Small from-scratch to understand and implement a full LLM training pipeline.
---
### **2. QA SFT Model (`qa-sft_best.pt`)**
| Feature | Details |
|--------|---------|
| Base | The pretrained model above |
| Method | Supervised Fine-Tuning (SFT) |
| Dataset | Custom JSONL Q&A dataset |
| Domain | Australian facts, general knowledge, definitions, reasoning |
| Use-case | QA-style interactive chatbot |
Demo available at:
👉 **https://gpt2.devshubh.me**
---
# 🧠 Model Architecture
This model follows the **GPT-2 Small** architecture:
- Decoder-only transformer
- Multi-Head Self-Attention
- GELU activations
- LayerNorm (Pre-Norm)
- Flash Attention enabled during training
- Positional embeddings
- Weight decay + AdamW (fused)
- Mixed Precision (AMP FP16)
---
# 🛠️ Training Details
## **Pretraining on Ultra-FineWeb Edu (5B token subset)**
- **Dataset:** Ultra-FineWeb Edu (educational, high-quality text)
- **Tokenizer:** GPT-2 BPE (50304 vocab)
- **Steps:** Thousands of steps on Kaggle T4
- **Techniques used:**
- Flash Attention
- Gradient Accumulation
- FP16 AMP
- Cosine Learning Rate Decay
- Warmup
- Fused AdamW
- Weight Decay
- Checkpointing every 500 steps
---
## **Supervised Fine-Tuning (SFT) for QA**
- **Dataset:** Custom QA JSONL
- **Format:** `{"instruction": "...", "response": "..."}`
- **Loss:** Cross-entropy
- **Goal:** Improve chat quality + correctness for QA
- **Result:** Stable ~0.6–0.7 loss, improved reasoning
- **Tokens:** ~100K–200K from curated dataset
---
# 📚 Datasets Used
### **Pretraining Dataset: Ultra-FineWeb Edu**
- Educational subset of Ultra-FineWeb
- High-quality English text
- Filtered for correctness
- Contains textbook-like explanations
- Clean enough to bootstrap small LLMs
### **Fine-Tuning Dataset: Custom QA JSONL**
- Australian knowledge
- Definitions
- Technology facts
- Simple reasoning questions
- Clean short answers
---
# 🔤 Tokenizer
- GPT-2 BPE
- 50304 vocab
- Identical formatting to GPT-2 tokenizer
- Tokenization done via `tiktoken`
---
# 💻 How to Use (Karpathy Repo)
### **1. Clone the repo**
```bash
git clone https://github.com/shubharthaksangharsha/karpathy
cd karpathy/chapter-9-sft-rhlf-dpo-gpt2-124m
```
### **2. Run inference**
```python
import torch
from model import GPT
ckpt = torch.load("model_09535.pt", map_location="cpu")
model = GPT(config=ckpt['config'])
model.load_state_dict(ckpt['model'])
model.eval()
out = model.generate("Who is the prime minister of australia?", max_new_tokens=60)
print(out)
```
### **To run the QA model instead:**
```python
import torch
from model import GPT
ckpt = torch.load("qa-sft_best.pt", map_location="cpu")
model = GPT(config=ckpt['config'])
model.load_state_dict(ckpt['model'])
model.eval()
out = model.generate("What is the capital of Australia?", max_new_tokens=60)
print(out)
```
---
# 🤗 How to Use (Hugging Face Transformers)
Because this is a **Karpathy-format checkpoint**, you cannot load it directly using:
```python
AutoModelForCausalLM.from_pretrained(...)
```
Instead, load the state dict manually:
```python
import torch
state = torch.load("model_09535.pt", map_location="cpu")
model = state["model"]
```
⚠️ A conversion script is required for full HF `.from_pretrained()` compatibility.
---
# 📝 Example Inference (QA Model)
```python
import torch
from model import GPT
from tokenizer import GPT2Tokenizer
tokenizer = GPT2Tokenizer()
ckpt = torch.load("qa-sft_best.pt")
model = GPT(config=ckpt['config'])
model.load_state_dict(ckpt['model'])
model.eval()
prompt = "Q: What is the capital of Australia?\nA:"
tokens = tokenizer.encode(prompt)
out = model.generate(tokens, max_new_tokens=60)
print(tokenizer.decode(out))
```
---
# ⚠️ Limitations
- Only 124M parameters (not SOTA)
- Limited reasoning ability
- Trained on small custom QA set
- Not RLHF-finetuned (only SFT)
- Not safety-aligned or filtered
---
# 📄 License
This work is based on Andrej Karpathy’s "Neural Networks: Zero to Hero" course and follows the same educational license.