Model Size Downloads Likes Flash Attention PyTorch Task

GPT-2 124M — Pretrained on Ultra-FineWeb Edu + QA SFT

This repository contains two trained checkpoints of a custom GPT-2 124M model:

  • Pretrained Model: model_09535.pt
    → Trained from scratch on Ultra-FineWeb Edu (5B token subset)
  • QA SFT Model: qa-sft_best.pt
    → Fine-tuned using Supervised Fine-Tuning (SFT) on a curated custom Q&A dataset

This model was implemented using a from-scratch GPT-2 training pipeline, inspired by Andrej Karpathy’s engineering approach, but trained independently with different datasets and objectives.


📦 Model Versions

1. Pretrained Model (model_09535.pt)

Feature Details
Parameters 124M
Layers 12
Heads 12
Hidden size 768
Sequence length 1024
Vocab size 50304
Dataset Ultra-FineWeb Edu (educational, high-quality web text)
Purpose General language modeling

Goal: Build a clean GPT-2 Small from-scratch to understand and implement a full LLM training pipeline.


2. QA SFT Model (qa-sft_best.pt)

Feature Details
Base The pretrained model above
Method Supervised Fine-Tuning (SFT)
Dataset Custom JSONL Q&A dataset
Domain Australian facts, general knowledge, definitions, reasoning
Use-case QA-style interactive chatbot

Demo available at:
👉 https://gpt2.devshubh.me


🧠 Model Architecture

This model follows the GPT-2 Small architecture:

  • Decoder-only transformer
  • Multi-Head Self-Attention
  • GELU activations
  • LayerNorm (Pre-Norm)
  • Flash Attention enabled during training
  • Positional embeddings
  • Weight decay + AdamW (fused)
  • Mixed Precision (AMP FP16)

🛠️ Training Details

Pretraining on Ultra-FineWeb Edu (5B token subset)

  • Dataset: Ultra-FineWeb Edu (educational, high-quality text)
  • Tokenizer: GPT-2 BPE (50304 vocab)
  • Steps: Thousands of steps on Kaggle T4
  • Techniques used:
    • Flash Attention
    • Gradient Accumulation
    • FP16 AMP
    • Cosine Learning Rate Decay
    • Warmup
    • Fused AdamW
    • Weight Decay
    • Checkpointing every 500 steps

Supervised Fine-Tuning (SFT) for QA

  • Dataset: Custom QA JSONL
  • Format: {"instruction": "...", "response": "..."}
  • Loss: Cross-entropy
  • Goal: Improve chat quality + correctness for QA
  • Result: Stable ~0.6–0.7 loss, improved reasoning
  • Tokens: ~100K–200K from curated dataset

📚 Datasets Used

Pretraining Dataset: Ultra-FineWeb Edu

  • Educational subset of Ultra-FineWeb
  • High-quality English text
  • Filtered for correctness
  • Contains textbook-like explanations
  • Clean enough to bootstrap small LLMs

Fine-Tuning Dataset: Custom QA JSONL

  • Australian knowledge
  • Definitions
  • Technology facts
  • Simple reasoning questions
  • Clean short answers

🔤 Tokenizer

  • GPT-2 BPE
  • 50304 vocab
  • Identical formatting to GPT-2 tokenizer
  • Tokenization done via tiktoken

💻 How to Use (Karpathy Repo)

1. Clone the repo

git clone https://github.com/shubharthaksangharsha/karpathy
cd karpathy/chapter-9-sft-rhlf-dpo-gpt2-124m

2. Run inference

import torch
from model import GPT

ckpt = torch.load("model_09535.pt", map_location="cpu")
model = GPT(config=ckpt['config'])
model.load_state_dict(ckpt['model'])
model.eval()

out = model.generate("Who is the prime minister of australia?", max_new_tokens=60)
print(out)

To run the QA model instead:

import torch
from model import GPT

ckpt = torch.load("qa-sft_best.pt", map_location="cpu")
model = GPT(config=ckpt['config'])
model.load_state_dict(ckpt['model'])
model.eval()

out = model.generate("What is the capital of Australia?", max_new_tokens=60)
print(out)

🤗 How to Use (Hugging Face Transformers)

Because this is a Karpathy-format checkpoint, you cannot load it directly using:

AutoModelForCausalLM.from_pretrained(...)

Instead, load the state dict manually:

import torch
state = torch.load("model_09535.pt", map_location="cpu")
model = state["model"]

⚠️ A conversion script is required for full HF .from_pretrained() compatibility.


📝 Example Inference (QA Model)

import torch
from model import GPT
from tokenizer import GPT2Tokenizer

tokenizer = GPT2Tokenizer()

ckpt = torch.load("qa-sft_best.pt")
model = GPT(config=ckpt['config'])
model.load_state_dict(ckpt['model'])
model.eval()

prompt = "Q: What is the capital of Australia?\nA:"
tokens = tokenizer.encode(prompt)
out = model.generate(tokens, max_new_tokens=60)
print(tokenizer.decode(out))

⚠️ Limitations

  • Only 124M parameters (not SOTA)
  • Limited reasoning ability
  • Trained on small custom QA set
  • Not RLHF-finetuned (only SFT)
  • Not safety-aligned or filtered

📄 License

This work is based on Andrej Karpathy’s "Neural Networks: Zero to Hero" course and follows the same educational license.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Evaluation results