mT5-Base KenSwQuAD Extractive (Stage 2)

Model Summary

This model is an intermediate research checkpoint developed as part of the KenSwQuAD Hierarchical Curriculum Learning project for Swahili Question Answering.

It is a google/mt5-base model that has undergone two stages of curriculum training:

Stage 1: Fine-tuned on English SQuAD v2 to learn QA task structure
Stage 2 (Current): Fine-tuned on extractive KenSwQuAD to learn Swahili morphology and syntax

The model learns to extract answer spans from Swahili text contexts given a question.

This is Stage 2 of a 3-Stage Pipeline:

✅ Stage 1: Structural Transfer (English SQuAD) → Learned "How to Answer"
✅ Stage 2 (Current): Morphological Alignment (Extractive KenSwQuAD) → Learned Swahili Syntax
⏳ Stage 3: Generative Refinement (Abstractive KenSwQuAD) → Will Learn Reasoning

Model Details

Property	Value
Developed by	Benjamin Kikwai (kikwaib)
Model Type	Multilingual Sequence-to-Sequence (Encoder-Decoder)
Base Model	kikwaib/mt5-base-squad-transfer
Original Base	google/mt5-base
Language(s)	Swahili (sw), English (en), Multilingual
Task	Extractive Question Answering (Text-to-Text)
License	Apache 2.0
Parameters	582.4M
Vocabulary Size	250,100

Intended Use

Primary Use Cases

Swahili Question Answering: Extract answers from Swahili text given a question
Transfer Learning: Serve as initialization for Swahili NLP tasks
Research: Baseline for low-resource language QA experiments

How to Use

The model accepts input in the format: question: <question_text> context: <context_text>

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

model_name = "kikwaib/mt5-base-kenswquad-extractive"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

context = "Kenya ni nchi ya Afrika Mashariki. Nairobi ni mji mkuu wa Kenya. Kenya ina wakazi zaidi ya milioni 50."
question = "Mji mkuu wa Kenya ni upi?"

input_text = f"question: {question} context: {context}"
inputs = tokenizer(input_text, return_tensors="pt", max_length=1024, truncation=True)

outputs = model.generate(**inputs, max_length=128)
answer = tokenizer.decode(outputs[0], skip_special_tokens=True)

print(answer)
# Expected Output: "Nairobi"

Limitations

Optimized for extractive QA where the answer is a substring of the context
May struggle with abstractive questions requiring reasoning or inference
Performance may vary on domains outside the KenSwQuAD training data (primarily news and Wikipedia)

Training Data

Dataset: KenSwQuAD

The model was fine-tuned on the extractive subset of KenSwQuAD (Kenya Swahili Question Answering Dataset).

Statistic	Value
Total QA Pairs Parsed	7,497
Extractive Pairs (Stage 2)	5,069 (67.6%)
Abstractive Pairs (Stage 3)	2,428 (32.4%)
Training Samples	4,562
Test Samples	507
Train/Test Split	90/10

Partitioning Logic

QA pairs were classified as extractive if the answer text appears as an exact substring (case-insensitive) of the context. Otherwise, they were classified as abstractive and reserved for Stage 3 training.

Training Procedure

Hardware

Component	Specification
GPU	NVIDIA A100-SXM4-40GB
GPU Memory	42.5 GB
Platform	Google Colab

Hyperparameters

Parameter	Value
Learning Rate	1e-4
Train Batch Size	8
Eval Batch Size	8
Epochs	10
Optimizer	AdamW (fused)
LR Scheduler	Linear
Weight Decay	0.01
Max Gradient Norm	1.0
Max Input Length	1024 tokens
Max Target Length	128 tokens
FP16	Disabled (T5 stability)
Seed	75

Training Results

Epoch	Training Loss	Validation Loss	BLEU Score
1	1.2512	0.7344	42.59
2	0.8488	0.7560	41.75
3	0.6945	0.7632	40.43
4	0.5414	0.7357	45.84
5	0.4604	0.7834	45.95
6	0.3730	0.8136	46.45
7	0.3249	0.8079	47.10
8	0.2473	0.8518	48.99 ★
9	0.2233	0.8989	46.53
10	0.2003	0.8905	46.82

★ Best checkpoint (selected based on highest BLEU score)

Training Dynamics

Total Training Time: 69.9 minutes (1h 8m 23s)
Total Steps: 5,710
Final Training Loss: 0.4857

Key Observations:

Initial Adaptation (Epochs 1-3): BLEU dipped as model transitioned from English to Swahili patterns
Rapid Improvement (Epochs 4-8): Strong gains as Swahili morphology was learned
Best Performance: Epoch 8 achieved peak BLEU of 48.99
Slight Overfitting (Epochs 9-10): Validation loss increased while training loss continued decreasing

Evaluation Results

Metric	Score
Best BLEU	48.99
Final BLEU	46.82
Best Validation Loss	0.7344 (Epoch 1)
Final Validation Loss	0.8905

Framework Versions

Library	Version
Transformers	4.57.3
PyTorch	2.9.0+cu126
Datasets	4.0.0
Tokenizers	0.22.1

Citation

comming soon

Related Models

Stage	Model	Description
0	google/mt5-base	Original pretrained model
1	kikwaib/mt5-base-squad-transfer	English SQuAD fine-tuned
2	kikwaib/mt5-base-kenswquad-extractive (Current)	Swahili extractive QA
3	Coming Soon	Swahili abstractive QA

Training Date: December 20, 2025

Downloads last month: 59

Safetensors

Model size

0.6B params

Tensor type

F32

Model tree for kikwaib/mt5-base-kenswquad-extractive

Base model

google/mt5-base

Finetuned

kikwaib/mt5-base-squad-transfer

Finetuned

(1)

this model

Finetunes

1 model

Evaluation results

BLEU on KenSwQuAD (Extractive Subset)
self-reported

48.990