mT5-Base KenSwQuAD Extractive (Stage 2)
Model Summary
This model is an intermediate research checkpoint developed as part of the KenSwQuAD Hierarchical Curriculum Learning project for Swahili Question Answering.
It is a google/mt5-base model that has undergone two stages of curriculum training:
- Stage 1: Fine-tuned on English SQuAD v2 to learn QA task structure
- Stage 2 (Current): Fine-tuned on extractive KenSwQuAD to learn Swahili morphology and syntax
The model learns to extract answer spans from Swahili text contexts given a question.
This is Stage 2 of a 3-Stage Pipeline:
- โ Stage 1: Structural Transfer (English SQuAD) โ Learned "How to Answer"
- โ Stage 2 (Current): Morphological Alignment (Extractive KenSwQuAD) โ Learned Swahili Syntax
- โณ Stage 3: Generative Refinement (Abstractive KenSwQuAD) โ Will Learn Reasoning
Model Details
| Property | Value |
|---|---|
| Developed by | Benjamin Kikwai (kikwaib) |
| Model Type | Multilingual Sequence-to-Sequence (Encoder-Decoder) |
| Base Model | kikwaib/mt5-base-squad-transfer |
| Original Base | google/mt5-base |
| Language(s) | Swahili (sw), English (en), Multilingual |
| Task | Extractive Question Answering (Text-to-Text) |
| License | Apache 2.0 |
| Parameters | 582.4M |
| Vocabulary Size | 250,100 |
Intended Use
Primary Use Cases
- Swahili Question Answering: Extract answers from Swahili text given a question
- Transfer Learning: Serve as initialization for Swahili NLP tasks
- Research: Baseline for low-resource language QA experiments
How to Use
The model accepts input in the format: question: <question_text> context: <context_text>
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
model_name = "kikwaib/mt5-base-kenswquad-extractive"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
context = "Kenya ni nchi ya Afrika Mashariki. Nairobi ni mji mkuu wa Kenya. Kenya ina wakazi zaidi ya milioni 50."
question = "Mji mkuu wa Kenya ni upi?"
input_text = f"question: {question} context: {context}"
inputs = tokenizer(input_text, return_tensors="pt", max_length=1024, truncation=True)
outputs = model.generate(**inputs, max_length=128)
answer = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(answer)
# Expected Output: "Nairobi"
Limitations
- Optimized for extractive QA where the answer is a substring of the context
- May struggle with abstractive questions requiring reasoning or inference
- Performance may vary on domains outside the KenSwQuAD training data (primarily news and Wikipedia)
Training Data
Dataset: KenSwQuAD
The model was fine-tuned on the extractive subset of KenSwQuAD (Kenya Swahili Question Answering Dataset).
| Statistic | Value |
|---|---|
| Total QA Pairs Parsed | 7,497 |
| Extractive Pairs (Stage 2) | 5,069 (67.6%) |
| Abstractive Pairs (Stage 3) | 2,428 (32.4%) |
| Training Samples | 4,562 |
| Test Samples | 507 |
| Train/Test Split | 90/10 |
Partitioning Logic
QA pairs were classified as extractive if the answer text appears as an exact substring (case-insensitive) of the context. Otherwise, they were classified as abstractive and reserved for Stage 3 training.
Training Procedure
Hardware
| Component | Specification |
|---|---|
| GPU | NVIDIA A100-SXM4-40GB |
| GPU Memory | 42.5 GB |
| Platform | Google Colab |
Hyperparameters
| Parameter | Value |
|---|---|
| Learning Rate | 1e-4 |
| Train Batch Size | 8 |
| Eval Batch Size | 8 |
| Epochs | 10 |
| Optimizer | AdamW (fused) |
| LR Scheduler | Linear |
| Weight Decay | 0.01 |
| Max Gradient Norm | 1.0 |
| Max Input Length | 1024 tokens |
| Max Target Length | 128 tokens |
| FP16 | Disabled (T5 stability) |
| Seed | 75 |
Training Results
| Epoch | Training Loss | Validation Loss | BLEU Score |
|---|---|---|---|
| 1 | 1.2512 | 0.7344 | 42.59 |
| 2 | 0.8488 | 0.7560 | 41.75 |
| 3 | 0.6945 | 0.7632 | 40.43 |
| 4 | 0.5414 | 0.7357 | 45.84 |
| 5 | 0.4604 | 0.7834 | 45.95 |
| 6 | 0.3730 | 0.8136 | 46.45 |
| 7 | 0.3249 | 0.8079 | 47.10 |
| 8 | 0.2473 | 0.8518 | 48.99 โ |
| 9 | 0.2233 | 0.8989 | 46.53 |
| 10 | 0.2003 | 0.8905 | 46.82 |
โ Best checkpoint (selected based on highest BLEU score)
Training Dynamics
- Total Training Time: 69.9 minutes (1h 8m 23s)
- Total Steps: 5,710
- Final Training Loss: 0.4857
Key Observations:
- Initial Adaptation (Epochs 1-3): BLEU dipped as model transitioned from English to Swahili patterns
- Rapid Improvement (Epochs 4-8): Strong gains as Swahili morphology was learned
- Best Performance: Epoch 8 achieved peak BLEU of 48.99
- Slight Overfitting (Epochs 9-10): Validation loss increased while training loss continued decreasing
Evaluation Results
| Metric | Score |
|---|---|
| Best BLEU | 48.99 |
| Final BLEU | 46.82 |
| Best Validation Loss | 0.7344 (Epoch 1) |
| Final Validation Loss | 0.8905 |
Framework Versions
| Library | Version |
|---|---|
| Transformers | 4.57.3 |
| PyTorch | 2.9.0+cu126 |
| Datasets | 4.0.0 |
| Tokenizers | 0.22.1 |
Citation
comming soon
Related Models
| Stage | Model | Description |
|---|---|---|
| 0 | google/mt5-base | Original pretrained model |
| 1 | kikwaib/mt5-base-squad-transfer | English SQuAD fine-tuned |
| 2 | kikwaib/mt5-base-kenswquad-extractive (Current) | Swahili extractive QA |
| 3 | Coming Soon | Swahili abstractive QA |
Training Date: December 20, 2025
- Downloads last month
- 59
Model tree for kikwaib/mt5-base-kenswquad-extractive
Evaluation results
- BLEU on KenSwQuAD (Extractive Subset)self-reported48.990