mT5-Base KenSwQuAD Extractive (Stage 2)

Model Summary

This model is an intermediate research checkpoint developed as part of the KenSwQuAD Hierarchical Curriculum Learning project for Swahili Question Answering.

It is a google/mt5-base model that has undergone two stages of curriculum training:

  1. Stage 1: Fine-tuned on English SQuAD v2 to learn QA task structure
  2. Stage 2 (Current): Fine-tuned on extractive KenSwQuAD to learn Swahili morphology and syntax

The model learns to extract answer spans from Swahili text contexts given a question.

This is Stage 2 of a 3-Stage Pipeline:

  1. โœ… Stage 1: Structural Transfer (English SQuAD) โ†’ Learned "How to Answer"
  2. โœ… Stage 2 (Current): Morphological Alignment (Extractive KenSwQuAD) โ†’ Learned Swahili Syntax
  3. โณ Stage 3: Generative Refinement (Abstractive KenSwQuAD) โ†’ Will Learn Reasoning

Model Details

Property Value
Developed by Benjamin Kikwai (kikwaib)
Model Type Multilingual Sequence-to-Sequence (Encoder-Decoder)
Base Model kikwaib/mt5-base-squad-transfer
Original Base google/mt5-base
Language(s) Swahili (sw), English (en), Multilingual
Task Extractive Question Answering (Text-to-Text)
License Apache 2.0
Parameters 582.4M
Vocabulary Size 250,100

Intended Use

Primary Use Cases

  • Swahili Question Answering: Extract answers from Swahili text given a question
  • Transfer Learning: Serve as initialization for Swahili NLP tasks
  • Research: Baseline for low-resource language QA experiments

How to Use

The model accepts input in the format: question: <question_text> context: <context_text>

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

model_name = "kikwaib/mt5-base-kenswquad-extractive"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

context = "Kenya ni nchi ya Afrika Mashariki. Nairobi ni mji mkuu wa Kenya. Kenya ina wakazi zaidi ya milioni 50."
question = "Mji mkuu wa Kenya ni upi?"

input_text = f"question: {question} context: {context}"
inputs = tokenizer(input_text, return_tensors="pt", max_length=1024, truncation=True)

outputs = model.generate(**inputs, max_length=128)
answer = tokenizer.decode(outputs[0], skip_special_tokens=True)

print(answer)
# Expected Output: "Nairobi"

Limitations

  • Optimized for extractive QA where the answer is a substring of the context
  • May struggle with abstractive questions requiring reasoning or inference
  • Performance may vary on domains outside the KenSwQuAD training data (primarily news and Wikipedia)

Training Data

Dataset: KenSwQuAD

The model was fine-tuned on the extractive subset of KenSwQuAD (Kenya Swahili Question Answering Dataset).

Statistic Value
Total QA Pairs Parsed 7,497
Extractive Pairs (Stage 2) 5,069 (67.6%)
Abstractive Pairs (Stage 3) 2,428 (32.4%)
Training Samples 4,562
Test Samples 507
Train/Test Split 90/10

Partitioning Logic

QA pairs were classified as extractive if the answer text appears as an exact substring (case-insensitive) of the context. Otherwise, they were classified as abstractive and reserved for Stage 3 training.

Training Procedure

Hardware

Component Specification
GPU NVIDIA A100-SXM4-40GB
GPU Memory 42.5 GB
Platform Google Colab

Hyperparameters

Parameter Value
Learning Rate 1e-4
Train Batch Size 8
Eval Batch Size 8
Epochs 10
Optimizer AdamW (fused)
LR Scheduler Linear
Weight Decay 0.01
Max Gradient Norm 1.0
Max Input Length 1024 tokens
Max Target Length 128 tokens
FP16 Disabled (T5 stability)
Seed 75

Training Results

Epoch Training Loss Validation Loss BLEU Score
1 1.2512 0.7344 42.59
2 0.8488 0.7560 41.75
3 0.6945 0.7632 40.43
4 0.5414 0.7357 45.84
5 0.4604 0.7834 45.95
6 0.3730 0.8136 46.45
7 0.3249 0.8079 47.10
8 0.2473 0.8518 48.99 โ˜…
9 0.2233 0.8989 46.53
10 0.2003 0.8905 46.82

โ˜… Best checkpoint (selected based on highest BLEU score)

Training Dynamics

  • Total Training Time: 69.9 minutes (1h 8m 23s)
  • Total Steps: 5,710
  • Final Training Loss: 0.4857

Key Observations:

  1. Initial Adaptation (Epochs 1-3): BLEU dipped as model transitioned from English to Swahili patterns
  2. Rapid Improvement (Epochs 4-8): Strong gains as Swahili morphology was learned
  3. Best Performance: Epoch 8 achieved peak BLEU of 48.99
  4. Slight Overfitting (Epochs 9-10): Validation loss increased while training loss continued decreasing

Evaluation Results

Metric Score
Best BLEU 48.99
Final BLEU 46.82
Best Validation Loss 0.7344 (Epoch 1)
Final Validation Loss 0.8905

Framework Versions

Library Version
Transformers 4.57.3
PyTorch 2.9.0+cu126
Datasets 4.0.0
Tokenizers 0.22.1

Citation

comming soon

Related Models

Stage Model Description
0 google/mt5-base Original pretrained model
1 kikwaib/mt5-base-squad-transfer English SQuAD fine-tuned
2 kikwaib/mt5-base-kenswquad-extractive (Current) Swahili extractive QA
3 Coming Soon Swahili abstractive QA

Training Date: December 20, 2025

Downloads last month
59
Safetensors
Model size
0.6B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for kikwaib/mt5-base-kenswquad-extractive

Base model

google/mt5-base
Finetuned
(1)
this model
Finetunes
1 model

Evaluation results