muril-lang-id-v11
Fine-tuned google/muril-base-cased for language identification on Indian banking chatbot messages. Covers 17 Indian languages plus English in both native and Romanized script, with an 18th undetermined class for out-of-distribution inputs.
This is v11 of an iterative series (v1 β v7). v11 adds conversational shorts β per-language greetings, affirmations, and closings (yes/no/ok/hi/namaste/shukriya/haan/nahi/sari/β¦) β to fix a real production failure where ultra-short single-token chat openers were falling into undetermined. Hindustani-shared Roman shorts (haan/nahi/theek/accha/namaste/namaskar/shukriya/dhanyavaad) are concentrated under hi only, mirroring the inference-time ur β hi merge.
Labels (0β17)
as, bn, en, gu, hi, kn, ks, ml, mr, ne, or, pa, sa, sd, ta, te, ur, undetermined
Evaluation
On the held-out 1,882-row banking chat test set (test_all.csv):
| version | overall | en | hi | kn | ta | te | ml | undetermined |
|---|---|---|---|---|---|---|---|---|
| v5 | 91.82% | 98.2% | 99.7% | 91.6% | 81.3% | 80.5% | β | 77.8% |
| v6 | 93.25% | 96.0% | 99.8% | 97.5% | 92.4% | 78.8% | β | 82.5% |
| v7 | 96.07% | 100% | 99.8% | 99.2% | 97.9% | 93.8% | 0.0% | 84.0% |
| v11 | 96.07% | 100% | 99.8% | 99.2% | 96.5% | 94.7% | 14.3% | 84.0% |
Held-out stratified test (from the training-mix distribution): accuracy 0.9727, f1_macro 0.9668.
v11 highlights
Single-token conversational openers β the failure mode fixed in this version:
| input | v7 | v11 |
|---|---|---|
hi |
undetermined | en (0.81) |
namaste |
hi (lucky) | hi (1.00) |
namaskar |
undetermined | hi (1.00) |
shukriya |
undetermined | hi (1.00) |
haan |
undetermined | hi (1.00) |
nahi |
undetermined | hi (1.00) |
sari |
undetermined | ta (1.00) |
accha |
undetermined | undetermined (residual edge case) |
Usage
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
repo = "dnivra26/muril-lang-id-v11"
tokenizer = AutoTokenizer.from_pretrained(repo)
model = AutoModelForSequenceClassification.from_pretrained(repo)
model.eval()
LABELS = ["as","bn","en","gu","hi","kn","ks","ml","mr","ne","or","pa","sa","sd","ta","te","ur","undetermined"]
ENERGY_THRESHOLD = -6.5 # energy > threshold β flag as undetermined
text = "mera balance kitna hai"
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=128)
with torch.inference_mode():
logits = model(**inputs).logits.squeeze(0)
energy = -torch.logsumexp(logits, dim=0).item()
top = int(logits.argmax())
label = "undetermined" if energy > ENERGY_THRESHOLD else LABELS[top]
print(label) # β hi
The energy threshold was calibrated on this checkpoint via a joint sweep on test_all.csv and Bhasha-Abhijnaanam OOD: β6.5 keeps test_all accuracy at 96.07% (identical to the v7 baseline) and recovers single-token hi (energy = β6.65) which v7βv10 flagged as undetermined. The energy gate's marginal contribution to OOD recall is small (the learned undetermined class at index 17 carries the bulk of OOD detection); loosening from β7.0 to β6.5 drops gate-only OOD recall from 4.3% to 2.8% on Bhasha-Abhijnaanam.
Training
- Base: google/muril-base-cased
- Epochs: 3
- Batch size: 128, lr: 4e-5, precision: bf16 + TF32
- Max seq length: 128
- Datasets: AI4Bharat Bhasha-Abhijnaanam, AI4Bharat Aksharantar, SST-2, suhani-sarvam/google-dakshina, findnitai/english-to-hinglish, AmazonScience/MASSIVE, community-datasets/offenseval_dravidian (non-offensive only), bitext retail-banking, FLORES-200 (OOD), synthetic brand-laden English banking Q&A, banking-style European OOD (DE/FR/PT/ES/IT/TR/SV/NL), synthetic gibberish, conversational shorts (v11: 400 rows Γ 17 labels of greetings/affirmations/closings, both native and Romanized).
Notes
- Romanized Urdu and Hindi are merged to
hiat inference time (Hindustani is effectively one spoken language). v11 mirrors this in training: Roman shukriya/haan/nahi/theek/accha/namaste/namaskar are labelledhionly, never split across hi/ur/pa/sd. - Pre-v6 checkpoints in this series only emit labels 0β16 and need a tighter energy threshold (
-11.22). - Works best when wrapped in a pipeline that runs Unicode-script short-circuiting first, so deterministic native-script inputs skip the model entirely.
- Known residual:
accha(energy β β4.25) is genuinely ambiguous between hi/ur/pa/sd and falls below any defensible threshold. Treat as an edge case or accept undetermined for this filler word.
- Downloads last month
- 17
Model tree for dnivra26/muril-lang-id-v11
Base model
google/muril-base-cased