muril-lang-id-v11

Fine-tuned google/muril-base-cased for language identification on Indian banking chatbot messages. Covers 17 Indian languages plus English in both native and Romanized script, with an 18th undetermined class for out-of-distribution inputs.

This is v11 of an iterative series (v1 → v7). v11 adds conversational shorts — per-language greetings, affirmations, and closings (yes/no/ok/hi/namaste/shukriya/haan/nahi/sari/…) — to fix a real production failure where ultra-short single-token chat openers were falling into undetermined. Hindustani-shared Roman shorts (haan/nahi/theek/accha/namaste/namaskar/shukriya/dhanyavaad) are concentrated under hi only, mirroring the inference-time ur → hi merge.

Labels (0–17)

as, bn, en, gu, hi, kn, ks, ml, mr, ne, or, pa, sa, sd, ta, te, ur, undetermined

Evaluation

On the held-out 1,882-row banking chat test set (test_all.csv):

version	overall	en	hi	kn	ta	te	ml	undetermined
v5	91.82%	98.2%	99.7%	91.6%	81.3%	80.5%	—	77.8%
v6	93.25%	96.0%	99.8%	97.5%	92.4%	78.8%	—	82.5%
v7	96.07%	100%	99.8%	99.2%	97.9%	93.8%	0.0%	84.0%
v11	96.07%	100%	99.8%	99.2%	96.5%	94.7%	14.3%	84.0%

Held-out stratified test (from the training-mix distribution): accuracy 0.9727, f1_macro 0.9668.

v11 highlights

Single-token conversational openers — the failure mode fixed in this version:

input	v7	v11
`hi`	undetermined	en (0.81)
`namaste`	hi (lucky)	hi (1.00)
`namaskar`	undetermined	hi (1.00)
`shukriya`	undetermined	hi (1.00)
`haan`	undetermined	hi (1.00)
`nahi`	undetermined	hi (1.00)
`sari`	undetermined	ta (1.00)
`accha`	undetermined	undetermined (residual edge case)

Usage

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

repo = "dnivra26/muril-lang-id-v11"
tokenizer = AutoTokenizer.from_pretrained(repo)
model = AutoModelForSequenceClassification.from_pretrained(repo)
model.eval()

LABELS = ["as","bn","en","gu","hi","kn","ks","ml","mr","ne","or","pa","sa","sd","ta","te","ur","undetermined"]
ENERGY_THRESHOLD = -6.5  # energy > threshold ⇒ flag as undetermined

text = "mera balance kitna hai"
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=128)
with torch.inference_mode():
    logits = model(**inputs).logits.squeeze(0)
energy = -torch.logsumexp(logits, dim=0).item()
top = int(logits.argmax())
label = "undetermined" if energy > ENERGY_THRESHOLD else LABELS[top]
print(label)  # → hi

The energy threshold was calibrated on this checkpoint via a joint sweep on test_all.csv and Bhasha-Abhijnaanam OOD: −6.5 keeps test_all accuracy at 96.07% (identical to the v7 baseline) and recovers single-token hi (energy = −6.65) which v7–v10 flagged as undetermined. The energy gate's marginal contribution to OOD recall is small (the learned undetermined class at index 17 carries the bulk of OOD detection); loosening from −7.0 to −6.5 drops gate-only OOD recall from 4.3% to 2.8% on Bhasha-Abhijnaanam.

Training

Base: google/muril-base-cased
Epochs: 3
Batch size: 128, lr: 4e-5, precision: bf16 + TF32
Max seq length: 128
Datasets: AI4Bharat Bhasha-Abhijnaanam, AI4Bharat Aksharantar, SST-2, suhani-sarvam/google-dakshina, findnitai/english-to-hinglish, AmazonScience/MASSIVE, community-datasets/offenseval_dravidian (non-offensive only), bitext retail-banking, FLORES-200 (OOD), synthetic brand-laden English banking Q&A, banking-style European OOD (DE/FR/PT/ES/IT/TR/SV/NL), synthetic gibberish, conversational shorts (v11: 400 rows × 17 labels of greetings/affirmations/closings, both native and Romanized).

Notes

Romanized Urdu and Hindi are merged to hi at inference time (Hindustani is effectively one spoken language). v11 mirrors this in training: Roman shukriya/haan/nahi/theek/accha/namaste/namaskar are labelled hi only, never split across hi/ur/pa/sd.
Pre-v6 checkpoints in this series only emit labels 0–16 and need a tighter energy threshold (-11.22).
Works best when wrapped in a pipeline that runs Unicode-script short-circuiting first, so deterministic native-script inputs skip the model entirely.
Known residual: accha (energy ≈ −4.25) is genuinely ambiguous between hi/ur/pa/sd and falls below any defensible threshold. Treat as an edge case or accept undetermined for this filler word.

Downloads last month: 17

Safetensors

Model size

0.2B params

Tensor type

F32

Model tree for dnivra26/muril-lang-id-v11

Base model

google/muril-base-cased

Finetuned

(55)

this model