muril-lang-id-v11

Fine-tuned google/muril-base-cased for language identification on Indian banking chatbot messages. Covers 17 Indian languages plus English in both native and Romanized script, with an 18th undetermined class for out-of-distribution inputs.

This is v11 of an iterative series (v1 β†’ v7). v11 adds conversational shorts β€” per-language greetings, affirmations, and closings (yes/no/ok/hi/namaste/shukriya/haan/nahi/sari/…) β€” to fix a real production failure where ultra-short single-token chat openers were falling into undetermined. Hindustani-shared Roman shorts (haan/nahi/theek/accha/namaste/namaskar/shukriya/dhanyavaad) are concentrated under hi only, mirroring the inference-time ur β†’ hi merge.

Labels (0–17)

as, bn, en, gu, hi, kn, ks, ml, mr, ne, or, pa, sa, sd, ta, te, ur, undetermined

Evaluation

On the held-out 1,882-row banking chat test set (test_all.csv):

version overall en hi kn ta te ml undetermined
v5 91.82% 98.2% 99.7% 91.6% 81.3% 80.5% β€” 77.8%
v6 93.25% 96.0% 99.8% 97.5% 92.4% 78.8% β€” 82.5%
v7 96.07% 100% 99.8% 99.2% 97.9% 93.8% 0.0% 84.0%
v11 96.07% 100% 99.8% 99.2% 96.5% 94.7% 14.3% 84.0%

Held-out stratified test (from the training-mix distribution): accuracy 0.9727, f1_macro 0.9668.

v11 highlights

Single-token conversational openers β€” the failure mode fixed in this version:

input v7 v11
hi undetermined en (0.81)
namaste hi (lucky) hi (1.00)
namaskar undetermined hi (1.00)
shukriya undetermined hi (1.00)
haan undetermined hi (1.00)
nahi undetermined hi (1.00)
sari undetermined ta (1.00)
accha undetermined undetermined (residual edge case)

Usage

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

repo = "dnivra26/muril-lang-id-v11"
tokenizer = AutoTokenizer.from_pretrained(repo)
model = AutoModelForSequenceClassification.from_pretrained(repo)
model.eval()

LABELS = ["as","bn","en","gu","hi","kn","ks","ml","mr","ne","or","pa","sa","sd","ta","te","ur","undetermined"]
ENERGY_THRESHOLD = -6.5  # energy > threshold β‡’ flag as undetermined

text = "mera balance kitna hai"
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=128)
with torch.inference_mode():
    logits = model(**inputs).logits.squeeze(0)
energy = -torch.logsumexp(logits, dim=0).item()
top = int(logits.argmax())
label = "undetermined" if energy > ENERGY_THRESHOLD else LABELS[top]
print(label)  # β†’ hi

The energy threshold was calibrated on this checkpoint via a joint sweep on test_all.csv and Bhasha-Abhijnaanam OOD: βˆ’6.5 keeps test_all accuracy at 96.07% (identical to the v7 baseline) and recovers single-token hi (energy = βˆ’6.65) which v7–v10 flagged as undetermined. The energy gate's marginal contribution to OOD recall is small (the learned undetermined class at index 17 carries the bulk of OOD detection); loosening from βˆ’7.0 to βˆ’6.5 drops gate-only OOD recall from 4.3% to 2.8% on Bhasha-Abhijnaanam.

Training

  • Base: google/muril-base-cased
  • Epochs: 3
  • Batch size: 128, lr: 4e-5, precision: bf16 + TF32
  • Max seq length: 128
  • Datasets: AI4Bharat Bhasha-Abhijnaanam, AI4Bharat Aksharantar, SST-2, suhani-sarvam/google-dakshina, findnitai/english-to-hinglish, AmazonScience/MASSIVE, community-datasets/offenseval_dravidian (non-offensive only), bitext retail-banking, FLORES-200 (OOD), synthetic brand-laden English banking Q&A, banking-style European OOD (DE/FR/PT/ES/IT/TR/SV/NL), synthetic gibberish, conversational shorts (v11: 400 rows Γ— 17 labels of greetings/affirmations/closings, both native and Romanized).

Notes

  • Romanized Urdu and Hindi are merged to hi at inference time (Hindustani is effectively one spoken language). v11 mirrors this in training: Roman shukriya/haan/nahi/theek/accha/namaste/namaskar are labelled hi only, never split across hi/ur/pa/sd.
  • Pre-v6 checkpoints in this series only emit labels 0–16 and need a tighter energy threshold (-11.22).
  • Works best when wrapped in a pipeline that runs Unicode-script short-circuiting first, so deterministic native-script inputs skip the model entirely.
  • Known residual: accha (energy β‰ˆ βˆ’4.25) is genuinely ambiguous between hi/ur/pa/sd and falls below any defensible threshold. Treat as an edge case or accept undetermined for this filler word.
Downloads last month
17
Safetensors
Model size
0.2B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for dnivra26/muril-lang-id-v11

Finetuned
(55)
this model