📝 myX-TransStyle-W2S: A Transformer-based Style Transfer for Myanmar Written (ရေးဟန်) to Spoken (ပြောဟန်)
myX-TransStyle-W2S is a specialized Sequence-to-Sequence (Seq2Seq) model developed by Khant Sint Heinn (Kalix Louis) under DatarrX. It is specifically designed to transform formal Written Burmese (ရေးဟန်) into its natural colloquial Spoken Burmese (ပြောဟန်) counterpart. This model ensures that formal documents or news can be converted into fluid, human-like dialogue while maintaining 100% semantic integrity.
Model Details
- Developed by: Khant Sint Heinn (Kalix Louis)
- Organization: DatarrX | ဒေတာ-အက်စ်
- Model Architecture: Fine-tuned NLLB-200 (600M Distilled) with merged LoRA adapters
- Language: Burmese (Myanmar)
- Task: Text Style Transfer (Written → Spoken)
- License: MIT
- Trained on: Myanmar Written-Spoken Parallel Corpus (MWSPC)
Linguistic Context: The Diglossia Challenge
Burmese is a diglossic language, featuring a major linguistic gap between two functional registers:
- Written Style (ရေးဟန်): Used in news, law, textbooks, and officialdom. It relies on formal grammatical markers such as "သည်", "၏", and "၍".
- Spoken Style (ပြောဟန်): Used in daily life, verbal communication, and social media. It uses colloquial markers like "တယ်" (tense), "ရဲ့" (possessive), and "နဲ့" (conjunction).
myX-TransStyle-W2S addresses the "robotic" nature of modern AI by allowing formal text to be localized into the natural, warm tone used by native speakers every day.
Training Methodology
The model was trained using an efficient adaptation strategy optimized for the unique structural shifts of Myanmar style.
1. The Dataset (MWSPC)
The model was trained on 5,555 high-quality, unique parallel text pairs. This dataset provides a direct mapping from formal literary structures to their informal colloquial equivalents, filtered to ensure maximum diversity.
2. Parameter-Efficient Fine-Tuning (PEFT)
To capture nuanced stylistic shifts without overwriting the base model's linguistic depth, we utilized Low-Rank Adaptation (LoRA):
- Target Modules:
q_proj,k_proj,v_proj,out_proj. - Rank (R): 32 | Alpha: 64.
- Learning Rate: 8e-5 with a Cosine scheduler.
3. Merging Strategy
The LoRA adapters were merged into the base nllb-200-distilled-600M model using merge_and_unload(). The resulting standalone 2.8 GB model provides high-speed inference without requiring the PEFT library.
Evaluation Results
The model was validated on 100 unseen test sentences and showed superior performance compared to its S2W sibling.
Performance Metrics
| Metric | Score | Interpretation |
|---|---|---|
| BERTScore F1 | 0.9693 | Indicates near-perfect meaning preservation during style transfer. |
| chrF | 78.40 | Exceptional character-level accuracy, specifically in converting formal suffixes. |
| BLEU | 19.64 | Higher than S2W, reflecting a more consistent conversion pattern into spoken style. |
Qualitative Analysis
Manual review by native speakers confirms the model's ability to not only swap particles but also adjust vocabulary (e.g., converting “အလွန်ပင်” to “သိပ်” or “အကယ်ပင်” to “တကယ်လို့တောင်”) in a way that feels authentic and human.
🔗 Related Models in the DatarrX Ecosystem
Explore other specialized models for Myanmar linguistic styles:
- myX-TransStyle-S2W: The sibling model for converting Spoken Style to formal Written Style.
- myX-StyleClassifier: Use this to automatically detect the style of your input text before processing.
How to Use
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
# 1. Load the Merged Model
model_id = "DatarrX/myX-TransStyle-W2S"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSeq2SeqLM.from_pretrained(model_id)
# 2. Prepare Input
prefix = "Rewrite Burmese formal written sentence into spoken Burmese: "
written_text = "ပုဂံခေတ်သည် မြန်မာနိုင်ငံသမိုင်းတွင် ပထမဆုံးသော အင်ပါယာနိုင်ငံတော်ကြီး ဖြစ်ခဲ့သည်။"
input_text = prefix + written_text
# 3. Generate Spoken Style
inputs = tokenizer(input_text, return_tensors="pt")
outputs = model.generate(
**inputs,
forced_bos_token_id=tokenizer.convert_tokens_to_ids("mya_Mymr"),
max_length=160,
num_beams=5
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
# Output: ပုဂံခေတ်က မြန်မာနိုင်ငံသမိုင်းမှာ ပထမဆုံး အင်ပါယာနိုင်ငံတော်ကြီးဖြစ်ခဲ့တယ်။
Intended Use & Limitations
Use Cases
- Natural AI Personalities: Converting formal bot responses into natural-sounding speech.
- Content Localization: Making formal news or articles more accessible for audio/podcasts.
- Creative Writing: Assisting authors in converting narrative descriptions into natural character dialogue.
Limitations
- Dialectal Focus: Primarily focuses on the standard Yangon/Mandalay dialect; regional slang may be less represented.
- Contextual Nuance: While meaning is preserved, the "warmth" of the spoken style may vary depending on the complexity of the input.
Citation
BibTeX
@misc{myx_transstyle_w2s_2026,
author = {Khant Sint Heinn (Kalix Louis)},
title = {myX-TransStyle-W2S: A Written to Spoken Burmese Style Transfer Model},
year = {2026},
publisher = {Hugging Face},
organization = {DatarrX},
howpublished = {https://huggingface.co/DatarrX/myX-TransStyle-W2S}
}
About the Author
Khant Sint Heinn, working under the name Kalix Louis, is a Machine Learning Engineer focused on Natural Language Processing (NLP), data foundations, and open-source AI development. His work is centered on improving support for the Burmese (Myanmar) language in modern AI systems by building high-quality datasets, practical tools, and scalable infrastructure for language technology.
He is currently the Lead Developer at DatarrX, where he develops data pipelines, manages large-scale data collection workflows, and helps create open-source resources for researchers, developers, and organizations. His experience includes data engineering, web scripting, dataset curation, and building systems that support real-world machine learning applications.
Khant Sint Heinn is especially interested in advancing low-resource languages and making AI more accessible to underrepresented communities. Through his open-source contributions, he works to strengthen the Burmese (Myanmar) tech ecosystem and provide reliable building blocks for future language models, search systems, and intelligent applications.
His goal is simple: to turn limited language resources into practical opportunities through clean data, useful tools, and community-driven innovation.
Connect with the Author:
GitHub | Hugging Face | Kaggle
Developed with ❤️ by DatarrX to empower the Myanmar AI ecosystem.
- Downloads last month
- 191
Model tree for DatarrX/myX-TransStyle-W2S
Base model
facebook/nllb-200-distilled-600MDataset used to train DatarrX/myX-TransStyle-W2S
Evaluation results
- BLEU on Custom External Test Settest set self-reported19.638
- chrF on Custom External Test Settest set self-reported78.397
- TER on Custom External Test Settest set self-reported50.735
- BERTScore F1 on Custom External Test Settest set self-reported0.969