📝 myX-TransStyle-S2W: A Transformer-based Style Transfer for Myanmar Spoken (ပြောဟန်) to Written (ရေးဟန်)
myX-TransStyle-S2W is a specialized Sequence-to-Sequence (Seq2Seq) model developed by Khant Sint Heinn (Kalix Louis) under DatarrX. It is designed to transform colloquial Spoken Burmese (ပြောဟန်) into its formal Written Burmese (ရေးဟန်) counterpart while strictly preserving the original semantic meaning.
Model Details
- Developed by: Khant Sint Heinn (Kalix Louis)
- Organization: DatarrX | ဒေတာ-အက်စ်
- Model Architecture: Fine-tuned NLLB-200 (600M Distilled) with merged LoRA adapters
- Language: Burmese (Myanmar)
- Task: Text Style Transfer (Spoken → Written)
- License: MIT
- Trained on: Myanmar Written-Spoken Parallel Corpus (MWSPC)
Linguistic Context: The Diglossia Challenge
Burmese is a diglossic language, characterized by a sharp divide between two distinct registers. Understanding this is crucial for effective Myanmar NLP:
- Spoken Style (ပြောဟန်): Used in daily life, social media, and verbal communication. It relies on colloquial grammatical markers like "တယ်" (tense) or "ရဲ့" (possessive).
- Written Style (ရေးဟန်): The standard for news, law, textbooks, and officialdom. It uses formal markers such as "သည်", "၏", and "၍".
Most existing AI models sound "robotic" because they are trained primarily on formal web-scraped data. myX-TransStyle-S2W bridges this gap by enabling AI to convert natural spoken input into grammatically correct formal documentation.
Training Methodology
The model was trained using an efficient yet powerful adaptation strategy to handle the nuances of Myanmar grammar.
1. The Dataset (MWSPC)
We utilized 5,555 high-quality, unique parallel text pairs from the MWSPC dataset. This dataset provides a direct mapping between informal and formal structures, curated specifically to remove duplicates and ensure linguistic diversity.
2. Parameter-Efficient Fine-Tuning (PEFT)
To capture complex structural transformations without losing the base model's knowledge, we used Low-Rank Adaptation (LoRA):
- Target Modules:
q_proj,k_proj,v_proj,out_proj. - Rank (R): 32 | Alpha: 64.
- Learning Rate: 8e-5 with a Cosine scheduler.
3. Merging Strategy
After training, the LoRA weights were merged back into the base nllb-200-distilled-600M model using merge_and_unload(). This creates a standalone 2.8 GB model that does not require additional PEFT libraries for inference.
Evaluation Results
The model was evaluated on 100 unseen test sentences across multiple metrics to ensure reliability.
Performance Metrics
| Metric | Score | Interpretation |
|---|---|---|
| BERTScore F1 | 0.9685 | Indicates near-perfect meaning preservation during style transfer. |
| chrF | 75.56 | High character-level similarity, showing mastery over Myanmar suffixes. |
| BLEU | 12.94 | Reflects the model's creative flexibility; multiple formal rewrites are often valid. |
Qualitative Analysis
Manual review by native speakers confirms that the model excels at swapping spoken particles (e.g., ...တာပါ။) for formal equivalents (e.g., ...ခြင်းဖြစ်သည်။). Even when the model deviates from the reference text, the outputs remain linguistically acceptable and natural within a formal context.
🔗 Related Models in the DatarrX Ecosystem
To get the most out of Myanmar Style Transfer, we recommend using these sibling models:
- myX-TransStyle-W2S: The inverse model for converting Written Style to Spoken Style.
- myX-StyleClassifier: A high-performance classifier to identify whether a sentence is Written or Spoken before applying style transfer.
How to Use
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
# 1. Load the Merged Model
model_id = "DatarrX/myX-TransStyle-S2W"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSeq2SeqLM.from_pretrained(model_id)
# 2. Prepare Input
prefix = "Rewrite Burmese spoken sentence into formal written Burmese: "
spoken_text = "ပုဂံခေတ်က မြန်မာနိုင်ငံသမိုင်းမှာ ပထမဆုံး အင်ပါယာနိုင်ငံကြီး ဖြစ်ခဲ့တယ်။"
input_text = prefix + spoken_text
# 3. Generate Written Style
inputs = tokenizer(input_text, return_tensors="pt")
outputs = model.generate(
**inputs,
forced_bos_token_id=tokenizer.convert_tokens_to_ids("mya_Mymr"),
max_length=160,
num_beams=5
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
# Output: ပုဂံခေတ်သည် မြန်မာနိုင်ငံသမိုင်းတွင် ပထမဆုံး အင်ပါယာနိုင်ငံကြီး ဖြစ်ခဲ့၏။
Intended Use & Limitations
Use Cases
- Formalizing Content: Converting interview transcripts or casual notes into professional reports.
- Data Normalization: Cleaning social media text for downstream NLP tasks.
- Educational Tools: Helping students learn the differences between Myanmar registers.
Limitations
- Hybrid Ambiguity: In cases where a sentence structure is valid in both registers, the model may output minimal changes.
- Domain Specificity: Performance is optimized for standard Yangon/Mandalay dialects and may vary with heavy regional slang.
Citation
BibTeX
@misc{myx_transstyle_s2w_2026,
author = {Khant Sint Heinn (Kalix Louis)},
title = {myX-TransStyle-S2W: A Spoken to Written Burmese Style Transfer Model},
year = {2026},
publisher = {Hugging Face},
organization = {DatarrX},
howpublished = {https://huggingface.co/DatarrX/myX-TransStyle-S2W}
}
About the Author
Khant Sint Heinn, working under the name Kalix Louis, is a Machine Learning Engineer focused on Natural Language Processing (NLP), data foundations, and open-source AI development. His work is centered on improving support for the Burmese (Myanmar) language in modern AI systems by building high-quality datasets, practical tools, and scalable infrastructure for language technology.
He is currently the Lead Developer at DatarrX, where he develops data pipelines, manages large-scale data collection workflows, and helps create open-source resources for researchers, developers, and organizations. His experience includes data engineering, web scripting, dataset curation, and building systems that support real-world machine learning applications.
Khant Sint Heinn is especially interested in advancing low-resource languages and making AI more accessible to underrepresented communities. Through his open-source contributions, he works to strengthen the Burmese (Myanmar) tech ecosystem and provide reliable building blocks for future language models, search systems, and intelligent applications.
His goal is simple: to turn limited language resources into practical opportunities through clean data, useful tools, and community-driven innovation.
Connect with the Author:
GitHub | Hugging Face | Kaggle
Developed with ❤️ by DatarrX to empower the Myanmar AI ecosystem.
- Downloads last month
- 141
Model tree for DatarrX/myX-TransStyle-S2W
Base model
facebook/nllb-200-distilled-600MDataset used to train DatarrX/myX-TransStyle-S2W
Evaluation results
- BLEU on Custom External Test Settest set self-reported12.944
- chrF on Custom External Test Settest set self-reported75.560
- TER on Custom External Test Settest set self-reported58.019
- BERTScore F1 on Custom External Test Settest set self-reported0.969