📝 myX-TransStyle-W2S: A Transformer-based Style Transfer for Myanmar Written (ရေးဟန်) to Spoken (ပြောဟန်)

myX-TransStyle-W2S is a specialized Sequence-to-Sequence (Seq2Seq) model developed by Khant Sint Heinn (Kalix Louis) under DatarrX. It is specifically designed to transform formal Written Burmese (ရေးဟန်) into its natural colloquial Spoken Burmese (ပြောဟန်) counterpart. This model ensures that formal documents or news can be converted into fluid, human-like dialogue while maintaining 100% semantic integrity.

Model Details

Developed by: Khant Sint Heinn (Kalix Louis)
Organization: DatarrX | ဒေတာ-အက်စ်
Model Architecture: Fine-tuned NLLB-200 (600M Distilled) with merged LoRA adapters
Language: Burmese (Myanmar)
Task: Text Style Transfer (Written → Spoken)
License: MIT
Trained on: Myanmar Written-Spoken Parallel Corpus (MWSPC)

Linguistic Context: The Diglossia Challenge

Burmese is a diglossic language, featuring a major linguistic gap between two functional registers:

Written Style (ရေးဟန်): Used in news, law, textbooks, and officialdom. It relies on formal grammatical markers such as "သည်", "၏", and "၍".
Spoken Style (ပြောဟန်): Used in daily life, verbal communication, and social media. It uses colloquial markers like "တယ်" (tense), "ရဲ့" (possessive), and "နဲ့" (conjunction).

myX-TransStyle-W2S addresses the "robotic" nature of modern AI by allowing formal text to be localized into the natural, warm tone used by native speakers every day.

Training Methodology

The model was trained using an efficient adaptation strategy optimized for the unique structural shifts of Myanmar style.

1. The Dataset (MWSPC)

The model was trained on 5,555 high-quality, unique parallel text pairs. This dataset provides a direct mapping from formal literary structures to their informal colloquial equivalents, filtered to ensure maximum diversity.

2. Parameter-Efficient Fine-Tuning (PEFT)

To capture nuanced stylistic shifts without overwriting the base model's linguistic depth, we utilized Low-Rank Adaptation (LoRA):

Target Modules: q_proj, k_proj, v_proj, out_proj.
Rank (R): 32 | Alpha: 64.
Learning Rate: 8e-5 with a Cosine scheduler.

3. Merging Strategy

The LoRA adapters were merged into the base nllb-200-distilled-600M model using merge_and_unload(). The resulting standalone 2.8 GB model provides high-speed inference without requiring the PEFT library.

Evaluation Results

The model was validated on 100 unseen test sentences and showed superior performance compared to its S2W sibling.

Performance Metrics

Metric	Score	Interpretation
BERTScore F1	0.9693	Indicates near-perfect meaning preservation during style transfer.
chrF	78.40	Exceptional character-level accuracy, specifically in converting formal suffixes.
BLEU	19.64	Higher than S2W, reflecting a more consistent conversion pattern into spoken style.

Qualitative Analysis

Manual review by native speakers confirms the model's ability to not only swap particles but also adjust vocabulary (e.g., converting “အလွန်ပင်” to “သိပ်” or “အကယ်ပင်” to “တကယ်လို့တောင်”) in a way that feels authentic and human.

🔗 Related Models in the DatarrX Ecosystem

Explore other specialized models for Myanmar linguistic styles:

myX-TransStyle-S2W: The sibling model for converting Spoken Style to formal Written Style.
myX-StyleClassifier: Use this to automatically detect the style of your input text before processing.

How to Use

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

# 1. Load the Merged Model
model_id = "DatarrX/myX-TransStyle-W2S"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSeq2SeqLM.from_pretrained(model_id)

# 2. Prepare Input
prefix = "Rewrite Burmese formal written sentence into spoken Burmese: "
written_text = "ပုဂံခေတ်သည် မြန်မာနိုင်ငံသမိုင်းတွင် ပထမဆုံးသော အင်ပါယာနိုင်ငံတော်ကြီး ဖြစ်ခဲ့သည်။"
input_text = prefix + written_text

# 3. Generate Spoken Style
inputs = tokenizer(input_text, return_tensors="pt")
outputs = model.generate(
    **inputs, 
    forced_bos_token_id=tokenizer.convert_tokens_to_ids("mya_Mymr"),
    max_length=160,
    num_beams=5
)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))
# Output: ပုဂံခေတ်က မြန်မာနိုင်ငံသမိုင်းမှာ ပထမဆုံး အင်ပါယာနိုင်ငံတော်ကြီးဖြစ်ခဲ့တယ်။

Intended Use & Limitations

Use Cases

Natural AI Personalities: Converting formal bot responses into natural-sounding speech.
Content Localization: Making formal news or articles more accessible for audio/podcasts.
Creative Writing: Assisting authors in converting narrative descriptions into natural character dialogue.

Limitations

Dialectal Focus: Primarily focuses on the standard Yangon/Mandalay dialect; regional slang may be less represented.
Contextual Nuance: While meaning is preserved, the "warmth" of the spoken style may vary depending on the complexity of the input.

Citation

BibTeX

@misc{myx_transstyle_w2s_2026,
  author = {Khant Sint Heinn (Kalix Louis)},
  title = {myX-TransStyle-W2S: A Written to Spoken Burmese Style Transfer Model},
  year = {2026},
  publisher = {Hugging Face},
  organization = {DatarrX},
  howpublished = {https://huggingface.co/DatarrX/myX-TransStyle-W2S}
}

About the Author

Khant Sint Heinn, working under the name Kalix Louis, is a Machine Learning Engineer focused on Natural Language Processing (NLP), data foundations, and open-source AI development. His work is centered on improving support for the Burmese (Myanmar) language in modern AI systems by building high-quality datasets, practical tools, and scalable infrastructure for language technology.

He is currently the Lead Developer at DatarrX, where he develops data pipelines, manages large-scale data collection workflows, and helps create open-source resources for researchers, developers, and organizations. His experience includes data engineering, web scripting, dataset curation, and building systems that support real-world machine learning applications.

Khant Sint Heinn is especially interested in advancing low-resource languages and making AI more accessible to underrepresented communities. Through his open-source contributions, he works to strengthen the Burmese (Myanmar) tech ecosystem and provide reliable building blocks for future language models, search systems, and intelligent applications.

His goal is simple: to turn limited language resources into practical opportunities through clean data, useful tools, and community-driven innovation.

Connect with the Author:
GitHub | Hugging Face | Kaggle

Developed with ❤️ by DatarrX to empower the Myanmar AI ecosystem.

Downloads last month: 191

Safetensors

Model size

1B params

Tensor type

F16

Model tree for DatarrX/myX-TransStyle-W2S

Base model

facebook/nllb-200-distilled-600M

Adapter

(76)

this model

Dataset used to train DatarrX/myX-TransStyle-W2S

Evaluation results

BLEU on Custom External Test Set
test set self-reported

19.638
chrF on Custom External Test Set
test set self-reported

78.397
TER on Custom External Test Set
test set self-reported

50.735
BERTScore F1 on Custom External Test Set
test set self-reported

0.969