📝 myX-TransStyle-S2W: A Transformer-based Style Transfer for Myanmar Spoken (ပြောဟန်) to Written (ရေးဟန်)

myX-TransStyle-S2W is a specialized Sequence-to-Sequence (Seq2Seq) model developed by Khant Sint Heinn (Kalix Louis) under DatarrX. It is designed to transform colloquial Spoken Burmese (ပြောဟန်) into its formal Written Burmese (ရေးဟန်) counterpart while strictly preserving the original semantic meaning.

Model Details

Developed by: Khant Sint Heinn (Kalix Louis)
Organization: DatarrX | ဒေတာ-အက်စ်
Model Architecture: Fine-tuned NLLB-200 (600M Distilled) with merged LoRA adapters
Language: Burmese (Myanmar)
Task: Text Style Transfer (Spoken → Written)
License: MIT
Trained on: Myanmar Written-Spoken Parallel Corpus (MWSPC)

Linguistic Context: The Diglossia Challenge

Burmese is a diglossic language, characterized by a sharp divide between two distinct registers. Understanding this is crucial for effective Myanmar NLP:

Spoken Style (ပြောဟန်): Used in daily life, social media, and verbal communication. It relies on colloquial grammatical markers like "တယ်" (tense) or "ရဲ့" (possessive).
Written Style (ရေးဟန်): The standard for news, law, textbooks, and officialdom. It uses formal markers such as "သည်", "၏", and "၍".

Most existing AI models sound "robotic" because they are trained primarily on formal web-scraped data. myX-TransStyle-S2W bridges this gap by enabling AI to convert natural spoken input into grammatically correct formal documentation.

Training Methodology

The model was trained using an efficient yet powerful adaptation strategy to handle the nuances of Myanmar grammar.

1. The Dataset (MWSPC)

We utilized 5,555 high-quality, unique parallel text pairs from the MWSPC dataset. This dataset provides a direct mapping between informal and formal structures, curated specifically to remove duplicates and ensure linguistic diversity.

2. Parameter-Efficient Fine-Tuning (PEFT)

To capture complex structural transformations without losing the base model's knowledge, we used Low-Rank Adaptation (LoRA):

Target Modules: q_proj, k_proj, v_proj, out_proj.
Rank (R): 32 | Alpha: 64.
Learning Rate: 8e-5 with a Cosine scheduler.

3. Merging Strategy

After training, the LoRA weights were merged back into the base nllb-200-distilled-600M model using merge_and_unload(). This creates a standalone 2.8 GB model that does not require additional PEFT libraries for inference.

Evaluation Results

The model was evaluated on 100 unseen test sentences across multiple metrics to ensure reliability.

Performance Metrics

Metric	Score	Interpretation
BERTScore F1	0.9685	Indicates near-perfect meaning preservation during style transfer.
chrF	75.56	High character-level similarity, showing mastery over Myanmar suffixes.
BLEU	12.94	Reflects the model's creative flexibility; multiple formal rewrites are often valid.

Qualitative Analysis

Manual review by native speakers confirms that the model excels at swapping spoken particles (e.g., ...တာပါ။) for formal equivalents (e.g., ...ခြင်းဖြစ်သည်။). Even when the model deviates from the reference text, the outputs remain linguistically acceptable and natural within a formal context.

🔗 Related Models in the DatarrX Ecosystem

To get the most out of Myanmar Style Transfer, we recommend using these sibling models:

myX-TransStyle-W2S: The inverse model for converting Written Style to Spoken Style.
myX-StyleClassifier: A high-performance classifier to identify whether a sentence is Written or Spoken before applying style transfer.

How to Use

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

# 1. Load the Merged Model
model_id = "DatarrX/myX-TransStyle-S2W"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSeq2SeqLM.from_pretrained(model_id)

# 2. Prepare Input
prefix = "Rewrite Burmese spoken sentence into formal written Burmese: "
spoken_text = "ပုဂံခေတ်က မြန်မာနိုင်ငံသမိုင်းမှာ ပထမဆုံး အင်ပါယာနိုင်ငံကြီး ဖြစ်ခဲ့တယ်။"
input_text = prefix + spoken_text

# 3. Generate Written Style
inputs = tokenizer(input_text, return_tensors="pt")
outputs = model.generate(
    **inputs, 
    forced_bos_token_id=tokenizer.convert_tokens_to_ids("mya_Mymr"),
    max_length=160,
    num_beams=5
)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))
# Output: ပုဂံခေတ်သည် မြန်မာနိုင်ငံသမိုင်းတွင် ပထမဆုံး အင်ပါယာနိုင်ငံကြီး ဖြစ်ခဲ့၏။

Intended Use & Limitations

Use Cases

Formalizing Content: Converting interview transcripts or casual notes into professional reports.
Data Normalization: Cleaning social media text for downstream NLP tasks.
Educational Tools: Helping students learn the differences between Myanmar registers.

Limitations

Hybrid Ambiguity: In cases where a sentence structure is valid in both registers, the model may output minimal changes.
Domain Specificity: Performance is optimized for standard Yangon/Mandalay dialects and may vary with heavy regional slang.

Citation

BibTeX

@misc{myx_transstyle_s2w_2026,
  author = {Khant Sint Heinn (Kalix Louis)},
  title = {myX-TransStyle-S2W: A Spoken to Written Burmese Style Transfer Model},
  year = {2026},
  publisher = {Hugging Face},
  organization = {DatarrX},
  howpublished = {https://huggingface.co/DatarrX/myX-TransStyle-S2W}
}

About the Author

Khant Sint Heinn, working under the name Kalix Louis, is a Machine Learning Engineer focused on Natural Language Processing (NLP), data foundations, and open-source AI development. His work is centered on improving support for the Burmese (Myanmar) language in modern AI systems by building high-quality datasets, practical tools, and scalable infrastructure for language technology.

He is currently the Lead Developer at DatarrX, where he develops data pipelines, manages large-scale data collection workflows, and helps create open-source resources for researchers, developers, and organizations. His experience includes data engineering, web scripting, dataset curation, and building systems that support real-world machine learning applications.

Khant Sint Heinn is especially interested in advancing low-resource languages and making AI more accessible to underrepresented communities. Through his open-source contributions, he works to strengthen the Burmese (Myanmar) tech ecosystem and provide reliable building blocks for future language models, search systems, and intelligent applications.

His goal is simple: to turn limited language resources into practical opportunities through clean data, useful tools, and community-driven innovation.

Connect with the Author:
GitHub | Hugging Face | Kaggle

Developed with ❤️ by DatarrX to empower the Myanmar AI ecosystem.

Downloads last month: 141

Safetensors

Model size

1B params

Tensor type

F16

Model tree for DatarrX/myX-TransStyle-S2W

Base model

facebook/nllb-200-distilled-600M

Adapter

(76)

this model

Dataset used to train DatarrX/myX-TransStyle-S2W

Evaluation results

BLEU on Custom External Test Set
test set self-reported

12.944
chrF on Custom External Test Set
test set self-reported

75.560
TER on Custom External Test Set
test set self-reported

58.019
BERTScore F1 on Custom External Test Set
test set self-reported

0.969