📝 myX-StyleClassifier: A Classifier for Myanmar Spoken (ပြောဟန်) and Written (ရေးဟန်) Styles

myX-StyleClassifier is a high-performance Machine Learning model developed by Khant Sint Heinn under, DatarrX to classify Myanmar (Burmese) text into two distinct linguistic registers: Written Style (Formal) and Spoken Style (Colloquial).

Model Details

Developed by: Khant Sint Heinn (Kalix Louis)
Organization: DatarrX | ဒေတာ-အက်စ်
Model Type: Ensemble Machine Learning (Voting Classifier)
Language(s): Burmese (Myanmar)
License: MIT
Trained on: Myanmar Style Classification Corpus (MSCC)

Training Methodology

To achieve robust performance beyond simple keyword matching, the model was trained using an Advanced Ensemble Learning approach.

1. Feature Engineering

The model utilizes a TF-IDF (Term Frequency-Inverse Document Frequency) vectorizer with a character-level N-gram range of (2, 4). This allows the model to capture the nuances of Myanmar grammatical suffixes (e.g., "...သည်" vs "...တယ်") and complex structural patterns without requiring a custom tokenizer.

2. Ensemble Architecture

We implemented a Soft Voting Classifier that combines the strengths of three diverse algorithms:

Logistic Regression: Optimized with C=10.0 for high-precision linear separation.
Support Vector Machine (SVC): Providing robust boundaries in high-dimensional text space.
Random Forest: Captures non-linear relationships and specific word importance.

The final configuration was selected via GridSearchCV, ensuring the hyperparameters are fine-tuned for the unique structure of the Myanmar language.

Evaluation Results

The model was validated against a blind test set of 100 unseen sentences (not included in the training/validation split).

Metrics

Metric	Score
Accuracy	96.00%
Macro F1-Score	0.96

Classification Report

Class	Precision	Recall	F1-Score	Support
Formal (0)	0.97	0.93	0.95	40
Colloquial (1)	0.95	0.98	0.97	60

Evaluation breakdown (Confusion Matrix)

The following table illustrates how the model performed on 100 unseen test sentences:

	Predicted Formal	Predicted Colloquial
Actual Formal	37 (Correct)	3 (Misclassified)
Actual Colloquial	1 (Misclassified)	59 (Correct)

Key Insights from the Matrix:

True Positives (Formal): 37 formal sentences were correctly identified.
True Positives (Colloquial): 59 colloquial sentences were correctly identified.
Misclassifications: Only 4 out of 100 sentences were misclassified, primarily due to "Hybrid" linguistic features where the sentence structure could reasonably belong to either style.

Error Analysis (Ambiguity Handling)

In the 4% of cases where the model failed, human review confirmed stylistic ambiguity. Certain Myanmar sentences are "Hybrid" or "Dual-use," where the vocabulary is neutral enough to be used in both formal writing and polite daily conversation.

How to Use

To use this model, you need scikit-learn, joblib, and huggingface_hub installed.

import joblib
from huggingface_hub import hf_hub_download

# 1. Download the model from Hugging Face Hub
repo_id = "DatarrX/myX-StyleClassifier"
filename = "model.joblib"
checkpoint_path = hf_hub_download(repo_id=repo_id, filename=filename)

# 2. Load the Ensemble Model
model = joblib.load(checkpoint_path)

# 3. Predict Styles
# 0 = Written/Formal, 1 = Spoken/Colloquial
sample_texts = [
    "ကျွန်ုပ်သည် ကျောင်းသို့ သွားပါသည်။", # Formal
    "ငါ ကျောင်းသွားမလို့။",              # Colloquial
    "ခဏစောင့်ပေးပါ။"                   # Ambiguous/Polite
]

predictions = model.predict(sample_texts)
probabilities = model.predict_proba(sample_texts) # Get confidence scores

for text, pred, prob in zip(sample_texts, predictions, probabilities):
    label = "Spoken/Colloquial" if pred == 1 else "Written/Formal"
    confidence = prob[pred] * 100
    print(f"Text: {text} | Style: {label} ({confidence:.2f}% confidence)")

🔄 Beyond Classification: Style Transfer

Once you have identified the style of your text using myX-StyleClassifier, you can use our transformation models to switch between registers:

myX-TransStyle-S2W: Convert detected Spoken text into formal Written prose.
myX-TransStyle-W2S: Transform detected Written text into natural Spoken dialogue.

Intended Use & Limitations

Use Cases

Style Checking: Automating the detection of informal language in professional documents.
Chatbot Alignment: Ensuring AI responses match the user's preferred register.
NLP Pre-processing: Filtering datasets for fine-tuning specific language models.

Limitations

The model may struggle with Internet Slang or Ancient Literary Burmese that deviates from modern standard registers.
Sentences that lack specific grammatical particles (suffixes) may result in lower confidence scores.

Citation

BibTeX

@misc{myx_styleclassifier_2026,
  author = {Khant Sint Heinn (Kalix Louis)},
  title = {myX-StyleClassifier: A Robust Myanmar Style Classification Model},
  year = {2026},
  publisher = {Hugging Face},
  organization = {DatarrX},
  howpublished = {https://huggingface.co/DatarrX/myX-StyleClassifier}
}

About the Author

Khant Sint Heinn, working under the name Kalix Louis, is a Machine Learning Engineer focused on Natural Language Processing (NLP), data foundations, and open-source AI development. His work is centered on improving support for the Burmese (Myanmar) language in modern AI systems by building high-quality datasets, practical tools, and scalable infrastructure for language technology.

He is currently the Lead Developer at DatarrX, where he develops data pipelines, manages large-scale data collection workflows, and helps create open-source resources for researchers, developers, and organizations. His experience includes data engineering, web scripting, dataset curation, and building systems that support real-world machine learning applications.

Khant Sint Heinn is especially interested in advancing low-resource languages and making AI more accessible to underrepresented communities. Through his open-source contributions, he works to strengthen the Burmese (Myanmar) tech ecosystem and provide reliable building blocks for future language models, search systems, and intelligent applications.

His goal is simple: to turn limited language resources into practical opportunities through clean data, useful tools, and community-driven innovation.

Connect with the Author:
GitHub | Hugging Face | Kaggle

Developed with ❤️ by DatarrX to empower the Myanmar AI ecosystem.

Downloads last month: -

DatarrX
/

myX-StyleClassifier