π myX-StyleClassifier: A Classifier for Myanmar Spoken (ααΌα±α¬αααΊ) and Written (αα±αΈαααΊ) Styles
myX-StyleClassifier is a high-performance Machine Learning model developed by Khant Sint Heinn under, DatarrX to classify Myanmar (Burmese) text into two distinct linguistic registers: Written Style (Formal) and Spoken Style (Colloquial).
Model Details
- Developed by: Khant Sint Heinn (Kalix Louis)
- Organization: DatarrX | αα±αα¬-α‘ααΊα αΊ
- Model Type: Ensemble Machine Learning (Voting Classifier)
- Language(s): Burmese (Myanmar)
- License: MIT
- Trained on: Myanmar Style Classification Corpus (MSCC)
Training Methodology
To achieve robust performance beyond simple keyword matching, the model was trained using an Advanced Ensemble Learning approach.
1. Feature Engineering
The model utilizes a TF-IDF (Term Frequency-Inverse Document Frequency) vectorizer with a character-level N-gram range of (2, 4). This allows the model to capture the nuances of Myanmar grammatical suffixes (e.g., "...αααΊ" vs "...αααΊ") and complex structural patterns without requiring a custom tokenizer.
2. Ensemble Architecture
We implemented a Soft Voting Classifier that combines the strengths of three diverse algorithms:
- Logistic Regression: Optimized with
C=10.0for high-precision linear separation. - Support Vector Machine (SVC): Providing robust boundaries in high-dimensional text space.
- Random Forest: Captures non-linear relationships and specific word importance.
The final configuration was selected via GridSearchCV, ensuring the hyperparameters are fine-tuned for the unique structure of the Myanmar language.
Evaluation Results
The model was validated against a blind test set of 100 unseen sentences (not included in the training/validation split).
Metrics
| Metric | Score |
|---|---|
| Accuracy | 96.00% |
| Macro F1-Score | 0.96 |
Classification Report
| Class | Precision | Recall | F1-Score | Support |
|---|---|---|---|---|
| Formal (0) | 0.97 | 0.93 | 0.95 | 40 |
| Colloquial (1) | 0.95 | 0.98 | 0.97 | 60 |
Evaluation breakdown (Confusion Matrix)
The following table illustrates how the model performed on 100 unseen test sentences:
| Predicted Formal | Predicted Colloquial | |
|---|---|---|
| Actual Formal | 37 (Correct) | 3 (Misclassified) |
| Actual Colloquial | 1 (Misclassified) | 59 (Correct) |
Key Insights from the Matrix:
- True Positives (Formal): 37 formal sentences were correctly identified.
- True Positives (Colloquial): 59 colloquial sentences were correctly identified.
- Misclassifications: Only 4 out of 100 sentences were misclassified, primarily due to "Hybrid" linguistic features where the sentence structure could reasonably belong to either style.
Error Analysis (Ambiguity Handling)
In the 4% of cases where the model failed, human review confirmed stylistic ambiguity. Certain Myanmar sentences are "Hybrid" or "Dual-use," where the vocabulary is neutral enough to be used in both formal writing and polite daily conversation.
How to Use
To use this model, you need
scikit-learn,joblib, andhuggingface_hubinstalled.
import joblib
from huggingface_hub import hf_hub_download
# 1. Download the model from Hugging Face Hub
repo_id = "DatarrX/myX-StyleClassifier"
filename = "model.joblib"
checkpoint_path = hf_hub_download(repo_id=repo_id, filename=filename)
# 2. Load the Ensemble Model
model = joblib.load(checkpoint_path)
# 3. Predict Styles
# 0 = Written/Formal, 1 = Spoken/Colloquial
sample_texts = [
"αα»α½ααΊα―ααΊαααΊ αα»α±α¬ααΊαΈααα―α· αα½α¬αΈαα«αααΊα", # Formal
"αα« αα»α±α¬ααΊαΈαα½α¬αΈαααα―α·α", # Colloquial
"ααα
α±α¬αα·αΊαα±αΈαα«α" # Ambiguous/Polite
]
predictions = model.predict(sample_texts)
probabilities = model.predict_proba(sample_texts) # Get confidence scores
for text, pred, prob in zip(sample_texts, predictions, probabilities):
label = "Spoken/Colloquial" if pred == 1 else "Written/Formal"
confidence = prob[pred] * 100
print(f"Text: {text} | Style: {label} ({confidence:.2f}% confidence)")
π Beyond Classification: Style Transfer
Once you have identified the style of your text using myX-StyleClassifier, you can use our transformation models to switch between registers:
- myX-TransStyle-S2W: Convert detected Spoken text into formal Written prose.
- myX-TransStyle-W2S: Transform detected Written text into natural Spoken dialogue.
Intended Use & Limitations
Use Cases
- Style Checking: Automating the detection of informal language in professional documents.
- Chatbot Alignment: Ensuring AI responses match the user's preferred register.
- NLP Pre-processing: Filtering datasets for fine-tuning specific language models.
Limitations
- The model may struggle with Internet Slang or Ancient Literary Burmese that deviates from modern standard registers.
- Sentences that lack specific grammatical particles (suffixes) may result in lower confidence scores.
Citation
BibTeX
@misc{myx_styleclassifier_2026,
author = {Khant Sint Heinn (Kalix Louis)},
title = {myX-StyleClassifier: A Robust Myanmar Style Classification Model},
year = {2026},
publisher = {Hugging Face},
organization = {DatarrX},
howpublished = {https://huggingface.co/DatarrX/myX-StyleClassifier}
}
About the Author
Khant Sint Heinn, working under the name Kalix Louis, is a Machine Learning Engineer focused on Natural Language Processing (NLP), data foundations, and open-source AI development. His work is centered on improving support for the Burmese (Myanmar) language in modern AI systems by building high-quality datasets, practical tools, and scalable infrastructure for language technology.
He is currently the Lead Developer at DatarrX, where he develops data pipelines, manages large-scale data collection workflows, and helps create open-source resources for researchers, developers, and organizations. His experience includes data engineering, web scripting, dataset curation, and building systems that support real-world machine learning applications.
Khant Sint Heinn is especially interested in advancing low-resource languages and making AI more accessible to underrepresented communities. Through his open-source contributions, he works to strengthen the Burmese (Myanmar) tech ecosystem and provide reliable building blocks for future language models, search systems, and intelligent applications.
His goal is simple: to turn limited language resources into practical opportunities through clean data, useful tools, and community-driven innovation.
Connect with the Author:
GitHub | Hugging Face | Kaggle
Developed with β€οΈ by DatarrX to empower the Myanmar AI ecosystem.
- Downloads last month
- -