Feature Extraction
Transformers
Sinhala
Hindi
English
tokenizer
WWHO
SGPE
linguis_trie
token
tokenization
Syllable
remeinium
transformer
linguistics
NLP
sinhala
hindi
english
BPE
GPE
Eval Results (legacy)
Instructions to use Remeinium/WWHO with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Remeinium/WWHO with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("feature-extraction", model="Remeinium/WWHO")# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("Remeinium/WWHO", dtype="auto") - Notebooks
- Google Colab
- Kaggle
| license: apache-2.0 | |
| datasets: | |
| - Remeinium/WWHO_30m | |
| language: | |
| - si | |
| - hi | |
| - en | |
| pipeline_tag: feature-extraction | |
| library_name: transformers | |
| tags: | |
| - tokenizer | |
| - WWHO | |
| - SGPE | |
| - linguis_trie | |
| - token | |
| - tokenization | |
| - Syllable | |
| - remeinium | |
| - transformer | |
| - linguistics | |
| - NLP | |
| - sinhala | |
| - hindi | |
| - english | |
| - BPE | |
| - GPE | |
| model-index: | |
| - name: WWHO | |
| results: | |
| - task: | |
| type: feature-extraction | |
| dataset: | |
| name: WWHO_30m | |
| type: Remeinium/WWHO_30m | |
| metrics: | |
| - name: Token-to-Word Ratio (TWR) - Sinhala | |
| type: twr | |
| value: 1.274 | |
| verified: false | |
| - name: Token-to-Word Ratio (TWR) - Hindi | |
| type: twr | |
| value: 1.181 | |
| verified: false | |
| - name: Token-to-Word Ratio (TWR) - Overall | |
| type: twr | |
| value: 1.240 | |
| verified: false | |
| # Separate before you Compress | |
| <!-- **Remeinium Research** | |
| [remeinium.com](https://remeinium.com) | [Paper](https://arxiv.org/abs/...) | [Tokenizer](https://huggingface.co/remeinium/WWHO) | [Dataset](https://huggingface.co/datasets/remeinium/WWHO_Cleaned_30m) | |
| --- --> | |
| ## The Next Architectural Primitive in Tokenization | |
| Large language models remain linguistically blind to Abugida scripts. Byte-Pair Encoding and its descendants routinely shatter complex conjuncts — atomic multi-codepoint grapheme clusters that constitute the fundamental phonetic units of Indic and Southeast Asian writing systems — into meaningless sub-character fragments. The result is degraded reasoning, inflated inference costs, and a systemic “Token Tax” that disproportionately burdens more than one billion speakers. | |
| **WWHO (Where-What-How Often) introduces the clean separation of concerns the field has been missing.** | |
| By decoupling linguistic structural constraints from statistical compression, WWHO builds a unified meta-vocabulary space: | |
| 1. **Layer 1 (Where): Code-Switching Router** | |
| A linear $O(N)$ block scanner that evaluates characters in $O(1)$ time to inherently identify script boundaries, routing Latin text to proven frontier tokenizers (like `o200k_base`) while sending Abugida text for specialized processing. | |
| 2. **Layer 2 (What): LinguisTrie** | |
| Enforces linguistic integrity by construction: a DFA based syllabifier segments raw Unicode into well-formed syllables with a formal zero-breakage guarantee. | |
| 3. **Layer 3 (How Often): SGPE & Meta-Vocabulary** | |
| Performs statistical pair merging exclusively over this linguistically sound stream, safely projecting the resulting tokens into a unified, mathematically offset ID space. | |
| Sinhala and Devanagari serve as the high-complexity proofs-of-concept. The same architecture generalizes directly to Tamil, Khmer, Myanmar, and the broader Abugida family. | |
| --- | |
| ## Multi-Script Stratified Benchmarks (122.2M Characters) | |
| We evaluated WWHO against frontier models across a 1.5 million sentence code-switched corpus containing Sinhala, Hindi (Devanagari), and English. | |
| ### 1. Sinhala Efficiency | |
| | Tokenizer | Tokens | TWR | Chr/Tok | % Reduction | | |
| |---|---|---|---|---| | |
| | **SGPE(WWHO)** | **6,654,288** | **1.274** | **4.83** | **-** | | |
| | OpenAI (o200k_base) | 17,360,196 | 3.324 | 1.85 | 61.7% | | |
| | Llama 4 Scout | 18,157,707 | 3.476 | 1.77 | 63.4% | | |
| | DeepSeek V3 | 29,152,698 | 5.581 | 1.10 | 77.2% | | |
| ### 2. Hindi (Devanagari) Efficiency | |
| | Tokenizer | Tokens | TWR | Chr/Tok | % Reduction | | |
| |---|---|---|---|---| | |
| | **SGPE(WWHO)** | **13,433,554** | **1.181** | **4.29** | **-** | | |
| | OpenAI (o200k_base) | 18,394,075 | 1.617 | 3.13 | 27.0% | | |
| | Llama 4 Scout | 19,566,121 | 1.720 | 2.94 | 31.3% | | |
| | DeepSeek V3 | 31,682,218 | 2.786 | 1.82 | 57.6% | | |
| ### 3. English | |
| | Tokenizer | Tokens | TWR | Chr/Tok | % Reduction | | |
| |---|---|---|---|---| | |
| | **SGPE(WWHO)** | **7,240,147** | **1.330** | **4.46** | **-** | | |
| | OpenAI (o200k_base) | 7,420,527 | 1.364 | 4.35 | 2.4% | | |
| | Llama 4 Scout | 7,512,843 | 1.381 | 4.30 | 3.6% | | |
| | DeepSeek V3 | 7,904,670 | 1.453 | 4.09 | 8.4% | | |
| *(Note: Because WWHO routes Latin text directly to the native Tiktoken sequence, English performance is mathematically identical. The minor delta in total tokens emerges solely from boundary crossing mechanics.)* | |
| ### 4. Overall (Mixed-Script) | |
| | Tokenizer | Tokens | TWR | Chr/Tok | % Reduction | | |
| |---|---|---|---|---| | |
| | **SGPE(WWHO)** | **27,327,989** | **1.240** | **4.47** | **-** | | |
| | OpenAI (o200k_base) | 43,174,798 | 1.959 | 2.83 | 36.7% | | |
| | Llama 4 Scout | 45,236,671 | 2.053 | 2.70 | 39.6% | | |
| | DeepSeek V3 | 68,739,586 | 3.119 | 1.78 | 60.2% | | |
| - **Zero-Breakage Guarantee**: Validated through exhaustive testing permutations across all supported Abugida scripts (0 violations). | |
| - **Full-corpus reconstruction**: 1.5M code-switched sentences encoded and decoded with 0 non-UNK mismatches. | |
| - **UNK rate**: 0.08 % (restricted strictly to rare compounds without violating structural boundaries). | |
| WWHO radically compresses the context window for Abugida text, effectively ending the Token Tax without penalizing existing state-of-the-art programming and reasoning capabilities. | |
| --- | |
| ## Quick Start with Hugging Face | |
| ```python | |
| from transformers import AutoTokenizer | |
| tokenizer = AutoTokenizer.from_pretrained("remeinium/SGPE") | |
| text = "ආයුබෝවන් ශ්රී ලංකා" | |
| tokens = tokenizer.tokenize(text) | |
| # ['ආයුබෝවන්', ' ශ්රී', ' ලංකා'] | |
| print(tokenizer.encode(text)) | |
| ``` | |
| --- | |
| ## Resources | |
| <!-- | |
| - **Research Paper**: “The Syllable is the Token: Breaking the Token Tax with SGPE” (Remeinium Research, February 2026) --> | |
| - **Pre-trained Tokenizer**: [Hugging Face](https://huggingface.co/remeinium/WWHO) | |
| - **Cleaned Training Corpus**: [Hugging Face](https://huggingface.co/datasets/remeinium/WWHO_30m) | |
| - **Full Code & Evaluation Harness**: [GitHub](https://github.com/remeinium/WWHO) | |
| --- | |
| ## License | |
| Apache License 2.0 — see [LICENSE](LICENSE). | |
| **Remeinium Research | Remeinium AI | Intelligence for a Greater Tomorrow** | |
| --- |