TokSuite
community
AI & ML interests
Tokenization, Robustness, LLMs
Recent Activity
View all activity
Papers
View all Papers
Organization Card
TokSuite is a collection of models and benchmarks designed to isolate and study the impact of tokenization on language model behavior across English, Chinese, Turkish, Italian, and Farsi languages, as well as STEM and mathematical text. It includes fourteen models that share the same architecture, training data, training budget, and initialization but differ only in their tokenizers, alongside a set of benchmarks that evaluate performance under real-world perturbations that affect tokenization.
Our code is available at https://github.com/r-three/Tokenizers.
models
20
toksuite/google-gemma-2-2b
Text Generation
•
2B
•
Updated
•
151
toksuite/meta-llama-Llama-3.2-1B
Text Generation
•
2B
•
Updated
•
154
toksuite/CohereLabs-aya-expanse-8b
Text Generation
•
2B
•
Updated
•
117
toksuite/tiktoken-gpt-4o
Text Generation
•
2B
•
Updated
•
671
toksuite/common-pile-comma-v0.1
Text Generation
•
2B
•
Updated
•
220
toksuite/microsoft-Phi-3-mini-4k-instruct
Text Generation
•
1B
•
Updated
•
126
toksuite/google-bert-bert-base-multilingual-cased
Text Generation
•
2B
•
Updated
•
205
toksuite/Qwen-Qwen3-8B
Text Generation
•
2B
•
Updated
•
94
toksuite/tokenmonster-englishcode-32000-consistent-v1
Text Generation
•
1B
•
Updated
•
146
toksuite/mistralai-tekken
Text Generation
•
2B
•
Updated
•
292
datasets
10
toksuite/toksuite_chinese
Viewer
•
Updated
•
485
•
2.76k
toksuite/toksuite_turkish
Viewer
•
Updated
•
621
•
123
toksuite/toksuite_farsi
Viewer
•
Updated
•
747
•
147
toksuite/toksuite_math
Viewer
•
Updated
•
189
•
244
toksuite/toksuite_english
Viewer
•
Updated
•
1.14k
•
2.7k
toksuite/toksuite_italian
Viewer
•
Updated
•
1.09k
•
1.77k
toksuite/toksuite_stem
Viewer
•
Updated
•
613
•
1.16k
toksuite/toksuite_general
Viewer
•
Updated
•
68
•
92
toksuite/toksuite_pretraining_data
Viewer
•
Updated
•
107M
•
795
toksuite/Qwen-Qwen3-8B-toksuite-detokenized
Viewer
•
Updated
•
28M
•
132