TokSuite

community

AI & ML interests

Tokenization, Robustness, LLMs

Recent Activity

gsaltintas updated a collection 18 days ago

TokSuite Text-Matched Models

gsaltintas updated a Space 22 days ago

toksuite/quick-tokenizer-accuracy

gsaltintas updated a dataset 24 days ago

toksuite/toksuite_chinese

View all activity

Papers

TokSuite: Measuring the Impact of Tokenizer Choice on Language Model Behavior

View all Papers

Organization Card

Community About org cards

TokSuite is a collection of models and benchmarks designed to isolate and study the impact of tokenization on language model behavior across English, Chinese, Turkish, Italian, and Farsi languages, as well as STEM and mathematical text. It includes fourteen models that share the same architecture, training data, training budget, and initialization but differ only in their tokenizers, alongside a set of benchmarks that evaluate performance under real-world perturbations that affect tokenization.

Our code is available at https://github.com/r-three/Tokenizers.

Collections 4

View 4 collections

spaces 3

Quick Tokenizer Accuracy

Evaluate models on multiple-choice questions

Tokenizer Comparison

Compare tokenizers to split text into tokens

models 20

toksuite/google-gemma-2-2b

Text Generation • 2B • Updated Dec 25, 2025 • 151

toksuite/meta-llama-Llama-3.2-1B

Text Generation • 2B • Updated Dec 25, 2025 • 154

toksuite/CohereLabs-aya-expanse-8b

Text Generation • 2B • Updated Dec 25, 2025 • 117

toksuite/tiktoken-gpt-4o

Text Generation • 2B • Updated Dec 25, 2025 • 671

toksuite/common-pile-comma-v0.1

Text Generation • 2B • Updated Dec 25, 2025 • 220

toksuite/microsoft-Phi-3-mini-4k-instruct

Text Generation • 1B • Updated Dec 25, 2025 • 126

toksuite/google-bert-bert-base-multilingual-cased

Text Generation • 2B • Updated Dec 25, 2025 • 205

toksuite/Qwen-Qwen3-8B

Text Generation • 2B • Updated Dec 25, 2025 • 94

toksuite/tokenmonster-englishcode-32000-consistent-v1

Text Generation • 1B • Updated Dec 25, 2025 • 146

toksuite/mistralai-tekken

Text Generation • 2B • Updated Dec 25, 2025 • 292

datasets 10

toksuite/toksuite_chinese

Viewer • Updated 24 days ago • 485 • 2.76k

toksuite/toksuite_turkish

Viewer • Updated 24 days ago • 621 • 123

toksuite/toksuite_farsi

Viewer • Updated 25 days ago • 747 • 147

toksuite/toksuite_math

Viewer • Updated 25 days ago • 189 • 244

toksuite/toksuite_english

Viewer • Updated 25 days ago • 1.14k • 2.7k

toksuite/toksuite_italian

Viewer • Updated 25 days ago • 1.09k • 1.77k

toksuite/toksuite_stem

Viewer • Updated 25 days ago • 613 • 1.16k

toksuite/toksuite_general

Viewer • Updated 25 days ago • 68 • 92

toksuite/toksuite_pretraining_data

Viewer • Updated Dec 18, 2025 • 107M • 795

toksuite/Qwen-Qwen3-8B-toksuite-detokenized

Viewer • Updated Dec 18, 2025 • 28M • 132