Athena
1
Hi, I am trying to train a basic Word Level tokenizer based on a file data.txt containing
5174 5155 4749 4814 4832 4761 4523 4999 4860 4699 5024 4788 [UNK]
When I run my code
from tokenizers import Tokenizer
from tokenizers.models import WordLevel
tokenizer = Tokenizer(WordLevel(unk_token='[UNK]'))
tokenizer.train(files=['data.txt'])
tokenizer.encode('5155')
I get the error
Exception: WordLevel error: Missing [UNK] token from the vocabulary
Why is it still missing despite having [UNK] in data.txt and also setting unk_token='[UNK]'?
Any help is very appreciated!
3 Likes
Hi Athena, I’m having the same issue… did you find the root of the problem?
1 Like
I am experiencing this too
2 Likes
from tokenizers import Tokenizer
from tokenizers.models import WordLevel
from tokenizers.trainers import WordLevelTrainer
tokenizer = Tokenizer(WordLevel(unk_token='[UNK]'))
########## Specify [UNK] here ############
trainer = WordLevelTrainer(
special_tokens=['[UNK]']
)
##########################################
files = ['./datasets/AAABBBCCC.txt']
tokenizer.train(files, trainer) # <--- specify trainer
tokenizer.encode('41').ids
1 Like