Different tokenization result to llama-model reference implementation

#4
by heheda - opened
Meta Llama org

The tokenize result of meta’s reference implementation and huggingface is different.
For the one-image request in meta (scripts/multimodal_example_chat_completion.py), its tokenization result is:
128000, 128006, 882, 128007, 271, 128256, 75885, 420, 2217, 304, 1403, 23719, 128009, 128006, 78191, 128007, 271,
While huggingface provides:
256, 128000, 256, 128006, 882, 128007, 271, 257, 128256, 262, 61885, 420, 2217, 304, 1403, 23719, 257, 128009, 262, 128006, 78191, 128007, 271, 220

huggingface test code:

import requests
import torch
from PIL import Image
from transformers import MllamaForConditionalGeneration, AutoProcessor, AutoTokenizer

model_id = "Llama-3.2-11B-Vision-Instruct"
model = MllamaForConditionalGeneration.from_pretrained(model_id, device_map="auto", torch_dtype=torch.bfloat16)
processor = AutoProcessor.from_pretrained(model_id)

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": "Describe this image in two sentences"}
        ]
    }
]
text = processor.apply_chat_template(messages, add_generation_prompt=True)
# print("text is:", text)

url = "https://llava-vl.github.io/static/images/view.jpg"
raw_image = Image.open(requests.get(url, stream=True).raw)

inputs = processor(text=text, images=raw_image, return_tensors="pt").to(model.device)
print("input_ids:", inputs['input_ids'])
output = model.generate(**inputs, do_sample=False, max_new_tokens=25)
print(processor.decode(output[0]))
Meta Llama org

Using the chat_template from 90B Instruct solves this issue, now I can get [128000, 128006, 882, 128007, 271, 128256, 75885, 420, 2217, 304, 1403, 23719, 128009, 128006, 78191, 128007, 271]

Meta Llama org

now fixed by this PR, I can get [128000, 128006, 882, 128007, 271, 128256, 75885, 420, 2217, 304, 1403, 23719, 128009, 128006, 78191, 128007, 271]

Meta Llama org

Thank you very much!

Thank you @wukaixingxp 🙌 We'll update the template in the tokenizer_config.json file if needed, as that's the one used by the transformers tokenizer.

Fixed, closing now.

pcuenq changed discussion status to closed

Sign up or log in to comment