Different tokenization result to llama-model reference implementation

by heheda - opened Sep 19, 2024

Meta Llama org Sep 19, 2024

The tokenize result of meta’s reference implementation and huggingface is different.
For the one-image request in meta (scripts/multimodal_example_chat_completion.py), its tokenization result is:
128000, 128006, 882, 128007, 271, 128256, 75885, 420, 2217, 304, 1403, 23719, 128009, 128006, 78191, 128007, 271,
While huggingface provides:
256, 128000, 256, 128006, 882, 128007, 271, 257, 128256, 262, 61885, 420, 2217, 304, 1403, 23719, 257, 128009, 262, 128006, 78191, 128007, 271, 220

huggingface test code:

import requests
import torch
from PIL import Image
from transformers import MllamaForConditionalGeneration, AutoProcessor, AutoTokenizer

model_id = "Llama-3.2-11B-Vision-Instruct"
model = MllamaForConditionalGeneration.from_pretrained(model_id, device_map="auto", torch_dtype=torch.bfloat16)
processor = AutoProcessor.from_pretrained(model_id)

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": "Describe this image in two sentences"}
        ]
    }
]
text = processor.apply_chat_template(messages, add_generation_prompt=True)
# print("text is:", text)

url = "https://llava-vl.github.io/static/images/view.jpg"
raw_image = Image.open(requests.get(url, stream=True).raw)

inputs = processor(text=text, images=raw_image, return_tensors="pt").to(model.device)
print("input_ids:", inputs['input_ids'])
output = model.generate(**inputs, do_sample=False, max_new_tokens=25)
print(processor.decode(output[0]))

wukaixingxp

Meta Llama org Sep 19, 2024

Using the chat_template from 90B Instruct solves this issue, now I can get [128000, 128006, 882, 128007, 271, 128256, 75885, 420, 2217, 304, 1403, 23719, 128009, 128006, 78191, 128007, 271]

wukaixingxp

Meta Llama org Sep 19, 2024

now fixed by this PR, I can get [128000, 128006, 882, 128007, 271, 128256, 75885, 420, 2217, 304, 1403, 23719, 128009, 128006, 78191, 128007, 271]

heheda

Meta Llama org Sep 19, 2024

Thank you very much!

pcuenq

Sep 20, 2024

Thank you @wukaixingxp 🙌 We'll update the template in the tokenizer_config.json file if needed, as that's the one used by the transformers tokenizer.

pcuenq

Sep 20, 2024

Fixed, closing now.

pcuenq changed discussion status to closed Sep 20, 2024

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment