Why are vocab_size and tokenizer different length?

#17
by choco9966 - opened

When I did tokenizer.vocab_size and len(tokenizer), I find that the lengths were different. I was wondering why this is actually different, and I was wondering if there would be no problem from the point of view of inference or continual-training.

>>> tokenizer.vocab_size
262144
>>> len(tokenizer)
262145
Google org

len(tokenizer) counts all vocabulary indices, including 0 while tokenizer.vocab_size represents the number of vocabulary entries without considering the 0-based indexing. This leads to len(tokenizer) being one greater than tokenizer.vocab_size.
Screenshot 2025-04-04 at 12.28.45 PM.png
Please refer to this gist for further clarification.

But the vocab ranges from 0 to 262144. Then shouldn't the vocab size be 262145?

@Renu11 model's vocab_size is also 262144 in the config.json. Is the last token not being used?

"262144": {
      "content": "<image_soft_token>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    }
Your need to confirm your account before you can post a new comment.

Sign up or log in to comment