Why are vocab_size and tokenizer different length?

#17

by choco9966 - opened 28 days ago

28 days ago

When I did tokenizer.vocab_size and len(tokenizer), I find that the lengths were different. I was wondering why this is actually different, and I was wondering if there would be no problem from the point of view of inference or continual-training.

>>> tokenizer.vocab_size
262144
>>> len(tokenizer)
262145

Renu11

Google org 25 days ago

len(tokenizer) counts all vocabulary indices, including 0 while tokenizer.vocab_size represents the number of vocabulary entries without considering the 0-based indexing. This leads to len(tokenizer) being one greater than tokenizer.vocab_size.

Please refer to this gist for further clarification.

choco9966

24 days ago

But the vocab ranges from 0 to 262144. Then shouldn't the vocab size be 262145?

nickname100231

8 days ago

@Renu11 model's vocab_size is also 262144 in the config.json. Is the last token not being used?

"262144": {
      "content": "<image_soft_token>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    }

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment