Miscovery Tokenizer

A SentencePiece unigram tokenizer trained on a mix of Arabic and English text, with a vocabulary size of 70,000 tokens.

Training Data

This tokenizer was trained on:

  • Arabic Quran.
  • awesome-chatgpt-prompts
  • open-r1/codeforces

Usage

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("miscovery/arabic-english-tokenizer")

# Example usage
text = "ุจุณู… ุงู„ู„ู‡ ุงู„ุฑุญู…ู† ุงู„ุฑุญูŠู… Hello World"
encoded = tokenizer(text)
print(encoded)

Features

  • Vocabulary size: 70,000
  • Model type: Unigram
  • Model Max Length: 512
  • Handles both Arabic and English text
  • Supports Arabic normalization
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support