Miscovery Tokenizer
A SentencePiece unigram tokenizer trained on a mix of Arabic and English text, with a vocabulary size of 70,000 tokens.
Training Data
This tokenizer was trained on:
- Arabic Quran.
- awesome-chatgpt-prompts
- open-r1/codeforces
Usage
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("miscovery/arabic-english-tokenizer")
# Example usage
text = "ุจุณู
ุงููู ุงูุฑุญู
ู ุงูุฑุญูู
Hello World"
encoded = tokenizer(text)
print(encoded)
Features
- Vocabulary size: 70,000
- Model type: Unigram
- Model Max Length: 512
- Handles both Arabic and English text
- Supports Arabic normalization
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
๐
Ask for provider support