Spaces:
Running
on
Zero
Running
on
Zero
title: Blt Entropy Patcher | |
emoji: ⚡ | |
colorFrom: green | |
colorTo: red | |
sdk: gradio | |
sdk_version: 5.27.0 | |
python_version: 3.10.13 | |
app_file: app.py | |
pinned: false | |
license: cc-by-nc-4.0 | |
# Byte Latent Transformer | |
This repository contains code for our paper: "Byte Latent Transformer: Patches Scale Better Than Tokens" | |
- [Paper Link](https://dl.fbaipublicfiles.com/blt/BLT__Patches_Scale_Better_Than_Tokens.pdf) | |
## Abstract | |
We introduce the Byte Latent Transformer architecture (BLTs), a new byte-level LLM architecture that | |
for the first time, matches tokenization-based LLM performance at scale, with significant improvements | |
in inference efficiency and robustness. BLT encodes bytes into dynamically sized patches, which serve | |
as the primary units of computation. Patches are segmented dynamically based on the entropy of the | |
next byte, allocating more compute and model capacity where there is more data complexity. The BLT | |
architecture includes new attention mechanisms to maximize the information flow between byte and | |
patch hidden representations and a new type of byte-sequence memory. We present the first scaling | |
study of byte-level models up to 8B parameters and 8T training bytes, showing for the first time | |
that we can train a model end-to-end at scale from bytes with no tokenization or other preprocessing. | |
Scaling trends reveal training and inference efficiency benefits from dynamically selecting very long | |
patches on average, along with qualitative improvements with reasoning and long tail generalization | |
from modeling byte-sequences. | |
 | |
## Development Status | |
We are actively updating the blt code to make it easier to reproduce our results. | |
Please file an issue and/or be patient while we make more of our code public! | |
## Quick start | |
The following commands launch a SLURM job that creates an environment for Meta Lingua. | |
The env creation should take around 5 minutes without counting downloads. | |
```bash | |
git clone https://github.com/facebookresearch/blt | |
cd blt | |
bash setup/create_env.sh | |
# or if you have access to a SLURM cluster | |
sbatch setup/create_env.sh | |
``` | |
Once that is done you can activate the environment | |
```bash | |
conda activate blt_<date> | |
``` | |
## Downloading HF Model Weights and Generating Text | |
We have released weights on HF for the [BLT 1B Model](https://huggingface.co./facebook/blt-1b) and [BLT 7B Model](https://huggingface.co./facebook/blt-7b). | |
We are actively working with HF to make BLT available in [Transformers](https://huggingface.co./docs/transformers/en/index) and will update this when it is. | |
In the meantime, you can follow these instructions to load model weights, initialize a model, and generate text. | |
These instructions have been tested on H100 GPUs, but we can only offer suggestions on running on other hardware. | |
1. On the model weights HF page, create a HuggingFace account, request access to weights, and wait for approval. | |
2. On the huggingface cli, login: `huggingface-cli login` | |
3. Download the model weights with: `python download_blt_weights.py`, which will load to `hf-weights` | |
4. Run the generate demo: `python demo.py "A BLT has"`. | |
The demo generates text, but is also a good starting point for loading BLT in your own code. | |
## Downloading Training Data | |
Note: The following instructions are not well tested in the BLT code as it is based on the lingua code, which we have diverged from. | |
Use the provided script to download and prepare data from huggingface (among `fineweb_edu`, `fineweb_edu_10bt`, or `dclm_baseline_1.0`). | |
This command will download the `fineweb_edu` and prepare it for training in the `./data` directory, specifying the amount of memory `terashuf` (the tool used to shuffle samples) will be allocated. By default, the number of chunks (`nchunks`) is 32. If you are running on fewer than 32 GPUs, it is recommended to set `nchunks` to 1 or to match `nchunks` with the number of GPUs (`nchunks` = NGPUs). See [here](https://github.com/facebookresearch/lingua/issues/55#issuecomment-2483643076) for more details. | |
```bash | |
python setup/download_prepare_hf_data.py fineweb_edu <MEMORY> --data_dir ./data --seed 42 --nchunks <NCHUNKS> | |
``` | |
to download tokenizer (here llama3), use the following script: | |
```bash | |
python setup/download_tokenizer.py llama3 <SAVE_PATH> --api_key <HUGGINGFACE_TOKEN> | |
``` | |
Now launch a debug job to check if everything works. **The provided configurations are templates, you need to adapt them for them to work (change `dump_dir`, `data.root_dir`, `data.tokenizer.path`, etc ...)** | |
```bash | |
# stool stands for SLURM tool ! | |
python -m bytelatent.stool script=bytelatent.train config=bytelatent/configs/debug.yaml nodes=1 partition=<partition> | |
# if you want to launch locally you can use torchrun | |
torchrun --nproc-per-node 8 -m bytelatent.train config=bytelatent/configs/debug.yaml | |
# or you can also launch on 1 GPU | |
python -m bytelatent.train config=bytelatent/configs/debug.yaml | |
``` | |
When using `stool`, if a job crashes, it can be relaunched using sbatch: | |
```bash | |
sbatch path/to/dump_dir/submit.slurm | |
``` | |
## Linting | |
To lint, run the following command | |
``` | |
bash dev/lint.sh | |
``` | |
## Citation | |
The BLT is partially based on Meta Lingua, so consider citing it in addition to our BLT paper if you re-use our work. | |
BLT Paper Citation (will be updated to arXiv soon) | |
``` | |
@article{meta_blt, | |
author = {Artidoro Pagnoni, Ram Pasunuru, Pedro Rodriguez, John Nguyen, Benjamin Muller, Margaret Li, Chunting Zhou, Lili Yu, Jason Weston, Luke Zettlemoyer, Gargi Ghosh, Mike Lewis, Ari Holtzman†, Srinivasan Iyer}, | |
title = {Byte Latent Transformer: Patches Scale Better Than Tokens}, | |
url = {https://github.com/facebookresearch/blt}, | |
year = {2024} | |
} | |
``` | |
Lingua Code | |
``` | |
@misc{meta_lingua, | |
author = {Mathurin Videau, Badr Youbi Idrissi, Daniel Haziza, Luca Wehrstedt, Jade Copet, Olivier Teytaud, David Lopez-Paz}, | |
title = {{Meta Lingua}: A minimal {PyTorch LLM} training library}, | |
url = {https://github.com/facebookresearch/lingua}, | |
year = {2024} | |
} | |
``` | |
## License | |
The BLT code is partially based on Meta Lingua. | |
Meta BLT is licensed under CC-BY-NC-4.0 license. Refer to the LICENSE file in the top level directory. | |