Spaces:
Running
Running
GGUF and interaction with Transformers | |
The GGUF file format is used to store models for inference with GGML and other | |
libraries that depend on it, like the very popular llama.cpp or | |
whisper.cpp. | |
It is a file format supported by the Hugging Face Hub with features | |
allowing for quick inspection of tensors and metadata within the file. | |
This file format is designed as a "single-file-format" where a single file usually contains both the configuration | |
attributes, the tokenizer vocabulary and other attributes, as well as all tensors to be loaded in the model. These | |
files come in different formats according to the quantization type of the file. We briefly go over some of them | |
here. | |
Support within Transformers | |
We have added the ability to load gguf files within transformers in order to offer further training/fine-tuning | |
capabilities to gguf models, before converting back those models to gguf to use within the ggml ecosystem. When | |
loading a model, we first dequantize it to fp32, before loading the weights to be used in PyTorch. | |
[!NOTE] | |
The support is still very exploratory and we welcome contributions in order to solidify it across quantization types | |
and model architectures. | |
For now, here are the supported model architectures and quantization types: | |
Supported quantization types | |
The initial supported quantization types are decided according to the popular quantized files that have been shared | |
on the Hub. | |
F32 | |
Q2_K | |
Q3_K | |
Q4_0 | |
Q4_K | |
Q5_K | |
Q6_K | |
Q8_0 | |
We take example from the excellent 99991/pygguf Python parser to dequantize the | |
weights. | |
Supported model architectures | |
For now the supported model architectures are the architectures that have been very popular on the Hub, namely: | |
LLaMa | |
Mistral | |
Qwen2 | |
Example usage | |
In order to load gguf files in transformers, you should specify the gguf_file argument to the from_pretrained | |
methods of both tokenizers and models. Here is how one would load a tokenizer and a model, which can be loaded | |
from the exact same file: | |
from transformers import AutoTokenizer, AutoModelForCausalLM | |
model_id = "TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF" | |
filename = "tinyllama-1.1b-chat-v1.0.Q6_K.gguf" | |
tokenizer = AutoTokenizer.from_pretrained(model_id, gguf_file=filename) | |
model = AutoModelForCausalLM.from_pretrained(model_id, gguf_file=filename) | |
Now you have access to the full, unquantized version of the model in the PyTorch ecosystem, where you can combine it | |
with a plethora of other tools. | |
In order to convert back to a gguf file, we recommend using the | |
convert-hf-to-gguf.py file from llama.cpp. | |
Here's how you would complete the script above to save the model and export it back to gguf: | |
tokenizer.save_pretrained('directory') | |
model.save_pretrained('directory') | |
!python ${path_to_llama_cpp}/convert-hf-to-gguf.py ${directory} | |
``` |