Spaces:

thenativefox
/

RAG

Running

thenativefox

Added split files and tables

939262b 10 months ago

2.78 kB


	GGUF and interaction with Transformers
	The GGUF file format is used to store models for inference with GGML and other
	libraries that depend on it, like the very popular llama.cpp or
	whisper.cpp.
	It is a file format supported by the Hugging Face Hub with features
	allowing for quick inspection of tensors and metadata within the file.
	This file format is designed as a "single-file-format" where a single file usually contains both the configuration
	attributes, the tokenizer vocabulary and other attributes, as well as all tensors to be loaded in the model. These
	files come in different formats according to the quantization type of the file. We briefly go over some of them
	here.
	Support within Transformers
	We have added the ability to load gguf files within transformers in order to offer further training/fine-tuning
	capabilities to gguf models, before converting back those models to gguf to use within the ggml ecosystem. When
	loading a model, we first dequantize it to fp32, before loading the weights to be used in PyTorch.

	[!NOTE]
	The support is still very exploratory and we welcome contributions in order to solidify it across quantization types
	and model architectures.

	For now, here are the supported model architectures and quantization types:
	Supported quantization types
	The initial supported quantization types are decided according to the popular quantized files that have been shared
	on the Hub.

	F32
	Q2_K
	Q3_K
	Q4_0
	Q4_K
	Q5_K
	Q6_K
	Q8_0

	We take example from the excellent 99991/pygguf Python parser to dequantize the
	weights.
	Supported model architectures
	For now the supported model architectures are the architectures that have been very popular on the Hub, namely:

	LLaMa
	Mistral
	Qwen2

	Example usage
	In order to load gguf files in transformers, you should specify the gguf_file argument to the from_pretrained
	methods of both tokenizers and models. Here is how one would load a tokenizer and a model, which can be loaded
	from the exact same file:

	from transformers import AutoTokenizer, AutoModelForCausalLM
	model_id = "TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF"
	filename = "tinyllama-1.1b-chat-v1.0.Q6_K.gguf"
	tokenizer = AutoTokenizer.from_pretrained(model_id, gguf_file=filename)
	model = AutoModelForCausalLM.from_pretrained(model_id, gguf_file=filename)

	Now you have access to the full, unquantized version of the model in the PyTorch ecosystem, where you can combine it
	with a plethora of other tools.
	In order to convert back to a gguf file, we recommend using the
	convert-hf-to-gguf.py file from llama.cpp.
	Here's how you would complete the script above to save the model and export it back to gguf:

	tokenizer.save_pretrained('directory')
	model.save_pretrained('directory')
	!python ${path_to_llama_cpp}/convert-hf-to-gguf.py ${directory}
	```