Spaces:

thenativefox
/

RAG

Running

RAG / openai_text-embedding-ada-002 /recursive_chunks /_generation_strategies.txt_chunk_0.txt

thenativefox

Added split files and tables

939262b 10 months ago

18.9 kB

	Text generation strategies
	Text generation is essential to many NLP tasks, such as open-ended text generation, summarization, translation, and
	more. It also plays a role in a variety of mixed-modality applications that have text as an output like speech-to-text
	and vision-to-text. Some of the models that can generate text include
	GPT2, XLNet, OpenAI GPT, CTRL, TransformerXL, XLM, Bart, T5, GIT, Whisper.
	Check out a few examples that use [~generation.GenerationMixin.generate] method to produce
	text outputs for different tasks:
	* Text summarization
	* Image captioning
	* Audio transcription
	Note that the inputs to the generate method depend on the model's modality. They are returned by the model's preprocessor
	class, such as AutoTokenizer or AutoProcessor. If a model's preprocessor creates more than one kind of input, pass all
	the inputs to generate(). You can learn more about the individual model's preprocessor in the corresponding model's documentation.
	The process of selecting output tokens to generate text is known as decoding, and you can customize the decoding strategy
	that the generate() method will use. Modifying a decoding strategy does not change the values of any trainable parameters.
	However, it can have a noticeable impact on the quality of the generated output. It can help reduce repetition in the text
	and make it more coherent.
	This guide describes:
	* default generation configuration
	* common decoding strategies and their main parameters
	* saving and sharing custom generation configurations with your fine-tuned model on 🤗 Hub
	Default text generation configuration
	A decoding strategy for a model is defined in its generation configuration. When using pre-trained models for inference
	within a [pipeline], the models call the PreTrainedModel.generate() method that applies a default generation
	configuration under the hood. The default configuration is also used when no custom configuration has been saved with
	the model.
	When you load a model explicitly, you can inspect the generation configuration that comes with it through
	model.generation_config:
	thon

	from transformers import AutoModelForCausalLM
	model = AutoModelForCausalLM.from_pretrained("distilbert/distilgpt2")
	model.generation_config
	GenerationConfig {
	"bos_token_id": 50256,
	"eos_token_id": 50256
	}

	Printing out the model.generation_config reveals only the values that are different from the default generation
	configuration, and does not list any of the default values.
	The default generation configuration limits the size of the output combined with the input prompt to a maximum of 20
	tokens to avoid running into resource limitations. The default decoding strategy is greedy search, which is the simplest decoding strategy that picks a token with the highest probability as the next token. For many tasks
	and small output sizes this works well. However, when used to generate longer outputs, greedy search can start
	producing highly repetitive results.
	Customize text generation
	You can override any generation_config by passing the parameters and their values directly to the [generate] method:
	thon

	my_model.generate(**inputs, num_beams=4, do_sample=True) # doctest: +SKIP

	Even if the default decoding strategy mostly works for your task, you can still tweak a few things. Some of the
	commonly adjusted parameters include:

	max_new_tokens: the maximum number of tokens to generate. In other words, the size of the output sequence, not
	including the tokens in the prompt. As an alternative to using the output's length as a stopping criteria, you can choose
	to stop generation whenever the full generation exceeds some amount of time. To learn more, check [StoppingCriteria].
	num_beams: by specifying a number of beams higher than 1, you are effectively switching from greedy search to
	beam search. This strategy evaluates several hypotheses at each time step and eventually chooses the hypothesis that
	has the overall highest probability for the entire sequence. This has the advantage of identifying high-probability
	sequences that start with a lower probability initial tokens and would've been ignored by the greedy search. Visualize how it works here.
	do_sample: if set to True, this parameter enables decoding strategies such as multinomial sampling, beam-search
	multinomial sampling, Top-K sampling and Top-p sampling. All these strategies select the next token from the probability
	distribution over the entire vocabulary with various strategy-specific adjustments.
	num_return_sequences: the number of sequence candidates to return for each input. This option is only available for
	the decoding strategies that support multiple sequence candidates, e.g. variations of beam search and sampling. Decoding
	strategies like greedy search and contrastive search return a single output sequence.

	Save a custom decoding strategy with your model
	If you would like to share your fine-tuned model with a specific generation configuration, you can:
	* Create a [GenerationConfig] class instance
	* Specify the decoding strategy parameters
	* Save your generation configuration with [GenerationConfig.save_pretrained], making sure to leave its config_file_name argument empty
	* Set push_to_hub to True to upload your config to the model's repo
	thon

	from transformers import AutoModelForCausalLM, GenerationConfig
	model = AutoModelForCausalLM.from_pretrained("my_account/my_model") # doctest: +SKIP
	generation_config = GenerationConfig(
	max_new_tokens=50, do_sample=True, top_k=50, eos_token_id=model.config.eos_token_id
	)
	generation_config.save_pretrained("my_account/my_model", push_to_hub=True) # doctest: +SKIP

	You can also store several generation configurations in a single directory, making use of the config_file_name
	argument in [GenerationConfig.save_pretrained]. You can later instantiate them with [GenerationConfig.from_pretrained]. This is useful if you want to
	store several generation configurations for a single model (e.g. one for creative text generation with sampling, and
	one for summarization with beam search). You must have the right Hub permissions to add configuration files to a model.
	thon

	from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, GenerationConfig
	tokenizer = AutoTokenizer.from_pretrained("google-t5/t5-small")
	model = AutoModelForSeq2SeqLM.from_pretrained("google-t5/t5-small")
	translation_generation_config = GenerationConfig(
	num_beams=4,
	early_stopping=True,
	decoder_start_token_id=0,
	eos_token_id=model.config.eos_token_id,
	pad_token=model.config.pad_token_id,
	)
	Tip: add push_to_hub=True to push to the Hub
	translation_generation_config.save_pretrained("/tmp", "translation_generation_config.json")
	You could then use the named generation config file to parameterize generation
	generation_config = GenerationConfig.from_pretrained("/tmp", "translation_generation_config.json")
	inputs = tokenizer("translate English to French: Configuration files are easy to use!", return_tensors="pt")
	outputs = model.generate(**inputs, generation_config=generation_config)
	print(tokenizer.batch_decode(outputs, skip_special_tokens=True))
	['Les fichiers de configuration sont faciles à utiliser!']

	Streaming
	The generate() supports streaming, through its streamer input. The streamer input is compatible with any instance
	from a class that has the following methods: put() and end(). Internally, put() is used to push new tokens and
	end() is used to flag the end of text generation.

	The API for the streamer classes is still under development and may change in the future.

	In practice, you can craft your own streaming class for all sorts of purposes! We also have basic streaming classes
	ready for you to use. For example, you can use the [TextStreamer] class to stream the output of generate() into
	your screen, one word at a time:
	thon

	from transformers import AutoModelForCausalLM, AutoTokenizer, TextStreamer
	tok = AutoTokenizer.from_pretrained("openai-community/gpt2")
	model = AutoModelForCausalLM.from_pretrained("openai-community/gpt2")
	inputs = tok(["An increasing sequence: one,"], return_tensors="pt")
	streamer = TextStreamer(tok)
	Despite returning the usual output, the streamer will also print the generated text to stdout.
	_ = model.generate(**inputs, streamer=streamer, max_new_tokens=20)
	An increasing sequence: one, two, three, four, five, six, seven, eight, nine, ten, eleven,

	KV Cache Quantization
	The generate() method supports caching keys and values to enhance efficiency and avoid re-computations. However the key and value
	cache can occupy a large portion of memory, becoming a bottleneck for long-context generation, especially for Large Language Models.
	Quantizing the cache when using generate() can significantly reduce memory requirements at the cost of speed.
	KV Cache quantization in transformers is largely inspired by the paper [KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache]
	(https://arxiv.org/abs/2402.02750) and currently supports quanto and HQQ as backends. For more information on the inner workings see the paper.
	To enable quantization of the key-value cache, one needs to indicate cache_implementation="quantized" in the generation_config.
	Quantization related arguments should be passed to the generation_config either as a dict or an instance of a [QuantizedCacheConfig] class.
	One has to indicate which quantization backend to use in the [QuantizedCacheConfig], the default is quanto.

	Cache quantization can be detrimental if the context length is short and there is enough GPU VRAM available to run without cache quantization.

	thon

	import torch
	from transformers import AutoTokenizer, AutoModelForCausalLM
	tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-chat-hf")
	model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-chat-hf", torch_dtype=torch.float16).to("cuda:0")
	inputs = tokenizer("I like rock music because", return_tensors="pt").to(model.device)
	out = model.generate(**inputs, do_sample=False, max_new_tokens=20, cache_implementation="quantized", cache_config={"nbits": 4, "backend": "quanto"})
	print(tokenizer.batch_decode(out, skip_special_tokens=True)[0])
	I like rock music because it's loud and energetic. It's a great way to express myself and rel
	out = model.generate(**inputs, do_sample=False, max_new_tokens=20)
	print(tokenizer.batch_decode(out, skip_special_tokens=True)[0])
	I like rock music because it's loud and energetic. I like to listen to it when I'm feeling

	Watermarking
	The generate() supports watermarking the generated text by randomly marking a portion of tokens as "green".
	When generating the "green" will have a small 'bias' value added to their logits, thus having a higher chance to be generated.
	The watermarked text can be detected by calculating the proportion of "green" tokens in the text and estimating how likely it is
	statistically to obtain that amount of "green" tokens for human-generated text. This watermarking strategy was proposed in the paper
	"On the Reliability of Watermarks for Large Language Models". For more information on
	the inner functioning of watermarking, it is recommended to refer to the paper.
	The watermarking can be used with any generative model in tranformers and does not require an extra classification model
	to detect watermarked text. To trigger watermarking, pass in a [WatermarkingConfig] with needed arguments directly to the
	.generate() method or add it to the [GenerationConfig]. Watermarked text can be later detected with a [WatermarkDetector].

	The WatermarkDetector internally relies on the proportion of "green" tokens, and whether generated text follows the coloring pattern.
	That is why it is recommended to strip off the prompt text, if it is much longer than the generated text.
	This also can have an effect when one sequence in the batch is a lot longer causing other rows to be padded.
	Additionally, the detector must be initiated with identical watermark configuration arguments used when generating.

	Let's generate some text with watermarking. In the below code snippet, we set the bias to 2.5 which is a value that
	will be added to "green" tokens' logits. After generating watermarked text, we can pass it directly to the WatermarkDetector
	to check if the text is machine-generated (outputs True for machine-generated and False otherwise).
	thon

	from transformers import AutoTokenizer, AutoModelForCausalLM, WatermarkDetector, WatermarkingConfig
	model = AutoModelForCausalLM.from_pretrained("openai-community/gpt2")
	tok = AutoTokenizer.from_pretrained("openai-community/gpt2")
	tok.pad_token_id = tok.eos_token_id
	tok.padding_side = "left"
	inputs = tok(["This is the beginning of a long story", "Alice and Bob are"], padding=True, return_tensors="pt")
	input_len = inputs["input_ids"].shape[-1]
	watermarking_config = WatermarkingConfig(bias=2.5, seeding_scheme="selfhash")
	out = model.generate(**inputs, watermarking_config=watermarking_config, do_sample=False, max_length=20)
	detector = WatermarkDetector(model_config=model.config, device="cpu", watermarking_config=watermarking_config)
	detection_out = detector(out, return_dict=True)
	detection_out.prediction
	array([True, True])

	Decoding strategies
	Certain combinations of the generate() parameters, and ultimately generation_config, can be used to enable specific
	decoding strategies. If you are new to this concept, we recommend reading this blog post that illustrates how common decoding strategies work.
	Here, we'll show some of the parameters that control the decoding strategies and illustrate how you can use them.
	Greedy Search
	[generate] uses greedy search decoding by default so you don't have to pass any parameters to enable it. This means the parameters num_beams is set to 1 and do_sample=False.
	thon

	from transformers import AutoModelForCausalLM, AutoTokenizer
	prompt = "I look forward to"
	checkpoint = "distilbert/distilgpt2"
	tokenizer = AutoTokenizer.from_pretrained(checkpoint)
	inputs = tokenizer(prompt, return_tensors="pt")
	model = AutoModelForCausalLM.from_pretrained(checkpoint)
	outputs = model.generate(**inputs)
	tokenizer.batch_decode(outputs, skip_special_tokens=True)
	['I look forward to seeing you all again!\n\n\n\n\n\n\n\n\n\n\n']

	Contrastive search
	The contrastive search decoding strategy was proposed in the 2022 paper A Contrastive Framework for Neural Text Generation.
	It demonstrates superior results for generating non-repetitive yet coherent long outputs. To learn how contrastive search
	works, check out this blog post.
	The two main parameters that enable and control the behavior of contrastive search are penalty_alpha and top_k:
	thon

	from transformers import AutoTokenizer, AutoModelForCausalLM
	checkpoint = "openai-community/gpt2-large"
	tokenizer = AutoTokenizer.from_pretrained(checkpoint)
	model = AutoModelForCausalLM.from_pretrained(checkpoint)
	prompt = "Hugging Face Company is"
	inputs = tokenizer(prompt, return_tensors="pt")
	outputs = model.generate(**inputs, penalty_alpha=0.6, top_k=4, max_new_tokens=100)
	tokenizer.batch_decode(outputs, skip_special_tokens=True)
	['Hugging Face Company is a family owned and operated business. We pride ourselves on being the best
	in the business and our customer service is second to none.\n\nIf you have any questions about our
	products or services, feel free to contact us at any time. We look forward to hearing from you!']

	Multinomial sampling
	As opposed to greedy search that always chooses a token with the highest probability as the
	next token, multinomial sampling (also called ancestral sampling) randomly selects the next token based on the probability distribution over the entire
	vocabulary given by the model. Every token with a non-zero probability has a chance of being selected, thus reducing the
	risk of repetition.
	To enable multinomial sampling set do_sample=True and num_beams=1.
	thon

	from transformers import AutoTokenizer, AutoModelForCausalLM, set_seed
	set_seed(0) # For reproducibility
	checkpoint = "openai-community/gpt2-large"
	tokenizer = AutoTokenizer.from_pretrained(checkpoint)
	model = AutoModelForCausalLM.from_pretrained(checkpoint)
	prompt = "Today was an amazing day because"
	inputs = tokenizer(prompt, return_tensors="pt")
	outputs = model.generate(**inputs, do_sample=True, num_beams=1, max_new_tokens=100)
	tokenizer.batch_decode(outputs, skip_special_tokens=True)
	["Today was an amazing day because we received these wonderful items by the way of a gift shop. The box arrived on a Thursday and I opened it on Monday afternoon to receive the gifts. Both bags featured pieces from all the previous years!\n\nThe box had lots of surprises in it, including some sweet little mini chocolate chips! I don't think I'd eat all of these. This was definitely one of the most expensive presents I have ever got, I actually got most of them for free!\n\nThe first package came"]

	Beam-search decoding
	Unlike greedy search, beam-search decoding keeps several hypotheses at each time step and eventually chooses
	the hypothesis that has the overall highest probability for the entire sequence. This has the advantage of identifying high-probability
	sequences that start with lower probability initial tokens and would've been ignored by the greedy search.

	You can visualize how beam-search decoding works in this interactive demo: type your input sentence, and play with the parameters to see how the decoding beams change.
	To enable this decoding strategy, specify the num_beams (aka number of hypotheses to keep track of) that is greater than 1.
	thon

	from transformers import AutoModelForCausalLM, AutoTokenizer
	prompt = "It is astonishing how one can"
	checkpoint = "openai-community/gpt2-medium"
	tokenizer = AutoTokenizer.from_pretrained(checkpoint)
	inputs = tokenizer(prompt, return_tensors="pt")
	model = AutoModelForCausalLM.from_pretrained(checkpoint)
	outputs = model.generate(**inputs, num_beams=5, max_new_tokens=50)
	tokenizer.batch_decode(outputs, skip_special_tokens=True)
	['It is astonishing how one can have such a profound impact on the lives of so many people in such a short period of
	time."\n\nHe added: "I am very proud of the work I have been able to do in the last few years.\n\n"I have']

	Beam-search multinomial sampling
	As the name implies, this decoding strategy combines beam search with multinomial sampling. You need to specify
	the num_beams greater than 1, and set do_sample=True to use this decoding strategy.
	thon

	from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, set_seed
	set_seed(0) # For reproducibility
	prompt = "translate English to German: The house is wonderful."
	checkpoint = "google-t5/t5-small"
	tokenizer = AutoTokenizer.from_pretrained(checkpoint)
	inputs = tokenizer(prompt, return_tensors="pt")
	model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint)
	outputs = model.generate(**inputs, num_beams=5, do_sample=True)
	tokenizer.decode(outputs[0], skip_special_tokens=True)
	'Das Haus ist wunderbar.'