Spaces:

thenativefox
/

RAG

Running

RAG

File size: 5,919 Bytes

939262b

Beam-search multinomial sampling
As the name implies, this decoding strategy combines beam search with multinomial sampling. You need to specify
the num_beams greater than 1, and set do_sample=True to use this decoding strategy.
thon

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, set_seed
set_seed(0)  # For reproducibility
prompt = "translate English to German: The house is wonderful."
checkpoint = "google-t5/t5-small"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
inputs = tokenizer(prompt, return_tensors="pt")
model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint)
outputs = model.generate(**inputs, num_beams=5, do_sample=True)
tokenizer.decode(outputs[0], skip_special_tokens=True)
'Das Haus ist wunderbar.'

Diverse beam search decoding
The diverse beam search decoding strategy is an extension of the beam search strategy that allows for generating a more diverse
set of beam sequences to choose from. To learn how it works, refer to Diverse Beam Search: Decoding Diverse Solutions from Neural Sequence Models.
This approach has three main parameters: num_beams, num_beam_groups, and diversity_penalty.
The diversity penalty ensures the outputs are distinct across groups, and beam search is used within each group.
thon

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
checkpoint = "google/pegasus-xsum"
prompt = (
     "The Permaculture Design Principles are a set of universal design principles "
     "that can be applied to any location, climate and culture, and they allow us to design "
     "the most efficient and sustainable human habitation and food production systems. "
     "Permaculture is a design system that encompasses a wide variety of disciplines, such "
     "as ecology, landscape design, environmental science and energy conservation, and the "
     "Permaculture design principles are drawn from these various disciplines. Each individual "
     "design principle itself embodies a complete conceptual framework based on sound "
     "scientific principles. When we bring all these separate  principles together, we can "
     "create a design system that both looks at whole systems, the parts that these systems "
     "consist of, and how those parts interact with each other to create a complex, dynamic, "
     "living system. Each design principle serves as a tool that allows us to integrate all "
     "the separate parts of a design, referred to as elements, into a functional, synergistic, "
     "whole system, where the elements harmoniously interact and work together in the most "
     "efficient way possible."
 )
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
inputs = tokenizer(prompt, return_tensors="pt")
model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint)
outputs = model.generate(**inputs, num_beams=5, num_beam_groups=5, max_new_tokens=30, diversity_penalty=1.0)
tokenizer.decode(outputs[0], skip_special_tokens=True)
'The Design Principles are a set of universal design principles that can be applied to any location, climate and
culture, and they allow us to design the'

This guide illustrates the main parameters that enable various decoding strategies. More advanced parameters exist for the
[generate] method, which gives you even further control over the [generate] method's behavior.
For the complete list of the available parameters, refer to the API documentation.
Speculative Decoding
Speculative decoding (also known as assisted decoding) is a modification of the decoding strategies above, that uses an
assistant model (ideally a much smaller one) with the same tokenizer, to generate a few candidate tokens. The main
model then validates the candidate tokens in a single forward pass, which speeds up the decoding process. If
do_sample=True, then the token validation with resampling introduced in the
speculative decoding paper is used.
Currently, only greedy search and sampling are supported with assisted decoding, and assisted decoding doesn't support batched inputs.
To learn more about assisted decoding, check this blog post.
To enable assisted decoding, set the assistant_model argument with a model.
thon

from transformers import AutoModelForCausalLM, AutoTokenizer
prompt = "Alice and Bob"
checkpoint = "EleutherAI/pythia-1.4b-deduped"
assistant_checkpoint = "EleutherAI/pythia-160m-deduped"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
inputs = tokenizer(prompt, return_tensors="pt")
model = AutoModelForCausalLM.from_pretrained(checkpoint)
assistant_model = AutoModelForCausalLM.from_pretrained(assistant_checkpoint)
outputs = model.generate(**inputs, assistant_model=assistant_model)
tokenizer.batch_decode(outputs, skip_special_tokens=True)
['Alice and Bob are sitting in a bar. Alice is drinking a beer and Bob is drinking a']

When using assisted decoding with sampling methods, you can use the temperature argument to control the randomness,
just like in multinomial sampling. However, in assisted decoding, reducing the temperature may help improve the latency.
thon

from transformers import AutoModelForCausalLM, AutoTokenizer, set_seed
set_seed(42)  # For reproducibility
prompt = "Alice and Bob"
checkpoint = "EleutherAI/pythia-1.4b-deduped"
assistant_checkpoint = "EleutherAI/pythia-160m-deduped"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
inputs = tokenizer(prompt, return_tensors="pt")
model = AutoModelForCausalLM.from_pretrained(checkpoint)
assistant_model = AutoModelForCausalLM.from_pretrained(assistant_checkpoint)
outputs = model.generate(**inputs, assistant_model=assistant_model, do_sample=True, temperature=0.5)
tokenizer.batch_decode(outputs, skip_special_tokens=True)
['Alice and Bob, a couple of friends of mine, who are both in the same office as']

Alternativelly, you can also set the prompt_lookup_num_tokens to trigger n-gram based assisted decoding, as opposed
to model based assisted decoding. You can read more about it here.