Spaces:
Running
Running
Templates for Chat Models | |
Introduction | |
An increasingly common use case for LLMs is chat. In a chat context, rather than continuing a single string | |
of text (as is the case with a standard language model), the model instead continues a conversation that consists | |
of one or more messages, each of which includes a role, like "user" or "assistant", as well as message text. | |
Much like tokenization, different models expect very different input formats for chat. This is the reason we added | |
chat templates as a feature. Chat templates are part of the tokenizer. They specify how to convert conversations, | |
represented as lists of messages, into a single tokenizable string in the format that the model expects. | |
Let's make this concrete with a quick example using the BlenderBot model. BlenderBot has an extremely simple default | |
template, which mostly just adds whitespace between rounds of dialogue: | |
thon | |
from transformers import AutoTokenizer | |
tokenizer = AutoTokenizer.from_pretrained("facebook/blenderbot-400M-distill") | |
chat = [ | |
{"role": "user", "content": "Hello, how are you?"}, | |
{"role": "assistant", "content": "I'm doing great. How can I help you today?"}, | |
{"role": "user", "content": "I'd like to show off how chat templating works!"}, | |
] | |
tokenizer.apply_chat_template(chat, tokenize=False) | |
" Hello, how are you? I'm doing great. How can I help you today? I'd like to show off how chat templating works!" | |
Notice how the entire chat is condensed into a single string. If we use tokenize=True, which is the default setting, | |
that string will also be tokenized for us. To see a more complex template in action, though, let's use the | |
mistralai/Mistral-7B-Instruct-v0.1 model. | |
thon | |
from transformers import AutoTokenizer | |
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.1") | |
chat = [ | |
{"role": "user", "content": "Hello, how are you?"}, | |
{"role": "assistant", "content": "I'm doing great. How can I help you today?"}, | |
{"role": "user", "content": "I'd like to show off how chat templating works!"}, | |
] | |
tokenizer.apply_chat_template(chat, tokenize=False) | |
"[INST] Hello, how are you? [/INST]I'm doing great. How can I help you today? [INST] I'd like to show off how chat templating works! [/INST]" | |
Note that this time, the tokenizer has added the control tokens [INST] and [/INST] to indicate the start and end of | |
user messages (but not assistant messages!). Mistral-instruct was trained with these tokens, but BlenderBot was not. | |
How do I use chat templates? | |
As you can see in the example above, chat templates are easy to use. Simply build a list of messages, with role | |
and content keys, and then pass it to the [~PreTrainedTokenizer.apply_chat_template] method. Once you do that, | |
you'll get output that's ready to go! When using chat templates as input for model generation, it's also a good idea | |
to use add_generation_prompt=True to add a generation prompt. | |
Here's an example of preparing input for model.generate(), using the Zephyr assistant model: | |
thon | |
from transformers import AutoModelForCausalLM, AutoTokenizer | |
checkpoint = "HuggingFaceH4/zephyr-7b-beta" | |
tokenizer = AutoTokenizer.from_pretrained(checkpoint) | |
model = AutoModelForCausalLM.from_pretrained(checkpoint) # You may want to use bfloat16 and/or move to GPU here | |
messages = [ | |
{ | |
"role": "system", | |
"content": "You are a friendly chatbot who always responds in the style of a pirate", | |
}, | |
{"role": "user", "content": "How many helicopters can a human eat in one sitting?"}, | |
] | |
tokenized_chat = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt") | |
print(tokenizer.decode(tokenized_chat[0])) | |
This will yield a string in the input format that Zephyr expects.text | |
<|system|> | |
You are a friendly chatbot who always responds in the style of a pirate | |
<|user|> | |
How many helicopters can a human eat in one sitting? | |
<|assistant|> | |
Now that our input is formatted correctly for Zephyr, we can use the model to generate a response to the user's question: | |
python | |
outputs = model.generate(tokenized_chat, max_new_tokens=128) | |
print(tokenizer.decode(outputs[0])) | |
This will yield: | |
text | |
<|system|> | |
You are a friendly chatbot who always responds in the style of a pirate</s> | |
<|user|> | |
How many helicopters can a human eat in one sitting?</s> | |
<|assistant|> | |
Matey, I'm afraid I must inform ye that humans cannot eat helicopters. Helicopters are not food, they are flying machines. Food is meant to be eaten, like a hearty plate o' grog, a savory bowl o' stew, or a delicious loaf o' bread. But helicopters, they be for transportin' and movin' around, not for eatin'. So, I'd say none, me hearties. None at all. | |
Arr, 'twas easy after all! | |
Is there an automated pipeline for chat? | |
Yes, there is! Our text generation pipelines support chat inputs, which makes it easy to use chat models. In the past, | |
we used to use a dedicated "ConversationalPipeline" class, but this has now been deprecated and its functionality | |
has been merged into the [TextGenerationPipeline]. Let's try the Zephyr example again, but this time using | |
a pipeline: | |
thon | |
from transformers import pipeline | |
pipe = pipeline("text-generation", "HuggingFaceH4/zephyr-7b-beta") | |
messages = [ | |
{ | |
"role": "system", | |
"content": "You are a friendly chatbot who always responds in the style of a pirate", | |
}, | |
{"role": "user", "content": "How many helicopters can a human eat in one sitting?"}, | |
] | |
print(pipe(messages, max_new_tokens=128)[0]['generated_text'][-1]) # Print the assistant's response | |
text | |
{'role': 'assistant', 'content': "Matey, I'm afraid I must inform ye that humans cannot eat helicopters. Helicopters are not food, they are flying machines. Food is meant to be eaten, like a hearty plate o' grog, a savory bowl o' stew, or a delicious loaf o' bread. But helicopters, they be for transportin' and movin' around, not for eatin'. So, I'd say none, me hearties. None at all."} | |
The pipeline will take care of all the details of tokenization and calling apply_chat_template for you - | |
once the model has a chat template, all you need to do is initialize the pipeline and pass it the list of messages! | |
What are "generation prompts"? | |
You may have noticed that the apply_chat_template method has an add_generation_prompt argument. This argument tells | |
the template to add tokens that indicate the start of a bot response. For example, consider the following chat: | |
python | |
messages = [ | |
{"role": "user", "content": "Hi there!"}, | |
{"role": "assistant", "content": "Nice to meet you!"}, | |
{"role": "user", "content": "Can I ask a question?"} | |
] | |
Here's what this will look like without a generation prompt, using the ChatML template we saw in the Zephyr example: | |
python | |
tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=False) | |
"""<|im_start|>user | |
Hi there!<|im_end|> | |
<|im_start|>assistant | |
Nice to meet you!<|im_end|> | |
<|im_start|>user | |
Can I ask a question?<|im_end|> | |
""" | |
And here's what it looks like with a generation prompt: | |
python | |
tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) | |
"""<|im_start|>user | |
Hi there!<|im_end|> | |
<|im_start|>assistant | |
Nice to meet you!<|im_end|> | |
<|im_start|>user | |
Can I ask a question?<|im_end|> | |
<|im_start|>assistant | |
""" | |
Note that this time, we've added the tokens that indicate the start of a bot response. This ensures that when the model | |
generates text it will write a bot response instead of doing something unexpected, like continuing the user's | |
message. Remember, chat models are still just language models - they're trained to continue text, and chat is just a | |
special kind of text to them! You need to guide them with appropriate control tokens, so they know what they're | |
supposed to be doing. | |
Not all models require generation prompts. Some models, like BlenderBot and LLaMA, don't have any | |
special tokens before bot responses. In these cases, the add_generation_prompt argument will have no effect. The exact | |
effect that add_generation_prompt has will depend on the template being used. | |
Can I use chat templates in training? | |
Yes! We recommend that you apply the chat template as a preprocessing step for your dataset. After this, you | |
can simply continue like any other language model training task. When training, you should usually set | |
add_generation_prompt=False, because the added tokens to prompt an assistant response will not be helpful during | |
training. Let's see an example: | |
thon | |
from transformers import AutoTokenizer | |
from datasets import Dataset | |
tokenizer = AutoTokenizer.from_pretrained("HuggingFaceH4/zephyr-7b-beta") | |
chat1 = [ | |
{"role": "user", "content": "Which is bigger, the moon or the sun?"}, | |
{"role": "assistant", "content": "The sun."} | |
] | |
chat2 = [ | |
{"role": "user", "content": "Which is bigger, a virus or a bacterium?"}, | |
{"role": "assistant", "content": "A bacterium."} | |
] | |
dataset = Dataset.from_dict({"chat": [chat1, chat2]}) | |
dataset = dataset.map(lambda x: {"formatted_chat": tokenizer.apply_chat_template(x["chat"], tokenize=False, add_generation_prompt=False)}) | |
print(dataset['formatted_chat'][0]) | |
And we get:text | |
<|user|> | |
Which is bigger, the moon or the sun? | |
<|assistant|> | |
The sun. | |
From here, just continue training like you would with a standard language modelling task, using the formatted_chat column. | |
Advanced: Extra inputs to chat templates | |
The only argument that apply_chat_template requires is messages. However, you can pass any keyword | |
argument to apply_chat_template and it will be accessible inside the template. This gives you a lot of freedom to use | |
chat templates for many things. There are no restrictions on the names or the format of these arguments - you can pass | |
strings, lists, dicts or whatever else you want. | |
That said, there are some common use-cases for these extra arguments, | |
such as passing tools for function calling, or documents for retrieval-augmented generation. In these common cases, | |
we have some opinionated recommendations about what the names and formats of these arguments should be, which are | |
described in the sections below. We encourage model authors to make their chat templates compatible with this format, | |
to make it easy to transfer tool-calling code between models. | |
Advanced: Tool use / function calling | |
"Tool use" LLMs can choose to call functions as external tools before generating an answer. When passing tools | |
to a tool-use model, you can simply pass a list of functions to the tools argument: | |
thon | |
import datetime | |
def current_time(): | |
"""Get the current local time as a string.""" | |
return str(datetime.now()) | |
def multiply(a: float, b: float): | |
""" | |
A function that multiplies two numbers | |
Args: | |
a: The first number to multiply | |
b: The second number to multiply | |
""" | |
return a * b | |
tools = [current_time, multiply] | |
model_input = tokenizer.apply_chat_template( | |
messages, | |
tools=tools | |
) | |
In order for this to work correctly, you should write your functions in the format above, so that they can be parsed | |
correctly as tools. Specifically, you should follow these rules: | |
The function should have a descriptive name | |
Every argument must have a type hint | |
The function must have a docstring in the standard Google style (in other words, an initial function description | |
followed by an Args: block that describes the arguments, unless the function does not have any arguments. | |
Do not include types in the Args: block. In other words, write a: The first number to multiply, not | |
a (int): The first number to multiply. Type hints should go in the function header instead. | |
The function can have a return type and a Returns: block in the docstring. However, these are optional | |
because most tool-use models ignore them. | |
Passing tool results to the model | |
The sample code above is enough to list the available tools for your model, but what happens if it wants to actually use | |
one? If that happens, you should: | |
Parse the model's output to get the tool name(s) and arguments. | |
Add the model's tool call(s) to the conversation. | |
Call the corresponding function(s) with those arguments. | |
Add the result(s) to the conversation | |
A complete tool use example | |
Let's walk through a tool use example, step by step. For this example, we will use an 8B Hermes-2-Pro model, | |
as it is one of the highest-performing tool-use models in its size category at the time of writing. If you have the | |
memory, you can consider using a larger model instead like Command-R | |
or Mixtral-8x22B, both of which also support tool use | |
and offer even stronger performance. | |
First, let's load our model and tokenizer: | |
thon | |
import torch | |
from transformers import AutoModelForCausalLM, AutoTokenizer | |
checkpoint = "NousResearch/Hermes-2-Pro-Llama-3-8B" | |
tokenizer = AutoTokenizer.from_pretrained(checkpoint, revision="pr/13") | |
model = AutoModelForCausalLM.from_pretrained(checkpoint, torch_dtype=torch.bfloat16, device_map="auto") | |
Next, let's define a list of tools: | |
thon | |
def get_current_temperature(location: str, unit: str) -> float: | |
""" | |
Get the current temperature at a location. | |
Args: | |
location: The location to get the temperature for, in the format "City, Country" | |
unit: The unit to return the temperature in. (choices: ["celsius", "fahrenheit"]) | |
Returns: | |
The current temperature at the specified location in the specified units, as a float. | |
""" | |
return 22. # A real function should probably actually get the temperature! | |
def get_current_wind_speed(location: str) -> float: | |
""" | |
Get the current wind speed in km/h at a given location. | |
Args: | |
location: The location to get the temperature for, in the format "City, Country" | |
Returns: | |
The current wind speed at the given location in km/h, as a float. | |
""" | |
return 6. # A real function should probably actually get the wind speed! | |
tools = [get_current_temperature, get_current_wind_speed] | |
Now, let's set up a conversation for our bot: | |
python | |
messages = [ | |
{"role": "system", "content": "You are a bot that responds to weather queries. You should reply with the unit used in the queried location."}, | |
{"role": "user", "content": "Hey, what's the temperature in Paris right now?"} | |
] | |
Now, let's apply the chat template and generate a response: | |
python | |
inputs = tokenizer.apply_chat_template(messages, chat_template="tool_use", tools=tools, add_generation_prompt=True, return_dict=True, return_tensors="pt") | |
inputs = {k: v.to(model.device) for k, v in inputs.items()} | |
out = model.generate(**inputs, max_new_tokens=128) | |
print(tokenizer.decode(out[0][len(inputs["input_ids"][0]):])) | |
And we get: | |
text | |
<tool_call> | |
{"arguments": {"location": "Paris, France", "unit": "celsius"}, "name": "get_current_temperature"} | |
</tool_call><|im_end|> | |
The model has called the function with valid arguments, in the format requested by the function docstring. It has | |
inferred that we're most likely referring to the Paris in France, and it remembered that, as the home of SI units, | |
the temperature in France should certainly be displayed in Celsius. | |
Let's append the model's tool call to the conversation. Note that we generate a random tool_call_id here. These IDs | |
are not used by all models, but they allow models to issue multiple tool calls at once and keep track of which response | |
corresponds to which call. You can generate them any way you like, but they should be unique within each chat. | |
python | |
tool_call_id = "vAHdf3" # Random ID, should be unique for each tool call | |
tool_call = {"name": "get_current_temperature", "arguments": {"location": "Paris, France", "unit": "celsius"}} | |
messages.append({"role": "assistant", "tool_calls": [{"id": tool_call_id, "type": "function", "function": tool_call}]}) | |
Now that we've added the tool call to the conversation, we can call the function and append the result to the | |
conversation. Since we're just using a dummy function for this example that always returns 22.0, we can just append | |
that result directly. Again, note the tool_call_id - this should match the ID used in the tool call above. | |
python | |
messages.append({"role": "tool", "tool_call_id": tool_call_id, "name": "get_current_temperature", "content": "22.0"}) | |
Finally, let's let the assistant read the function outputs and continue chatting with the user: | |
python | |
inputs = tokenizer.apply_chat_template(messages, chat_template="tool_use", tools=tools, add_generation_prompt=True, return_dict=True, return_tensors="pt") | |
inputs = {k: v.to(model.device) for k, v in inputs.items()} | |
out = model.generate(**inputs, max_new_tokens=128) | |
print(tokenizer.decode(out[0][len(inputs["input_ids"][0]):])) | |
And we get: | |
text | |
The current temperature in Paris, France is 22.0 ° Celsius.<|im_end|> | |
Although this was a simple demo with dummy tools and a single call, the same technique works with | |
multiple real tools and longer conversations. This can be a powerful way to extend the capabilities of conversational | |
agents with real-time information, computational tools like calculators, or access to large databases. | |
Not all of the tool-calling features shown above are used by all models. Some use tool call IDs, others simply use the function name and | |
match tool calls to results using the ordering, and there are several models that use neither and only issue one tool | |
call at a time to avoid confusion. If you want your code to be compatible across as many models as possible, we | |
recommend structuring your tools calls like we've shown here, and returning tool results in the order that | |
they were issued by the model. The chat templates on each model should handle the rest. |