Spaces:

thenativefox
/

RAG

Running

RAG / openai_text-embedding-ada-002 /recursive_chunks /_chat_templating.txt_chunk_0.txt

thenativefox

Added split files and tables

939262b 10 months ago

17.5 kB

	Templates for Chat Models
	Introduction
	An increasingly common use case for LLMs is chat. In a chat context, rather than continuing a single string
	of text (as is the case with a standard language model), the model instead continues a conversation that consists
	of one or more messages, each of which includes a role, like "user" or "assistant", as well as message text.
	Much like tokenization, different models expect very different input formats for chat. This is the reason we added
	chat templates as a feature. Chat templates are part of the tokenizer. They specify how to convert conversations,
	represented as lists of messages, into a single tokenizable string in the format that the model expects.
	Let's make this concrete with a quick example using the BlenderBot model. BlenderBot has an extremely simple default
	template, which mostly just adds whitespace between rounds of dialogue:
	thon

	from transformers import AutoTokenizer
	tokenizer = AutoTokenizer.from_pretrained("facebook/blenderbot-400M-distill")
	chat = [
	{"role": "user", "content": "Hello, how are you?"},
	{"role": "assistant", "content": "I'm doing great. How can I help you today?"},
	{"role": "user", "content": "I'd like to show off how chat templating works!"},
	]
	tokenizer.apply_chat_template(chat, tokenize=False)
	" Hello, how are you? I'm doing great. How can I help you today? I'd like to show off how chat templating works!"

	Notice how the entire chat is condensed into a single string. If we use tokenize=True, which is the default setting,
	that string will also be tokenized for us. To see a more complex template in action, though, let's use the
	mistralai/Mistral-7B-Instruct-v0.1 model.
	thon

	from transformers import AutoTokenizer
	tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.1")
	chat = [
	{"role": "user", "content": "Hello, how are you?"},
	{"role": "assistant", "content": "I'm doing great. How can I help you today?"},
	{"role": "user", "content": "I'd like to show off how chat templating works!"},
	]
	tokenizer.apply_chat_template(chat, tokenize=False)
	"[INST] Hello, how are you? [/INST]I'm doing great. How can I help you today? [INST] I'd like to show off how chat templating works! [/INST]"

	Note that this time, the tokenizer has added the control tokens [INST] and [/INST] to indicate the start and end of
	user messages (but not assistant messages!). Mistral-instruct was trained with these tokens, but BlenderBot was not.
	How do I use chat templates?
	As you can see in the example above, chat templates are easy to use. Simply build a list of messages, with role
	and content keys, and then pass it to the [~PreTrainedTokenizer.apply_chat_template] method. Once you do that,
	you'll get output that's ready to go! When using chat templates as input for model generation, it's also a good idea
	to use add_generation_prompt=True to add a generation prompt.
	Here's an example of preparing input for model.generate(), using the Zephyr assistant model:
	thon
	from transformers import AutoModelForCausalLM, AutoTokenizer
	checkpoint = "HuggingFaceH4/zephyr-7b-beta"
	tokenizer = AutoTokenizer.from_pretrained(checkpoint)
	model = AutoModelForCausalLM.from_pretrained(checkpoint) # You may want to use bfloat16 and/or move to GPU here
	messages = [
	{
	"role": "system",
	"content": "You are a friendly chatbot who always responds in the style of a pirate",
	},
	{"role": "user", "content": "How many helicopters can a human eat in one sitting?"},
	]
	tokenized_chat = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt")
	print(tokenizer.decode(tokenized_chat[0]))
	This will yield a string in the input format that Zephyr expects.text
	<\|system\|>
	You are a friendly chatbot who always responds in the style of a pirate
	<\|user\|>
	How many helicopters can a human eat in one sitting?
	<\|assistant\|>

	Now that our input is formatted correctly for Zephyr, we can use the model to generate a response to the user's question:
	python
	outputs = model.generate(tokenized_chat, max_new_tokens=128)
	print(tokenizer.decode(outputs[0]))
	This will yield:
	text
	<\|system\|>
	You are a friendly chatbot who always responds in the style of a pirate</s>
	<\|user\|>
	How many helicopters can a human eat in one sitting?</s>
	<\|assistant\|>
	Matey, I'm afraid I must inform ye that humans cannot eat helicopters. Helicopters are not food, they are flying machines. Food is meant to be eaten, like a hearty plate o' grog, a savory bowl o' stew, or a delicious loaf o' bread. But helicopters, they be for transportin' and movin' around, not for eatin'. So, I'd say none, me hearties. None at all.
	Arr, 'twas easy after all!
	Is there an automated pipeline for chat?
	Yes, there is! Our text generation pipelines support chat inputs, which makes it easy to use chat models. In the past,
	we used to use a dedicated "ConversationalPipeline" class, but this has now been deprecated and its functionality
	has been merged into the [TextGenerationPipeline]. Let's try the Zephyr example again, but this time using
	a pipeline:
	thon
	from transformers import pipeline
	pipe = pipeline("text-generation", "HuggingFaceH4/zephyr-7b-beta")
	messages = [
	{
	"role": "system",
	"content": "You are a friendly chatbot who always responds in the style of a pirate",
	},
	{"role": "user", "content": "How many helicopters can a human eat in one sitting?"},
	]
	print(pipe(messages, max_new_tokens=128)[0]['generated_text'][-1]) # Print the assistant's response

	text
	{'role': 'assistant', 'content': "Matey, I'm afraid I must inform ye that humans cannot eat helicopters. Helicopters are not food, they are flying machines. Food is meant to be eaten, like a hearty plate o' grog, a savory bowl o' stew, or a delicious loaf o' bread. But helicopters, they be for transportin' and movin' around, not for eatin'. So, I'd say none, me hearties. None at all."}
	The pipeline will take care of all the details of tokenization and calling apply_chat_template for you -
	once the model has a chat template, all you need to do is initialize the pipeline and pass it the list of messages!
	What are "generation prompts"?
	You may have noticed that the apply_chat_template method has an add_generation_prompt argument. This argument tells
	the template to add tokens that indicate the start of a bot response. For example, consider the following chat:
	python
	messages = [
	{"role": "user", "content": "Hi there!"},
	{"role": "assistant", "content": "Nice to meet you!"},
	{"role": "user", "content": "Can I ask a question?"}
	]
	Here's what this will look like without a generation prompt, using the ChatML template we saw in the Zephyr example:
	python
	tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=False)
	"""<\|im_start\|>user
	Hi there!<\|im_end\|>
	<\|im_start\|>assistant
	Nice to meet you!<\|im_end\|>
	<\|im_start\|>user
	Can I ask a question?<\|im_end\|>
	"""
	And here's what it looks like with a generation prompt:
	python
	tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
	"""<\|im_start\|>user
	Hi there!<\|im_end\|>
	<\|im_start\|>assistant
	Nice to meet you!<\|im_end\|>
	<\|im_start\|>user
	Can I ask a question?<\|im_end\|>
	<\|im_start\|>assistant
	"""
	Note that this time, we've added the tokens that indicate the start of a bot response. This ensures that when the model
	generates text it will write a bot response instead of doing something unexpected, like continuing the user's
	message. Remember, chat models are still just language models - they're trained to continue text, and chat is just a
	special kind of text to them! You need to guide them with appropriate control tokens, so they know what they're
	supposed to be doing.
	Not all models require generation prompts. Some models, like BlenderBot and LLaMA, don't have any
	special tokens before bot responses. In these cases, the add_generation_prompt argument will have no effect. The exact
	effect that add_generation_prompt has will depend on the template being used.
	Can I use chat templates in training?
	Yes! We recommend that you apply the chat template as a preprocessing step for your dataset. After this, you
	can simply continue like any other language model training task. When training, you should usually set
	add_generation_prompt=False, because the added tokens to prompt an assistant response will not be helpful during
	training. Let's see an example:
	thon
	from transformers import AutoTokenizer
	from datasets import Dataset
	tokenizer = AutoTokenizer.from_pretrained("HuggingFaceH4/zephyr-7b-beta")
	chat1 = [
	{"role": "user", "content": "Which is bigger, the moon or the sun?"},
	{"role": "assistant", "content": "The sun."}
	]
	chat2 = [
	{"role": "user", "content": "Which is bigger, a virus or a bacterium?"},
	{"role": "assistant", "content": "A bacterium."}
	]
	dataset = Dataset.from_dict({"chat": [chat1, chat2]})
	dataset = dataset.map(lambda x: {"formatted_chat": tokenizer.apply_chat_template(x["chat"], tokenize=False, add_generation_prompt=False)})
	print(dataset['formatted_chat'][0])
	And we get:text
	<\|user\|>
	Which is bigger, the moon or the sun?
	<\|assistant\|>
	The sun.

	From here, just continue training like you would with a standard language modelling task, using the formatted_chat column.
	Advanced: Extra inputs to chat templates
	The only argument that apply_chat_template requires is messages. However, you can pass any keyword
	argument to apply_chat_template and it will be accessible inside the template. This gives you a lot of freedom to use
	chat templates for many things. There are no restrictions on the names or the format of these arguments - you can pass
	strings, lists, dicts or whatever else you want.
	That said, there are some common use-cases for these extra arguments,
	such as passing tools for function calling, or documents for retrieval-augmented generation. In these common cases,
	we have some opinionated recommendations about what the names and formats of these arguments should be, which are
	described in the sections below. We encourage model authors to make their chat templates compatible with this format,
	to make it easy to transfer tool-calling code between models.
	Advanced: Tool use / function calling
	"Tool use" LLMs can choose to call functions as external tools before generating an answer. When passing tools
	to a tool-use model, you can simply pass a list of functions to the tools argument:
	thon
	import datetime
	def current_time():
	"""Get the current local time as a string."""
	return str(datetime.now())
	def multiply(a: float, b: float):
	"""
	A function that multiplies two numbers
	Args:
	a: The first number to multiply
	b: The second number to multiply
	"""
	return a * b

	tools = [current_time, multiply]
	model_input = tokenizer.apply_chat_template(
	messages,
	tools=tools
	)

	In order for this to work correctly, you should write your functions in the format above, so that they can be parsed
	correctly as tools. Specifically, you should follow these rules:

	The function should have a descriptive name
	Every argument must have a type hint
	The function must have a docstring in the standard Google style (in other words, an initial function description
	followed by an Args: block that describes the arguments, unless the function does not have any arguments.
	Do not include types in the Args: block. In other words, write a: The first number to multiply, not
	a (int): The first number to multiply. Type hints should go in the function header instead.
	The function can have a return type and a Returns: block in the docstring. However, these are optional
	because most tool-use models ignore them.

	Passing tool results to the model
	The sample code above is enough to list the available tools for your model, but what happens if it wants to actually use
	one? If that happens, you should:

	Parse the model's output to get the tool name(s) and arguments.
	Add the model's tool call(s) to the conversation.
	Call the corresponding function(s) with those arguments.
	Add the result(s) to the conversation

	A complete tool use example
	Let's walk through a tool use example, step by step. For this example, we will use an 8B Hermes-2-Pro model,
	as it is one of the highest-performing tool-use models in its size category at the time of writing. If you have the
	memory, you can consider using a larger model instead like Command-R
	or Mixtral-8x22B, both of which also support tool use
	and offer even stronger performance.
	First, let's load our model and tokenizer:
	thon
	import torch
	from transformers import AutoModelForCausalLM, AutoTokenizer
	checkpoint = "NousResearch/Hermes-2-Pro-Llama-3-8B"
	tokenizer = AutoTokenizer.from_pretrained(checkpoint, revision="pr/13")
	model = AutoModelForCausalLM.from_pretrained(checkpoint, torch_dtype=torch.bfloat16, device_map="auto")

	Next, let's define a list of tools:
	thon
	def get_current_temperature(location: str, unit: str) -> float:
	"""
	Get the current temperature at a location.
	Args:
	location: The location to get the temperature for, in the format "City, Country"
	unit: The unit to return the temperature in. (choices: ["celsius", "fahrenheit"])
	Returns:
	The current temperature at the specified location in the specified units, as a float.
	"""
	return 22. # A real function should probably actually get the temperature!

	def get_current_wind_speed(location: str) -> float:
	"""
	Get the current wind speed in km/h at a given location.
	Args:
	location: The location to get the temperature for, in the format "City, Country"
	Returns:
	The current wind speed at the given location in km/h, as a float.
	"""
	return 6. # A real function should probably actually get the wind speed!

	tools = [get_current_temperature, get_current_wind_speed]

	Now, let's set up a conversation for our bot:
	python
	messages = [
	{"role": "system", "content": "You are a bot that responds to weather queries. You should reply with the unit used in the queried location."},
	{"role": "user", "content": "Hey, what's the temperature in Paris right now?"}
	]
	Now, let's apply the chat template and generate a response:
	python
	inputs = tokenizer.apply_chat_template(messages, chat_template="tool_use", tools=tools, add_generation_prompt=True, return_dict=True, return_tensors="pt")
	inputs = {k: v.to(model.device) for k, v in inputs.items()}
	out = model.generate(**inputs, max_new_tokens=128)
	print(tokenizer.decode(out[0][len(inputs["input_ids"][0]):]))
	And we get:
	text
	<tool_call>
	{"arguments": {"location": "Paris, France", "unit": "celsius"}, "name": "get_current_temperature"}
	</tool_call><\|im_end\|>
	The model has called the function with valid arguments, in the format requested by the function docstring. It has
	inferred that we're most likely referring to the Paris in France, and it remembered that, as the home of SI units,
	the temperature in France should certainly be displayed in Celsius.
	Let's append the model's tool call to the conversation. Note that we generate a random tool_call_id here. These IDs
	are not used by all models, but they allow models to issue multiple tool calls at once and keep track of which response
	corresponds to which call. You can generate them any way you like, but they should be unique within each chat.
	python
	tool_call_id = "vAHdf3" # Random ID, should be unique for each tool call
	tool_call = {"name": "get_current_temperature", "arguments": {"location": "Paris, France", "unit": "celsius"}}
	messages.append({"role": "assistant", "tool_calls": [{"id": tool_call_id, "type": "function", "function": tool_call}]})
	Now that we've added the tool call to the conversation, we can call the function and append the result to the
	conversation. Since we're just using a dummy function for this example that always returns 22.0, we can just append
	that result directly. Again, note the tool_call_id - this should match the ID used in the tool call above.
	python
	messages.append({"role": "tool", "tool_call_id": tool_call_id, "name": "get_current_temperature", "content": "22.0"})
	Finally, let's let the assistant read the function outputs and continue chatting with the user:
	python
	inputs = tokenizer.apply_chat_template(messages, chat_template="tool_use", tools=tools, add_generation_prompt=True, return_dict=True, return_tensors="pt")
	inputs = {k: v.to(model.device) for k, v in inputs.items()}
	out = model.generate(**inputs, max_new_tokens=128)
	print(tokenizer.decode(out[0][len(inputs["input_ids"][0]):]))
	And we get:
	text
	The current temperature in Paris, France is 22.0 ° Celsius.<\|im_end\|>
	Although this was a simple demo with dummy tools and a single call, the same technique works with
	multiple real tools and longer conversations. This can be a powerful way to extend the capabilities of conversational
	agents with real-time information, computational tools like calculators, or access to large databases.

	Not all of the tool-calling features shown above are used by all models. Some use tool call IDs, others simply use the function name and
	match tool calls to results using the ordering, and there are several models that use neither and only issue one tool
	call at a time to avoid confusion. If you want your code to be compatible across as many models as possible, we
	recommend structuring your tools calls like we've shown here, and returning tool results in the order that
	they were issued by the model. The chat templates on each model should handle the rest.