How can I run the pipeline without the Image input?

#17
by prakhaaaas - opened

When I am setting the pipeline to only text-generation instead of image-text-to-text it giving me error.

from transformers import pipeline
import torch

pipe = pipeline(
"text-generation",
model=r"Gemma_3_4B_it\Model",
torch_dtype=torch.bfloat16
)

AttributeError: 'Gemma3Config' object has no attribute 'vocab_size'

Is there any other way to do it?
Am i doing something wrong?

Hi!

I found the solution to the issue you had

import torch
from transformers import pipeline

pipe_txt2txt = pipeline(
"text2text-generation",
model="google/gemma-3-4b-it", # "google/gemma-3-12b-it", "google/gemma-3-27b-it"
device="cuda",
torch_dtype=torch.bfloat16
)

messages = "what is huggingface?"

output = pipe_txt2txt(messages)
print(output[0]["generated_text"])
#what is huggingface?
#Hugging Face is a company and a community that provides tools and resources for building, training,

the task you are looking for is text2text_generation I believe. Hope this helps ☺

Google org

Hi @prakhaaaas ,

The google/gemma-3-4b-it model is built for image-text-to-text tasks. When you try to use it for text-generation, the pipeline expects a text-only model. However, the Gemma3Config doesn't include the "vocab_size" attribute, which is necessary for text-only models. This mismatch happens because the model is designed to handle both text and images, while the text-generation pipeline expects a purely text-based model.

If you try to use "text2text-generation" as the pipeline task, you will see a warning or error. This is because the model "Gemma3ForConditionalGeneration" does not support this specific task, meaning it cannot process text-to-text generation as expected.

If you provide only text, the model behaves like a regular text-generation model. However, if you pass both text and an image, it switches to image-text-to-text mode, where it processes the image along with the text input.

Could you please refer go through this gist file.

Thank you.

Hi @GopiUppari

When there is no image attached to the input, for example
messages = [
{
"role": "user",
"content": [
{"type": "text", "text": "What is the capital of Italy?"}
]
}
]

the image-text-to-text model throws a IndexError, "IndexError: list index out of range". To get around this I guess @prakhaaaas was trying to load the text-generation model.
is there any other option to use the image-text-to-text model on it's own sometimes without attaching a image in every prompt? For example initiate a conversation about a topic and then send the image to the model instead of requiring atleast one image to be sent in the first prompt.

Hello @pramoda09 ,
Yes you are correct, As a first step I tried to remove the image from the prompt hoping it will ignore, and behave like a text -only model but I encountered the same error. Then I went on change the pipeline task and those were also giving me various errors.
I still have to try your solution @pramoda09 ,by using text2text-generation.
@GopiUppari you might have to look into this, as the model is not behaving the way you described, it is throwing error when image is removed from the prompt.

Hi @prakhaaaas
When I tried out the model I found out there has to be atleast one mention of a image in the prompt.
As I was trying out if the model could recall past conversations when I append the past response under assistant role, I found out as long as it had one image at the very beginning you could send only text inputs to the image-text-to-text model and it would work just fine. The problem I am having is making it so that image can be sent later in the conversation chain rather than hard coding it to be present in the first prompt no matter what. If the model responds like a normal LLM without the need for an initial image there is a lot of potential in making really good projects with this. The applications are limitless.

hi @pramoda09 So I tried your solution with text2text-generation pipeline, it is working but I think it is not taking the prompt in a formatted manner, and output coming from model does not contain any special tokens. I believe as a workaround this might be fine, but the real solution needs to be implemented, or a solid fix is required. Or maybe the same can be achieved using the prompt template which was used while training.
Let's see if others are also facing this issue or not?

@GopiUppari Thanks for the suggested solution it helps in troubleshooting the situation, however, text2text-generation and text-generation are distinct pipeline tasks in the transformers library. text2text-generation is used for models that transform input text into output text, like translation or summarization, often employing encoder-decoder architectures. In contrast, text generation is for models that create new text based on a prompt, such as chatbots or story writing, typically using decoder-only models. Essentially, text2text modifies existing text, while text-generation creates new text continuations. As I mentioned, this can troubleshoot the issue, but for deep analysis and real text generation the text-generation type is still on top.

@GopiUppari @prakhaaaas @pramoda09 I am trying to run the 4b It variant of the Gemma variant on my mac mini, however the generated text is always blank. See below the output object. I am just running the sample code that i am running which is same as the one provided by HF. What am i doing wrong here?

[{'input_text': [{'role': 'system', 'content': [{'type': 'text', 'text': 'You are a helpful assistant.'}]}, {'role': 'user', 'content': [{'type': 'image', 'url': 'https://huggingface.co./datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG'}, {'type': 'text', 'text': 'What animal is on the candy?'}]}], 'generated_text': [{'role': 'system', 'content': [{'type': 'text', 'text': 'You are a helpful assistant.'}]}, {'role': 'user', 'content': [{'type': 'image', 'url': 'https://huggingface.co./datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG'}, {'type': 'text', 'text': 'What animal is on the candy?'}]}, {'role': 'assistant', 'content': ''}]}]

from transformers import pipeline
import torch

pipe = pipeline(
"image-text-to-text",
model="google/gemma-3-4b-it",
device="mps",
torch_dtype=torch.float16
)

messages = [
{
"role": "system",
"content": [{"type": "text", "text": "You are a helpful assistant."}]
},
{
"role": "user",
"content": [
{"type": "image", "url": "https://huggingface.co./datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
{"type": "text", "text": "What animal is on the candy?"}
]
}
]

output = pipe(text=messages, max_new_tokens=200)
print(output[0]["generated_text"][-1]["content"])
print(output)

Okay, let's take a look!

Based on the image, the animal on the candy is a turtle.

You can see the shell shape and the head and legs.

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment