Batch processing on a GPU?
I'm pretty new to the transformers
package. Can anyone provide example code for how a Gemma 3 VLM can be used to batch process images on a CUDA GPU? In my case, I have a list of local files that I want to process using a common prompt. Currently I'm only able to process each image sequentially on my CUDA GPU.
To process images in bulk using the Gemma 3 VLM model on a CUDA GPU, you can use PyTorch along with Tesseract OCR to extract text from the images and then send those texts to the model for inference. First, install the necessary libraries like torch, transformers, pytesseract, and Pillow. Then, load the model and tokenizer using transformers, and use Tesseract to process each image individually. To optimize batch processing, you can loop through all the images in a directory and generate text for each of them. The code below illustrates this process, using the GPU to perform the inferences:
python
Copiar
Editar
here !
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from PIL import Image
import pytesseract
import os
Define the path to the Tesseract OCR executable (if necessary)
pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe" # Adjust as needed
Function to process an image with OCR
def process_image(image_path):
# Load the image
img = Image.open(image_path)
# Use Tesseract to extract text
text = pytesseract.image_to_string(img)
return text
Function to process the batch of images
def process_batch(image_paths, model, tokenizer, device):
texts = []
for image_path in image_paths:
print(f"Processing {image_path}...")
# Step 1: Process the image with OCR (convert image to text)
ocr_text = process_image(image_path)
# Step 2: Use the model for inference (based on the extracted text)
inputs = tokenizer(ocr_text, return_tensors="pt").to(device)
with torch.no_grad():
outputs = model.generate(**inputs, max_length=1024)
# Decode the model's response
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
texts.append(generated_text)
return texts
Load the model and tokenizer
model_name = "gemma-3-4b-it" # Or any other model you have
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name).to('cuda')
Path to the folder with the images
image_folder = "/path/to/your/images"
List of image paths
image_paths = [os.path.join(image_folder, fname) for fname in os.listdir(image_folder) if fname.endswith('.jpg') or fname.endswith('.png')]
Process the images in bulk
generated_texts = process_batch(image_paths, model, tokenizer, 'cuda')
Display the results
for idx, generated_text in enumerate(generated_texts):
print(f"Generated text for image {image_paths[idx]}: {generated_text}\n")
@Ayorinha thanks for replying. I'm guessing you asked an LLM my question and pasted the response? What you provided doesn't make sense. The code is using pytesseract to extract the text from my image then feeding the text into Gemma 3 without any prompt from me. It's treating Gemma 3 like an LLM rather than a VLM, and it doesn't provide any prompt. This is not how Gemma 3 is meant to be used. I should be feeding the image and my prompt into Gemma 3. My aim is to do visual question answering (VQA) of the images I have.
I should mention I have already consulted with Sonnet 3.7 on this question and it wasn't able to figure it out. Maybe a more experienced transformers
user could coax the right answer out of it, but I couldn't.
sorry man ,
Explain to me what you did
omg , This should work correctly for Visual Question Answering VQA ?
import torch
from transformers import AutoProcessor, AutoModelForVision2Seq
from PIL import Image
import os