Batch processing on a GPU?

#32
by buckeye17-bah - opened

I'm pretty new to the transformers package. Can anyone provide example code for how a Gemma 3 VLM can be used to batch process images on a CUDA GPU? In my case, I have a list of local files that I want to process using a common prompt. Currently I'm only able to process each image sequentially on my CUDA GPU.

To process images in bulk using the Gemma 3 VLM model on a CUDA GPU, you can use PyTorch along with Tesseract OCR to extract text from the images and then send those texts to the model for inference. First, install the necessary libraries like torch, transformers, pytesseract, and Pillow. Then, load the model and tokenizer using transformers, and use Tesseract to process each image individually. To optimize batch processing, you can loop through all the images in a directory and generate text for each of them. The code below illustrates this process, using the GPU to perform the inferences:

python
Copiar
Editar

here !

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from PIL import Image
import pytesseract
import os

Define the path to the Tesseract OCR executable (if necessary)

pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe" # Adjust as needed

Function to process an image with OCR

def process_image(image_path):
# Load the image
img = Image.open(image_path)

# Use Tesseract to extract text
text = pytesseract.image_to_string(img)

return text

Function to process the batch of images

def process_batch(image_paths, model, tokenizer, device):
texts = []

for image_path in image_paths:
    print(f"Processing {image_path}...")
    
    # Step 1: Process the image with OCR (convert image to text)
    ocr_text = process_image(image_path)
    
    # Step 2: Use the model for inference (based on the extracted text)
    inputs = tokenizer(ocr_text, return_tensors="pt").to(device)
    with torch.no_grad():
        outputs = model.generate(**inputs, max_length=1024)
    
    # Decode the model's response
    generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
    texts.append(generated_text)

return texts

Load the model and tokenizer

model_name = "gemma-3-4b-it" # Or any other model you have
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name).to('cuda')

Path to the folder with the images

image_folder = "/path/to/your/images"

List of image paths

image_paths = [os.path.join(image_folder, fname) for fname in os.listdir(image_folder) if fname.endswith('.jpg') or fname.endswith('.png')]

Process the images in bulk

generated_texts = process_batch(image_paths, model, tokenizer, 'cuda')

Display the results

for idx, generated_text in enumerate(generated_texts):
print(f"Generated text for image {image_paths[idx]}: {generated_text}\n")

@Ayorinha thanks for replying. I'm guessing you asked an LLM my question and pasted the response? What you provided doesn't make sense. The code is using pytesseract to extract the text from my image then feeding the text into Gemma 3 without any prompt from me. It's treating Gemma 3 like an LLM rather than a VLM, and it doesn't provide any prompt. This is not how Gemma 3 is meant to be used. I should be feeding the image and my prompt into Gemma 3. My aim is to do visual question answering (VQA) of the images I have.

I should mention I have already consulted with Sonnet 3.7 on this question and it wasn't able to figure it out. Maybe a more experienced transformers user could coax the right answer out of it, but I couldn't.

sorry man ,

Explain to me what you did

omg , This should work correctly for Visual Question Answering VQA ?

import torch
from transformers import AutoProcessor, AutoModelForVision2Seq
from PIL import Image
import os

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment