metadata

title: InternVL2.5 Dual Image Analyzer
emoji: 🖼️
colorFrom: blue
colorTo: purple
sdk: docker
sdk_version: 3.1
app_file: app.py
pinned: false
license: mit

InternVL2.5 Dual Image Analyzer

This Hugging Face Space demonstrates the capabilities of InternVL2.5, a powerful vision-language model. It allows you to upload and analyze two images simultaneously, comparing the results side by side.

Features

Upload one or two images for detailed analysis
Uses the InternVL2.5-8B model for high-quality image understanding
Handles various image aspects and formats
Multi-GPU support for efficient processing
Provides a selection of prompts or allows custom queries

Usage

Upload one or two images using the upload buttons
Select a prompt from the dropdown or enter your own
Click "Analyze Images" to process the images
View the detailed analysis for each image

For comparing two images, use the prompt "Compare these images and describe the differences."

Requirements

Python 3.8 or higher
PyTorch
Transformers (version 4.35.2+)
Pillow
Matplotlib
Accelerate
Bitsandbytes
Safetensors
Gradio for the web interface

Hardware Requirements

This application uses a vision-language model which requires:

A CUDA-capable GPU with at least 8GB VRAM
8GB+ system RAM

Deployment Options

1. Hugging Face Spaces (Recommended)

This repository is ready to be deployed on Hugging Face Spaces.

Steps:

Create a new Space on Hugging Face Spaces
Select "Docker" as the Space SDK
Link this GitHub repository
Select a GPU (T4 or better is recommended)
Create the Space

The application will automatically deploy with the Gradio UI frontend.

2. AWS SageMaker

For production deployment on AWS SageMaker:

Package the application using the provided Dockerfile
Upload the Docker image to Amazon ECR
Create a SageMaker Model using the ECR image
Deploy an endpoint with an instance type like ml.g4dn.xlarge
Set up API Gateway for HTTP access (optional)

Detailed AWS instructions can be found in the docs/aws_deployment.md file.

3. Azure Machine Learning

For Azure deployment:

Create an Azure ML workspace
Register the model on Azure ML
Create an inference configuration
Deploy to AKS or ACI with a GPU-enabled instance

Detailed Azure instructions can be found in the docs/azure_deployment.md file.

How It Works

The application uses the InternVL2.5 model, a state-of-the-art multimodal AI model that can understand and describe images with impressive detail.

The script:

Processes the images with the selected prompt
Uses 8-bit quantization to reduce memory requirements
Formats and displays the results

Repository Structure

app.py - Gradio UI for web interface
Dockerfile - For containerized deployment
requirements.txt - Python dependencies
data_temp/ - Sample images for testing

Local Development

Install the required packages:
```
pip install -r requirements.txt
```
Run the Gradio UI:
```
python app.py
```
Visit http://localhost:7860 in your browser

Example Output

Processing image: data_temp/page_2.png
Loading model...
Generating descriptions...

==== Image Description Results (InternVL2.5) ====

Basic Description:
The image shows a webpage or document with text content organized in multiple columns.

Detailed Description:
The image displays a structured document or webpage with multiple sections of text organized in a grid layout. The content appears to be technical or educational in nature, with what looks like headings and paragraphs of text. The color scheme is primarily black text on a white background, creating a clean, professional appearance. There appear to be multiple columns of information, possibly representing different topics or categories. The layout suggests this might be documentation, a reference guide, or an educational resource related to technical content.

Technical Analysis:
This appears to be a screenshot of a digital document or webpage. The image quality is good with clear text rendering, suggesting it was captured at an appropriate resolution. The image uses a standard document layout with what appears to be a grid or multi-column structure. The screenshot has been taken of what seems to be a text-heavy interface with minimal graphics, consistent with technical documentation or reference materials.

Note: Actual descriptions will vary based on the specific image content and may be more detailed than this example.