title: InternVL2.5 Dual Image Analyzer
emoji: 🖼️
colorFrom: blue
colorTo: purple
sdk: docker
sdk_version: 3.1
app_file: app.py
pinned: false
license: mit
InternVL2.5 Dual Image Analyzer
This Hugging Face Space demonstrates the capabilities of InternVL2.5, a powerful vision-language model. It allows you to upload and analyze two images simultaneously, comparing the results side by side.
Features
- Upload one or two images for detailed analysis
- Uses the InternVL2.5-8B model for high-quality image understanding
- Handles various image aspects and formats
- Multi-GPU support for efficient processing
- Provides a selection of prompts or allows custom queries
Usage
- Upload one or two images using the upload buttons
- Select a prompt from the dropdown or enter your own
- Click "Analyze Images" to process the images
- View the detailed analysis for each image
For comparing two images, use the prompt "Compare these images and describe the differences."
Requirements
- Python 3.8 or higher
- PyTorch
- Transformers (version 4.35.2+)
- Pillow
- Matplotlib
- Accelerate
- Bitsandbytes
- Safetensors
- Gradio for the web interface
Hardware Requirements
This application uses a vision-language model which requires:
- A CUDA-capable GPU with at least 8GB VRAM
- 8GB+ system RAM
Deployment Options
1. Hugging Face Spaces (Recommended)
This repository is ready to be deployed on Hugging Face Spaces.
Steps:
- Create a new Space on Hugging Face Spaces
- Select "Docker" as the Space SDK
- Link this GitHub repository
- Select a GPU (T4 or better is recommended)
- Create the Space
The application will automatically deploy with the Gradio UI frontend.
2. AWS SageMaker
For production deployment on AWS SageMaker:
- Package the application using the provided Dockerfile
- Upload the Docker image to Amazon ECR
- Create a SageMaker Model using the ECR image
- Deploy an endpoint with an instance type like ml.g4dn.xlarge
- Set up API Gateway for HTTP access (optional)
Detailed AWS instructions can be found in the docs/aws_deployment.md
file.
3. Azure Machine Learning
For Azure deployment:
- Create an Azure ML workspace
- Register the model on Azure ML
- Create an inference configuration
- Deploy to AKS or ACI with a GPU-enabled instance
Detailed Azure instructions can be found in the docs/azure_deployment.md
file.
How It Works
The application uses the InternVL2.5 model, a state-of-the-art multimodal AI model that can understand and describe images with impressive detail.
The script:
- Processes the images with the selected prompt
- Uses 8-bit quantization to reduce memory requirements
- Formats and displays the results
Repository Structure
app.py
- Gradio UI for web interfaceDockerfile
- For containerized deploymentrequirements.txt
- Python dependenciesdata_temp/
- Sample images for testing
Local Development
Install the required packages:
pip install -r requirements.txt
Run the Gradio UI:
python app.py
Visit
http://localhost:7860
in your browser
Example Output
Processing image: data_temp/page_2.png
Loading model...
Generating descriptions...
==== Image Description Results (InternVL2.5) ====
Basic Description:
The image shows a webpage or document with text content organized in multiple columns.
Detailed Description:
The image displays a structured document or webpage with multiple sections of text organized in a grid layout. The content appears to be technical or educational in nature, with what looks like headings and paragraphs of text. The color scheme is primarily black text on a white background, creating a clean, professional appearance. There appear to be multiple columns of information, possibly representing different topics or categories. The layout suggests this might be documentation, a reference guide, or an educational resource related to technical content.
Technical Analysis:
This appears to be a screenshot of a digital document or webpage. The image quality is good with clear text rendering, suggesting it was captured at an appropriate resolution. The image uses a standard document layout with what appears to be a grid or multi-column structure. The screenshot has been taken of what seems to be a text-heavy interface with minimal graphics, consistent with technical documentation or reference materials.
Note: Actual descriptions will vary based on the specific image content and may be more detailed than this example.