Logo

EraX-VL-2B-V1.5

Introduction ๐ŸŽ‰

Hot on the heels of the popular EraX-VL-7B-V1.0 model, we proudly present EraX-VL-2B-V1.5. This enhanced multimodal model offers robust OCR and VQA capabilities across diverse languages ๐ŸŒ, with a significant advantage in processing Vietnamese ๐Ÿ‡ป๐Ÿ‡ณ. The EraX-VL-2B model stands out for its precise recognition capabilities across a range of documents ๐Ÿ“, including medical forms ๐Ÿฉบ, invoices ๐Ÿงพ, bills of sale ๐Ÿ’ณ, quotes ๐Ÿ“„, and medical records ๐Ÿ’Š. This functionality is expected to be highly beneficial for hospitals ๐Ÿฅ, clinics ๐Ÿ’‰, insurance companies ๐Ÿ›ก๏ธ, and other similar applications ๐Ÿ“‹. Built on the solid foundation of the Qwen/Qwen2-VL-2B-Instruct[1], which we found to be of high quality and fluent in Vietnamese, EraX-VL-2B has been fine-tuned to enhance its performance. We plan to continue improving and releasing new versions for free, along with sharing performance benchmarks in the near future.

One standing-out feature of EraX-VL-2B-V1.5 is the capability to do multi-turn Q&A with reasonable reasoning capability at its small size of only +2 billions parameters.

NOTA BENE:

  • EraX-VL-2B-V1.5 is NOT a typical OCR-only tool likes Tesseract but is a Multimodal LLM-based model. To use it effectively, you may have to twist your prompt carefully depending on your tasks.
  • This model was NOT finetuned with medical (X-ray) dataset or car accidences (yet). Stay tune for updated version coming up sometime 2025.

EraX-VL-2B-V1.5 is a young and tiny member of our EraX's Lร nhGPT collection of LLM models.

Benchmarks ๐Ÿ“Š

๐Ÿ† LeaderBoard

Models Open-Source VI-MTVQA
EraX-VL-7B-V1.5 ๐Ÿฅ‡ โœ… 47.2
Qwen2-VL 72B ๐Ÿฅˆ โœ˜ 41.6
ViGPT-VL ๐Ÿฅ‰ โœ˜ 39.1
EraX-VL-2B-V1.5 โœ… 38.2
EraX-VL-7B-V1 โœ… 37.6
Vintern-1B-V2 โœ… 37.4
Qwen2-VL 7B โœ… 30.0
Claude3 Opus โœ˜ 29.1
GPT-4o mini โœ˜ 29.1
GPT-4V โœ˜ 28.9
Gemini Ultra โœ˜ 28.6
InternVL2 76B โœ… 26.9
QwenVL Max โœ˜ 23.5
Claude3 Sonnet โœ˜ 20.8
QwenVL Plus โœ˜ 18.1
MiniCPM-V2.5 โœ… 15.3

The test code for evaluating models in the paper can be found in: EraX-JS-Company/EraX-MTVQA-Benchmark

API trial ๐ŸŽ‰

Please contact [email protected] for API access inquiry.

Examples ๐Ÿงฉ

1. OCR - Optical Character Recognition for Multi-Images

Example 01: Citizen identification card

Front View

Front View

Back View

Back View

Source: Google Support

{
  "Sแป‘ thแบป":"037094012351"
  "Hแป vร  tรชn":"TRแปŠNH QUANG DUY"
  "Ngร y sinh":"04/09/1994"
  "Giแป›i tรญnh":"Nam"
  "Quแป‘c tแป‹ch":"Viแป‡t Nam"
  "Quรช quรกn / Place of origin":"Tรขn Thร nh, Kim Sฦกn, Ninh Bรฌnh"
  "Nฦกi thฦฐแปng trรบ / Place of residence":"Xรณm 6 Tรขn Thร nh, Kim Sฦกn, Ninh Bรฌnh"
  "Cรณ giรก trแป‹ ฤ‘แบฟn":"04/09/2034"
  "ฤแบทc ฤ‘iแปƒm nhรขn dแบกng / Personal identification":"seo chแบฅm c:1cm trรชn ฤ‘uรดi mแบฏt trรกi"
  "Cแปฅc trฦฐแปŸng cแปฅc cแบฃnh sรกt quแบฃn lรฝ hร nh chรญnh vแป trแบญt tแปฑ xรฃ hแป™i":"Nguyแป…n Quแป‘c Hรนng"
  "Ngร y cแบฅp":"10/12/2022"
}

Example 01: Identity Card

Front View

Front View

Back View

Back View

Source: Internet

{
  "Sแป‘":"272737384"
  "Hแป tรชn":"PHแบ M NHแบฌT TRฦฏแปœNG"
  "Sinh ngร y":"08-08-2000"
  "Nguyรชn quรกn":"Tiแปn Giang"
  "Nฦกi ฤKHK thฦฐแปng trรบ":"393, Tรขn Xuรขn, Bแบฃo Bรฌnh, Cแบฉm Mแปน, ฤแป“ng Nai"
  "Dรขn tแป™c":"Kinh"
  "Tรดn giรกo":"Khรดng"
  "ฤแบทc ฤ‘iแปƒm nhแบญn dแบกng":"Nแป‘t ruแป“i c.3,5cm trรชn sau cรกnh mลฉi phแบฃi."
  "Ngร y cแบฅp":"30 thรกng 01 nฤƒm 2018"
  "Giรกm ฤ‘แป‘c CA":"T.BรŒNH ฤแปŠNH"
}

Example 02: Driver's License

Front View

Front View

Back View

Back View

Source: Bรกo Phรกp luแบญt

{
  "No.":"400116012313"
  "Fullname":"NGUYแป„N Vฤ‚N DลจNG"
  "Date_of_birth":"08/06/1979"
  "Nationality":"VIแป†T NAM"
  "Address":"X. Quแปณnh Hแบงu, H. Quแปณnh Lฦฐu, T. Nghแป‡ An
  Nghแป‡ An, ngร y/date 23 thรกng/month 04 nฤƒm/year 2022"
  "Hang_Class":"FC"
  "Expires":"23/04/2027"
  "Place_of_issue":"Nghแป‡ An"
  "Date_of_issue":"ngร y/date 23 thรกng/month 04 nฤƒm/year 2022"
  "Signer":"Trแบงn Anh Tuแบฅn"
  "Cรกc loแบกi xe ฤ‘ฦฐแปฃc phรฉp":"ร” tรด hแบกng C kรฉo rฦกmoรณc, ฤ‘แบงu kรฉo kรฉo sฦกmi rฦกmoรณc vร  xe hแบกng B1, B2, C, FB2 (Motor vehicle of class C with a trailer, semi-trailer truck and vehicles of classes B1, B2, C, FB2)"
  "Mรฃ sแป‘":""
}

Example 03: Vehicle Registration Certificate

Source: Bรกo Vietnamnet

{
  "Tรชn chแปง xe":"NGUYแป„N Tร”N NHUแบฌN"
  "ฤแป‹a chแป‰":"KE27 Kp3 P.TTTรขy Q7"
  "Nhรฃn hiแป‡u":"HONDA"
  "Sแป‘ loแบกi":"DYLAN"
  "Mร u sฦกn":"Trแบฏng"
  "Sแป‘ ngฦฐแปi ฤ‘ฦฐแปฃc phรฉp chแปŸ":"02"
  "Nguแป“n gแป‘c":"Xe nhแบญp mแป›i"
  "Biแปƒn sแป‘ ฤ‘ฤƒng kรฝ":"59V1-498.89"
  "ฤฤƒng kรฝ lแบงn ฤ‘แบงu ngร y":"08/06/2004"
  "Sแป‘ mรกy":"F03E-0057735"
  "Sแป‘ khung":"5A04F-070410"
  "Dung tรญch":"152"
  "Quแบฃn lรฝ":"TRฦฏแปžNG CA QUแบฌN"
  "Thฦฐแปฃng tรก":"Trแบงn Vฤƒn Hiแปƒu"
}

Example 04: Birth Certificate

Source: https://congchung247.com.vn

{
    "name": "NGUYแป„N NAM PHฦฏฦ NG",
    "gender": "Nแปฏ",
    "date_of_birth": "08/6/2011",
    "place_of_birth": "Bแป‡nh viแป‡n Viแป‡t - Phรกp Hร  Nแป™i",
    "nationality": "Viแป‡t Nam",
    "father_name": "Nguyแป…n Ninh Hแป“ng Quang",
    "father_dob": "1980",
    "father_address": "309 nhร  E2 Bแบกch Khoa - Hai Bร  Trฦฐng - Hร  Nแป™i",
    "mother_name": "Phแบกm Thรนy Trang",
    "mother_dob": "1984",
    "mother_address": "309 nhร  E2 Bแบกch Khoa - Hai Bร  Trฦฐng - Hร  Nแป™i",
    "registration_place": "UBND phฦฐแปng Bแบกch Khoa - Quแบญn Hai Bร  Trฦฐng - Hร  Nแป™i",
    "registration_date": "05/8/2011",
    "registration_ralation": "cha",
    "notes": None,
    "certified_by": "Nguyแป…n Thแป‹ Kim Hoa"
}

Quickstart ๐ŸŽฎ

Install the necessary packages:

python -m pip install git+https://github.com/huggingface/transformers accelerate
python -m pip install qwen-vl-utils
pip install flash-attn --no-build-isolation

Then you can use EraX-VL-2B-V1.5 like this:

import os
import base64
import json

import cv2
import numpy as np
import matplotlib.pyplot as  plt

import torch
from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info

model_path = "erax/EraX-VL-2B-V1.5"

model = Qwen2VLForConditionalGeneration.from_pretrained(
    model_path,
    torch_dtype=torch.bfloat16,
    attn_implementation="eager", # replace with "flash_attention_2" if your GPU is Ampere architecture
    device_map="auto"
)

tokenizer =  AutoTokenizer.from_pretrained(model_path)
# processor = AutoProcessor.from_pretrained(model_path)

min_pixels = 256 * 28 * 28
max_pixels = 1280 * 28 * 28
processor = AutoProcessor.from_pretrained(
     model_path,
     min_pixels=min_pixels,
     max_pixels=max_pixels,
 )

image_path ="image.jpg"

with open(image_path, "rb") as f:
    encoded_image = base64.b64encode(f.read())
decoded_image_text = encoded_image.decode('utf-8')
base64_data = f"data:image;base64,{decoded_image_text}"

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": base64_data,
            },
            {
                "type": "text",
                "text": "Trรญch xuแบฅt thรดng tin nแป™i dung tแปซ hรฌnh แบฃnh ฤ‘ฦฐแปฃc cung cแบฅp."
            },
        ],
    }
]

# Prepare prompt
tokenized_text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)

image_inputs, video_inputs = process_vision_info(messages)

inputs = processor(
    text=[ tokenized_text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

# Generation configs
generation_config =  model.generation_config
generation_config.do_sample   = True
generation_config.temperature = 1.0
generation_config.top_k       = 1
generation_config.top_p       = 0.9
generation_config.min_p       = 0.1
generation_config.best_of     = 5
generation_config.max_new_tokens     = 2048
generation_config.repetition_penalty = 1.06

# Inference
generated_ids = model.generate(**inputs, generation_config=generation_config)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)

print(output_text[0])

References ๐Ÿ“‘

[1] Qwen team. Qwen2-VL. 2024.

[2] Bai, Jinze, et al. "Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond." arXiv preprint arXiv:2308.12966 (2023).

[4] Yang, An, et al. "Qwen2 technical report." arXiv preprint arXiv:2407.10671 (2024).

[5] Chen, Zhe, et al. "Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024.

[6] Chen, Zhe, et al. "How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites." arXiv preprint arXiv:2404.16821 (2024).

[7] Tran, Chi, and Huong Le Thanh. "LaVy: Vietnamese Multimodal Large Language Model." arXiv preprint arXiv:2404.07922 (2024).

Contact ๐Ÿค

  • For correspondence regarding this work or inquiry for API trial, please contact Nguyแป…n Anh Nguyรชn at [email protected].
  • Follow us on EraX Github
Downloads last month
29
Safetensors
Model size
2.21B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for mxw1998/erax-ai-fork2

Base model

Qwen/Qwen2-VL-2B
Finetuned
(203)
this model