How to use transformers for PaddleOCR-VL inferencing?
Excellent work! It would be more convenient if PaddleOCR-VL support transformers-backed inferencing.
Hello, we currently support inference using the PaddleOCR-VL-0.9B model with the transformers library, which can recognize texts, formulas, tables, and chart elements. In the future, we plan to support full document parsing inference with transformers. Below is a simple script we provide to support inference using the PaddleOCR-VL-0.9B model with transformers. We currently recommend using the official method for inference, which is faster and can support page-level document parsing.
If you need any further assistance, feel free to ask!
# -*- coding: utf-8 -*-
"""
This script includes four task prompts (prompts) and allows switching by modifying the CHOSEN_TASK line without any command line parameters.
Available tasks (CHOSEN_TASK):
- 'ocr' -> 'OCR:'
- 'table' -> 'Table Recognition:'
- 'chart' -> 'Chart Recognition:'
- 'formula' -> 'Formula Recognition:'
To add/modify prompts, change the PROMPTS dictionary as needed.
"""
from PIL import Image
import torch
from transformers import AutoModelForCausalLM, AutoProcessor
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
CHOSEN_TASK = "ocr" # Options: 'ocr' | 'table' | 'chart' | 'formula'
PROMPTS = {
"ocr": "OCR:",
"table": "Table Recognition:",
"chart": "Chart Recognition:",
"formula": "Formula Recognition:",
}
model_path = "PaddleOCR-VL-0.9B"
image_path = "test.png"
image = Image.open(image_path).convert("RGB")
model = AutoModelForCausalLM.from_pretrained(
model_path, trust_remote_code=True, torch_dtype=torch.bfloat16
).to(DEVICE).eval()
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
messages = [{"role": "user", "content": PROMPTS[CHOSEN_TASK]}]
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=[text], images=[image], return_tensors="pt")
inputs = {k: (v.to(DEVICE) if isinstance(v, torch.Tensor) else v) for k, v in inputs.items()}
with torch.inference_mode():
generated = model.generate(**inputs, max_new_tokens=1024, do_sample=False, use_cache=True)
resp = processor.batch_decode(generated, skip_special_tokens=True)[0]
answer = resp.split(text)[-1].strip()
print(answer)
model_path = "PaddleOCR-VL-0.9B" is it correct? I changed it to "PaddlePaddle/PaddleOCR-VL" still its not working. Error says model_type is missing from config.
model_path = "PaddleOCR-VL-0.9B" is an example, please replace it with your local model path and try again.
Yes. It's working. Thanks for the quick response. I have two more queries
1.Is it possible to parse complete page to markdown or JSON using transformers?
2. I tried using PaddleOCRVL() pipeline, but its not working in CPU only system. How can I set it for CPU only system.
Thank you for your interest.
- As I mentioned in my previous reply, we do not currently support end-to-end Transformers inference, but we plan to add this support in the future. We recommend that you use the official deployment method for higher inference efficiency.
- We do not support CPU inference at this time, as it would lead to a poor user experience.
Using official deployment, can we output the confidence interval or probability of each word?
I encountered an error:
"""
from transformers.modeling_layers import GradientCheckpointingLayer
ModuleNotFoundError: No module named 'transformers.modeling_layers'
"""
I asked GPT and they told me that the version of Transformers is incorrect. May I know which version I should use
Hello, weβre currently using Transformers version 4.55.0. You may try installing this version if needed.
I am really excited
Hello, which specific method do you recommend for using official deployment? What I currently see are the following:
1γ
from paddlex import create_pipeline
pipeline = create_pipeline(pipeline="PaddleOCR-VL")
2γ
from paddleocr import PaddleOCRVL
pipeline = PaddleOCRVL()
Then there are VLM acceleration schemes based on both Paddlex and PaddleOCR.
Which deployment plan is recommended? Is PaddleX and PaddleOCR using PaddleOCRVL internally the same?
It's the same. You can just use paddleocr.