PaddlePaddle/PaddleOCR-VL · Unrecognized footer

Unrecognized footer

#11

by Maguro97 - opened Oct 17

Oct 17

I uploaded this image to the demo, but I’m noticing some strange behavior:
• when I set it to extract text only, it captures everything correctly;
• when I ask it to extract everything, the footer gets lost.

What’s odd is that the preview shows the footer text inside a colored box (as if it were recognized), but then it doesn’t appear in the MD output.

Am I doing something wrong?

ChengCui

PaddlePaddle org Oct 17

Yes, you are correct. This behavior is by design. Since the footer is typically not required for LLM processing in most scenarios, we intentionally filter it out during the Markdown conversion to reduce noise.

However, the complete footer text is indeed preserved in the generated JSON file, which you can parse if needed.

Thank you for your feedback. We will consider adding an optional parameter to control the inclusion of the footer in a future update.

eddprogrammer

Oct 18

The footer text often carries very important contextual information for the document. Would love to see an optional parameter to include that. At the mean time, is there sample code showing on to parse json file to include the footer?
Also would love to see a function like save_to_text() to export ALL plain text. Often we don't want any of the formatting characters in markdown format.

Tingquan

PaddlePaddle org Oct 18

Thanks for the suggestion! We’ll take this feature into consideration for future updates.

Currently, you can use the save_to_json(save_path="output.json") method to save all the recognized information as a JSON file. Once you have the file, you can easily parse and process the data to meet your specific needs.

eddprogrammer

Oct 20

Also curious why the three squares (Fig 1) are not OCR'd at all. They are recognized as one image.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment