Unrecognized footer
I uploaded this image to the demo, but I’m noticing some strange behavior:
• when I set it to extract text only, it captures everything correctly;
• when I ask it to extract everything, the footer gets lost.
What’s odd is that the preview shows the footer text inside a colored box (as if it were recognized), but then it doesn’t appear in the MD output.
Am I doing something wrong?
Yes, you are correct. This behavior is by design. Since the footer is typically not required for LLM processing in most scenarios, we intentionally filter it out during the Markdown conversion to reduce noise.
However, the complete footer text is indeed preserved in the generated JSON file, which you can parse if needed.
Thank you for your feedback. We will consider adding an optional parameter to control the inclusion of the footer in a future update.
The footer text often carries very important contextual information for the document. Would love to see an optional parameter to include that. At the mean time, is there sample code showing on to parse json file to include the footer?
Also would love to see a function like save_to_text() to export ALL plain text. Often we don't want any of the formatting characters in markdown format.
Thanks for the suggestion! We’ll take this feature into consideration for future updates.
Currently, you can use the save_to_json(save_path="output.json") method to save all the recognized information as a JSON file. Once you have the file, you can easily parse and process the data to meet your specific needs.
Also curious why the three squares (Fig 1) are not OCR'd at all. They are recognized as one image.
