Extract Text and Knowledge from Images with Open Vision Language Models

Community Article Published October 23, 2025

Vision language models can extract and process text from images, making them useful for digitizing handwritten documents, receipts, and other visual content. If you are looking for traditional OCR of structured documents, check this awesome blog post. This tutorial shows how to use these models using AI Sheets.

Upload Your Images

Start with a folder of images containing text you want to extract. These could be handwritten recipes, documents, or any images with text content.

folder

Upload them directly to AI Sheets:

upload

The images appear in a spreadsheet format:

table

Apply AI Actions to Your Columns

Each column can be processed with AI actions. Click the overlay on any column to see available operations:

ai-action

Image columns support text extraction, visual question answering, object detection, and custom actions. Text columns offer summarization, keyword extraction, and translation.

Extract Text Using OCR

AI Sheets includes a template for text extraction:

extract-text

Here's an example handwritten recipe:

recipe

The default extraction captures all visible text:

MEMORANDUM:

From

To

1 Box Duncan Hines Yellow Cake Mix
1 Box instant lemon pudding
2/3 cups water
1/2 cup Mozola oil
4 eggs
Lemon flavoring to taste.
Put in mixing bowl and beat for 10 min.

and REMEMBER... for Quality PRINTING
CALL OR WRITE
Gatling & Pierce
PRINTERS
TELEPHONE 332-2579
22 YEARS OF SERVICE IN NORTHEASTERN CAROLINA

The default template extracts everything, including headers and footers. For cleaner results, use a custom prompt:

custom

This produces focused recipe details:

- 1 box Duncan Hines Yellow Cake Mix  
- 1 box instant lemon pudding  
- 2/3 cups water  
- 1/2 cup Mazola oil  
- 4 eggs  
- Lemon flavoring to taste  
- Put in mixing bowl and beat for 10 minutes

Compare Vision Language Models for OCR Accuracy

The default model Qwen/Qwen2.5-VL-7B-Instruct handles most tasks well. For complex handwriting, try more powerful models like Qwen/Qwen3-VL-235B-A22B-Reasoning:

qwen3

Comparison on difficult handwriting:

Qwen/Qwen2.5-VL-7B-Instruct Qwen/Qwen3-VL-235B-A22B-Reasoning
in large bowl combine meat, onion, bread crumbs 1/2 nutmeg & cheese - as you add sprinkle around. Then blend - Last sprinkle blend again Bake in large pan for 10-15 min. at 350. Let stand 5 min before serving. in lg bowl combine meat, onion, bread crumbs 1/4 nutmeg & cheese - as you add sprinkle around. then blend - last spinach blend again. Bake in lg pan for 50-60 min. @ 350 - let stand 5 min before serving

The larger model catches critical details like "spinach" and corrects the cooking time from "10-15 min" to "50-60 min."

Process Extracted Text

After extraction, transform the text into structured formats:

format

This creates formatted HTML for each recipe:

html

Transform Images

Apply image-to-image models for visual transformations. Convert to black and white:

transform-bw

Result:

bw

Export Your Dataset

Export the processed dataset to Hugging Face Hub:

export

The final dataset is available at aisheets/unlocked-recipes.

Resources

Try AI Sheets directly or deploy locally from the GitHub repository. For questions, use the Community tab or open a GitHub issue.

Community

Sign up or log in to comment