Extract Text and Knowledge from Images with Open Vision Language Models
Vision language models can extract and process text from images, making them useful for digitizing handwritten documents, receipts, and other visual content. If you are looking for traditional OCR of structured documents, check this awesome blog post. This tutorial shows how to use these models using AI Sheets.
Upload Your Images
Start with a folder of images containing text you want to extract. These could be handwritten recipes, documents, or any images with text content.
Upload them directly to AI Sheets:
The images appear in a spreadsheet format:
Apply AI Actions to Your Columns
Each column can be processed with AI actions. Click the overlay on any column to see available operations:
Image columns support text extraction, visual question answering, object detection, and custom actions. Text columns offer summarization, keyword extraction, and translation.
Extract Text Using OCR
AI Sheets includes a template for text extraction:
Here's an example handwritten recipe:
The default extraction captures all visible text:
MEMORANDUM:
From
To
1 Box Duncan Hines Yellow Cake Mix
1 Box instant lemon pudding
2/3 cups water
1/2 cup Mozola oil
4 eggs
Lemon flavoring to taste.
Put in mixing bowl and beat for 10 min.
and REMEMBER... for Quality PRINTING
CALL OR WRITE
Gatling & Pierce
PRINTERS
TELEPHONE 332-2579
22 YEARS OF SERVICE IN NORTHEASTERN CAROLINA
The default template extracts everything, including headers and footers. For cleaner results, use a custom prompt:
This produces focused recipe details:
- 1 box Duncan Hines Yellow Cake Mix
- 1 box instant lemon pudding
- 2/3 cups water
- 1/2 cup Mazola oil
- 4 eggs
- Lemon flavoring to taste
- Put in mixing bowl and beat for 10 minutes
Compare Vision Language Models for OCR Accuracy
The default model Qwen/Qwen2.5-VL-7B-Instruct handles most tasks well. For complex handwriting, try more powerful models like Qwen/Qwen3-VL-235B-A22B-Reasoning:
Comparison on difficult handwriting:
| Qwen/Qwen2.5-VL-7B-Instruct | Qwen/Qwen3-VL-235B-A22B-Reasoning |
|---|---|
| in large bowl combine meat, onion, bread crumbs 1/2 nutmeg & cheese - as you add sprinkle around. Then blend - Last sprinkle blend again Bake in large pan for 10-15 min. at 350. Let stand 5 min before serving. | in lg bowl combine meat, onion, bread crumbs 1/4 nutmeg & cheese - as you add sprinkle around. then blend - last spinach blend again. Bake in lg pan for 50-60 min. @ 350 - let stand 5 min before serving |
The larger model catches critical details like "spinach" and corrects the cooking time from "10-15 min" to "50-60 min."
Process Extracted Text
After extraction, transform the text into structured formats:
This creates formatted HTML for each recipe:
Transform Images
Apply image-to-image models for visual transformations. Convert to black and white:
Result:
Export Your Dataset
Export the processed dataset to Hugging Face Hub:
The final dataset is available at aisheets/unlocked-recipes.
Resources
Try AI Sheets directly or deploy locally from the GitHub repository. For questions, use the Community tab or open a GitHub issue.












