Qwen3 VL HF Demo
Object Detection, Visual Grounding, Keypoint Detection
Audio, Image, Image-to-Text
Object Detection, Visual Grounding, Keypoint Detection
FireRed / Nanonets / Monkey / Thyme / Typhoon / SmolDocling
Ultra-compact Computer-Use Agent [GUI Localization]
understand document semantics, extract text and tables.
Multimodal OCR model for complex document understanding.
Nanonets / olmOCR / RolmOCR / Aya-Vision / Qwen2-VL-OCR
Text-guided object tracking, point tracking, reasoning.
Chandra-OCR / Nanonets-OCR2 / olmOCR-2 / Dots.OCR
custom voice, voice design, and voice cloning, asr nodes.
Demo of a collection of Qwen3-VL models
DeepSeek-OCR 2: Visual Causal Flow
Demo of the Qwen 3.5 Multimodal Model
Unified Multimodal Comprehension and Generation
Florence-2-large / Florence-2-base
DeepCaption / SkyCaptioner / SpaceThinker / Core / SpaceOm
coreOCR / Camel-Doc-OCR / docscopeOCR / MonkeyOCR
Cosmos-R1 / docscopeOCR / Captioner-7B / visionOCR-3B
Unredacted: Ask Anything with Near-Zero Refusal Rates
Lumian-VLR / VisionThink / MiniCPM-V / Typhoon-OCR / olmOCR
Florence-2 vision models demo. (transformers)
OCR, VQA, Thinking and Object Detection.
Vision-Language Models for Document Conversion
Molmo2 - Image, Video (QA, Pointing & Tracking)
Testing for the latest transformers (DeepSeek-OCR).
Experiment with small super OCR models here.
Fast Editing with Robust Consistency
Smart Any-Horizon Agents for Long Video Reasoning. [SAGE]
Camera Control Dolly [Distilled]