Nemotron Plus VLM
A 14GB Vision-Language Model fine-tuned for UI automation and self-operating computer tasks.
π Model Description
This model combines multiple state-of-the-art architectures:
- Nemotron architecture for advanced language understanding
- InternVL3 for powerful vision processing
- Custom fine-tuning for UI automation and computer control tasks
π Training Data
The model was fine-tuned on comprehensive datasets (2.2M+ samples) for:
- File Operations: Hierarchical reasoning for file system tasks (~440K samples)
- Web Interactions: CAPTCHA solving and web element interaction (~577K samples)
- Window Management: Desktop windowing system control (~351K samples)
- Ydotool Automation: System-level UI automation commands (~414K samples)
πΎ Model Files
lam_model.pt(14.1GB) - Main unified model weightsbest.pt(6.2MB) - Best training checkpointsample_vision_weights.pt(12.6MB) - Vision module weights- Configuration files in JSON format
π οΈ Quick Start
import torch
from pathlib import Path
# Load the main model
model_path = "lam_model.pt"
model = torch.load(model_path, map_location='cuda' if torch.cuda.is_available() else 'cpu')
# Load checkpoint
checkpoint = torch.load("best.pt")
ποΈ Architecture
The model uses a hierarchical reasoning approach:
- Vision Encoding: Process UI screenshots with InternVL3
- Task Planning: Decompose user intent into actions
- Action Generation: Generate appropriate UI commands
- Verification: Validate action outcomes
π Performance
- Optimized for real-time UI automation
- Supports multi-modal inputs (text + images)
- Hierarchical reasoning for complex multi-step tasks
- ~450MB/s inference on modern GPUs
π§ Use Cases
- Automated GUI testing
- Computer control via natural language
- UI element detection and interaction
- Workflow automation
- Accessibility tools
π Repository Structure
.
βββ lam_model.pt # Main 14GB model
βββ best.pt # Best checkpoint
βββ sample_vision_weights.pt # Vision weights
βββ config.json # Model configuration
βββ capabilities.json # Capabilities definition
βββ strategic_analysis.json # Training strategy
βββ training_data/ # Fine-tuning datasets
β οΈ Limitations
- Model size requires significant GPU memory (14GB+)
- Optimized for English language interfaces
- Best performance on standard desktop environments
π License
Apache 2.0
π€ Citation
@model{nemotron-plus-vlm-2025,
title={Nemotron Plus VLM: Vision-Language Model for UI Automation},
author={genialo555},
year={2025},
publisher={Hugging Face},
url={https://huggingface.co/genialo555/nemotron-plus-vlm}
}
π Links
Model uploaded with 14.1GB weights via Hugging Face Hub
- Downloads last month
- 2