Nemotron Plus VLM

A 14GB Vision-Language Model fine-tuned for UI automation and self-operating computer tasks.

πŸš€ Model Description

This model combines multiple state-of-the-art architectures:

  • Nemotron architecture for advanced language understanding
  • InternVL3 for powerful vision processing
  • Custom fine-tuning for UI automation and computer control tasks

πŸ“Š Training Data

The model was fine-tuned on comprehensive datasets (2.2M+ samples) for:

  • File Operations: Hierarchical reasoning for file system tasks (~440K samples)
  • Web Interactions: CAPTCHA solving and web element interaction (~577K samples)
  • Window Management: Desktop windowing system control (~351K samples)
  • Ydotool Automation: System-level UI automation commands (~414K samples)

πŸ’Ύ Model Files

  • lam_model.pt (14.1GB) - Main unified model weights
  • best.pt (6.2MB) - Best training checkpoint
  • sample_vision_weights.pt (12.6MB) - Vision module weights
  • Configuration files in JSON format

πŸ› οΈ Quick Start

import torch
from pathlib import Path

# Load the main model
model_path = "lam_model.pt"
model = torch.load(model_path, map_location='cuda' if torch.cuda.is_available() else 'cpu')

# Load checkpoint
checkpoint = torch.load("best.pt")

πŸ—οΈ Architecture

The model uses a hierarchical reasoning approach:

  1. Vision Encoding: Process UI screenshots with InternVL3
  2. Task Planning: Decompose user intent into actions
  3. Action Generation: Generate appropriate UI commands
  4. Verification: Validate action outcomes

πŸ“ˆ Performance

  • Optimized for real-time UI automation
  • Supports multi-modal inputs (text + images)
  • Hierarchical reasoning for complex multi-step tasks
  • ~450MB/s inference on modern GPUs

πŸ”§ Use Cases

  • Automated GUI testing
  • Computer control via natural language
  • UI element detection and interaction
  • Workflow automation
  • Accessibility tools

πŸ“ Repository Structure

.
β”œβ”€β”€ lam_model.pt           # Main 14GB model
β”œβ”€β”€ best.pt                # Best checkpoint
β”œβ”€β”€ sample_vision_weights.pt # Vision weights
β”œβ”€β”€ config.json            # Model configuration
β”œβ”€β”€ capabilities.json      # Capabilities definition
β”œβ”€β”€ strategic_analysis.json # Training strategy
└── training_data/         # Fine-tuning datasets

⚠️ Limitations

  • Model size requires significant GPU memory (14GB+)
  • Optimized for English language interfaces
  • Best performance on standard desktop environments

πŸ“„ License

Apache 2.0

🀝 Citation

@model{nemotron-plus-vlm-2025,
  title={Nemotron Plus VLM: Vision-Language Model for UI Automation},
  author={genialo555},
  year={2025},
  publisher={Hugging Face},
  url={https://huggingface.co/genialo555/nemotron-plus-vlm}
}

πŸ”— Links


Model uploaded with 14.1GB weights via Hugging Face Hub

Downloads last month
2
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support