Nemotron Plus VLM
A Vision-Language Model fine-tuned for UI automation and self-operating computer tasks.
π Model Description
This model combines multiple state-of-the-art architectures:
- Nemotron architecture for advanced language understanding
- InternVL3 for powerful vision processing
- Custom fine-tuning for UI automation and computer control tasks
π Training Data
The model was fine-tuned on comprehensive datasets for:
- File Operations: Hierarchical reasoning for file system tasks
- Web Interactions: CAPTCHA solving and web element interaction
- Window Management: Desktop windowing system control
- Ydotool Automation: System-level UI automation commands
Dataset Statistics:
- File operations: ~440K training samples
- Web/CAPTCHA: ~577K training samples
- Window management: ~351K training samples
- Ydotool commands: ~414K training samples
π οΈ Model Components
Vision Module
- Based on InternVL3 architecture
- Supports high-resolution image understanding
- Optimized for UI element detection
Language Module
- Nemotron-based language model
- Fine-tuned for action planning and reasoning
- Supports multi-step task decomposition
π Repository Structure
βββ training_data/ # Fine-tuning datasets (JSONL format)
βββ model_implementation/ # Core model implementation files
β βββ internvl3/ # Vision model components
β βββ cfp_jarvis1_*.py # Custom model implementations
β βββ lam_model.py # LAM integration
βββ config/ # Training configurations
π§ Usage
# Coming soon - model loading and inference examples
ποΈ Architecture Details
The model uses a hierarchical reasoning approach:
- Vision Encoding: Process UI screenshots
- Task Planning: Decompose user intent into actions
- Action Generation: Generate appropriate UI commands
- Verification: Validate action outcomes
π Performance
- Optimized for real-time UI automation
- Supports multi-modal inputs (text + images)
- Hierarchical reasoning for complex tasks
π¬ Technical Specifications
- Base Model: Nemotron + InternVL3
- Training Framework: Custom fine-tuning pipeline
- Input: Text prompts + UI screenshots
- Output: Structured action commands
π License
[To be determined]
π€ Citation
If you use this model, please cite:
@model{nemotron-plus-vlm,
title={Nemotron Plus VLM},
author={genialo555},
year={2024},
publisher={Hugging Face}
}
π‘ Contact
For questions or collaboration: [Your contact info]