nemotron-plus-vlm / README.md
genialo555's picture
Upload Nemotron Plus VLM - UI automation vision-language model
d3c0eea verified
|
raw
history blame
2.65 kB

Nemotron Plus VLM

A Vision-Language Model fine-tuned for UI automation and self-operating computer tasks.

πŸš€ Model Description

This model combines multiple state-of-the-art architectures:

  • Nemotron architecture for advanced language understanding
  • InternVL3 for powerful vision processing
  • Custom fine-tuning for UI automation and computer control tasks

πŸ“Š Training Data

The model was fine-tuned on comprehensive datasets for:

  • File Operations: Hierarchical reasoning for file system tasks
  • Web Interactions: CAPTCHA solving and web element interaction
  • Window Management: Desktop windowing system control
  • Ydotool Automation: System-level UI automation commands

Dataset Statistics:

  • File operations: ~440K training samples
  • Web/CAPTCHA: ~577K training samples
  • Window management: ~351K training samples
  • Ydotool commands: ~414K training samples

πŸ› οΈ Model Components

Vision Module

  • Based on InternVL3 architecture
  • Supports high-resolution image understanding
  • Optimized for UI element detection

Language Module

  • Nemotron-based language model
  • Fine-tuned for action planning and reasoning
  • Supports multi-step task decomposition

πŸ“ Repository Structure

β”œβ”€β”€ training_data/          # Fine-tuning datasets (JSONL format)
β”œβ”€β”€ model_implementation/   # Core model implementation files
β”‚   β”œβ”€β”€ internvl3/         # Vision model components
β”‚   β”œβ”€β”€ cfp_jarvis1_*.py   # Custom model implementations
β”‚   └── lam_model.py       # LAM integration
└── config/                # Training configurations

πŸ”§ Usage

# Coming soon - model loading and inference examples

πŸ—οΈ Architecture Details

The model uses a hierarchical reasoning approach:

  1. Vision Encoding: Process UI screenshots
  2. Task Planning: Decompose user intent into actions
  3. Action Generation: Generate appropriate UI commands
  4. Verification: Validate action outcomes

πŸ“ˆ Performance

  • Optimized for real-time UI automation
  • Supports multi-modal inputs (text + images)
  • Hierarchical reasoning for complex tasks

πŸ”¬ Technical Specifications

  • Base Model: Nemotron + InternVL3
  • Training Framework: Custom fine-tuning pipeline
  • Input: Text prompts + UI screenshots
  • Output: Structured action commands

πŸ“„ License

[To be determined]

🀝 Citation

If you use this model, please cite:

@model{nemotron-plus-vlm,
  title={Nemotron Plus VLM},
  author={genialo555},
  year={2024},
  publisher={Hugging Face}
}

πŸ’‘ Contact

For questions or collaboration: [Your contact info]