File size: 2,651 Bytes
d3c0eea |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 |
# Nemotron Plus VLM
A Vision-Language Model fine-tuned for UI automation and self-operating computer tasks.
## π Model Description
This model combines multiple state-of-the-art architectures:
- **Nemotron** architecture for advanced language understanding
- **InternVL3** for powerful vision processing
- **Custom fine-tuning** for UI automation and computer control tasks
## π Training Data
The model was fine-tuned on comprehensive datasets for:
- **File Operations**: Hierarchical reasoning for file system tasks
- **Web Interactions**: CAPTCHA solving and web element interaction
- **Window Management**: Desktop windowing system control
- **Ydotool Automation**: System-level UI automation commands
### Dataset Statistics:
- File operations: ~440K training samples
- Web/CAPTCHA: ~577K training samples
- Window management: ~351K training samples
- Ydotool commands: ~414K training samples
## π οΈ Model Components
### Vision Module
- Based on InternVL3 architecture
- Supports high-resolution image understanding
- Optimized for UI element detection
### Language Module
- Nemotron-based language model
- Fine-tuned for action planning and reasoning
- Supports multi-step task decomposition
## π Repository Structure
```
βββ training_data/ # Fine-tuning datasets (JSONL format)
βββ model_implementation/ # Core model implementation files
β βββ internvl3/ # Vision model components
β βββ cfp_jarvis1_*.py # Custom model implementations
β βββ lam_model.py # LAM integration
βββ config/ # Training configurations
```
## π§ Usage
```python
# Coming soon - model loading and inference examples
```
## ποΈ Architecture Details
The model uses a hierarchical reasoning approach:
1. **Vision Encoding**: Process UI screenshots
2. **Task Planning**: Decompose user intent into actions
3. **Action Generation**: Generate appropriate UI commands
4. **Verification**: Validate action outcomes
## π Performance
- Optimized for real-time UI automation
- Supports multi-modal inputs (text + images)
- Hierarchical reasoning for complex tasks
## π¬ Technical Specifications
- Base Model: Nemotron + InternVL3
- Training Framework: Custom fine-tuning pipeline
- Input: Text prompts + UI screenshots
- Output: Structured action commands
## π License
[To be determined]
## π€ Citation
If you use this model, please cite:
```
@model{nemotron-plus-vlm,
title={Nemotron Plus VLM},
author={genialo555},
year={2024},
publisher={Hugging Face}
}
```
## π‘ Contact
For questions or collaboration: [Your contact info]
|