File size: 2,651 Bytes
d3c0eea
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
# Nemotron Plus VLM

A Vision-Language Model fine-tuned for UI automation and self-operating computer tasks.

## πŸš€ Model Description

This model combines multiple state-of-the-art architectures:
- **Nemotron** architecture for advanced language understanding
- **InternVL3** for powerful vision processing
- **Custom fine-tuning** for UI automation and computer control tasks

## πŸ“Š Training Data

The model was fine-tuned on comprehensive datasets for:
- **File Operations**: Hierarchical reasoning for file system tasks
- **Web Interactions**: CAPTCHA solving and web element interaction
- **Window Management**: Desktop windowing system control
- **Ydotool Automation**: System-level UI automation commands

### Dataset Statistics:
- File operations: ~440K training samples
- Web/CAPTCHA: ~577K training samples  
- Window management: ~351K training samples
- Ydotool commands: ~414K training samples

## πŸ› οΈ Model Components

### Vision Module
- Based on InternVL3 architecture
- Supports high-resolution image understanding
- Optimized for UI element detection

### Language Module  
- Nemotron-based language model
- Fine-tuned for action planning and reasoning
- Supports multi-step task decomposition

## πŸ“ Repository Structure

```
β”œβ”€β”€ training_data/          # Fine-tuning datasets (JSONL format)
β”œβ”€β”€ model_implementation/   # Core model implementation files
β”‚   β”œβ”€β”€ internvl3/         # Vision model components
β”‚   β”œβ”€β”€ cfp_jarvis1_*.py   # Custom model implementations
β”‚   └── lam_model.py       # LAM integration
└── config/                # Training configurations
```

## πŸ”§ Usage

```python
# Coming soon - model loading and inference examples
```

## πŸ—οΈ Architecture Details

The model uses a hierarchical reasoning approach:
1. **Vision Encoding**: Process UI screenshots
2. **Task Planning**: Decompose user intent into actions
3. **Action Generation**: Generate appropriate UI commands
4. **Verification**: Validate action outcomes

## πŸ“ˆ Performance

- Optimized for real-time UI automation
- Supports multi-modal inputs (text + images)
- Hierarchical reasoning for complex tasks

## πŸ”¬ Technical Specifications

- Base Model: Nemotron + InternVL3
- Training Framework: Custom fine-tuning pipeline
- Input: Text prompts + UI screenshots
- Output: Structured action commands

## πŸ“„ License

[To be determined]

## 🀝 Citation

If you use this model, please cite:
```
@model{nemotron-plus-vlm,
  title={Nemotron Plus VLM},
  author={genialo555},
  year={2024},
  publisher={Hugging Face}
}
```

## πŸ’‘ Contact

For questions or collaboration: [Your contact info]