Create README.md
Browse files- .gitattributes +1 -0
- 2025-08-14T02-36_export.csv +4 -0
- README.md +41 -10
- assets/bleu.png +3 -0
- assets/cliptagger-example.png +3 -0
- assets/cost.png +3 -0
- assets/grass-x-inference.png +3 -0
- assets/judge-score.png +3 -0
- assets/rouge-1.png +3 -0
- assets/rouge-L.png +3 -0
.gitattributes
CHANGED
|
@@ -34,3 +34,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
|
|
| 34 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
| 35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
| 36 |
tokenizer.json filter=lfs diff=lfs merge=lfs -text
|
|
|
|
|
|
| 34 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
| 35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
| 36 |
tokenizer.json filter=lfs diff=lfs merge=lfs -text
|
| 37 |
+
assets/*.png filter=lfs diff=lfs merge=lfs -text
|
2025-08-14T02-36_export.csv
ADDED
|
@@ -0,0 +1,4 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
Model,Avg Judge Score,ROUGE-1,ROUGE-2,ROUGE-L,BLEU,Samples w/ Eval,Samples w/ Caption
|
| 2 |
+
claude_4_sonnet,3.16,0.463,0.179,0.281,0.060,500,500
|
| 3 |
+
cliptagger_12b,3.53,0.674,0.404,0.520,0.267,499,998
|
| 4 |
+
gpt_4.1,3.64,0.581,0.260,0.376,0.119,494,500
|
README.md
CHANGED
|
@@ -1,8 +1,9 @@
|
|
| 1 |
-
# GrassData/cliptagger-12b
|
| 2 |
|
| 3 |
-
|
| 4 |
|
| 5 |
-
|
|
|
|
|
|
|
| 6 |
|
| 7 |
The model generates structured, schema-consistent JSON outputs for every video frame, making it ideal for building searchable video databases, content moderation systems, and accessibility tools. It maintains temporal consistency across frames while delivering frontier-quality performance at a fraction of the cost of closed-source alternatives.
|
| 8 |
|
|
@@ -17,7 +18,7 @@ The model generates structured, schema-consistent JSON outputs for every video f
|
|
| 17 |
|
| 18 |
## Architecture
|
| 19 |
|
| 20 |
-
GrassData/
|
| 21 |
|
| 22 |
### Technical Specifications
|
| 23 |
- **Parameters**: 12 billion
|
|
@@ -42,21 +43,51 @@ The model was trained on 1 million carefully curated single-frame samples from p
|
|
| 42 |
|
| 43 |
Performance metrics on our internal evaluation set:
|
| 44 |
|
| 45 |
-
| Model
|
| 46 |
-
|
| 47 |
-
|
|
| 48 |
-
|
|
| 49 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 50 |
|
| 51 |
FP8 quantization showed no measurable quality degradation compared to bf16 precision.
|
| 52 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 53 |
## Usage
|
| 54 |
|
| 55 |
### API Access
|
| 56 |
|
| 57 |
For production deployments, we recommend using our managed API service which includes advanced features like batch processing, webhooks, and automatic scaling:
|
| 58 |
|
| 59 |
-
**[Run GrassData/
|
| 60 |
|
| 61 |
### Required Prompts
|
| 62 |
|
|
|
|
|
|
|
| 1 |
|
| 2 |
+

|
| 3 |
|
| 4 |
+
## Model Information
|
| 5 |
+
|
| 6 |
+
**GrassData/ClipTagger-12b** is a 12-billion parameter vision-language model (VLM) designed for video understanding at massive scale. Developed by [Inference.net](https://inference.net) in collaboration with [Grass](https://grass.io), this model was created to meet the demanding requirements of trillion-scale video frame captioning workloads.
|
| 7 |
|
| 8 |
The model generates structured, schema-consistent JSON outputs for every video frame, making it ideal for building searchable video databases, content moderation systems, and accessibility tools. It maintains temporal consistency across frames while delivering frontier-quality performance at a fraction of the cost of closed-source alternatives.
|
| 9 |
|
|
|
|
| 18 |
|
| 19 |
## Architecture
|
| 20 |
|
| 21 |
+
GrassData/ClipTagger-12b is based on the Gemma-12B architecture and has been optimized with FP8 quantization for maximum throughput on modern GPUs. The model is specifically tuned for RTX 40-series and H100 GPUs, leveraging native FP8 support for efficient inference.
|
| 22 |
|
| 23 |
### Technical Specifications
|
| 24 |
- **Parameters**: 12 billion
|
|
|
|
| 43 |
|
| 44 |
Performance metrics on our internal evaluation set:
|
| 45 |
|
| 46 |
+
| Model | Avg Judge Score | ROUGE-1 | ROUGE-2 | ROUGE-L | BLEU |
|
| 47 |
+
|-------|-----------------|---------|---------|---------|------|
|
| 48 |
+
| cliptagger_12b | **3.53** | **0.674** | **0.404** | **0.520** | **0.267** |
|
| 49 |
+
| claude_4_sonnet | 3.16 | 0.463 | 0.179 | 0.281 | 0.060 |
|
| 50 |
+
| gpt_4.1 | 3.64 | 0.581 | 0.260 | 0.376 | 0.119 |
|
| 51 |
+
|
| 52 |
+
### Benchmark Visualizations
|
| 53 |
+
|
| 54 |
+
<div align="center">
|
| 55 |
+
<img src="./assets/judge-score.png" alt="Average Judge Score Comparison" width="45%" />
|
| 56 |
+
<img src="./assets/rouge-1.png" alt="ROUGE-1 Score Comparison" width="45%" />
|
| 57 |
+
<br/>
|
| 58 |
+
<img src="./assets/rouge-L.png" alt="ROUGE-L Score Comparison" width="45%" />
|
| 59 |
+
<img src="./assets/bleu.png" alt="BLEU Score Comparison" width="45%" />
|
| 60 |
+
</div>
|
| 61 |
|
| 62 |
FP8 quantization showed no measurable quality degradation compared to bf16 precision.
|
| 63 |
|
| 64 |
+
## Cost Comparison
|
| 65 |
+
|
| 66 |
+
GrassData/ClipTagger-12b delivers frontier-quality performance at a fraction of the cost of closed-source alternatives. Based on typical usage patterns (700 input tokens and 250 output tokens per generation), here's how the costs compare:
|
| 67 |
+
|
| 68 |
+
### Pricing Comparison
|
| 69 |
+
|
| 70 |
+
| Model | Input Cost/MTok | Output Cost/MTok | Cost per 1M Generations | Cost per Generation |
|
| 71 |
+
|-------|-----------------|------------------|------------------------|-------------------|
|
| 72 |
+
| ClipTagger-12b | $0.30 | $0.50 | $335 | $0.000335 |
|
| 73 |
+
| GPT-4.1 | $3.00 | $12.00 | $5,100 | $0.0051 |
|
| 74 |
+
| Claude 4 Sonnet | $3.00 | $15.00 | $5,850 | $0.00585 |
|
| 75 |
+
|
| 76 |
+
*Cost calculations based on 700 input tokens and 250 output tokens per generation.
|
| 77 |
+
|
| 78 |
+
<div align="center">
|
| 79 |
+
<img src="./assets/cost.png" alt="Cost Comparison Per 1 Million Generations" width="80%" />
|
| 80 |
+
</div>
|
| 81 |
+
|
| 82 |
+
ClipTagger-12b offers **15x cost savings** compared to GPT-4.1 and **17x cost savings** compared to Claude 4 Sonnet, while maintaining comparable quality metrics.
|
| 83 |
+
|
| 84 |
## Usage
|
| 85 |
|
| 86 |
### API Access
|
| 87 |
|
| 88 |
For production deployments, we recommend using our managed API service which includes advanced features like batch processing, webhooks, and automatic scaling:
|
| 89 |
|
| 90 |
+
**[Run GrassData/ClipTagger-12b via Inference.net API →](https://localhost:3000/use-cases/video-understanding)**
|
| 91 |
|
| 92 |
### Required Prompts
|
| 93 |
|
assets/bleu.png
ADDED
|
Git LFS Details
|
assets/cliptagger-example.png
ADDED
|
Git LFS Details
|
assets/cost.png
ADDED
|
Git LFS Details
|
assets/grass-x-inference.png
ADDED
|
Git LFS Details
|
assets/judge-score.png
ADDED
|
Git LFS Details
|
assets/rouge-1.png
ADDED
|
Git LFS Details
|
assets/rouge-L.png
ADDED
|
Git LFS Details
|