Update README.md
Browse files
README.md
CHANGED
|
@@ -1,21 +1,146 @@
|
|
| 1 |
---
|
| 2 |
-
base_model: Nayana-cognitivelab/
|
| 3 |
tags:
|
| 4 |
- text-generation-inference
|
| 5 |
- transformers
|
| 6 |
- unsloth
|
| 7 |
- gemma3n
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 8 |
license: apache-2.0
|
| 9 |
language:
|
| 10 |
-
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 11 |
---
|
| 12 |
|
| 13 |
-
#
|
| 14 |
|
| 15 |
-
|
| 16 |
-
|
| 17 |
-
|
|
|
|
| 18 |
|
| 19 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 20 |
|
| 21 |
[<img src="https://raw.githubusercontent.com/unslothai/unsloth/main/images/unsloth%20made%20with%20love.png" width="200"/>](https://github.com/unslothai/unsloth)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
+
base_model: Nayana-cognitivelab/NayanaSectionOCR
|
| 3 |
tags:
|
| 4 |
- text-generation-inference
|
| 5 |
- transformers
|
| 6 |
- unsloth
|
| 7 |
- gemma3n
|
| 8 |
+
- vqa
|
| 9 |
+
- document-vqa
|
| 10 |
+
- visual-question-answering
|
| 11 |
+
- vision-language
|
| 12 |
+
- kannada
|
| 13 |
license: apache-2.0
|
| 14 |
language:
|
| 15 |
+
- kn
|
| 16 |
+
datasets:
|
| 17 |
+
- Nayana-cognitivelab/NayanaVQA-SFT
|
| 18 |
+
library_name: transformers
|
| 19 |
+
pipeline_tag: image-text-to-text
|
| 20 |
---
|
| 21 |
|
| 22 |
+
# 🤖 Nayana VQA - Advanced Kannada Visual Question Answering Model
|
| 23 |
|
| 24 |
+
**Developed by:** [CognitiveLab](https://nayana.cognitivelab.in/)
|
| 25 |
+
**License:** Apache 2.0
|
| 26 |
+
**Base Model:** unsloth/gemma-3n-E4B-it
|
| 27 |
+
**Architecture:** Gemma 3n (4B parameters)
|
| 28 |
|
| 29 |
+
## 🌟 Model Overview
|
| 30 |
+
|
| 31 |
+
Nayana VQA is an advanced vision-language model specifically fine-tuned for Visual Question Answering (VQA) and Document Visual Question Answering (Document VQA) tasks. Built on the powerful Gemma 3n architecture, this model excels at understanding and answering questions about visual content, with a special focus on Kannada language support.
|
| 32 |
+
|
| 33 |
+
## 🌍 Supported Languages
|
| 34 |
+
|
| 35 |
+
- **Kannada** (kn) - Primary focus language
|
| 36 |
+
|
| 37 |
+
**More languages coming soon!** We are actively working on expanding language support to include additional 20 languages
|
| 38 |
+
|
| 39 |
+
## 🎯 Key Features
|
| 40 |
+
|
| 41 |
+
- **Visual Question Answering**: Accurate question answering from images in Kannada
|
| 42 |
+
- **Document Understanding**: Advanced comprehension of document layouts and content
|
| 43 |
+
- **Multimodal Reasoning**: Combines visual and textual understanding for complex queries
|
| 44 |
+
- **Fast Inference**: Optimized for real-time applications
|
| 45 |
+
- **High Accuracy**: Fine-tuned on diverse VQA datasets
|
| 46 |
+
- **Easy Integration**: Compatible with Transformers and Modal deployment
|
| 47 |
+
|
| 48 |
+
## 📋 Model Specifications
|
| 49 |
+
|
| 50 |
+
| Parameter | Value |
|
| 51 |
+
|-----------|-------|
|
| 52 |
+
| Model Size | 4B parameters |
|
| 53 |
+
| Context Length | 32K tokens |
|
| 54 |
+
| Image Resolution | Flexible (optimized for documents and general images) |
|
| 55 |
+
| Precision | BFloat16 |
|
| 56 |
+
| Framework | Transformers + Unsloth |
|
| 57 |
+
|
| 58 |
+
## 🚀 Quick Start
|
| 59 |
+
|
| 60 |
+
### Installation
|
| 61 |
+
|
| 62 |
+
```python
|
| 63 |
+
pip install transformers torch pillow unsloth
|
| 64 |
+
```
|
| 65 |
+
|
| 66 |
+
### Basic Usage
|
| 67 |
+
|
| 68 |
+
```python
|
| 69 |
+
from transformers import AutoProcessor, AutoModelForImageTextToText
|
| 70 |
+
from PIL import Image
|
| 71 |
+
import torch
|
| 72 |
+
|
| 73 |
+
# Load model and processor
|
| 74 |
+
model_id = "Nayana-cognitivelab/NayanaVQA"
|
| 75 |
+
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
|
| 76 |
+
model = AutoModelForImageTextToText.from_pretrained(
|
| 77 |
+
model_id,
|
| 78 |
+
device_map="auto",
|
| 79 |
+
torch_dtype=torch.bfloat16,
|
| 80 |
+
trust_remote_code=True
|
| 81 |
+
)
|
| 82 |
+
|
| 83 |
+
# System prompt
|
| 84 |
+
system_prompt = "You are Nayana, an advanced AI assistant developed by CognitiveLab. You specialize in vision-based tasks, particularly Visual Question Answering (VQA) and Document Visual Question Answering (Document VQA). You are highly accurate, fast, and reliable when working with visual content. You can understand and respond to questions about images in Kannada with high precision."
|
| 85 |
+
|
| 86 |
+
# Load and process image
|
| 87 |
+
image = Image.open("your_image.jpg")
|
| 88 |
+
user_question = "ಈ ಚಿತ್ರದಲ್ಲಿ ಏನಿದೆ?" # "What is in this image?" in Kannada
|
| 89 |
+
|
| 90 |
+
# Prepare messages
|
| 91 |
+
messages = [
|
| 92 |
+
{
|
| 93 |
+
"role": "system",
|
| 94 |
+
"content": [{"type": "text", "text": system_prompt}]
|
| 95 |
+
},
|
| 96 |
+
{
|
| 97 |
+
"role": "user",
|
| 98 |
+
"content": [
|
| 99 |
+
{"type": "text", "text": user_question},
|
| 100 |
+
{"type": "image", "image": image}
|
| 101 |
+
]
|
| 102 |
+
}
|
| 103 |
+
]
|
| 104 |
+
|
| 105 |
+
# Apply chat template
|
| 106 |
+
inputs = processor.apply_chat_template(
|
| 107 |
+
messages,
|
| 108 |
+
add_generation_prompt=True,
|
| 109 |
+
tokenize=True,
|
| 110 |
+
return_dict=True,
|
| 111 |
+
return_tensors="pt"
|
| 112 |
+
)
|
| 113 |
+
|
| 114 |
+
# Generate response
|
| 115 |
+
with torch.inference_mode():
|
| 116 |
+
outputs = model.generate(
|
| 117 |
+
**inputs,
|
| 118 |
+
max_new_tokens=1024,
|
| 119 |
+
temperature=1.0,
|
| 120 |
+
top_p=0.95,
|
| 121 |
+
top_k=64,
|
| 122 |
+
do_sample=True
|
| 123 |
+
)
|
| 124 |
+
|
| 125 |
+
# Decode response
|
| 126 |
+
response = processor.tokenizer.decode(
|
| 127 |
+
outputs[0][inputs["input_ids"].shape[1]:],
|
| 128 |
+
skip_special_tokens=True
|
| 129 |
+
)
|
| 130 |
+
print(response)
|
| 131 |
+
```
|
| 132 |
+
|
| 133 |
+
This model was trained **2x faster** with [Unsloth](https://github.com/unslothai/unsloth) and Hugging Face's TRL library.
|
| 134 |
|
| 135 |
[<img src="https://raw.githubusercontent.com/unslothai/unsloth/main/images/unsloth%20made%20with%20love.png" width="200"/>](https://github.com/unslothai/unsloth)
|
| 136 |
+
|
| 137 |
+
## 📜 Citation
|
| 138 |
+
|
| 139 |
+
```bibtex
|
| 140 |
+
@model{nayana_vqa_2024,
|
| 141 |
+
title={Nayana VQA: Advanced Kannada Visual Question Answering with Gemma 3n},
|
| 142 |
+
author={CognitiveLab},
|
| 143 |
+
year={2024},
|
| 144 |
+
url={https://huggingface.co/Nayana-cognitivelab/NayanaVQA}
|
| 145 |
+
}
|
| 146 |
+
```
|