AdithyaSK commited on
Commit
a38d43d
·
verified ·
1 Parent(s): b0dbc27

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +132 -7
README.md CHANGED
@@ -1,21 +1,146 @@
1
  ---
2
- base_model: Nayana-cognitivelab/DocOCR_SFT_v3_10500
3
  tags:
4
  - text-generation-inference
5
  - transformers
6
  - unsloth
7
  - gemma3n
 
 
 
 
 
8
  license: apache-2.0
9
  language:
10
- - en
 
 
 
 
11
  ---
12
 
13
- # Uploaded finetuned model
14
 
15
- - **Developed by:** Nayana-cognitivelab
16
- - **License:** apache-2.0
17
- - **Finetuned from model :** Nayana-cognitivelab/DocOCR_SFT_v3_10500
 
18
 
19
- This gemma3n model was trained 2x faster with [Unsloth](https://github.com/unslothai/unsloth) and Huggingface's TRL library.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
20
 
21
  [<img src="https://raw.githubusercontent.com/unslothai/unsloth/main/images/unsloth%20made%20with%20love.png" width="200"/>](https://github.com/unslothai/unsloth)
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ base_model: Nayana-cognitivelab/NayanaSectionOCR
3
  tags:
4
  - text-generation-inference
5
  - transformers
6
  - unsloth
7
  - gemma3n
8
+ - vqa
9
+ - document-vqa
10
+ - visual-question-answering
11
+ - vision-language
12
+ - kannada
13
  license: apache-2.0
14
  language:
15
+ - kn
16
+ datasets:
17
+ - Nayana-cognitivelab/NayanaVQA-SFT
18
+ library_name: transformers
19
+ pipeline_tag: image-text-to-text
20
  ---
21
 
22
+ # 🤖 Nayana VQA - Advanced Kannada Visual Question Answering Model
23
 
24
+ **Developed by:** [CognitiveLab](https://nayana.cognitivelab.in/)
25
+ **License:** Apache 2.0
26
+ **Base Model:** unsloth/gemma-3n-E4B-it
27
+ **Architecture:** Gemma 3n (4B parameters)
28
 
29
+ ## 🌟 Model Overview
30
+
31
+ Nayana VQA is an advanced vision-language model specifically fine-tuned for Visual Question Answering (VQA) and Document Visual Question Answering (Document VQA) tasks. Built on the powerful Gemma 3n architecture, this model excels at understanding and answering questions about visual content, with a special focus on Kannada language support.
32
+
33
+ ## 🌍 Supported Languages
34
+
35
+ - **Kannada** (kn) - Primary focus language
36
+
37
+ **More languages coming soon!** We are actively working on expanding language support to include additional 20 languages
38
+
39
+ ## 🎯 Key Features
40
+
41
+ - **Visual Question Answering**: Accurate question answering from images in Kannada
42
+ - **Document Understanding**: Advanced comprehension of document layouts and content
43
+ - **Multimodal Reasoning**: Combines visual and textual understanding for complex queries
44
+ - **Fast Inference**: Optimized for real-time applications
45
+ - **High Accuracy**: Fine-tuned on diverse VQA datasets
46
+ - **Easy Integration**: Compatible with Transformers and Modal deployment
47
+
48
+ ## 📋 Model Specifications
49
+
50
+ | Parameter | Value |
51
+ |-----------|-------|
52
+ | Model Size | 4B parameters |
53
+ | Context Length | 32K tokens |
54
+ | Image Resolution | Flexible (optimized for documents and general images) |
55
+ | Precision | BFloat16 |
56
+ | Framework | Transformers + Unsloth |
57
+
58
+ ## 🚀 Quick Start
59
+
60
+ ### Installation
61
+
62
+ ```python
63
+ pip install transformers torch pillow unsloth
64
+ ```
65
+
66
+ ### Basic Usage
67
+
68
+ ```python
69
+ from transformers import AutoProcessor, AutoModelForImageTextToText
70
+ from PIL import Image
71
+ import torch
72
+
73
+ # Load model and processor
74
+ model_id = "Nayana-cognitivelab/NayanaVQA"
75
+ processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
76
+ model = AutoModelForImageTextToText.from_pretrained(
77
+ model_id,
78
+ device_map="auto",
79
+ torch_dtype=torch.bfloat16,
80
+ trust_remote_code=True
81
+ )
82
+
83
+ # System prompt
84
+ system_prompt = "You are Nayana, an advanced AI assistant developed by CognitiveLab. You specialize in vision-based tasks, particularly Visual Question Answering (VQA) and Document Visual Question Answering (Document VQA). You are highly accurate, fast, and reliable when working with visual content. You can understand and respond to questions about images in Kannada with high precision."
85
+
86
+ # Load and process image
87
+ image = Image.open("your_image.jpg")
88
+ user_question = "ಈ ಚಿತ್ರದಲ್ಲಿ ಏನಿದೆ?" # "What is in this image?" in Kannada
89
+
90
+ # Prepare messages
91
+ messages = [
92
+ {
93
+ "role": "system",
94
+ "content": [{"type": "text", "text": system_prompt}]
95
+ },
96
+ {
97
+ "role": "user",
98
+ "content": [
99
+ {"type": "text", "text": user_question},
100
+ {"type": "image", "image": image}
101
+ ]
102
+ }
103
+ ]
104
+
105
+ # Apply chat template
106
+ inputs = processor.apply_chat_template(
107
+ messages,
108
+ add_generation_prompt=True,
109
+ tokenize=True,
110
+ return_dict=True,
111
+ return_tensors="pt"
112
+ )
113
+
114
+ # Generate response
115
+ with torch.inference_mode():
116
+ outputs = model.generate(
117
+ **inputs,
118
+ max_new_tokens=1024,
119
+ temperature=1.0,
120
+ top_p=0.95,
121
+ top_k=64,
122
+ do_sample=True
123
+ )
124
+
125
+ # Decode response
126
+ response = processor.tokenizer.decode(
127
+ outputs[0][inputs["input_ids"].shape[1]:],
128
+ skip_special_tokens=True
129
+ )
130
+ print(response)
131
+ ```
132
+
133
+ This model was trained **2x faster** with [Unsloth](https://github.com/unslothai/unsloth) and Hugging Face's TRL library.
134
 
135
  [<img src="https://raw.githubusercontent.com/unslothai/unsloth/main/images/unsloth%20made%20with%20love.png" width="200"/>](https://github.com/unslothai/unsloth)
136
+
137
+ ## 📜 Citation
138
+
139
+ ```bibtex
140
+ @model{nayana_vqa_2024,
141
+ title={Nayana VQA: Advanced Kannada Visual Question Answering with Gemma 3n},
142
+ author={CognitiveLab},
143
+ year={2024},
144
+ url={https://huggingface.co/Nayana-cognitivelab/NayanaVQA}
145
+ }
146
+ ```