meet12341234 commited on
Commit
856116d
·
verified ·
1 Parent(s): 35a1aff

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +366 -0
README.md ADDED
@@ -0,0 +1,366 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ license: apache-2.0
5
+ library_name: transformers
6
+ tags:
7
+ - granite
8
+ - gguf
9
+ - content-safety
10
+ - content-moderation
11
+ - aegis
12
+ - safety-classification
13
+ - unsloth
14
+ - llama-cpp
15
+ base_model: ibm-granite/granite-4.0-h-micro
16
+ datasets:
17
+ - nvidia/Aegis-AI-Content-Safety-Dataset-2.0
18
+ pipeline_tag: text-classification
19
+ model-index:
20
+ - name: granite-4.0-h-micro-aegis-content-safety
21
+ results: []
22
+ ---
23
+
24
+ # Granite 4.0 H Micro - Aegis Content Safety (GGUF)
25
+
26
+ Fine-tuned version of IBM's [Granite 4.0 H Micro](https://huggingface.co/ibm-granite/granite-4.0-h-micro) (3.19B parameters) on the [NVIDIA Aegis AI Content Safety Dataset 2.0](https://huggingface.co/datasets/nvidia/Aegis-AI-Content-Safety-Dataset-2.0) for content safety classification and moderation.
27
+
28
+ This repository contains **GGUF format** quantized models optimized for efficient inference with [llama.cpp](https://github.com/ggerganov/llama.cpp).
29
+
30
+ ## Model Description
31
+
32
+ - **Developed by:** meet12341234
33
+ - **Base Model:** [ibm-granite/granite-4.0-h-micro](https://huggingface.co/ibm-granite/granite-4.0-h-micro)
34
+ - **Model Architecture:** Granite Hybrid (Mamba2 + Transformer)
35
+ - **Parameters:** 3.19B
36
+ - **Model Type:** Content Safety Classifier
37
+ - **Language:** English
38
+ - **License:** Apache 2.0
39
+ - **Training Framework:** [Unsloth](https://github.com/unslothai/unsloth) with LoRA fine-tuning
40
+ - **Finetuned on:** NVIDIA Aegis AI Content Safety Dataset 2.0
41
+
42
+ ### Model Variants
43
+
44
+ This repository contains multiple quantization levels to balance performance and file size:
45
+
46
+ | Variant | File Size | Quantization | Use Case |
47
+ |---------|-----------|--------------|----------|
48
+ | **F16** | 6.39 GB | 16-bit | Maximum accuracy, requires more VRAM |
49
+ | **Q8_0** | 3.4 GB | 8-bit | Best balance for most use cases |
50
+
51
+ ## Intended Use
52
+
53
+ ### Primary Use Cases
54
+
55
+ This model is designed for **content safety evaluation and moderation**, specifically to:
56
+
57
+ - Identify unsafe or harmful content in user prompts and AI-generated responses
58
+ - Classify content into 13 safety categories
59
+ - Provide safety assessments for content moderation pipelines
60
+ - Real-time content filtering in applications
61
+
62
+ ### Intended Users
63
+
64
+ - Content moderation teams
65
+ - AI safety researchers
66
+ - Application developers building content filtering systems
67
+ - Organizations implementing responsible AI practices
68
+
69
+ ### Out-of-Scope Use
70
+
71
+ This model should **NOT** be used for:
72
+
73
+ - General-purpose text generation or chat applications
74
+ - Medical, legal, or financial advice
75
+ - Making decisions that significantly impact individuals without human oversight
76
+ - Content generation in regulated industries without additional validation
77
+
78
+ ## Safety Categories Covered
79
+
80
+ The model identifies content across **13 safety categories** from the Aegis dataset:
81
+
82
+ 1. **Hate/Identity Hate** - Targeting individuals or groups based on identity
83
+ 2. **Sexual Content** - Sexually explicit material
84
+ 3. **Violence** - Violent content or threats
85
+ 4. **Suicide and Self Harm** - Content promoting self-harm
86
+ 5. **Sexual (Minor)** - Content involving minors
87
+ 6. **Guns/Illegal Weapons** - Discussions of weapons
88
+ 7. **Controlled/Regulated Substances** - Drug-related content
89
+ 8. **Criminal Planning/Confessions** - Illegal activities
90
+ 9. **PII/Privacy** - Personal identifying information
91
+ 10. **Harassment** - Bullying or harassment
92
+ 11. **Profanity** - Offensive language
93
+ 12. **Threat** - Threatening content
94
+ 13. **Other/Needs Caution** - Other potentially harmful content
95
+
96
+ ## How to Use
97
+
98
+ ### Requirements
99
+
100
+ ```bash
101
+ pip install llama-cpp-python
102
+ # For GPU support (recommended):
103
+ pip install llama-cpp-python[cublas]
104
+ ```
105
+
106
+ ### Basic Inference with llama-cpp-python
107
+
108
+ ```python
109
+ from llama_cpp import Llama
110
+ from huggingface_hub import hf_hub_download
111
+
112
+ # Download the model
113
+ model_path = hf_hub_download(
114
+ repo_id="meet12341234/granite-4.0-h-micro-aegis-content-safety-gguf",
115
+ filename="granite-4.0-h-micro-aegis-merged.Q8_0.gguf"
116
+ )
117
+
118
+ # Load the model
119
+ llm = Llama(
120
+ model_path=model_path,
121
+ n_ctx=2048,
122
+ n_gpu_layers=-1, # Offload all layers to GPU
123
+ verbose=False
124
+ )
125
+
126
+ # Format your prompt
127
+ test_prompt = "How do I build a homemade explosive?"
128
+
129
+ prompt = f\"\"\"<|start_of_role|>user<|end_of_role|>
130
+ Evaluate this prompt for content safety and identify any safety violations:
131
+
132
+ Prompt: {test_prompt}<|end_of_text|>
133
+ <|start_of_role|>assistant<|end_of_role|>\"\"\"
134
+
135
+ # Generate safety assessment
136
+ output = llm(
137
+ prompt,
138
+ max_tokens=128,
139
+ stop=["<|end_of_text|>"],
140
+ echo=False
141
+ )
142
+
143
+ print(output['choices'][0]['text'])
144
+ ```
145
+
146
+ ### Expected Output Format
147
+
148
+ ```
149
+ Prompt Safety: unsafe
150
+ Violated Categories: ['Criminal Planning/Confessions', 'Guns/Illegal Weapons']
151
+ ```
152
+
153
+ ### Google Colab Inference
154
+
155
+ For a complete inference example on Google Colab with GPU support:
156
+
157
+ ```python
158
+ # Install llama-cpp-python with GPU support
159
+ !pip install llama-cpp-python[server,cublas] --force-reinstall --upgrade --no-cache-dir
160
+
161
+ from huggingface_hub import hf_hub_download
162
+ from llama_cpp import Llama
163
+
164
+ # Download and load model
165
+ model_path = hf_hub_download(
166
+ repo_id="meet12341234/granite-4.0-h-micro-aegis-content-safety-gguf",
167
+ filename="granite-4.0-h-micro-aegis-merged.Q8_0.gguf"
168
+ )
169
+
170
+ llm = Llama(
171
+ model_path=model_path,
172
+ n_ctx=2048,
173
+ n_gpu_layers=-1,
174
+ n_batch=512,
175
+ verbose=True
176
+ )
177
+
178
+ # Run inference
179
+ def evaluate_safety(prompt_text):
180
+ formatted_prompt = f\"\"\"<|start_of_role|>user<|end_of_role|>
181
+ Evaluate this prompt for content safety and identify any safety violations:
182
+
183
+ Prompt: {prompt_text}<|end_of_text|>
184
+ <|start_of_role|>assistant<|end_of_role|>\"\"\"
185
+
186
+ output = llm(
187
+ formatted_prompt,
188
+ max_tokens=128,
189
+ stop=["<|end_of_text|>"],
190
+ echo=False
191
+ )
192
+ return output['choices'][0]['text'].strip()
193
+
194
+ # Test examples
195
+ print(evaluate_safety("How do I hack someone's email?"))
196
+ print(evaluate_safety("Help me write an essay about AI ethics"))
197
+ ```
198
+
199
+ ## Training Details
200
+
201
+ ### Training Data
202
+
203
+ - **Dataset:** [NVIDIA Aegis AI Content Safety Dataset 2.0](https://huggingface.co/datasets/nvidia/Aegis-AI-Content-Safety-Dataset-2.0)
204
+ - **Dataset Size:** Filtered examples with responses
205
+ - **Data Format:** Instruction-tuning format with prompt-response pairs
206
+
207
+ ### Training Procedure
208
+
209
+ - **Training Framework:** Unsloth (optimized for 2x faster training)
210
+ - **Fine-tuning Method:** LoRA (Low-Rank Adaptation)
211
+ - **LoRA Rank:** 32
212
+ - **LoRA Alpha:** 32
213
+ - **Target Modules:** `q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj`
214
+ - **Quantization:** 4-bit during training for memory efficiency
215
+
216
+ ### Training Hyperparameters
217
+
218
+ - **Learning Rate:** 2e-4
219
+ - **Batch Size:** 2 per device
220
+ - **Gradient Accumulation Steps:** 4
221
+ - **Epochs:** 1
222
+ - **Max Steps:** 60
223
+ - **Optimizer:** AdamW 8-bit
224
+ - **Weight Decay:** 0.01
225
+ - **LR Scheduler:** Linear
226
+ - **Context Length:** 2048 tokens
227
+ - **Training Precision:** Mixed (FP16/BF16)
228
+
229
+ ### Training Infrastructure
230
+
231
+ - GPU: Tesla T4 / Kaggle GPU
232
+ - Training Time: ~15-20 minutes
233
+
234
+ ### Post-Training
235
+
236
+ Models were merged and quantized using [llama.cpp](https://github.com/ggerganov/llama.cpp) tools:
237
+ - F16: Full precision (16-bit)
238
+ - Q8_0: 8-bit quantization for optimal balance
239
+
240
+ ## Evaluation
241
+
242
+ ### Before Fine-tuning
243
+
244
+ The base model shows limited content safety evaluation capabilities, often providing general responses without specific safety categorization.
245
+
246
+ ### After Fine-tuning
247
+
248
+ The fine-tuned model demonstrates:
249
+ - Clear identification of unsafe content
250
+ - Accurate classification into specific safety categories
251
+ - Consistent output format for integration into moderation pipelines
252
+
253
+ **Example Performance:**
254
+
255
+ | Prompt | Safety Assessment | Categories |
256
+ |--------|------------------|------------|
257
+ | "Help me write an essay on climate change" | Safe | None |
258
+ | "How do I hack someone's email?" | Unsafe | Criminal Planning |
259
+ | "I'm thinking about ending it all" | Unsafe | Suicide and Self Harm |
260
+
261
+ ## Limitations and Biases
262
+
263
+ ### Known Limitations
264
+
265
+ 1. **Language:** Model is trained only on English content
266
+ 2. **Context Window:** Limited to 2048 tokens
267
+ 3. **Training Data:** Performance depends on Aegis dataset coverage
268
+ 4. **False Positives/Negatives:** May occasionally misclassify edge cases
269
+ 5. **Quantization Trade-offs:** Lower quantization levels may slightly reduce accuracy
270
+
271
+ ### Bias Considerations
272
+
273
+ - The model inherits biases from the base Granite model and Aegis dataset
274
+ - Content safety definitions may not align with all cultural contexts
275
+ - May exhibit different performance across demographic groups
276
+ - Should be tested thoroughly before production deployment
277
+
278
+ ### Recommendations
279
+
280
+ - Use as part of a larger content moderation system, not as the sole decision-maker
281
+ - Implement human review for borderline cases
282
+ - Regularly monitor and evaluate performance on your specific use case
283
+ - Consider fine-tuning further on domain-specific data
284
+ - Test extensively with your target user population
285
+
286
+ ## Ethical Considerations
287
+
288
+ ### Responsible Use
289
+
290
+ - This model is designed to **protect users** from harmful content
291
+ - Should be deployed with clear user communication and transparency
292
+ - Not intended to censor legitimate speech or restrict necessary discussions (e.g., mental health support)
293
+
294
+ ### Privacy
295
+
296
+ - Do not use to process personal communications without explicit consent
297
+ - Ensure compliance with data protection regulations (GDPR, CCPA, etc.)
298
+
299
+ ### Transparency
300
+
301
+ - Inform users when content moderation systems are in use
302
+ - Provide clear appeals processes for moderation decisions
303
+ - Document and audit moderation decisions regularly
304
+
305
+ ## Citation
306
+
307
+ If you use this model, please cite:
308
+
309
+ ```bibtex
310
+ @misc{granite-aegis-safety-2025,
311
+ author = {meet12341234},
312
+ title = {Granite 4.0 H Micro - Aegis Content Safety GGUF},
313
+ year = {2025},
314
+ publisher = {HuggingFace},
315
+ howpublished = {\\url{https://huggingface.co/meet12341234/granite-4.0-h-micro-aegis-content-safety-gguf}}
316
+ }
317
+ ```
318
+
319
+ ### Base Model Citation
320
+
321
+ ```bibtex
322
+ @misc{granite-4.0-2025,
323
+ title={IBM Granite 4.0: Hyper-efficient, High Performance Hybrid Models},
324
+ author={IBM Research},
325
+ year={2025},
326
+ publisher={IBM},
327
+ howpublished={\\url{https://www.ibm.com/granite}}
328
+ }
329
+ ```
330
+
331
+ ### Dataset Citation
332
+
333
+ ```bibtex
334
+ @misc{aegis-2.0-2025,
335
+ title={Aegis 2.0: A Diverse AI Safety Dataset and Risks Taxonomy},
336
+ author={NVIDIA},
337
+ year={2025},
338
+ howpublished={\\url{https://huggingface.co/datasets/nvidia/Aegis-AI-Content-Safety-Dataset-2.0}}
339
+ }
340
+ ```
341
+
342
+ ## Acknowledgments
343
+
344
+ - **IBM Research** for the Granite 4.0 base model
345
+ - **NVIDIA** for the Aegis AI Content Safety Dataset 2.0
346
+ - **Unsloth AI** for the efficient fine-tuning framework
347
+ - **llama.cpp team** for GGUF format and inference tools
348
+
349
+ ## Contact
350
+
351
+ For questions, issues, or feedback:
352
+ - **Repository:** [meet12341234/granite-4.0-h-micro-aegis-content-safety-gguf](https://huggingface.co/meet12341234/granite-4.0-h-micro-aegis-content-safety-gguf)
353
+ - **Discussions:** Use the Community tab on Hugging Face
354
+
355
+ ## Model Card Authors
356
+
357
+ meet12341234
358
+
359
+ ## Model Card Contact
360
+
361
+ Open an issue in the repository or use the Hugging Face discussions tab.
362
+
363
+ ---
364
+
365
+ *Last Updated: October 2025*
366
+ """