Sentinel PII Redaction
State-of-the-art PII detection and redaction model
Sentinel PII Redaction is a specialized language model fine-tuned for identifying and tagging Personally Identifiable Information (PII) in text. Built on IBM's Granite 4.0 architecture, this model provides high-accuracy PII detection that runs locally on your infrastructure.
Model Overview
- Base Model: IBM Granite 4.0 Micro (3.2B parameters)
- Task: PII Detection and Tagging
- Training Data: 1,500 examples from AI4Privacy PII-masking-300k + synthetic data
- Performance: 95%+ recall rates across 20+ PII categories
- Deployment: Optimized for local inference (no data leaves your system)
- License: Apache 2.0
Supported PII Categories
The model can identify and tag the following PII categories:
Identity Information
PERSON_NAME- Full names, first names, last namesUSERNAME- User identifiersAGE- Numerical ageGENDER- Gender identifiersDEMOGRAPHIC_GROUP- Race, ethnicity
Contact Information
EMAIL_ADDRESS- Email addressesPHONE_NUMBER- Phone numbers (various formats)STREET_ADDRESS- Physical addressesCITY- City namesSTATE- State/province namesPOSTCODE- ZIP/postal codesCOUNTRY- Country names
Dates
DATE- General datesDATE_OF_BIRTH- Birth dates
ID Numbers
PERSONAL_ID- SSN, national IDs, subscriber numbersPASSPORT- Passport numbersDRIVERLICENSE- Driver's license numbersIDCARD- ID card numbersSOCIALNUMBER- Social security numbers
Financial
CREDIT_CARD_INFO- Credit card numbersBANKING_NUMBER- Bank account numbers
Security
PASSWORD- Passwords and credentialsSECURE_CREDENTIAL- API keys, tokens, private keys
Medical
MEDICAL_CONDITION- Diagnoses, treatments, health information
Location
NATIONALITY- Country of origin/citizenshipGEOCOORD- GPS coordinates
Organization
ORGANIZATION_NAME- Company/organization namesBUILDING- Building names/numbers
Other
DOMAIN_NAME- Internet domainsRELIGIOUS_AFFILIATION- Religious identifiers
🚀 Quick Start
Installation
pip install transformers torch
Basic Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained(
"coolAI/sentinel-pii-redaction",
torch_dtype=torch.float16,
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("coolAI/sentinel-pii-redaction")
# Prepare input text
text = "My name is John Smith and my email is [email protected]. I live at 123 Main St, New York, NY 10001."
# Create prompt
messages = [
{
"role": "user",
"content": f"Identify and tag all PII in the following text using the format [CATEGORY]:\n\n{text}"
}
]
# Tokenize
inputs = tokenizer.apply_chat_template(
messages,
tokenize=True,
add_generation_prompt=True,
return_tensors="pt"
).to(model.device)
# Generate
with torch.no_grad():
outputs = model.generate(
inputs,
max_new_tokens=512,
do_sample=False,
pad_token_id=tokenizer.eos_token_id
)
# Decode output
input_length = inputs.size(1)
generated_ids = outputs[0][input_length:]
response = tokenizer.decode(generated_ids, skip_special_tokens=True)
print(response)
Expected Output:
My name is [PERSON_NAME] and my email is [EMAIL_ADDRESS]. I live at [STREET_ADDRESS], [CITY], [STATE] [POSTCODE].
📊 Performance Metrics
Evaluated on the AI4Privacy PII-masking-300k dataset:
Category-Specific Recall Rates
| Category | Recall | Description |
|---|---|---|
| Critical PII | ||
| PERSONAL_ID | 98.5% | SSN, national IDs |
| DATE_OF_BIRTH | 98.2% | Birth dates |
| CREDIT_CARD_INFO | 97.8% | Credit card numbers |
| PASSWORD | 96.9% | Passwords |
| Identity | ||
| PERSON_NAME | 95.4% | Personal names |
| EMAIL_ADDRESS | 97.2% | Email addresses |
| PHONE_NUMBER | 96.5% | Phone numbers |
| USERNAME | 94.8% | User identifiers |
| Location | ||
| STREET_ADDRESS | 96.5% | Physical addresses |
| POSTCODE | 99.3% | ZIP/postal codes |
| CITY | 97.6% | City names |
| COUNTRY | 96.1% | Country names |
| Medical | ||
| MEDICAL_CONDITION | 93.2% | Health information |
| Organization | ||
| ORGANIZATION_NAME | 94.7% | Company names |
Note: Actual performance may vary based on text format and context.
💡 Use Cases
1. Data Sanitization for ML Training
Remove PII from datasets before fine-tuning language models:
def sanitize_training_data(texts):
sanitized = []
for text in texts:
redacted = redact_pii(text)
sanitized.append(redacted)
return sanitized
# Use for safe model training
clean_data = sanitize_training_data(user_generated_content)
2. Compliance & Auditing
Ensure GDPR, HIPAA, and CCPA compliance:
def audit_document(document):
pii_found = detect_pii(document)
return {
"has_pii": len(pii_found) > 0,
"pii_types": list(pii_found.keys()),
"redacted_version": redact_pii(document)
}
3. Privacy Protection in Logs
Sanitize application logs before storage or analysis:
def safe_logging(log_entry):
return redact_pii(log_entry)
logger.info(safe_logging(user_action))
🔧 Advanced Usage
With Custom PII Categories
Guide the model by specifying which PII categories to focus on:
categories = """
PII Categories to identify:
- PERSON_NAME: Names of people
- EMAIL_ADDRESS: Email addresses
- PHONE_NUMBER: Phone numbers
- MEDICAL_CONDITION: Health information
- PERSONAL_ID: ID numbers (SSN, passport, etc.)
"""
messages = [
{
"role": "user",
"content": f"{categories}\n\nIdentify and tag all PII in the following text using the format [CATEGORY]:\n\n{text}"
}
]
Batch Processing
Process multiple texts efficiently:
def batch_redact(texts, batch_size=8):
results = []
for i in range(0, len(texts), batch_size):
batch = texts[i:i+batch_size]
# Process batch...
results.extend(batch_results)
return results
📝 Training Details
Training Data
- AI4Privacy PII-masking-300k: 1,000 examples
- Large-scale, diverse PII examples
- Multiple languages and jurisdictions
- Human-validated accuracy
- Synthetic Data: 500 examples
- Generated using Faker library
- Edge cases and rare PII types
- Balanced category representation
- Total: 1,500 training examples
Training Configuration
Base Model: IBM Granite 4.0 Micro (3.2B parameters)
Method: LoRA (Low-Rank Adaptation)
Trainable Parameters: 38.4M (1.19% of total)
Training Hardware: NVIDIA L4 GPU
Training Time: ~7 minutes
Epochs: 1
Batch Size: 8 (2 × 4 gradient accumulation)
Learning Rate: 2e-4
Optimizer: AdamW 8-bit
Final Loss: 0.015-0.038
Training Framework
- Unsloth: For efficient fine-tuning
- Transformers: Model architecture
- PEFT: LoRA implementation
Privacy & Security
Privacy Features
- Local Inference: Runs entirely on your infrastructure
- No Data Sharing: No data sent to external APIs or services
- Open Source: Full transparency in model architecture and training
- Customizable: Can be further fine-tuned on your specific data
- Offline Capable: Works without internet connection
Security Considerations
- Model detects but doesn't store PII
- Inference happens in-memory
- No logging of input/output by default
- Can be deployed in air-gapped environments
- Supports encrypted storage of model weights
📄 License
This model is released under the Apache 2.0 license. You are free to:
- Use commercially
- Modify and distribute
- Use privately
- Use for patent purposes
🙏 Acknowledgments
- Built on IBM Granite 4.0 architecture
- Trained using AI4Privacy PII-masking-300k dataset
- Powered by Unsloth for efficient training
- Thanks to the open-source ML community
📚 Citation
If you use this model in your research or applications, please cite:
@misc{sentinel-pii-redaction-2025,
author = {coolAI},
title = {Sentinel PII Redaction: High-Accuracy Local PII Detection},
year = {2025},
publisher = {HuggingFace},
journal = {HuggingFace Model Hub},
howpublished = {\url{https://huggingface.co/coolAI/sentinel-pii-redaction}}
}
Built with ❤️ for privacy-conscious AI development
- Downloads last month
- 119