Sentinel PII Redaction

State-of-the-art PII detection and redaction model

Sentinel PII Redaction is a specialized language model fine-tuned for identifying and tagging Personally Identifiable Information (PII) in text. Built on IBM's Granite 4.0 architecture, this model provides high-accuracy PII detection that runs locally on your infrastructure.

Model Overview

Base Model: IBM Granite 4.0 Micro (3.2B parameters)
Task: PII Detection and Tagging
Training Data: 1,500 examples from AI4Privacy PII-masking-300k + synthetic data
Performance: 95%+ recall rates across 20+ PII categories
Deployment: Optimized for local inference (no data leaves your system)
License: Apache 2.0

Supported PII Categories

The model can identify and tag the following PII categories:

Identity Information

PERSON_NAME - Full names, first names, last names
USERNAME - User identifiers
AGE - Numerical age
GENDER - Gender identifiers
DEMOGRAPHIC_GROUP - Race, ethnicity

Contact Information

EMAIL_ADDRESS - Email addresses
PHONE_NUMBER - Phone numbers (various formats)
STREET_ADDRESS - Physical addresses
CITY - City names
STATE - State/province names
POSTCODE - ZIP/postal codes
COUNTRY - Country names

Dates

DATE - General dates
DATE_OF_BIRTH - Birth dates

ID Numbers

PERSONAL_ID - SSN, national IDs, subscriber numbers
PASSPORT - Passport numbers
DRIVERLICENSE - Driver's license numbers
IDCARD - ID card numbers
SOCIALNUMBER - Social security numbers

Financial

CREDIT_CARD_INFO - Credit card numbers
BANKING_NUMBER - Bank account numbers

Security

PASSWORD - Passwords and credentials
SECURE_CREDENTIAL - API keys, tokens, private keys

Medical

MEDICAL_CONDITION - Diagnoses, treatments, health information

Location

NATIONALITY - Country of origin/citizenship
GEOCOORD - GPS coordinates

Organization

ORGANIZATION_NAME - Company/organization names
BUILDING - Building names/numbers

Other

DOMAIN_NAME - Internet domains
RELIGIOUS_AFFILIATION - Religious identifiers

🚀 Quick Start

Installation

pip install transformers torch

Basic Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained(
    "coolAI/sentinel-pii-redaction",
    torch_dtype=torch.float16,
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("coolAI/sentinel-pii-redaction")

# Prepare input text
text = "My name is John Smith and my email is john@email.com. I live at 123 Main St, New York, NY 10001."

# Create prompt
messages = [
    {
        "role": "user", 
        "content": f"Identify and tag all PII in the following text using the format [CATEGORY]:\n\n{text}"
    }
]

# Tokenize
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_tensors="pt"
).to(model.device)

# Generate
with torch.no_grad():
    outputs = model.generate(
        inputs,
        max_new_tokens=512,
        do_sample=False,
        pad_token_id=tokenizer.eos_token_id
    )

# Decode output
input_length = inputs.size(1)
generated_ids = outputs[0][input_length:]
response = tokenizer.decode(generated_ids, skip_special_tokens=True)

print(response)

Expected Output:

My name is [PERSON_NAME] and my email is [EMAIL_ADDRESS]. I live at [STREET_ADDRESS], [CITY], [STATE] [POSTCODE].

📊 Performance Metrics

Evaluated on the AI4Privacy PII-masking-300k dataset:

Category-Specific Recall Rates

Category	Recall	Description
Critical PII
PERSONAL_ID	98.5%	SSN, national IDs
DATE_OF_BIRTH	98.2%	Birth dates
CREDIT_CARD_INFO	97.8%	Credit card numbers
PASSWORD	96.9%	Passwords
Identity
PERSON_NAME	95.4%	Personal names
EMAIL_ADDRESS	97.2%	Email addresses
PHONE_NUMBER	96.5%	Phone numbers
USERNAME	94.8%	User identifiers
Location
STREET_ADDRESS	96.5%	Physical addresses
POSTCODE	99.3%	ZIP/postal codes
CITY	97.6%	City names
COUNTRY	96.1%	Country names
Medical
MEDICAL_CONDITION	93.2%	Health information
Organization
ORGANIZATION_NAME	94.7%	Company names

Note: Actual performance may vary based on text format and context.

💡 Use Cases

1. Data Sanitization for ML Training

Remove PII from datasets before fine-tuning language models:

def sanitize_training_data(texts):
    sanitized = []
    for text in texts:
        redacted = redact_pii(text)
        sanitized.append(redacted)
    return sanitized

# Use for safe model training
clean_data = sanitize_training_data(user_generated_content)

2. Compliance & Auditing

Ensure GDPR, HIPAA, and CCPA compliance:

def audit_document(document):
    pii_found = detect_pii(document)
    return {
        "has_pii": len(pii_found) > 0,
        "pii_types": list(pii_found.keys()),
        "redacted_version": redact_pii(document)
    }

3. Privacy Protection in Logs

Sanitize application logs before storage or analysis:

def safe_logging(log_entry):
    return redact_pii(log_entry)

logger.info(safe_logging(user_action))

🔧 Advanced Usage

With Custom PII Categories

Guide the model by specifying which PII categories to focus on:

categories = """
PII Categories to identify:
- PERSON_NAME: Names of people
- EMAIL_ADDRESS: Email addresses
- PHONE_NUMBER: Phone numbers
- MEDICAL_CONDITION: Health information
- PERSONAL_ID: ID numbers (SSN, passport, etc.)
"""

messages = [
    {
        "role": "user", 
        "content": f"{categories}\n\nIdentify and tag all PII in the following text using the format [CATEGORY]:\n\n{text}"
    }
]

Batch Processing

Process multiple texts efficiently:

def batch_redact(texts, batch_size=8):
    results = []
    for i in range(0, len(texts), batch_size):
        batch = texts[i:i+batch_size]
        # Process batch...
        results.extend(batch_results)
    return results

📝 Training Details

Training Data

AI4Privacy PII-masking-300k: 1,000 examples
- Large-scale, diverse PII examples
- Multiple languages and jurisdictions
- Human-validated accuracy
Synthetic Data: 500 examples
- Generated using Faker library
- Edge cases and rare PII types
- Balanced category representation
Total: 1,500 training examples

Training Configuration

Base Model: IBM Granite 4.0 Micro (3.2B parameters)
Method: LoRA (Low-Rank Adaptation)
Trainable Parameters: 38.4M (1.19% of total)
Training Hardware: NVIDIA L4 GPU
Training Time: ~7 minutes
Epochs: 1
Batch Size: 8 (2 × 4 gradient accumulation)
Learning Rate: 2e-4
Optimizer: AdamW 8-bit
Final Loss: 0.015-0.038

Training Framework

Unsloth: For efficient fine-tuning
Transformers: Model architecture
PEFT: LoRA implementation

Privacy & Security

Privacy Features

Local Inference: Runs entirely on your infrastructure
No Data Sharing: No data sent to external APIs or services
Open Source: Full transparency in model architecture and training
Customizable: Can be further fine-tuned on your specific data
Offline Capable: Works without internet connection

Security Considerations

Model detects but doesn't store PII
Inference happens in-memory
No logging of input/output by default
Can be deployed in air-gapped environments
Supports encrypted storage of model weights

📄 License

This model is released under the Apache 2.0 license. You are free to:

Use commercially
Modify and distribute
Use privately
Use for patent purposes

🙏 Acknowledgments

Built on IBM Granite 4.0 architecture
Trained using AI4Privacy PII-masking-300k dataset
Powered by Unsloth for efficient training
Thanks to the open-source ML community

📚 Citation

If you use this model in your research or applications, please cite:

@misc{sentinel-pii-redaction-2025,
  author = {coolAI},
  title = {Sentinel PII Redaction: High-Accuracy Local PII Detection},
  year = {2025},
  publisher = {HuggingFace},
  journal = {HuggingFace Model Hub},
  howpublished = {\url{https://huggingface.co/coolAI/sentinel-pii-redaction}}
}

Built with ❤️ for privacy-conscious AI development

Downloads last month: 40

Safetensors

Model size

3B params

Tensor type

BF16

Model tree for cernis-intelligence/sentinel

Base model

ibm-granite/granite-4.0-h-micro

Finetuned

(10)

this model

Quantizations

1 model

cernis-intelligence
/

sentinel