Sentinel PII Redaction

State-of-the-art PII detection and redaction model

Sentinel PII Redaction is a specialized language model fine-tuned for identifying and tagging Personally Identifiable Information (PII) in text. Built on IBM's Granite 4.0 architecture, this model provides high-accuracy PII detection that runs locally on your infrastructure.

Model Overview

  • Base Model: IBM Granite 4.0 Micro (3.2B parameters)
  • Task: PII Detection and Tagging
  • Training Data: 1,500 examples from AI4Privacy PII-masking-300k + synthetic data
  • Performance: 95%+ recall rates across 20+ PII categories
  • Deployment: Optimized for local inference (no data leaves your system)
  • License: Apache 2.0

Supported PII Categories

The model can identify and tag the following PII categories:

Identity Information

  • PERSON_NAME - Full names, first names, last names
  • USERNAME - User identifiers
  • AGE - Numerical age
  • GENDER - Gender identifiers
  • DEMOGRAPHIC_GROUP - Race, ethnicity

Contact Information

  • EMAIL_ADDRESS - Email addresses
  • PHONE_NUMBER - Phone numbers (various formats)
  • STREET_ADDRESS - Physical addresses
  • CITY - City names
  • STATE - State/province names
  • POSTCODE - ZIP/postal codes
  • COUNTRY - Country names

Dates

  • DATE - General dates
  • DATE_OF_BIRTH - Birth dates

ID Numbers

  • PERSONAL_ID - SSN, national IDs, subscriber numbers
  • PASSPORT - Passport numbers
  • DRIVERLICENSE - Driver's license numbers
  • IDCARD - ID card numbers
  • SOCIALNUMBER - Social security numbers

Financial

  • CREDIT_CARD_INFO - Credit card numbers
  • BANKING_NUMBER - Bank account numbers

Security

  • PASSWORD - Passwords and credentials
  • SECURE_CREDENTIAL - API keys, tokens, private keys

Medical

  • MEDICAL_CONDITION - Diagnoses, treatments, health information

Location

  • NATIONALITY - Country of origin/citizenship
  • GEOCOORD - GPS coordinates

Organization

  • ORGANIZATION_NAME - Company/organization names
  • BUILDING - Building names/numbers

Other

  • DOMAIN_NAME - Internet domains
  • RELIGIOUS_AFFILIATION - Religious identifiers

🚀 Quick Start

Installation

pip install transformers torch

Basic Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained(
    "coolAI/sentinel-pii-redaction",
    torch_dtype=torch.float16,
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("coolAI/sentinel-pii-redaction")

# Prepare input text
text = "My name is John Smith and my email is [email protected]. I live at 123 Main St, New York, NY 10001."

# Create prompt
messages = [
    {
        "role": "user", 
        "content": f"Identify and tag all PII in the following text using the format [CATEGORY]:\n\n{text}"
    }
]

# Tokenize
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_tensors="pt"
).to(model.device)

# Generate
with torch.no_grad():
    outputs = model.generate(
        inputs,
        max_new_tokens=512,
        do_sample=False,
        pad_token_id=tokenizer.eos_token_id
    )

# Decode output
input_length = inputs.size(1)
generated_ids = outputs[0][input_length:]
response = tokenizer.decode(generated_ids, skip_special_tokens=True)

print(response)

Expected Output:

My name is [PERSON_NAME] and my email is [EMAIL_ADDRESS]. I live at [STREET_ADDRESS], [CITY], [STATE] [POSTCODE].

📊 Performance Metrics

Evaluated on the AI4Privacy PII-masking-300k dataset:

Category-Specific Recall Rates

Category Recall Description
Critical PII
PERSONAL_ID 98.5% SSN, national IDs
DATE_OF_BIRTH 98.2% Birth dates
CREDIT_CARD_INFO 97.8% Credit card numbers
PASSWORD 96.9% Passwords
Identity
PERSON_NAME 95.4% Personal names
EMAIL_ADDRESS 97.2% Email addresses
PHONE_NUMBER 96.5% Phone numbers
USERNAME 94.8% User identifiers
Location
STREET_ADDRESS 96.5% Physical addresses
POSTCODE 99.3% ZIP/postal codes
CITY 97.6% City names
COUNTRY 96.1% Country names
Medical
MEDICAL_CONDITION 93.2% Health information
Organization
ORGANIZATION_NAME 94.7% Company names

Note: Actual performance may vary based on text format and context.

💡 Use Cases

1. Data Sanitization for ML Training

Remove PII from datasets before fine-tuning language models:

def sanitize_training_data(texts):
    sanitized = []
    for text in texts:
        redacted = redact_pii(text)
        sanitized.append(redacted)
    return sanitized

# Use for safe model training
clean_data = sanitize_training_data(user_generated_content)

2. Compliance & Auditing

Ensure GDPR, HIPAA, and CCPA compliance:

def audit_document(document):
    pii_found = detect_pii(document)
    return {
        "has_pii": len(pii_found) > 0,
        "pii_types": list(pii_found.keys()),
        "redacted_version": redact_pii(document)
    }

3. Privacy Protection in Logs

Sanitize application logs before storage or analysis:

def safe_logging(log_entry):
    return redact_pii(log_entry)

logger.info(safe_logging(user_action))

🔧 Advanced Usage

With Custom PII Categories

Guide the model by specifying which PII categories to focus on:

categories = """
PII Categories to identify:
- PERSON_NAME: Names of people
- EMAIL_ADDRESS: Email addresses
- PHONE_NUMBER: Phone numbers
- MEDICAL_CONDITION: Health information
- PERSONAL_ID: ID numbers (SSN, passport, etc.)
"""

messages = [
    {
        "role": "user", 
        "content": f"{categories}\n\nIdentify and tag all PII in the following text using the format [CATEGORY]:\n\n{text}"
    }
]

Batch Processing

Process multiple texts efficiently:

def batch_redact(texts, batch_size=8):
    results = []
    for i in range(0, len(texts), batch_size):
        batch = texts[i:i+batch_size]
        # Process batch...
        results.extend(batch_results)
    return results

📝 Training Details

Training Data

  • AI4Privacy PII-masking-300k: 1,000 examples
    • Large-scale, diverse PII examples
    • Multiple languages and jurisdictions
    • Human-validated accuracy
  • Synthetic Data: 500 examples
    • Generated using Faker library
    • Edge cases and rare PII types
    • Balanced category representation
  • Total: 1,500 training examples

Training Configuration

Base Model: IBM Granite 4.0 Micro (3.2B parameters)
Method: LoRA (Low-Rank Adaptation)
Trainable Parameters: 38.4M (1.19% of total)
Training Hardware: NVIDIA L4 GPU
Training Time: ~7 minutes
Epochs: 1
Batch Size: 8 (2 × 4 gradient accumulation)
Learning Rate: 2e-4
Optimizer: AdamW 8-bit
Final Loss: 0.015-0.038

Training Framework

  • Unsloth: For efficient fine-tuning
  • Transformers: Model architecture
  • PEFT: LoRA implementation

Privacy & Security

Privacy Features

  • Local Inference: Runs entirely on your infrastructure
  • No Data Sharing: No data sent to external APIs or services
  • Open Source: Full transparency in model architecture and training
  • Customizable: Can be further fine-tuned on your specific data
  • Offline Capable: Works without internet connection

Security Considerations

  • Model detects but doesn't store PII
  • Inference happens in-memory
  • No logging of input/output by default
  • Can be deployed in air-gapped environments
  • Supports encrypted storage of model weights

📄 License

This model is released under the Apache 2.0 license. You are free to:

  • Use commercially
  • Modify and distribute
  • Use privately
  • Use for patent purposes

🙏 Acknowledgments

  • Built on IBM Granite 4.0 architecture
  • Trained using AI4Privacy PII-masking-300k dataset
  • Powered by Unsloth for efficient training
  • Thanks to the open-source ML community

📚 Citation

If you use this model in your research or applications, please cite:

@misc{sentinel-pii-redaction-2025,
  author = {coolAI},
  title = {Sentinel PII Redaction: High-Accuracy Local PII Detection},
  year = {2025},
  publisher = {HuggingFace},
  journal = {HuggingFace Model Hub},
  howpublished = {\url{https://huggingface.co/coolAI/sentinel-pii-redaction}}
}

Built with ❤️ for privacy-conscious AI development

Downloads last month
119
Safetensors
Model size
3B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for cernis-intelligence/sentinel

Finetuned
(7)
this model
Quantizations
1 model

Dataset used to train cernis-intelligence/sentinel