Deepfake Voice Detector - SOTA Transfer Learning

Repository: https://huggingface.co/koyelog/deepfake-voice-detector-sota
Model card last updated: 2025-10-31

This repository contains a binary audio classification model trained to distinguish real (bonafide) human voice samples from fake (synthetic / deepfake) voice samples. The model leverages transfer learning from facebook/wav2vec2-base and a lightweight temporal classifier (BiGRU + Multi-Head Attention) on top.

Model Details

Model name: Deepfake Voice Detector - SOTA Transfer Learning
HF repo: koyelog/deepfake-voice-detector-sota
Task: Audio Classification (Binary — Real vs Fake)
Base model: facebook/wav2vec2-base (feature extractor)
Architecture:
- Wav2Vec2 encoder (facebook/wav2vec2-base) — pretrained speech feature extractor; CNN layers frozen
- Bidirectional GRU: 2 layers, 256 hidden units per direction (512 total)
- Multi-Head Attention: 8 heads, 512-dimensional embeddings
- Classification head:
  - Linear(512 → 512) + ReLU + BatchNorm + Dropout(0.4)
  - Linear(512 → 128) + ReLU + BatchNorm + Dropout(0.3)
  - Linear(128 → 1) + Sigmoid
Framework: PyTorch + Transformers
Total parameters: ~98.5M
Trainable parameters: ~98.5M
Input: 4-second audio clip at 16 kHz (single-channel)
Output: single probability (0..1) representing likelihood of "fake". Default threshold: 0.5 (0 = Real, 1 = Fake)
License: Apache-2.0

Training Procedure

Training data: 822,166 audio samples aggregated from 19 datasets (listed below)
- Real/Bonafide: 387,422 samples (47.1%)
- Fake/Deepfake/Synthetic: 434,744 samples (52.9%)
Dataset sources (combined): ASVspoof 2021, WaveFake, Audio-Deepfake, Fake-Real-Audio, Deepfake-Audio, Combined-Real-Voices, Scenefake, Gender-Balanced-Audio-Deepfake, Synthetic-Speech-Commands, and 10+ other Kaggle/academic datasets
Data preprocessing:
- Resample to 16 kHz
- Fixed-length segments: 4 seconds (pad/truncate as required)
- Feature extraction: raw audio → wav2vec2 feature frames
- Data balancing & augmentation: dataset composition described above; standard augmentations (noise, speed perturbation) used where applicable
Train / Val split:
- Training: 657,732 samples (80%)
- Validation: 164,434 samples (20%)
Optimization:
- Optimizer: AdamW
- Learning rate: 5e-5
- Weight decay: 0.01
- Batch size: 24
- Gradient accumulation: 2 (effective batch size 48)
- Epochs: 20
- Scheduler: Cosine Annealing with Warm Restarts (T_0=5, T_mult=2)
- Loss: Binary Cross-Entropy (BCE)
- Mixed precision: supported where hardware permits
Hardware: Tesla P100-PCIE-16GB
Training time: ~16 hours (single GPU as reported)
Random seed and reproducibility: users should set deterministic seeds for fully reproducible runs (not included in this artifact)

Evaluation Results

The reported validation metrics (expected ranges from final evaluation) are:

Validation accuracy: 95%–97%
Precision: ~0.95
Recall: ~0.94
F1-score: ~0.94
AUC-ROC: ~0.96

Notes:

These values represent evaluation on the combined held-out validation split described above. Performance will vary by dataset, language, recording conditions, and unseen manipulation techniques.
Reported metrics are aggregated and averaged across the validation partition. Per-dataset metrics (e.g., ASVspoof vs WaveFake) will differ and are not included in this artifact.

How to Use

Supported input: 4-second audio clip sampled at 16 kHz. Longer or shorter clips should be truncated or padded to 4 seconds before inference.

Example (pseudocode / minimal PyTorch + Transformers usage):

import torch
import torchaudio
from transformers import Wav2Vec2FeatureExtractor
# model = load your checkpoint wrapped with the BiGRU+Attention classifier

# 1) Load audio and resample to 16k
waveform, sr = torchaudio.load("example.wav")
if sr != 16000:
    waveform = torchaudio.functional.resample(waveform, sr, 16000)

# 2) Ensure 4 seconds length (pad or truncate)
target_len = 4 * 16000
if waveform.shape[1] < target_len:
    pad = target_len - waveform.shape[1]
    waveform = torch.nn.functional.pad(waveform, (0, pad))
else:
    waveform = waveform[:, :target_len]

# 3) Feature extraction (wav2vec2)
feature_extractor = Wav2Vec2FeatureExtractor.from_pretrained("facebook/wav2vec2-base")
input_values = feature_extractor(waveform.squeeze(0).numpy(), sampling_rate=16000, return_tensors="pt").input_values

# 4) Forward pass through model
model.eval()
with torch.no_grad():
    logits = model(input_values)          # model should return a scalar logit per sample
    prob = torch.sigmoid(logits).item()   # probability of 'fake' class

prediction = 1 if prob >= 0.5 else 0
confidence = prob if prediction == 1 else 1 - prob

Model outputs:

logits / raw score (float)
probability (sigmoid(logit)): float in [0,1]
final label: 0 = Real (bonafide), 1 = Fake (deepfake/synthetic)
confidence: probability of the predicted class

Labeling Strategy

Label 0 — Real / Bonafide:
- Human voice recordings from authentic sources (keywords in metadata/filenames: bonafide, real, genuine, human, authentic, original)
Label 1 — Fake / Deepfake:
- AI-generated, synthetic, or manipulated audio produced by text-to-speech, voice conversion, or other spoofing methods (keywords: spoof, fake, deepfake, synthetic, generated, ai)

Labels were assigned by dataset provider metadata and cross-checked using dataset documentation. Users applying new datasets should ensure consistent labeling and metadata mapping.

Limitations and Biases

Clip length sensitivity: The model is optimized for 4-second clips; performance on markedly shorter/longer clips may degrade.
Language & accent coverage: Although trained on many datasets and multi-language samples, underrepresented languages/accents in the training corpora can cause degraded performance.
Dataset composition: Slight skew towards fake samples (52.9% fake vs 47.1% real) — may increase false positives in certain deployment scenarios.
Novel attacks: Not evaluated on zero-shot or post-2025 deepfake generation techniques; performance against new generator families is unknown.
Environmental factors: Recording quality, background noise, channel effects, and codecs may affect predictions.
Ethical risk: Incorrect or automated use of the model can cause reputational or legal harm. Model outputs should not be used as sole evidence.

Ethical Considerations

This model is intended as an assistive tool for verification and detection workflows. Human oversight is essential for any high-stakes decisions.
Avoid using the model to definitively accuse individuals of wrongdoing without corroborating evidence.
Respect privacy and legal restrictions when processing audio data.
Be transparent about limitations, false positive/negative rates, and the potential for demographic biases.

Hardware & Inference Requirements

Recommended: GPU with CUDA support for fast inference (e.g., NVIDIA GPUs). CPU inference possible but slower.
Approximate memory: ~2 GB GPU VRAM for single-sample inference (depends on implementation).
Batch inference recommended for throughput.

Caveats & Reproducibility

This model card documents a trained model artifact. If you require retraining or further experiments, use the architecture specification above and provide deterministic seeds and full data provenance to replicate results.
Exact training scripts, hyperparameter sweep logs, and raw dataset bundles are not included in this model artifact. Users should exercise caution when assuming identical performance in other contexts.

Citation

If you use this model, please cite:

@misc{deepfake-voice-detector-sota, author = {koyelog}, title = {Deepfake Voice Detector - Transfer Learning with Wav2Vec2}, year = {2025}, publisher = {Hugging Face}, url = {https://huggingface.co/koyelog/deepfake-voice-detector-sota} }

Contact

Model owner: koyelog
Model hub: https://huggingface.co/koyelog/deepfake-voice-detector-sota

Downloads last month: 41