AITextDetector / TRAINING_GUIDE.md
ChauHPham's picture
Upload folder using huggingface_hub
25faba3 verified

A newer version of the Gradio SDK is available: 6.0.0

Upgrade

πŸš€ Training Guide

Problem

The mutex lock error [mutex.cc : 452] RAW: Lock blocking... happens because:

  1. HuggingFace Trainer API tries to use multiprocessing
  2. macOS doesn't handle multiprocessing from tokenizers well
  3. Environment variables alone aren't enough to fix it completely

Solution

βœ… BEST: Use the Simple Training Script (Recommended)

The simple training script avoids the Trainer API entirely:

python scripts/run_train_simple.py

What it does:

  • βœ… No multiprocessing
  • βœ… No threading issues
  • βœ… Direct PyTorch training loop
  • βœ… Works on macOS
  • βœ… Same results as Trainer API

Output:

  • Trains for 2 epochs
  • Shows progress with tqdm
  • Saves model to models/ai_detector

Alternative: Shell Script

bash train_macos.sh

This sets all environment variables and runs the simple script.

If You Still Get Errors

Option 1: Reduce to Tiny Dataset

python scripts/sample_dataset.py data/ai_vs_human_text.csv data/tiny.csv -n 100
# Then edit configs/default.yaml:
#   data_path: data/tiny.csv
python scripts/run_train.py

Option 2: Run Outside venv

# Exit your virtualenv
deactivate

# Install system-wide
pip install --user -r requirements.txt

# Train
python scripts/run_train_simple.py

Option 3: Use Colab/Cloud

If nothing works locally, train on Google Colab (free GPU):

  • Upload your data to Google Drive
  • Use the Colab notebook template
  • Much faster training

Key Differences

Simple Script (run_train_simple.py)

  • βœ… No Trainer API (no multiprocessing issues)
  • βœ… Works on macOS
  • βœ… Good for small-medium datasets
  • ⚠️ Less efficient on large datasets

Standard Script (run_train.py)

  • Uses HuggingFace Trainer API
  • βœ… Optimized for large datasets
  • ⚠️ Multiprocessing issues on macOS

Recommended Setup

  1. Dataset: βœ… Downloaded (data/ai_vs_human_text.csv)
  2. Config: βœ… Updated (configs/default.yaml)
  3. Training: Use run_train_simple.py

Start Training

python scripts/run_train_simple.py

Should see output like: ``` πŸš€ Starting training (simple mode - no multiprocessing)

πŸ“– Loading data from data/ai_vs_human_text.csv... Loaded 1,000 samples Distribution: {0: 493, 1: 507} Train: 800 | Val: 200

πŸ€– Loading model: roberta-base...

πŸ“Š Creating datasets...

βš™οΈ Training for 2 epochs...


Good luck! πŸŽ‰