Spaces:

ChauHPham
/

AITextDetector

Running

App Files Files Community

AITextDetector / TRAINING_GUIDE.md

ChauHPham

Upload folder using huggingface_hub

25faba3 verified 10 days ago

preview code

raw

history blame contribute delete

2.49 kB

A newer version of the Gradio SDK is available: 6.0.0

Upgrade

🚀 Training Guide

Problem

The mutex lock error [mutex.cc : 452] RAW: Lock blocking... happens because:

HuggingFace Trainer API tries to use multiprocessing
macOS doesn't handle multiprocessing from tokenizers well
Environment variables alone aren't enough to fix it completely

Solution

✅ BEST: Use the Simple Training Script (Recommended)

The simple training script avoids the Trainer API entirely:

python scripts/run_train_simple.py

What it does:

✅ No multiprocessing
✅ No threading issues
✅ Direct PyTorch training loop
✅ Works on macOS
✅ Same results as Trainer API

Output:

Trains for 2 epochs
Shows progress with tqdm
Saves model to models/ai_detector

Alternative: Shell Script

bash train_macos.sh

This sets all environment variables and runs the simple script.

If You Still Get Errors

Option 1: Reduce to Tiny Dataset

python scripts/sample_dataset.py data/ai_vs_human_text.csv data/tiny.csv -n 100
# Then edit configs/default.yaml:
#   data_path: data/tiny.csv
python scripts/run_train.py

Option 2: Run Outside venv

# Exit your virtualenv
deactivate

# Install system-wide
pip install --user -r requirements.txt

# Train
python scripts/run_train_simple.py

Option 3: Use Colab/Cloud

If nothing works locally, train on Google Colab (free GPU):

Upload your data to Google Drive
Use the Colab notebook template
Much faster training

Key Differences

Simple Script (`run_train_simple.py`)

✅ No Trainer API (no multiprocessing issues)
✅ Works on macOS
✅ Good for small-medium datasets
⚠️ Less efficient on large datasets

Standard Script (`run_train.py`)

Uses HuggingFace Trainer API
✅ Optimized for large datasets
⚠️ Multiprocessing issues on macOS

Recommended Setup

Dataset: ✅ Downloaded (data/ai_vs_human_text.csv)
Config: ✅ Updated (configs/default.yaml)
Training: Use run_train_simple.py

Start Training

python scripts/run_train_simple.py

Should see output like: ``` 🚀 Starting training (simple mode - no multiprocessing)

📖 Loading data from data/ai_vs_human_text.csv... Loaded 1,000 samples Distribution: {0: 493, 1: 507} Train: 800 | Val: 200

🤖 Loading model: roberta-base...

📊 Creating datasets...

⚙️ Training for 2 epochs...


Good luck! 🎉