AITextDetector / QUICK_START_DOWNLOAD.md
ChauHPham's picture
Upload folder using huggingface_hub
25faba3 verified

A newer version of the Gradio SDK is available: 6.0.0

Upgrade

πŸš€ Quick Start: Download Dataset

βœ… Script Works! (Tested Successfully)

The download script works perfectly! Here are all the ways to use it:


Method 1: Use the Script (Easiest) ⭐

# Download the default dataset
python scripts/download_kagglehub.py

# Or specify a different dataset
python scripts/download_kagglehub.py --dataset shamimhasan8/ai-vs-human-text-dataset

Output: Dataset saved to data/ai_vs_human_text.csv


Method 2: Direct in Your Code (Simple)

Just copy-paste this into your Python script:

import kagglehub
import pandas as pd
from pathlib import Path

# Download dataset (no API token needed!)
path = kagglehub.dataset_download("shamimhasan8/ai-vs-human-text-dataset")
print("Path to dataset files:", path)

# Load the CSV
csv_files = list(Path(path).glob("*.csv"))
df = pd.read_csv(csv_files[0])

# Save to your data directory
df.to_csv("data/dataset.csv", index=False)

See: examples/simple_download.py for a complete example


Method 3: Use the Integrated Function

from ai_text_detector.download_data import download_ai_vs_human_dataset

# Download and get the path
csv_path = download_ai_vs_human_dataset()
print(f"Dataset at: {csv_path}")

# Now use it in your training
from ai_text_detector.config import load_config
cfg = load_config("configs/default.yaml")
cfg.data_path = csv_path

See: examples/download_and_train.py for a complete training example


Method 4: Download Any Dataset

from ai_text_detector.download_data import download_kaggle_dataset

# Download any Kaggle dataset
csv_path = download_kaggle_dataset(
    "shamimhasan8/ai-vs-human-text-dataset",
    output_path="data/my_dataset.csv"
)

πŸ“Š What Was Downloaded

  • Dataset: shamimhasan8/ai-vs-human-text-dataset
  • Size: 1,000 samples
  • Columns: id, text, label, prompt, model, date
  • Labels: "AI-generated" or "Human-written"
  • Saved to: data/ai_vs_human_text.csv

🎯 Next Steps

  1. Dataset is ready! It's at data/ai_vs_human_text.csv
  2. Config updated! configs/default.yaml already points to it
  3. Train your model:
    python scripts/run_train.py
    

πŸ’‘ Tips

  • Small dataset (1k samples): Good for quick testing
  • Want more data? Look for larger datasets on Kaggle
  • Already downloaded? The script won't re-download (uses cache)
  • No API token needed! kagglehub handles everything

πŸ” Verify It Works

# Check the dataset
head -5 data/ai_vs_human_text.csv

# Or in Python
import pandas as pd
df = pd.read_csv("data/ai_vs_human_text.csv")
print(f"Rows: {len(df):,}")
print(df.head())