Spaces:
Running
Running
A newer version of the Gradio SDK is available:
6.0.0
π Quick Start: Download Dataset
β Script Works! (Tested Successfully)
The download script works perfectly! Here are all the ways to use it:
Method 1: Use the Script (Easiest) β
# Download the default dataset
python scripts/download_kagglehub.py
# Or specify a different dataset
python scripts/download_kagglehub.py --dataset shamimhasan8/ai-vs-human-text-dataset
Output: Dataset saved to data/ai_vs_human_text.csv
Method 2: Direct in Your Code (Simple)
Just copy-paste this into your Python script:
import kagglehub
import pandas as pd
from pathlib import Path
# Download dataset (no API token needed!)
path = kagglehub.dataset_download("shamimhasan8/ai-vs-human-text-dataset")
print("Path to dataset files:", path)
# Load the CSV
csv_files = list(Path(path).glob("*.csv"))
df = pd.read_csv(csv_files[0])
# Save to your data directory
df.to_csv("data/dataset.csv", index=False)
See: examples/simple_download.py for a complete example
Method 3: Use the Integrated Function
from ai_text_detector.download_data import download_ai_vs_human_dataset
# Download and get the path
csv_path = download_ai_vs_human_dataset()
print(f"Dataset at: {csv_path}")
# Now use it in your training
from ai_text_detector.config import load_config
cfg = load_config("configs/default.yaml")
cfg.data_path = csv_path
See: examples/download_and_train.py for a complete training example
Method 4: Download Any Dataset
from ai_text_detector.download_data import download_kaggle_dataset
# Download any Kaggle dataset
csv_path = download_kaggle_dataset(
"shamimhasan8/ai-vs-human-text-dataset",
output_path="data/my_dataset.csv"
)
π What Was Downloaded
- Dataset:
shamimhasan8/ai-vs-human-text-dataset - Size: 1,000 samples
- Columns:
id,text,label,prompt,model,date - Labels: "AI-generated" or "Human-written"
- Saved to:
data/ai_vs_human_text.csv
π― Next Steps
- Dataset is ready! It's at
data/ai_vs_human_text.csv - Config updated!
configs/default.yamlalready points to it - Train your model:
python scripts/run_train.py
π‘ Tips
- Small dataset (1k samples): Good for quick testing
- Want more data? Look for larger datasets on Kaggle
- Already downloaded? The script won't re-download (uses cache)
- No API token needed!
kagglehubhandles everything
π Verify It Works
# Check the dataset
head -5 data/ai_vs_human_text.csv
# Or in Python
import pandas as pd
df = pd.read_csv("data/ai_vs_human_text.csv")
print(f"Rows: {len(df):,}")
print(df.head())