Spaces:
Sleeping
Sleeping
| # Data Processing for slimface Training πΌοΈ | |
| ## Table of Contents | |
| - [Data Processing for slimface Training πΌοΈ](#data-processing-for-slimface-training-) | |
| - [Command-Line Arguments](#command-line-arguments) | |
| - [Command-Line Arguments for `process_dataset.py`](#command-line-arguments-for-process_datasetpy) | |
| - [Example Usage](#example-usage) | |
| - [Step-by-step process for handling a dataset](#step-by-step-process-for-handling-a-dataset) | |
| - [Step 1: Clone the Repository](#step-1-clone-the-repository) | |
| - [Step 2: Process the Dataset](#step-2-process-the-dataset) | |
| - [Option 1: Using Dataset from Kaggle](#option-1-using-dataset-from-kaggle) | |
| - [Option 2: Using a Custom Dataset](#option-2-using-a-custom-dataset) | |
| ## Command-Line Arguments | |
| ### Command-Line Arguments for `process_dataset.py` | |
| When running `python scripts/process_dataset.py`, you can customize the dataset processing with the following command-line arguments: | |
| | Argument | Type | Default | Description | | |
| |----------|------|---------|-------------| | |
| | `--dataset_slug` | `str` | `vasukipatel/face-recognition-dataset` | The Kaggle dataset slug in `username/dataset-name` format. Specifies which dataset to download from Kaggle. | | |
| | `--base_dir` | `str` | `./data` | The base directory where the dataset will be stored and processed. | | |
| | `--augment` | `flag` | `False` | Enables data augmentation (e.g., flipping, rotation) for training images to increase dataset variety. Use `--augment` to enable. | | |
| | `--random_state` | `int` | `42` | Random seed for reproducibility in the train-test split. Ensures consistent splitting across runs. | | |
| | `--test_split_rate` | `float` | `0.2` | Proportion of data to use for validation (between 0 and 1). For example, `0.2` means 20% of the data is used for validation. | | |
| | `--rotation_range` | `int` | `15` | Maximum rotation angle in degrees for data augmentation (if `--augment` is enabled). Images may be rotated randomly within this range. | | |
| | `--source_subdir` | `str` | `Original Images/Original Images` | Subdirectory within `raw_dir` containing the images to process. Used for both Kaggle and custom datasets. | | |
| | `--delete_raw` | `flag` | `False` | Deletes the raw folder after processing to save storage. Use `--delete_raw` to enable. | | |
| ### Example Usage | |
| To process a Kaggle dataset with augmentation and a custom validation split: | |
| ```bash | |
| python scripts/process_dataset.py \ | |
| --augment \ | |
| --test_split_rate 0.3 \ | |
| --rotation_range 15 | |
| ``` | |
| To process a **custom dataset** with a specific subdirectory and delete the raw folder: | |
| ```bash | |
| python scripts/process_dataset.py \ | |
| --source_subdir your_custom_dataset_dir \ | |
| --delete_raw | |
| ``` | |
| ## Step-by-step process for handling a dataset | |
| These options allow flexible dataset processing tailored to your needs. π | |
| ### Step 1: Clone the Repository | |
| Ensure the `slimface` project is set up by cloning the repository and navigating to the project directory: | |
| ```bash | |
| git clone https://github.com/danhtran2mind/slimface/ | |
| cd slimface | |
| ``` | |
| ### Step 2: Process the Dataset | |
| #### Option 1: Using Dataset from Kaggle | |
| To download and process the sample dataset from Kaggle, run: | |
| ```bash | |
| python scripts/process_dataset.py | |
| ``` | |
| This script organizes the dataset into the following structure under `data/`: | |
| ```markdown | |
| data/ | |
| βββ processed_ds/ | |
| β βββ train_data/ | |
| β β βββ Charlize Theron/ | |
| β β β βββ Charlize Theron_70.jpg | |
| β β β βββ Charlize Theron_46.jpg | |
| β β β ... | |
| β β βββ Dwayne Johnson/ | |
| β β β βββ Dwayne Johnson_58.jpg | |
| β β β βββ Dwayne Johnson_9.jpg | |
| β β β ... | |
| β βββ val_data/ | |
| β βββ Charlize Theron/ | |
| β β βββ Charlize Theron_60.jpg | |
| β β βββ Charlize Theron_45.jpg | |
| β β ... | |
| β βββ Dwayne Johnson/ | |
| β β βββ Dwayne Johnson_11.jpg | |
| β β βββ Dwayne Johnson_46.jpg | |
| β β ... | |
| βββ raw/ | |
| β βββ Faces/ | |
| β β βββ Jessica Alba_90.jpg | |
| β β βββ Hugh Jackman_70.jpg | |
| β β ... | |
| β βββ Original Images/ | |
| β β βββ Charlize Theron/ | |
| β β β βββ Charlize Theron_60.jpg | |
| β β β βββ Charlize Theron_70.jpg | |
| β β β ... | |
| β β βββ Dwayne Johnson/ | |
| β β β βββ Dwayne Johnson_11.jpg | |
| β β β βββ Dwayne Johnson_58.jpg | |
| β β β ... | |
| β βββ dataset.zip | |
| β βββ Dataset.csv | |
| βββ .gitignore | |
| ``` | |
| #### Option 2: Using a Custom Dataset | |
| If you prefer to use your own dataset, place it in `./data/raw/your_custom_dataset_dir/` with the following structure: | |
| ```markdown | |
| data/ | |
| βββ raw/ | |
| β βββ your_custom_dataset_dir/ | |
| β β βββ Charlize Theron/ | |
| β β β βββ Charlize Theron_60.jpg | |
| β β β βββ Charlize Theron_70.jpg | |
| β β β ... | |
| β β βββ Dwayne Johnson/ | |
| β β β βββ Dwayne Johnson_11.jpg | |
| β β β βββ Dwayne Johnson_58.jpg | |
| β β β ... | |
| ``` | |
| If you use your dataset, you do not need to include only human faces, because **we support face extraction using face detection**, and all extracted faces are saved at `data/processed_ds`. | |
| Then, process your custom dataset by specifying the subdirectory: | |
| ```bash | |
| python scripts/process_dataset.py \ | |
| --source_subdir your_custom_dataset_dir | |
| ``` | |
| This ensures your dataset is properly formatted for training. π |