You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
**DataForge** is a high-performance CLI tool designed to automate the preparation and management of machine learning datasets. It helps you transform raw and dirty data (like videos and unsorted images) into clean, balanced, and ready-to-train datasets with minimal effort.
3
2
4
-
`📖 [Read the full documentation here](https://seregacodit.github.io/DataForge/)`
3
+
**DataForge** is a high-performance CLI tool designed to automate the preparation and management of machine learning datasets. It helps you transform raw data (like videos and unsorted images) into clean, balanced, and ready-to-train datasets with minimal effort.
4
+
5
+
[Read the full documentation here](https://seregacodit.github.io/DataForge)
6
+
5
7
### Key Features
6
-
***Parallel Processing:**uses multiprocessing to handle thousands of files quickly.
7
-
***Vectorized Calculations:**employs NumPy for ultra-fast image comparison.
8
-
***Smart Caching:**incremental caching (MD5-based) allows working with large datasets on NAS without re-calculating everything.
9
-
***Config:** Built with Pydantic v2 for safe and flexible settings via JSON or CLI.
8
+
***Parallel Processing:**Uses multiprocessing to handle thousands of files quickly.
9
+
***Vectorized Calculations:**Employs NumPy for ultra-fast image comparison and hashing.
10
+
***Smart Caching:**Incremental caching (MD5-based) allows working with large datasets on NAS or local storage without re-calculating existing data.
11
+
***Flexible Configuration:** Built with Pydantic v2 for safe settings via `config.json` or CLI arguments.
10
12
11
13
---
12
14
13
15
### Available Commands
14
16
15
-
***`move`** — move files from source to target directory based on patterns.
16
-
***`slice`** — convert video files into sequences of images. Use `--remove` to delete the source video after a successful slice.
17
-
***`delete`** — safely remove files matching specific patterns.
18
-
***`dedup`** — find and remove visual duplicates using **dHash**.
19
-
**Threshold:* information similarity limit (0-100%).
20
-
**Core Size:* higher values (e.g., 32, 64) detect small changes (like a moving car), lower values (e.g., 8) ignore noise.
21
-
***`clean-annotations`** — automatically find and delete "orphan" annotation files (XML/TXT) that no longer have a corresponding image.
22
-
***`convert-annotations`** — convert dataset labels between formats (e.g., **Pascal VOC** to **YOLO**).
17
+
***`move`** — Move files from source to target directory based on specific patterns.
18
+
***`slice`** — Convert video files into sequences of images. Use `--remove` to delete the source video after a successful slice.
19
+
***`delete`** — Safely remove files matching specific patterns.
20
+
***`dedup`** — Find and remove visual duplicates using **dHash**.
0 commit comments