You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
feat(chore): add stats command and it`s business-logic (#8)
* feat(chore): add empty base structure for implementing stats operation and logic
* refactor(stats): add base structure for business logic of stats
* refactor(caching)!: refactor casing methods for pd.DataFrame capacity. Also generation of cache filename allows kwargs that will be added as suffix for cache file
* refactor(hasher): refactor code for methods responsibility determination
* refactor(stats): add voc stats full business logic
* refactor(stats)!: add constants for dict keys and image analyzer
* refactor(stats): rename dict keys using constant values
* refactor(stats)!: add worker initializer for getting truth image paths
* refactor(stats): add unit tests for feature extractor logic, add exceptions interception
* refactor(stats): add simple cli dataset report in
* refactor(stats): add outlier detector
* refactor(stats): change outliers detection algorithm to iqr method
* refactor(stats)!: add visual image dataset report
* refactor(stats): change mainfold plot from t-sne to umap
* refactor(stats)!: embeddings umap coords calculating and adding in to dataframe before saving. plot manifold gets this coords from dataset
* refactor(converter): make yolo to dict convertion as service
* refactor(stats): add yolo stats strategy
* docs(docs): refactor docstrings, add new pages to mkdocs, refactor readme
* deps(deps): update requirements.txt
* docs(stats): add docstrings to dataset reporters
---------
Co-authored-by: Serhii Naumenko <naumenko.s.mail@gmail.com>
**DataForge** is a high-performance CLI tool designed to automate the preparation and management of machine learning datasets. It helps you transform raw and dirty data (like videos and unsorted images) into clean, balanced, and ready-to-train datasets with minimal effort.
3
2
4
-
`📖 [Read the full documentation here](https://seregacodit.github.io/DataForge/)`
3
+
**DataForge** is a high-performance CLI tool designed to automate the preparation and management of machine learning datasets. It helps you transform raw data (like videos and unsorted images) into clean, balanced, and ready-to-train datasets with minimal effort.
4
+
5
+
[Read the full documentation here](https://seregacodit.github.io/DataForge)
6
+
5
7
### Key Features
6
-
***Parallel Processing:**uses multiprocessing to handle thousands of files quickly.
7
-
***Vectorized Calculations:**employs NumPy for ultra-fast image comparison.
8
-
***Smart Caching:**incremental caching (MD5-based) allows working with large datasets on NAS without re-calculating everything.
9
-
***Config:** Built with Pydantic v2 for safe and flexible settings via JSON or CLI.
8
+
***Parallel Processing:**Uses multiprocessing to handle thousands of files quickly.
9
+
***Vectorized Calculations:**Employs NumPy for ultra-fast image comparison and hashing.
10
+
***Smart Caching:**Incremental caching (MD5-based) allows working with large datasets on NAS or local storage without re-calculating existing data.
11
+
***Flexible Configuration:** Built with Pydantic v2 for safe settings via `config.json` or CLI arguments.
10
12
11
13
---
12
14
13
15
### Available Commands
14
16
15
-
***`move`** — move files from source to target directory based on patterns.
16
-
***`slice`** — convert video files into sequences of images. Use `--remove` to delete the source video after a successful slice.
17
-
***`delete`** — safely remove files matching specific patterns.
18
-
***`dedup`** — find and remove visual duplicates using **dHash**.
19
-
**Threshold:* information similarity limit (0-100%).
20
-
**Core Size:* higher values (e.g., 32, 64) detect small changes (like a moving car), lower values (e.g., 8) ignore noise.
21
-
***`clean-annotations`** — automatically find and delete "orphan" annotation files (XML/TXT) that no longer have a corresponding image.
22
-
***`convert-annotations`** — convert dataset labels between formats (e.g., **Pascal VOC** to **YOLO**).
17
+
***`move`** — Move files from source to target directory based on specific patterns.
18
+
***`slice`** — Convert video files into sequences of images. Use `--remove` to delete the source video after a successful slice.
19
+
***`delete`** — Safely remove files matching specific patterns.
20
+
***`dedup`** — Find and remove visual duplicates using **dHash**.
0 commit comments