Skip to content

Commit bddea20

Browse files
SeregaCoditSerhii Naumenko
andauthored
feat(chore): add stats command and it`s business-logic (#8)
* feat(chore): add empty base structure for implementing stats operation and logic * refactor(stats): add base structure for business logic of stats * refactor(caching)!: refactor casing methods for pd.DataFrame capacity. Also generation of cache filename allows kwargs that will be added as suffix for cache file * refactor(hasher): refactor code for methods responsibility determination * refactor(stats): add voc stats full business logic * refactor(stats)!: add constants for dict keys and image analyzer * refactor(stats): rename dict keys using constant values * refactor(stats)!: add worker initializer for getting truth image paths * refactor(stats): add unit tests for feature extractor logic, add exceptions interception * refactor(stats): add simple cli dataset report in * refactor(stats): add outlier detector * refactor(stats): change outliers detection algorithm to iqr method * refactor(stats)!: add visual image dataset report * refactor(stats): change mainfold plot from t-sne to umap * refactor(stats)!: embeddings umap coords calculating and adding in to dataframe before saving. plot manifold gets this coords from dataset * refactor(converter): make yolo to dict convertion as service * refactor(stats): add yolo stats strategy * docs(docs): refactor docstrings, add new pages to mkdocs, refactor readme * deps(deps): update requirements.txt * docs(stats): add docstrings to dataset reporters --------- Co-authored-by: Serhii Naumenko <naumenko.s.mail@gmail.com>
1 parent 2f4e72b commit bddea20

53 files changed

Lines changed: 2596 additions & 267 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -289,3 +289,4 @@ pyrightconfig.json
289289
/cache/
290290
/tst_commands.py
291291
/.idea/DataForge.iml
292+
/reports/

README.MD

Lines changed: 39 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -1,32 +1,49 @@
11
# DataForge
2-
**DataForge** is a high-performance CLI tool designed to automate the preparation and management of machine learning datasets. It helps you transform raw and dirty data (like videos and unsorted images) into clean, balanced, and ready-to-train datasets with minimal effort.
32

4-
`📖 [Read the full documentation here](https://seregacodit.github.io/DataForge/)`
3+
**DataForge** is a high-performance CLI tool designed to automate the preparation and management of machine learning datasets. It helps you transform raw data (like videos and unsorted images) into clean, balanced, and ready-to-train datasets with minimal effort.
4+
5+
[Read the full documentation here](https://seregacodit.github.io/DataForge)
6+
57
### Key Features
6-
* **Parallel Processing:** uses multiprocessing to handle thousands of files quickly.
7-
* **Vectorized Calculations:** employs NumPy for ultra-fast image comparison.
8-
* **Smart Caching:** incremental caching (MD5-based) allows working with large datasets on NAS without re-calculating everything.
9-
* **Config:** Built with Pydantic v2 for safe and flexible settings via JSON or CLI.
8+
* **Parallel Processing:** Uses multiprocessing to handle thousands of files quickly.
9+
* **Vectorized Calculations:** Employs NumPy for ultra-fast image comparison and hashing.
10+
* **Smart Caching:** Incremental caching (MD5-based) allows working with large datasets on NAS or local storage without re-calculating existing data.
11+
* **Flexible Configuration:** Built with Pydantic v2 for safe settings via `config.json` or CLI arguments.
1012

1113
---
1214

1315
### Available Commands
1416

15-
* **`move`** — move files from source to target directory based on patterns.
16-
* **`slice`** — convert video files into sequences of images. Use `--remove` to delete the source video after a successful slice.
17-
* **`delete`** — safely remove files matching specific patterns.
18-
* **`dedup`** — find and remove visual duplicates using **dHash**.
19-
* *Threshold:* information similarity limit (0-100%).
20-
* *Core Size:* higher values (e.g., 32, 64) detect small changes (like a moving car), lower values (e.g., 8) ignore noise.
21-
* **`clean-annotations`** — automatically find and delete "orphan" annotation files (XML/TXT) that no longer have a corresponding image.
22-
* **`convert-annotations`** — convert dataset labels between formats (e.g., **Pascal VOC** to **YOLO**).
17+
* **`move`** — Move files from source to target directory based on specific patterns.
18+
* **`slice`** — Convert video files into sequences of images. Use `--remove` to delete the source video after a successful slice.
19+
* **`delete`** — Safely remove files matching specific patterns.
20+
* **`dedup`** — Find and remove visual duplicates using **dHash**.
21+
* *Threshold:* Similarity limit (0-100%).
22+
* *Core Size:* Higher values (e.g., 32) detect small changes; lower values (e.g., 8) ignore noise.
23+
* **`clean-annotations`** — Automatically find and delete "orphan" annotation files (XML/TXT) that do not have a corresponding image.
24+
* **`convert-annotations`** — Convert dataset labels between formats (e.g., **Pascal VOC** to **YOLO**).
25+
* **`stats` — Advanced Dataset Analytics & Health Check**
26+
This command performs a deep-dive into your dataset to identify biases and feature correlations before you start training.
27+
* **Analytics Highlights:**
28+
* **Class Distribution:** Visualizes object counts to detect imbalances.
29+
* **Spatial Density Heatmaps:** Identifies "positional bias" for each class using 3x3 grids.
30+
* **Correlation Analysis:** Global and per-class matrices showing relationships between features.
31+
* **Dataset Manifold (UMAP):** 2D projection to identify "representation gaps" and object clusters.
32+
* **Quality Metrics:** Analysis of object areas, aspect ratios, brightness, contrast, and blur.
33+
* **Outlier Detection:** Automatically marks extreme data points using the IQR method.
34+
* **Outputs:** Technical console summary, high-resolution PNG plots, and unified PDF reports.
35+
36+
**Usage Example:**
37+
```bash
38+
python data_forge.py stats --src ./data/train --target_format yolo --report_path ./reports/v1
39+
```
2340

2441
---
2542

2643
### Automation & Intervals
27-
By default, commands run once. If you want to monitor a folder and process files as they appear, use the repeat flag:
28-
* Use **`-r`** to run the command in a cycle.
29-
* Set the delay between cycles with **`-s`** (seconds).
44+
By default, commands run once. To monitor a folder and process files as they appear, use these flags:
45+
* **`-r`**: Run the command in a continuous cycle.
46+
* **`-s`**: Set the delay (in seconds) between cycles.
3047

3148
---
3249

@@ -47,14 +64,14 @@ pip install -r requirements.txt
4764

4865
3. **Check usage:**
4966
```bash
50-
python data_forge.py --help # See all commands
67+
python data_forge.py --help # See all available commands
5168
python data_forge.py {command} --help # See arguments for a specific command
5269
```
5370

5471
---
5572

56-
### Workflow
57-
For multiple tasks, you can modify `start_all_tasks.sh` and run them all in the background:
73+
### Workflow Optimization
74+
For multiple tasks, you can modify `start_all_tasks.sh` and run them in the background:
5875
```bash
5976
bash start_all_tasks.sh
6077
```
@@ -63,6 +80,6 @@ To stop all running DataForge processes:
6380
pkill -f data_forge.py
6481
```
6582

66-
### Configuration
67-
You can manage all default settings in `config.json`. DataForge follows this priority:
83+
### Configuration Priority
84+
You can manage default settings in `config.json`. DataForge follows this priority:
6885
**CLI Arguments > config.json > Internal Defaults.**

config.json

Lines changed: 58 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -7,14 +7,67 @@
77
"step_sec": 5,
88
"log_path": "./log",
99
"log_level": "INFO",
10-
"filetype": "image",
10+
"report_path": "./reports",
11+
"datatype": "image",
1112
"method": "dhash",
1213
"hash_threshold": 10,
13-
"confirm_choice": ["delete", "вудуеу", "yes", "y", "true", "t", "1"],
14+
"confirm_choice": [
15+
"yes"
16+
],
1417
"core_size": 16,
15-
"n_jobs": 20,
18+
"n_jobs": 20,
1619
"cache_file_path": "./cache",
1720
"cache_name": null,
18-
"a_suffix": [".xml"],
19-
"a_source": null
21+
"a_suffix": [
22+
".xml"
23+
],
24+
"a_source": null,
25+
"img_dataset_report_schema": [
26+
{
27+
"title": "GEOMETRY",
28+
"type": "numeric",
29+
"columns": [
30+
"object_area",
31+
"object_relative_area",
32+
"object_width",
33+
"object_height",
34+
"object_aspect_ratio"
35+
]
36+
},
37+
{
38+
"title": "IMAGE QUALITY",
39+
"type": "numeric",
40+
"columns": [
41+
"im_brightness",
42+
"im_contrast",
43+
"im_blur_score"
44+
]
45+
},
46+
{
47+
"title": "SPATIAL BIAS",
48+
"type": "binary",
49+
"columns": [
50+
"object_in_center",
51+
"object_in_top_side",
52+
"object_in_bottom_side",
53+
"object_in_left_side",
54+
"object_in_right_side",
55+
"object_in_left_top",
56+
"object_in_right_top",
57+
"object_in_left_bottom",
58+
"object_in_right_bottom"
59+
]
60+
},
61+
{
62+
"title": "TRUNCATION",
63+
"type": "binary",
64+
"columns": [
65+
"truncated_top",
66+
"truncated_bottom",
67+
"truncated_left",
68+
"truncated_right"
69+
]
70+
}
71+
72+
]
2073
}

const_utils/arguments.py

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,7 @@ class Arguments:
1919
rm: str = "-rm"
2020
log_level: str = "--log_level"
2121
log_path: str = "--log_path"
22-
filetype: str = "--filetype"
22+
datatype: str = "--datatype"
2323
method: str = "--method"
2424
m: str = "-m"
2525
action: str = "--action"
@@ -33,3 +33,5 @@ class Arguments:
3333
destination_type: str = "--destination-type"
3434
img_path: str = "--img_path"
3535
extensions: str = "--ext"
36+
margin: str = "--margin"
37+
report_path: str = "--report_path"

const_utils/commands.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,4 +8,5 @@ class Commands:
88
delete: str = "delete"
99
dedup: str = "dedup"
1010
clean_annotations: str = "clean-annotations"
11-
convert_annotations: str = "convert-annotations"
11+
convert_annotations: str = "convert-annotations"
12+
stats: str = "stats"

const_utils/default_values.py

Lines changed: 56 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,13 +1,14 @@
11
import json
22
import multiprocessing
33

4-
from typing import Union, Tuple, Optional, List
4+
from typing import Union, Tuple, Optional, List, Dict, Any
55

66
from pydantic import Field, field_validator
77
from pydantic_settings import BaseSettings, SettingsConfigDict
88
from pathlib import Path
99

1010
from const_utils.copmarer import Constants
11+
from const_utils.stats_constansts import ImageStatsKeys
1112
from logger.log_level_mapping import LevelMapping
1213

1314

@@ -30,7 +31,7 @@ class AppSettings(BaseSettings):
3031
step_sec (float): Time interval in seconds for video slicing.
3132
log_path (Path): Directory where log files are stored.
3233
log_level (str): Verbosity level of the logger (e.g., INFO, DEBUG).
33-
filetype (str): The category of files being processed (e.g., image).
34+
datatype (str): The category of files being processed (e.g., image).
3435
method (str): The algorithm name for hashing or comparison.
3536
hash_threshold (int): Distance threshold for identifying duplicates (0-100).
3637
confirm_choice (tuple): Keywords used to confirm interactive deletion.
@@ -40,7 +41,7 @@ class AppSettings(BaseSettings):
4041
cache_name (Optional[Path]): Custom name for the cache file.
4142
a_suffix (Tuple[str, ...]): File patterns specific to annotations.
4243
a_source (Optional[Path]): Directory where annotation files are located.
43-
destination_type (Optional[str]): Target format for annotation conversion.
44+
destination_type (Optional[str]): Target format for annotations.
4445
extensions (Tuple[str, ...]): Supported image file extensions.
4546
"""
4647
max_percentage: int = 100
@@ -58,10 +59,10 @@ class AppSettings(BaseSettings):
5859
step_sec: float = Field(default=1.0, ge=0.1)
5960
log_path: Path = Field(default=Path("./log"))
6061
log_level: str = Field(default=LevelMapping.info)
61-
filetype: str = Field(default=Constants.image)
62+
datatype: str = Field(default=Constants.image)
6263
method: str = Field(default=Constants.dhash)
6364
hash_threshold: int = Field(default=10, ge=0, le=100)
64-
confirm_choice: tuple = Field(default=("delete",))
65+
confirm_choice: tuple = Field(default=("yes",))
6566
core_size: int = Field(default=8, ge=8)
6667
n_jobs: int = Field(default=2, ge=1, le=multiprocessing.cpu_count())
6768
cache_file_path: Path = Field(default=Path("./cache"))
@@ -70,6 +71,55 @@ class AppSettings(BaseSettings):
7071
a_source: Optional[Path] = Field(default=None)
7172
destination_type: Optional[str] = Field(default=None)
7273
extensions: Tuple[str, ...] = Field(default=(".jpg", ".jpeg,", ".png"))
74+
margin_threshold: int = Field(default=5, ge=0, le=100)
75+
report_path: Path = Field(default=Path("./reports"))
76+
img_dataset_report_schema: List[Dict[str, Any]] = Field(default=[
77+
{
78+
"title": "GEOMETRY",
79+
"type": "numeric",
80+
"columns": [
81+
ImageStatsKeys.object_area,
82+
ImageStatsKeys.object_relative_area,
83+
ImageStatsKeys.object_width,
84+
ImageStatsKeys.object_height,
85+
ImageStatsKeys.object_aspect_ratio
86+
]
87+
},
88+
{
89+
"title": "SPATIAL BIAS",
90+
"type": "binary",
91+
"columns": [
92+
ImageStatsKeys.object_in_center,
93+
ImageStatsKeys.object_in_top_side,
94+
ImageStatsKeys.object_in_bottom_side,
95+
ImageStatsKeys.object_in_left_side,
96+
ImageStatsKeys.object_in_right_side,
97+
ImageStatsKeys.object_in_left_top,
98+
ImageStatsKeys.object_in_right_top,
99+
ImageStatsKeys.object_in_left_bottom,
100+
ImageStatsKeys.object_in_right_bottom
101+
]
102+
},
103+
{
104+
"title": "TRUNCATION",
105+
"type": "binary",
106+
"columns": [
107+
ImageStatsKeys.truncated_top,
108+
ImageStatsKeys.truncated_bottom,
109+
ImageStatsKeys.truncated_left,
110+
ImageStatsKeys.truncated_right
111+
]
112+
},
113+
{
114+
"title": "IMAGE QUALITY",
115+
"type": "numeric",
116+
"columns": [
117+
ImageStatsKeys.im_brightness,
118+
ImageStatsKeys.im_contrast,
119+
ImageStatsKeys.im_blur_score
120+
]
121+
}
122+
])
73123

74124

75125
@field_validator('core_size')
@@ -92,7 +142,7 @@ def check_power_of_two(cls, value: int) -> int:
92142
return value
93143

94144

95-
@field_validator("log_path", "cache_file_path", "a_source", mode='before')
145+
@field_validator("report_path", "log_path", "cache_file_path", "a_source", mode='before')
96146
@classmethod
97147
def ensure_path(cls, value: Union[str, Path]) -> Path:
98148
"""

const_utils/parser_help.py

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,7 @@ class HelpStrings:
1818
remove: str = "remove files after processing"
1919
log_path: str = "path to log directory"
2020
log_level: str = f"A level of logging matches mapping: {str(LevelMapping.mapping())}"
21-
filetype: str = "Type of file. Currently this parameter only supports 'image'"
21+
datatype: str = "Type of data. Currently this parameter only supports 'image'"
2222
method: str = "Default: dhash. A method of comparing images. It's can be ['phash, dhash, ahash, cnn]"
2323
threshold: str = ("A minimal difference between files that means the files"
2424
f" have a different information. Using Hemming distance for *hash methods")
@@ -36,4 +36,7 @@ class HelpStrings:
3636
destination_type: str = "A type of destination annotation format"
3737
img_path: str = "Path to dataset images directory"
3838
extensions: str = ("A tuple of file extensions that will be used as pattern for building file whitelists for "
39-
"converting from yolo to other formats")
39+
"converting from yolo to other formats")
40+
margin: str = ("A threshold value of margin from any image border. If any side of object bbox cloaser that this"
41+
"value to image boarder - object will be defined as truncated")
42+
report_path: str = "A path to directory where reports will be stored"

const_utils/stats_constansts.py

Lines changed: 38 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,38 @@
1+
from dataclasses import dataclass
2+
3+
4+
@dataclass
5+
class ImageStatsKeys:
6+
"""Constants for stats dictionary keys and default values."""
7+
path: str = "path"
8+
im_path: str = "im_path"
9+
mtime: str = "mtime"
10+
class_name: str = "class_name"
11+
objects_count: str = "objects_count"
12+
im_width: str = "im_width"
13+
im_height: str = "im_height"
14+
im_depth: str = "im_depth"
15+
im_brightness: str = "im_brightness"
16+
im_contrast: str = "im_contrast"
17+
im_blur_score: str = "im_blur_score"
18+
has_neighbors: str = "has_neighbors"
19+
object_width: str = "object_width"
20+
object_height: str = "object_height"
21+
object_aspect_ratio: str = "object_aspect_ratio"
22+
object_area: str = "object_area"
23+
object_relative_area: str = "object_relative_area"
24+
object_in_center: str = "object_in_center"
25+
object_in_right_side: str = "object_in_right_side"
26+
object_in_left_side: str = "object_in_left_side"
27+
object_in_top_side: str = "object_in_top_side"
28+
object_in_bottom_side: str = "object_in_bottom_side"
29+
object_in_left_top: str = "object_in_left_top"
30+
object_in_right_top: str = "object_in_right_top"
31+
object_in_left_bottom: str = "object_in_left_bottom"
32+
object_in_right_bottom: str = "object_in_right_bottom"
33+
full_size: str = "full_size"
34+
truncated_left: str = "truncated_left"
35+
truncated_right: str = "truncated_right"
36+
truncated_top: str = "truncated_top"
37+
truncated_bottom: str = "truncated_bottom"
38+
outlier_any: str = "outlier_any"

const_utils/xml_names.py

Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,26 @@
1+
from dataclasses import dataclass
2+
3+
@dataclass
4+
class XMLNames:
5+
"""Constants for XML tag and attribute names."""
6+
annotation: str = "annotation"
7+
folder: str = "folder"
8+
filename: str = "filename"
9+
path: str = "path"
10+
source: str = "source"
11+
database: str = "database"
12+
size: str = "size"
13+
width: str = "width"
14+
height: str = "height"
15+
depth: str = "depth"
16+
segmented: str = "segmented"
17+
object: str = "object"
18+
name: str = "name"
19+
pose: str = "pose"
20+
truncated: str = "truncated"
21+
difficult: str = "difficult"
22+
bndbox: str = "bndbox"
23+
xmin: str = "xmin"
24+
ymin: str = "ymin"
25+
xmax: str = "xmax"
26+
ymax: str = "ymax"

0 commit comments

Comments
 (0)