Skip to content

Commit b103489

Browse files
authored
Documentation update (#26)
* Cleaned up README.md to be a bit more clean and clear (namely, removing newlines which simulated soft-wrap poorly) * Cleaned up README.md to be a bit more clean and clear (namely, removing newlines which simulated soft-wrap poorly) * Added docstring for 'registered_data_hook' * Updated docstrings for the ABCs for data hooks * Updated the docstrings of data-encoding hooks * Updated DocStrings for the 'feature_selection.py' data hooks. Also fixes a slight error in the SampleNullityDrop hook, where it was evaluating checking the threshold against sample count (rather than feature count) * [Minor] Corrected indentation of use-case to match the rest of the doc-strings * Added the docstring for the Imputation data hooks (of which there is currently only one: SimpleImputation) * Added the docstring for the Standardization data hooks (of which there is currently only one: StandardScaling) * Added missing docstring for the "evaluate_param" function, used by SciKit-Learn's root ModelManager * Updated docstrings for the SciKit-Learn Ensemble models (mostly adding example usage) * Updated docstring for the Linear models provided by this tool. Note the extended discussion of how this implementation works around SciKit-Learn's triple-variant approach! * Fixed incorrect indentation in the use-case examples for Ensemble models. Oops * Extended the docstring of KNeighborsClassifierManager to include example usage. Woohoo * Added example usage to the SVC docstring. * Extended the docstring of the tuning utility classes, to clarify their use in the broader context of the framework. * Update data/hooks/encoding.py Addendum suggested by valosekj Co-authored-by: Jan Valosek <39456460+valosekj@users.noreply.github.com> * Update data/hooks/encoding.py Added additional (common) parameter, as suggested by Jan Co-authored-by: Jan Valosek <39456460+valosekj@users.noreply.github.com> * Update README.md Swapped to GitHub Block formatting, which is a lot better at drawing the eye of the user to this note. Thanks Jan for pointing this out! Co-authored-by: Jan Valosek <39456460+valosekj@users.noreply.github.com> * Update data/hooks/encoding.py Modified suggestion by Jan * Initial Sphinx docs addition. Still WIP, be gentle * Further improves to the documentation, mostly restructuring and clarification * Minor formatting correction to the FAQ page; its header should now actuall say what it is * Initial commit of the Getting Started page * Initial addition of the "walkthrough" for MOOP; still very much WIP * Extended the data config documentation to discuss data hooks. * Initial addition of model configuration + parameter tuning walkthrough * [Minor] Grammar correction in model config docs * Added model config template for easy user reference * Added the template file for the study config before I forget to later * Initial commit of the study config documentation (walk-through); last step is documenting how to interpret it! * Some clean-up and re-wiring of cross-references in docs. Should be a bit nicer to navigate now * Trimmed the titles of the walkthrough, as they were too elaborate for the context of a tutorial * Added missing comma in the model config tutorial JSON example; oops * Yet another missing comma aaaaa * Slight correction to the parameters for the output path in the study config tutorial * Initial results interpretation discussion added to docs * Added missing results.csv, using in result interpretation docs * First addition to results, showing common plots which can be generated from a MOOP analysis * Extended results documentation, showing how to compare two different MOOP runs to one another via plotting * Added warning about an edge case bug, where the data is erroneously coded as an "object" type if the database write is interrupted during a MOOPs run * Initial addition of statistical analyses in the walkthrough; still very much WIP * Swapped to PyData theme; should help with eye strain via its dark mode. * [Minor] Corrected misleading comment in statistics documentation. Oops * Added section on calculating statistics for single runs, place before multi-run comparison * Added a section detailing some common statistical analyses that can be run using MOOP's results * Fixed ".. codeblock" appearing erroneously in some doc pages * Clarified an example of the dataset in question * Initial ReadTheDocs commit! * Clarified null-like data formatting * Changed header of "index" section to remove duplicate adjacent headers * Fixed strategy used in docs --------- Kalum Ost
1 parent e691f03 commit b103489

35 files changed

Lines changed: 1538 additions & 76 deletions

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,7 @@
33
/testing/output/output.db
44
/testing/private/*
55
/.idea/*
6+
/docs/build/
67
*__pycache__/
78
.vscode/
89
.DS_Store

.readthedocs.yaml

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
version: "2"
2+
3+
build:
4+
os: "ubuntu-22.04"
5+
tools:
6+
python: "3.12"
7+
8+
python:
9+
install:
10+
- requirements: docs/requirements.txt
11+
12+
sphinx:
13+
configuration: docs/source/conf.py

README.md

Lines changed: 25 additions & 47 deletions
Original file line numberDiff line numberDiff line change
@@ -15,27 +15,23 @@ it can be easily extended to allow for the analysis of any tabular dataset.
1515
* `conda activate modular_optuna_ml`
1616
* `mamba activate modular_optuna_ml`
1717
4. Done!
18-
5. This only sets up the tool to be run; you will still need to create the configuration files for the analyses you want to run (see `testing` for an example).
18+
19+
> [!NOTE]
20+
> This only sets up the tool to be run; you will still need to create the configuration files for the analyses you want to run (see the `testing` directory for some examples).
1921
2022
## Running the Program
2123

2224
Four files are needed to run an analysis
2325

24-
* A tabular dataset, containing the metrics you want to run the analysis on
25-
* Should contain at least 1 independent and 1 target metric; unsupervised analyses are currently
26-
not supported
26+
* A tabular dataset (usually a `.tsv` file), containing the metrics you want to run the analysis on
27+
* Should contain at least 1 independent and 1 target metric; unsupervised analyses are currently not supported
2728
* A data configuration file; this defines where a dataset is and what pre-processing methods
28-
should be applied to its contents. An example, alongside the dataset it manages, can be found
29-
in `testing/iris_data/`
29+
should be applied to its contents. An example, alongside the dataset it manages, can be found in `testing/iris_data/`
3030
* A model configuration file; this defines which ML model to test, which hyper-parameters to tune,
3131
and how to tune them. A few examples are available in `testing/model_configs/`
32-
* A study configuration file; this defines which metrics to evaluated throughout the runtime of the
33-
analysis, and where to save the results (currently only supports an SQLite DB output format). An
34-
example is provided in `testing/testing_study_config.json`
32+
* A study configuration file; this defines which metrics to evaluated throughout the runtime of the analysis, and where to save the results (currently only supports an SQLite DB output format). An example is provided in `testing/testing_study_config.json`
3533

36-
Once all three have been created, and you have installed all dependencies (detailed in
37-
`environment.yml`) simply run the following command (replacing the values within the
38-
curly brackets with the corresponding file name):
34+
Once all three have been created, and you have installed all dependencies (detailed in `environment.yml`) simply run the following command (replacing the values within the curly brackets with the corresponding file name):
3935

4036
`python run_ml_analysis.py -d {data_config} -m {model_config} -s {study_config}`
4137

@@ -51,48 +47,30 @@ The overall structure of the analysis can be broken down into the following broa
5147

5248
1. **Configuration Loading:** All configuration files are loaded and checked for validity.
5349
2. **Dataset Loading:** The tabular dataset designated in the data configuration file is loaded
54-
* If a target column is specified, it is split off the dataset at this point to isolate it from
55-
pre-processing (see below)
56-
3. **Study Initialization:** An Optuna study is initialized, set up to run `n_trials` trials as specified
57-
in the study config file.
58-
* All steps past this point occur per-trial, sampling from the corresponding `Trial` instance to
59-
determine the hyperparameters to use.
60-
* Configuration files denote a parameter as being "trial tunable" by placing a dictionary in the
61-
place of a constant; an example of this can be seen in the `penalty` parameter for the
50+
* If a target column is specified, it is split off the dataset at this point to isolate it from pre-processing (see below)
51+
3. **Study Initialization:** An Optuna study is initialized, set up to run `n_trials` trials as specified in the study config file.
52+
* All steps past this point occur per-trial, sampling from the corresponding `Trial` instance to determine the hyperparameters to use.
53+
* Configuration files denote a parameter as being "trial tunable" by placing a dictionary in the place of a constant; an example of this can be seen in the `penalty` parameter for the
6254
`testing/model_configs/log_reg.json` file.
63-
* Details on how hyper-parameters are sampled via Optuna Trials can be found
64-
(here)[https://optuna.readthedocs.io/en/stable/tutorial/10_key_features/002_configurations.html].
65-
4. **Universal Pre-Processing:** Any data processing hooks for which `"run_per_replicate": true` are
66-
run on the dataset in its entirety
55+
* Details on how hyper-parameters are sampled via Optuna Trials can be found (here)[https://optuna.readthedocs.io/en/stable/tutorial/10_key_features/002_configurations.html].
56+
4. **Universal Pre-Processing:** Any data processing hooks for which `"run_per_replicate": true` are run on the dataset in its entirety
6757
* If a data processing hook does not specify a `run_per_replicate` value, it defaults to `true`.
68-
5. **In-Out Splits:** The dataset is split via a stratified k-fold split into in- and out-groups,
69-
`n_replicates` times.
58+
5. **In-Out Splits:** The dataset is split via a stratified k-fold split into in- and out-groups, `n_replicates` times.
7059
* As the parameter name implies, each of these splits will make up an analytical "replicate"
7160
* Any post-split hooks for which `"run_per_replicate": true` will also run here, fitting to the
7261
in-dataset and transforming both the in- and out-dataset if possible
7362
* If a data processing hook does not specify a `run_per_replicate` value, it defaults to `true`.
74-
* NOTE: Despite this occurring per-trial, the RNG state being fixed prior to study start ensures that
75-
the in-out datasets are the same for all trials, so long as universal pre-processing did not delete
76-
and samples during its run-time
77-
6. **Replicate Pre-Processing:** For each in-dataset, any data processing hooks for which
78-
`"run_per_cross": true` are run on the in-dataset.
63+
* NOTE: Despite this occurring per-trial, the RNG state being fixed prior to study start ensures that the in-out datasets are the same for all trials, so long as universal pre-processing did not delete any samples during its run-time
64+
6. **Replicate Pre-Processing:** For each in-dataset, any data processing hooks for which "run_per_cross": true` are run on the in-dataset.
7965
* If a data processing hook does not specify a `run_per_cross` value, it defaults to `false`.
80-
7. **Train-Test Splits:** The validation dataset is split via a stratified k-fold split into
81-
`n_crosses` splits, as defined in the study configuration file.
66+
7. **Train-Test Splits:** The validation dataset is split via a stratified k-fold split into `n_crosses` splits, as defined in the study configuration file.
8267
* As the parameter name implies, each of these splits will make up an analytical "cross"
83-
* Any post-split hooks for which `"run_per_cross": true` will also run here, fitting to the
84-
train dataset and transforming both the train and test set if possible
68+
* Any post-split hooks for which `"run_per_cross": true` will also run here, fitting to the train dataset and transforming both the train and test set if possible
8569
* If a data processing hook does not specify a `run_per_cross` value, it defaults to `false`.
86-
8. **Cross-Validate Performance Reported:** Any metrics that the user requested be tracked are
87-
calculated. These metrics are defined in the study config like so.
88-
* `train`: Evaluate the metric on a model which has been trained on the training set, evaluating the
89-
metric from the model itself, or from the model's output when applied to the test set.
90-
* As a result of this being run once per cross, each metric specified at this hook will result in
91-
`n_crosses` values being output (each denoted as `{metric_name} [{cross_idx}]`)
92-
* `validate`: Evaluate the metric on a model which has been trained on the in-dataset, evaluating the
93-
metric from the model itself, or from the model's output when applied to the in-dataset.
94-
* `test`: Evaluate the metric on a model which has been trained on the in-dataset, evaluating the
95-
metric from the model itself, or from the model's output when applied to the out-dataset.
96-
* `objective`: Evaluated identically to the `train` hook, but reported as an average both to you
97-
and the study instance (allowing the study to guide the hyperparameter sampling in future trials)
70+
8. **Cross-Validate Performance Reported:** Any metrics that the user requested be tracked are calculated. These metrics are defined in the study config like so.
71+
* `train`: Evaluate the metric on a model which has been trained on the training set, evaluating the metric from the model itself, or from the model's output when applied to the test set.
72+
* As a result of this being run once per cross, each metric specified at this hook will result in `n_crosses` values being output (each denoted as `{metric_name} [{cross_idx}]`)
73+
* `validate`: Evaluate the metric on a model which has been trained on the in-dataset, evaluating the metric from the model itself, or from the model's output when applied to the in-dataset.
74+
* `test`: Evaluate the metric on a model which has been trained on the in-dataset, evaluating the metric from the model itself, or from the model's output when applied to the out-dataset.
75+
* `objective`: Evaluated identically to the `train` hook, but reported as an average both to you and the study instance (allowing the study to guide the hyperparameter sampling in future trials)
9876
* Currently, only one `objective` metric can be defined due to this averaging.
Lines changed: 37 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,37 @@
1+
{
2+
"label": "data_label",
3+
"data_source": "path/to/your/data.csv",
4+
"format": "tabular",
5+
"separator": ",",
6+
"index": "column_to_label_samples_with",
7+
"pre_split_hooks": [
8+
{
9+
"type": "hook_type",
10+
"param1": 1,
11+
"param2": "b"
12+
}, {
13+
"type": "tunable_hook_type",
14+
"tunable_param": {
15+
"label": "param_label_in_db",
16+
"type": "int",
17+
"low": 1,
18+
"high": 10
19+
}
20+
}
21+
],
22+
"post_split_hooks": [
23+
{
24+
"type": "fitted_hook_type",
25+
"param3": "For the Greater Good"
26+
}, {
27+
"type": "fitted_and_tunable_hook_type",
28+
"tunable_param": {
29+
"label": "fitted_param_label_in_db",
30+
"type": "float",
31+
"log": true,
32+
"low": 0.1,
33+
"high": 10.0
34+
}
35+
}
36+
]
37+
}
Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,18 @@
1+
{
2+
"label": "model_label",
3+
"model": "ModelType",
4+
"parameters": {
5+
"non_tuned_param": "value",
6+
"tuned_float_param": {
7+
"label": "float_param_label",
8+
"type": "float",
9+
"low": 1.0,
10+
"high": 2.0
11+
},
12+
"tuned_choice_param": {
13+
"label": "cat_label",
14+
"type": "category",
15+
"choices": ["A", "B", 3]
16+
}
17+
}
18+
}
Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
{
2+
"label": "StudyLabel",
3+
"random_seed": 12345,
4+
"no_replicates": 10,
5+
"no_crosses": 10,
6+
"no_trials": 10,
7+
"target": "metric_label",
8+
"objective": "metric_function",
9+
"metrics": {
10+
"train": [
11+
"metric_function"
12+
],
13+
"validate": [
14+
"metric_function"
15+
],
16+
"test": [
17+
"metric_function"
18+
]
19+
},
20+
"track_params": true,
21+
"output_path": "path/to/output.db"
22+
}

data/hooks/__init__.py

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,16 @@
88

99
# Decorator to allow for registry key to be kept alongside the class of interest
1010
def registered_data_hook(key: str):
11+
"""
12+
Decorator for registering a data hook.
13+
14+
:param key: The label that this data hook will be registered as. This will be the string that the user provides
15+
in the 'type' argument within the data configuration file to request the corresponding data hook be used.
16+
17+
NOTE: Currently, for a data hook to be registered, the package it is part of must be imported in this file
18+
specifically. We are looking into a more elegant solution to this, but for now, add new imports to the set
19+
placed below this class to ensure the data hooks in that module are registered correctly.
20+
"""
1121
def _decorator(cls: Type[DataHook]):
1222
# Decorator which registers a data manager under a specific key automatically
1323
if key in DATA_HOOKS.keys():

data/hooks/base.py

Lines changed: 31 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -10,8 +10,18 @@ class DataHook(ABC):
1010
"""
1111
Basic implementation for functions which can be called as data hooks.
1212
13-
These should be configured during program init (to allow for fail-faster checking),
14-
after which they will be run at the user-specified points within the dataset
13+
Data hook use is split into two stages: initialization and application.
14+
15+
During initialization, any configuration settings the user provided are parsed and used to initialize the data hook
16+
instance. This is done before any ML analyses are run (to follow the fail-faster paradigm), and is where you
17+
should parse any configuration options and check their validity. Anything that needs to be run once at the
18+
beginning, before the data hook is applied to any data, should be done here as well. This code is usually placed
19+
into the `from_config` function.
20+
21+
During application, the hook is applied to the data provided to it (in `BaseDataManager` form). At minimum, the
22+
data hook will receive a data manager `x` containing feature/regressor values `x`. If the analysis is a
23+
supervised one, can also receive a data manager containing the target values `y` as well. See `run` for further
24+
details.
1525
"""
1626
def __init__(self, config: dict, logger: Logger = Logger.root):
1727
# Basic init implementation which tracks attributes shared with all data hooks
@@ -40,18 +50,27 @@ def from_config(cls, config: dict, logger: Logger = Logger.root) -> Self:
4050
@abstractmethod
4151
def run(self, x: BaseDataManager, y: Optional[BaseDataManager] = None) -> BaseDataManager:
4252
"""
43-
Run this hook's process on a given DataManager in its entirely.
53+
Run this hook's process on a given DataManager in its entirety.
54+
55+
To avoid potentially propagating data modifications across replicates and/or cross-validation splits, this
56+
function should return a modified copy of the original input data, rather than modifying the input
57+
DataManager(s) directly!
58+
4459
:param x: The data to process
4560
:param y: The target metric to use, if the data hook needs it.
46-
:return: The data manager, post-processing. For safety, it should generally be its own (copied) instance
61+
:return: A copy of the original data manager `x`, with this data hook applied to it.
4762
"""
4863
...
4964

5065

5166
class FittedDataHook(DataHook, ABC):
5267
"""
53-
Data hook which "fits" itself to a set of training data, and uses that fit to
54-
inform how it will be applied to other datasets
68+
An extended data hook which allows for the hook to be "fit" to a training dataset, then applied to said training
69+
dataset AND another testing dataset. This should be used for data hooks which manage data transformations which,
70+
if they were applied to the entire dataset indiscriminately, would result in data leakage.
71+
72+
Like `run` before it, this function should return modified copies of its input data, rather than modifying the input
73+
DataManager(s) directly!
5574
"""
5675
@abstractmethod
5776
def run_fitted(self,
@@ -62,11 +81,12 @@ def run_fitted(self,
6281
) -> (BaseDataManager, BaseDataManager):
6382
"""
6483
Run this hook's process on a pair of DataManagers, fitting on the training input applying to both
65-
:param x_train: The data which should be used to "fit" the hook to, before it is applied to both
66-
:param x_test: A dataset which will have the hook applied to it, but not fit to it.
67-
:param y_train: The target metric to use during fitting, if the data hook needs it.
68-
:param y_test: The target metric to use during application to testing, if the data hook needs it.
84+
85+
:param x_train: A dataset which will be used to "train" the hook. Post-training, the hook is applied to it as well
86+
:param x_test: A dataset which will have the hook applied to it only, but will not affect how the hook is "trained".
87+
:param y_train: The target metric associated with `x_train` for each of its samples, should the analyses be a supervised one.
88+
:param y_test: The target metric associated with `x_test` for each of its samples, should the analyses be a supervised one.
6989
NOTE: 'y_test' is here solely for standardization, and should probably never be used to avoid overfitting!
70-
:return: The modified versions of x_train and x_test, after the fit has been applied to them
90+
:return: The modified copies of x_train and x_test, after the fit has been applied to them.
7191
"""
7292
...

0 commit comments

Comments
 (0)