You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
* Cleaned up README.md to be a bit more clean and clear (namely, removing newlines which simulated soft-wrap poorly)
* Cleaned up README.md to be a bit more clean and clear (namely, removing newlines which simulated soft-wrap poorly)
* Added docstring for 'registered_data_hook'
* Updated docstrings for the ABCs for data hooks
* Updated the docstrings of data-encoding hooks
* Updated DocStrings for the 'feature_selection.py' data hooks. Also fixes a slight error in the SampleNullityDrop hook, where it was evaluating checking the threshold against sample count (rather than feature count)
* [Minor] Corrected indentation of use-case to match the rest of the doc-strings
* Added the docstring for the Imputation data hooks (of which there is currently only one: SimpleImputation)
* Added the docstring for the Standardization data hooks (of which there is currently only one: StandardScaling)
* Added missing docstring for the "evaluate_param" function, used by SciKit-Learn's root ModelManager
* Updated docstrings for the SciKit-Learn Ensemble models (mostly adding example usage)
* Updated docstring for the Linear models provided by this tool. Note the extended discussion of how this implementation works around SciKit-Learn's triple-variant approach!
* Fixed incorrect indentation in the use-case examples for Ensemble models. Oops
* Extended the docstring of KNeighborsClassifierManager to include example usage. Woohoo
* Added example usage to the SVC docstring.
* Extended the docstring of the tuning utility classes, to clarify their use in the broader context of the framework.
* Update data/hooks/encoding.py
Addendum suggested by valosekj
Co-authored-by: Jan Valosek <39456460+valosekj@users.noreply.github.com>
* Update data/hooks/encoding.py
Added additional (common) parameter, as suggested by Jan
Co-authored-by: Jan Valosek <39456460+valosekj@users.noreply.github.com>
* Update README.md
Swapped to GitHub Block formatting, which is a lot better at drawing the eye of the user to this note. Thanks Jan for pointing this out!
Co-authored-by: Jan Valosek <39456460+valosekj@users.noreply.github.com>
* Update data/hooks/encoding.py
Modified suggestion by Jan
* Initial Sphinx docs addition. Still WIP, be gentle
* Further improves to the documentation, mostly restructuring and clarification
* Minor formatting correction to the FAQ page; its header should now actuall say what it is
* Initial commit of the Getting Started page
* Initial addition of the "walkthrough" for MOOP; still very much WIP
* Extended the data config documentation to discuss data hooks.
* Initial addition of model configuration + parameter tuning walkthrough
* [Minor] Grammar correction in model config docs
* Added model config template for easy user reference
* Added the template file for the study config before I forget to later
* Initial commit of the study config documentation (walk-through); last step is documenting how to interpret it!
* Some clean-up and re-wiring of cross-references in docs. Should be a bit nicer to navigate now
* Trimmed the titles of the walkthrough, as they were too elaborate for the context of a tutorial
* Added missing comma in the model config tutorial JSON example; oops
* Yet another missing comma aaaaa
* Slight correction to the parameters for the output path in the study config tutorial
* Initial results interpretation discussion added to docs
* Added missing results.csv, using in result interpretation docs
* First addition to results, showing common plots which can be generated from a MOOP analysis
* Extended results documentation, showing how to compare two different MOOP runs to one another via plotting
* Added warning about an edge case bug, where the data is erroneously coded as an "object" type if the database write is interrupted during a MOOPs run
* Initial addition of statistical analyses in the walkthrough; still very much WIP
* Swapped to PyData theme; should help with eye strain via its dark mode.
* [Minor] Corrected misleading comment in statistics documentation. Oops
* Added section on calculating statistics for single runs, place before multi-run comparison
* Added a section detailing some common statistical analyses that can be run using MOOP's results
* Fixed ".. codeblock" appearing erroneously in some doc pages
* Clarified an example of the dataset in question
* Initial ReadTheDocs commit!
* Clarified null-like data formatting
* Changed header of "index" section to remove duplicate adjacent headers
* Fixed strategy used in docs
---------
Kalum Ost
Copy file name to clipboardExpand all lines: README.md
+25-47Lines changed: 25 additions & 47 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -15,27 +15,23 @@ it can be easily extended to allow for the analysis of any tabular dataset.
15
15
*`conda activate modular_optuna_ml`
16
16
*`mamba activate modular_optuna_ml`
17
17
4. Done!
18
-
5. This only sets up the tool to be run; you will still need to create the configuration files for the analyses you want to run (see `testing` for an example).
18
+
19
+
> [!NOTE]
20
+
> This only sets up the tool to be run; you will still need to create the configuration files for the analyses you want to run (see the `testing` directory for some examples).
19
21
20
22
## Running the Program
21
23
22
24
Four files are needed to run an analysis
23
25
24
-
* A tabular dataset, containing the metrics you want to run the analysis on
25
-
* Should contain at least 1 independent and 1 target metric; unsupervised analyses are currently
26
-
not supported
26
+
* A tabular dataset (usually a `.tsv` file), containing the metrics you want to run the analysis on
27
+
* Should contain at least 1 independent and 1 target metric; unsupervised analyses are currently not supported
27
28
* A data configuration file; this defines where a dataset is and what pre-processing methods
28
-
should be applied to its contents. An example, alongside the dataset it manages, can be found
29
-
in `testing/iris_data/`
29
+
should be applied to its contents. An example, alongside the dataset it manages, can be found in `testing/iris_data/`
30
30
* A model configuration file; this defines which ML model to test, which hyper-parameters to tune,
31
31
and how to tune them. A few examples are available in `testing/model_configs/`
32
-
* A study configuration file; this defines which metrics to evaluated throughout the runtime of the
33
-
analysis, and where to save the results (currently only supports an SQLite DB output format). An
34
-
example is provided in `testing/testing_study_config.json`
32
+
* A study configuration file; this defines which metrics to evaluated throughout the runtime of the analysis, and where to save the results (currently only supports an SQLite DB output format). An example is provided in `testing/testing_study_config.json`
35
33
36
-
Once all three have been created, and you have installed all dependencies (detailed in
37
-
`environment.yml`) simply run the following command (replacing the values within the
38
-
curly brackets with the corresponding file name):
34
+
Once all three have been created, and you have installed all dependencies (detailed in `environment.yml`) simply run the following command (replacing the values within the curly brackets with the corresponding file name):
@@ -51,48 +47,30 @@ The overall structure of the analysis can be broken down into the following broa
51
47
52
48
1.**Configuration Loading:** All configuration files are loaded and checked for validity.
53
49
2.**Dataset Loading:** The tabular dataset designated in the data configuration file is loaded
54
-
* If a target column is specified, it is split off the dataset at this point to isolate it from
55
-
pre-processing (see below)
56
-
3.**Study Initialization:** An Optuna study is initialized, set up to run `n_trials` trials as specified
57
-
in the study config file.
58
-
* All steps past this point occur per-trial, sampling from the corresponding `Trial` instance to
59
-
determine the hyperparameters to use.
60
-
* Configuration files denote a parameter as being "trial tunable" by placing a dictionary in the
61
-
place of a constant; an example of this can be seen in the `penalty` parameter for the
50
+
* If a target column is specified, it is split off the dataset at this point to isolate it from pre-processing (see below)
51
+
3.**Study Initialization:** An Optuna study is initialized, set up to run `n_trials` trials as specified in the study config file.
52
+
* All steps past this point occur per-trial, sampling from the corresponding `Trial` instance to determine the hyperparameters to use.
53
+
* Configuration files denote a parameter as being "trial tunable" by placing a dictionary in the place of a constant; an example of this can be seen in the `penalty` parameter for the
62
54
`testing/model_configs/log_reg.json` file.
63
-
* Details on how hyper-parameters are sampled via Optuna Trials can be found
4.**Universal Pre-Processing:** Any data processing hooks for which `"run_per_replicate": true` are
66
-
run on the dataset in its entirety
55
+
* Details on how hyper-parameters are sampled via Optuna Trials can be found (here)[https://optuna.readthedocs.io/en/stable/tutorial/10_key_features/002_configurations.html].
56
+
4.**Universal Pre-Processing:** Any data processing hooks for which `"run_per_replicate": true` are run on the dataset in its entirety
67
57
* If a data processing hook does not specify a `run_per_replicate` value, it defaults to `true`.
68
-
5.**In-Out Splits:** The dataset is split via a stratified k-fold split into in- and out-groups,
69
-
`n_replicates` times.
58
+
5.**In-Out Splits:** The dataset is split via a stratified k-fold split into in- and out-groups, `n_replicates` times.
70
59
* As the parameter name implies, each of these splits will make up an analytical "replicate"
71
60
* Any post-split hooks for which `"run_per_replicate": true` will also run here, fitting to the
72
61
in-dataset and transforming both the in- and out-dataset if possible
73
62
* If a data processing hook does not specify a `run_per_replicate` value, it defaults to `true`.
74
-
* NOTE: Despite this occurring per-trial, the RNG state being fixed prior to study start ensures that
75
-
the in-out datasets are the same for all trials, so long as universal pre-processing did not delete
76
-
and samples during its run-time
77
-
6.**Replicate Pre-Processing:** For each in-dataset, any data processing hooks for which
78
-
`"run_per_cross": true` are run on the in-dataset.
63
+
* NOTE: Despite this occurring per-trial, the RNG state being fixed prior to study start ensures that the in-out datasets are the same for all trials, so long as universal pre-processing did not delete any samples during its run-time
64
+
6.**Replicate Pre-Processing:** For each in-dataset, any data processing hooks for which "run_per_cross": true` are run on the in-dataset.
79
65
* If a data processing hook does not specify a `run_per_cross` value, it defaults to `false`.
80
-
7.**Train-Test Splits:** The validation dataset is split via a stratified k-fold split into
81
-
`n_crosses` splits, as defined in the study configuration file.
66
+
7.**Train-Test Splits:** The validation dataset is split via a stratified k-fold split into `n_crosses` splits, as defined in the study configuration file.
82
67
* As the parameter name implies, each of these splits will make up an analytical "cross"
83
-
* Any post-split hooks for which `"run_per_cross": true` will also run here, fitting to the
84
-
train dataset and transforming both the train and test set if possible
68
+
* Any post-split hooks for which `"run_per_cross": true` will also run here, fitting to the train dataset and transforming both the train and test set if possible
85
69
* If a data processing hook does not specify a `run_per_cross` value, it defaults to `false`.
86
-
8.**Cross-Validate Performance Reported:** Any metrics that the user requested be tracked are
87
-
calculated. These metrics are defined in the study config like so.
88
-
*`train`: Evaluate the metric on a model which has been trained on the training set, evaluating the
89
-
metric from the model itself, or from the model's output when applied to the test set.
90
-
* As a result of this being run once per cross, each metric specified at this hook will result in
91
-
`n_crosses` values being output (each denoted as `{metric_name} [{cross_idx}]`)
92
-
*`validate`: Evaluate the metric on a model which has been trained on the in-dataset, evaluating the
93
-
metric from the model itself, or from the model's output when applied to the in-dataset.
94
-
*`test`: Evaluate the metric on a model which has been trained on the in-dataset, evaluating the
95
-
metric from the model itself, or from the model's output when applied to the out-dataset.
96
-
*`objective`: Evaluated identically to the `train` hook, but reported as an average both to you
97
-
and the study instance (allowing the study to guide the hyperparameter sampling in future trials)
70
+
8.**Cross-Validate Performance Reported:** Any metrics that the user requested be tracked are calculated. These metrics are defined in the study config like so.
71
+
*`train`: Evaluate the metric on a model which has been trained on the training set, evaluating the metric from the model itself, or from the model's output when applied to the test set.
72
+
* As a result of this being run once per cross, each metric specified at this hook will result in `n_crosses` values being output (each denoted as `{metric_name} [{cross_idx}]`)
73
+
*`validate`: Evaluate the metric on a model which has been trained on the in-dataset, evaluating the metric from the model itself, or from the model's output when applied to the in-dataset.
74
+
*`test`: Evaluate the metric on a model which has been trained on the in-dataset, evaluating the metric from the model itself, or from the model's output when applied to the out-dataset.
75
+
*`objective`: Evaluated identically to the `train` hook, but reported as an average both to you and the study instance (allowing the study to guide the hyperparameter sampling in future trials)
98
76
* Currently, only one `objective` metric can be defined due to this averaging.
0 commit comments