|
1 | 1 | # COBRA :snake: <img src="https://github.com/JanBenisek/Pytho/blob/master/pythongrey%20large.png" width="100" align="right"> |
2 | 2 |
|
3 | | -**Cobra** here on GitHub is refactored web-based cobra originally developed by Guillaume. The goal is to wrap the back-end into easy to use Python package. |
| 3 | +**Cobra** is a Python package that implements the Python Predictions methodology for predictive analytics. It consists of a main script/notebook that can be used to build and save a predictive model only by setting several parameters. The main scripts itself consists of several modules that can be used independently of one another to build custom scripts. |
4 | 4 |
|
5 | | -If you wish to modify the code, the best is to fork the repository or create another branch! |
| 5 | +Note that this package is a refactored version of the back-end of the original web-based cobra. |
6 | 6 |
|
7 | | -:heavy_exclamation_mark: Still lots of :bug: and under construction, keep that in mind:heavy_exclamation_mark: |
| 7 | +:heavy_exclamation_mark: Be aware that there could still be :bug: in the code :heavy_exclamation_mark: |
8 | 8 |
|
9 | | -## What can Cobra 1.0 do: |
10 | | - * Transform given .csv to be ready to use for prediction modelling |
11 | | - * _Clense the headers, partition into train/selection/validation sets, sample, bins and regroups variables and add columns with incidence rate per categories._ |
| 9 | +## What can cobra do? |
| 10 | + |
| 11 | + * Prepare a given pandas DataFrame for prediction modelling: |
| 12 | + - partition into train/selection/validation sets |
| 13 | + - create bins from continuous variables |
| 14 | + - regroup categorical variables |
| 15 | + - replace missing values and |
| 16 | + - add columns with incidence rate per category/bin. |
12 | 17 | * Perform univariate selection based on AUC |
13 | | - * Find best model by forward selection |
| 18 | + * Compute correlation matrix of predictors |
| 19 | + * Find the suitable variables using forward feature selection |
14 | 20 | * Visualize the results |
15 | 21 | * Allow iteration among each step for the analyst |
16 | | - |
17 | | -## Installation |
18 | | - * Clone this repository to your local PC (use GitHub Desktop). This assumes that the cloned repository will be in this directory `C:\Local\pers\Documents\GitHub\cobra` |
19 | | - * Open Powershell and navigate to that folder |
20 | | - * Once you are in the folder, execute `python setup.py install`. This is how the line should look like: |
21 | | - `PS C:\Local\pers\Documents\GitHub\cobra> python setup.py install` |
22 | | - * Restart kernel and you are ready to go |
23 | | - * For example of use, see the Jupyter Notebook in `examples` folder |
| 22 | + |
| 23 | +## Getting started |
| 24 | + |
| 25 | +These instructions will get you a copy of the project up and running on your local machine for usage, development and testing purposes. Furthermore, this section includes some brief examples on how to use it. |
| 26 | + |
| 27 | +### Requirements |
| 28 | + |
| 29 | +This package requires the usual Python packages for data science: |
| 30 | + |
| 31 | +* numpy |
| 32 | +* scipy |
| 33 | +* matplotlib |
| 34 | +* seaborn |
| 35 | +* pandas |
| 36 | +* scikit-learn |
| 37 | + |
| 38 | +These packages, along with their versions are listed in `requirements.txt` and `conda_env.txt`. To install these packages using pip, run |
| 39 | + |
| 40 | +``` |
| 41 | +pip install requirements.txt |
| 42 | +``` |
| 43 | + |
| 44 | +or using conda |
| 45 | + |
| 46 | +``` |
| 47 | +conda install requirements.txt |
| 48 | +``` |
| 49 | + |
| 50 | +__Note__: if you want to install cobra with e.g. pip, you don't have to install all of these requirements as these are automatically installed with cobra itself. |
| 51 | + |
| 52 | +### Installation |
| 53 | + |
| 54 | +As this package is an internal package that is not open-sourced, it is not available through `pip` or `conda`. As a result, the package has to be installed manually using the following steps: |
| 55 | + |
| 56 | + * Clone this repository. |
| 57 | + * Open a shell that can execute python code and navigate to the folder where this repo was cloned in. |
| 58 | + * Once you are in the folder, execute `python setup.py install` or `pip install .` (preferred). |
| 59 | + |
| 60 | +### Usage |
| 61 | + |
| 62 | +This section contains detailed examples for each step on how to use COBRA for building a predictive model. All classes and functions contain detailed documentation, so in case you want more information on a class or function, simply run the following python snippet: |
| 63 | + |
| 64 | +```python |
| 65 | +help(function_or_class_you_want_info_from) |
| 66 | +``` |
| 67 | + |
| 68 | +In the examples below, we assume the data for model building is available in a pandas DataFrame called `basetable`. This DataFrame should contain an ID columns (e.g. customernumber), a target column (e.g. "TARGET") and a number of candidate predictors to build or model with. |
| 69 | + |
| 70 | +```python |
| 71 | +from cobra.preprocessing import PreProcessor |
| 72 | + |
| 73 | +# Prepare data |
| 74 | +# create instance of PreProcessor from parameters |
| 75 | +# (many options possible, see source code for docs) |
| 76 | +path = "path/to/store/preprocessing/pipeline/as/json/file/for/later/re-use/" |
| 77 | +preprocessor = PreProcessor.from_params(serialization_path=path) |
| 78 | + |
| 79 | +# split data into train-selection-validation set |
| 80 | +# in the result, an additional column "split" will be created |
| 81 | +# containing each of those values |
| 82 | +basetable = preprocessor.train_selection_validation_split( |
| 83 | + basetable, |
| 84 | + target_column_name=target_column_name, |
| 85 | + train_prop=0.6, selection_prop=0.2, |
| 86 | + validation_prop=0.2) |
| 87 | + |
| 88 | +# create list containing the column names of the discrete resp. |
| 89 | +# continiuous variables |
| 90 | +continuous_vars = [] |
| 91 | +discrete_vars = [] |
| 92 | + |
| 93 | +# fit the pipeline (will automatically be stored to "path" variable) |
| 94 | +preprocessor.fit(basetable[basetable["split"]=="train"], |
| 95 | + continuous_vars=continuous_vars, |
| 96 | + discrete_vars=discrete_vars, |
| 97 | + target_column_name=target_column_name) |
| 98 | + |
| 99 | +# When you want to reuse the pipeline the next time, simply run |
| 100 | +# preprocessor = PreProcessor.from_pipeline(path) and you're good to go! |
| 101 | + |
| 102 | +# transform the data (e.g. perform discretisation, incidence replacement, ...) |
| 103 | +basetable = preprocessor.transform(basetable, |
| 104 | + continuous_vars=continuous_vars, |
| 105 | + discrete_vars=discrete_vars) |
| 106 | + |
| 107 | +``` |
| 108 | + |
| 109 | +Once the preprocessing pipeline is fitted and applied to your data, we are ready to start modelling. However, we could already compute the PIG tables here for later use: |
| 110 | + |
| 111 | +```python |
| 112 | +from cobra.evaluation import generate_pig_tables |
| 113 | + |
| 114 | +pig_tables = generate_pig_tables(basetable[basetable["split"] == "selection"], |
| 115 | + id_column_name=id_column_name, |
| 116 | + target_column_name=target_column_name, |
| 117 | + preprocessed_predictors=preprocessed_predictors) |
| 118 | +``` |
| 119 | + |
| 120 | +Once these PIG tables are computed, we can start with the _univariate preselection_: |
| 121 | + |
| 122 | +```python |
| 123 | +from cobra.model_building import univariate_selection |
| 124 | +from cobra.evaluation import plot_univariate_predictor_quality |
| 125 | +from cobra.evaluation import plot_correlation_matrix |
| 126 | + |
| 127 | +# Get list of predictor names to use for univariate_selection |
| 128 | +preprocessed_predictors = [col for col in basetable.columns if col.endswith("_enc")] |
| 129 | + |
| 130 | +# perform univariate selection on preprocessed predictors: |
| 131 | +df_auc = univariate_selection.compute_univariate_preselection( |
| 132 | + target_enc_train_data=basetable[basetable["split"] == "train"], |
| 133 | + target_enc_selection_data=basetable[basetable["split"] == "selection"], |
| 134 | + predictors=preprocessed_predictors, |
| 135 | + target_column=target_column_name, |
| 136 | + preselect_auc_threshold=0.53, # if auc_selection <= 0.53 exclude predictor |
| 137 | + preselect_overtrain_threshold=0.05 # if (auc_train - auc_selection) >= 0.05 --> overfitting! |
| 138 | + ) |
| 139 | + |
| 140 | +# Plot df_auc to get a horizontal barplot: |
| 141 | +plot_univariate_predictor_quality(df_auc) |
| 142 | + |
| 143 | +# compute correlations between preprocessed predictors: |
| 144 | +df_corr = (univariate_selection |
| 145 | + .compute_correlations(basetable[basetable["split"] == "train"], |
| 146 | + preprocessed_predictors)) |
| 147 | + |
| 148 | +# plot correlation matrix |
| 149 | +plot_correlation_matrix(df_corr) |
| 150 | + |
| 151 | +# get a list of predictors selection by the univariate selection |
| 152 | +preselected_predictors = (univariate_selection |
| 153 | + .get_preselected_predictors(df_auc)) |
| 154 | +``` |
| 155 | + |
| 156 | +After a preselection is done on the predictors, we can start the model building itself using forward feature selection to choose the right set of predictors: |
| 157 | + |
| 158 | +```python |
| 159 | +from cobra.model_building import ForwardFeatureSelection |
| 160 | +from cobra.evaluation import plot_performance_curves |
| 161 | +from cobra.evaluation import plot_variable_importance |
| 162 | + |
| 163 | +forward_selection = ForwardFeatureSelection(max_predictors=30, |
| 164 | + pos_only=True) |
| 165 | + |
| 166 | +# fit the forward feature selection on the train data |
| 167 | +# has optional parameters to force and/or exclude certain predictors |
| 168 | +forward_selection.fit(basetable[basetable["split"] == "train"], |
| 169 | + target_column_name, |
| 170 | + preselected_predictors) |
| 171 | + |
| 172 | +# compute model performance (e.g. AUC for train-selection-validation) |
| 173 | +performances = (forward_selection |
| 174 | + .compute_model_performances(basetable, target_column_name)) |
| 175 | + |
| 176 | +# plot performance curves |
| 177 | +plot_performance_curves(performances) |
| 178 | + |
| 179 | +# After plotting the performances and selecting the model, |
| 180 | +# we can extract this model from the forward_selection class: |
| 181 | +model = forward_selection.get_model_from_step(5) |
| 182 | + |
| 183 | +# Note that chosen model has 6 variables (python lists start with index 0), |
| 184 | +# which can be obtained as follows: |
| 185 | +final_predictors = model.predictors |
| 186 | +# We can also compute and plot the importance of each predictor in the model: |
| 187 | +variable_importance = model.compute_variable_importance( |
| 188 | + basetable[basetable["split"] == "selection"] |
| 189 | +) |
| 190 | +plot_variable_importance(variable_importance) |
| 191 | +``` |
| 192 | + |
| 193 | +Now that we have build and selected a final model, it is time to evaluate it against various evaluation metrics: |
| 194 | + |
| 195 | +```python |
| 196 | +from cobra.evaluation import Evaluator |
| 197 | + |
| 198 | +# get numpy array of True target labels and predicted scores: |
| 199 | +y_true = basetable[basetable["split"] == "selection"][target_column_name].values |
| 200 | +y_pred = model.score_model(basetable[basetable["split"] == "selection"]) |
| 201 | + |
| 202 | +evaluator = Evaluator() |
| 203 | +evaluator.fit(y_true, y_pred) # Automatically find the best cut-off probability |
| 204 | + |
| 205 | +# Get various scalar metrics such as accuracy, AUC, precision, recall, ... |
| 206 | +evaluator.scalar_metrics |
| 207 | + |
| 208 | +# Plot non-scalar evaluation metrics: |
| 209 | +evaluator.plot_roc_curve() |
| 210 | + |
| 211 | +evaluator.plot_confusion_matrix() |
| 212 | + |
| 213 | +evaluator.plot_cumulative_gains() |
| 214 | + |
| 215 | +evaluator.plot_lift_curve() |
| 216 | + |
| 217 | +evaluator.plot_cumulative_response_curve() |
| 218 | + |
| 219 | +``` |
| 220 | + |
| 221 | +## Development |
| 222 | + |
| 223 | +We'd love you to contribute to the development of Cobra! To do so, clone the repo and create a _feature branch_ to do your development. Once your are finished, you can create a _pull request_ to merge it back into the main branch. Make sure to follow the _PEP 8_ styleguide if you make any changes to COBRA. You should also write or modify unit test for your changes if they are related to preprocessing! |
0 commit comments