Skip to content

Commit 9ebc383

Browse files
Merge pull request #18 from PythonPredictions/develop
Develop
2 parents 489c007 + f980a7d commit 9ebc383

49 files changed

Lines changed: 6113 additions & 9996 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.gitignore

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -102,3 +102,10 @@ ENV/
102102

103103
# mypy
104104
.mypy_cache/
105+
106+
# vscode settings
107+
.vscode/
108+
109+
# Other ignore files
110+
*.pptx
111+
*.ppt

LICENSE

Lines changed: 0 additions & 21 deletions
This file was deleted.

README.md

Lines changed: 215 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -1,23 +1,223 @@
11
# COBRA :snake: <img src="https://github.com/JanBenisek/Pytho/blob/master/pythongrey%20large.png" width="100" align="right">
22

3-
**Cobra** here on GitHub is refactored web-based cobra originally developed by Guillaume. The goal is to wrap the back-end into easy to use Python package.
3+
**Cobra** is a Python package that implements the Python Predictions methodology for predictive analytics. It consists of a main script/notebook that can be used to build and save a predictive model only by setting several parameters. The main scripts itself consists of several modules that can be used independently of one another to build custom scripts.
44

5-
If you wish to modify the code, the best is to fork the repository or create another branch!
5+
Note that this package is a refactored version of the back-end of the original web-based cobra.
66

7-
:heavy_exclamation_mark: Still lots of :bug: and under construction, keep that in mind:heavy_exclamation_mark:
7+
:heavy_exclamation_mark: Be aware that there could still be :bug: in the code :heavy_exclamation_mark:
88

9-
## What can Cobra 1.0 do:
10-
* Transform given .csv to be ready to use for prediction modelling
11-
* _Clense the headers, partition into train/selection/validation sets, sample, bins and regroups variables and add columns with incidence rate per categories._
9+
## What can cobra do?
10+
11+
* Prepare a given pandas DataFrame for prediction modelling:
12+
- partition into train/selection/validation sets
13+
- create bins from continuous variables
14+
- regroup categorical variables
15+
- replace missing values and
16+
- add columns with incidence rate per category/bin.
1217
* Perform univariate selection based on AUC
13-
* Find best model by forward selection
18+
* Compute correlation matrix of predictors
19+
* Find the suitable variables using forward feature selection
1420
* Visualize the results
1521
* Allow iteration among each step for the analyst
16-
17-
## Installation
18-
* Clone this repository to your local PC (use GitHub Desktop). This assumes that the cloned repository will be in this directory `C:\Local\pers\Documents\GitHub\cobra`
19-
* Open Powershell and navigate to that folder
20-
* Once you are in the folder, execute `python setup.py install`. This is how the line should look like:
21-
`PS C:\Local\pers\Documents\GitHub\cobra> python setup.py install`
22-
* Restart kernel and you are ready to go
23-
* For example of use, see the Jupyter Notebook in `examples` folder
22+
23+
## Getting started
24+
25+
These instructions will get you a copy of the project up and running on your local machine for usage, development and testing purposes. Furthermore, this section includes some brief examples on how to use it.
26+
27+
### Requirements
28+
29+
This package requires the usual Python packages for data science:
30+
31+
* numpy
32+
* scipy
33+
* matplotlib
34+
* seaborn
35+
* pandas
36+
* scikit-learn
37+
38+
These packages, along with their versions are listed in `requirements.txt` and `conda_env.txt`. To install these packages using pip, run
39+
40+
```
41+
pip install requirements.txt
42+
```
43+
44+
or using conda
45+
46+
```
47+
conda install requirements.txt
48+
```
49+
50+
__Note__: if you want to install cobra with e.g. pip, you don't have to install all of these requirements as these are automatically installed with cobra itself.
51+
52+
### Installation
53+
54+
As this package is an internal package that is not open-sourced, it is not available through `pip` or `conda`. As a result, the package has to be installed manually using the following steps:
55+
56+
* Clone this repository.
57+
* Open a shell that can execute python code and navigate to the folder where this repo was cloned in.
58+
* Once you are in the folder, execute `python setup.py install` or `pip install .` (preferred).
59+
60+
### Usage
61+
62+
This section contains detailed examples for each step on how to use COBRA for building a predictive model. All classes and functions contain detailed documentation, so in case you want more information on a class or function, simply run the following python snippet:
63+
64+
```python
65+
help(function_or_class_you_want_info_from)
66+
```
67+
68+
In the examples below, we assume the data for model building is available in a pandas DataFrame called `basetable`. This DataFrame should contain an ID columns (e.g. customernumber), a target column (e.g. "TARGET") and a number of candidate predictors to build or model with.
69+
70+
```python
71+
from cobra.preprocessing import PreProcessor
72+
73+
# Prepare data
74+
# create instance of PreProcessor from parameters
75+
# (many options possible, see source code for docs)
76+
path = "path/to/store/preprocessing/pipeline/as/json/file/for/later/re-use/"
77+
preprocessor = PreProcessor.from_params(serialization_path=path)
78+
79+
# split data into train-selection-validation set
80+
# in the result, an additional column "split" will be created
81+
# containing each of those values
82+
basetable = preprocessor.train_selection_validation_split(
83+
basetable,
84+
target_column_name=target_column_name,
85+
train_prop=0.6, selection_prop=0.2,
86+
validation_prop=0.2)
87+
88+
# create list containing the column names of the discrete resp.
89+
# continiuous variables
90+
continuous_vars = []
91+
discrete_vars = []
92+
93+
# fit the pipeline (will automatically be stored to "path" variable)
94+
preprocessor.fit(basetable[basetable["split"]=="train"],
95+
continuous_vars=continuous_vars,
96+
discrete_vars=discrete_vars,
97+
target_column_name=target_column_name)
98+
99+
# When you want to reuse the pipeline the next time, simply run
100+
# preprocessor = PreProcessor.from_pipeline(path) and you're good to go!
101+
102+
# transform the data (e.g. perform discretisation, incidence replacement, ...)
103+
basetable = preprocessor.transform(basetable,
104+
continuous_vars=continuous_vars,
105+
discrete_vars=discrete_vars)
106+
107+
```
108+
109+
Once the preprocessing pipeline is fitted and applied to your data, we are ready to start modelling. However, we could already compute the PIG tables here for later use:
110+
111+
```python
112+
from cobra.evaluation import generate_pig_tables
113+
114+
pig_tables = generate_pig_tables(basetable[basetable["split"] == "selection"],
115+
id_column_name=id_column_name,
116+
target_column_name=target_column_name,
117+
preprocessed_predictors=preprocessed_predictors)
118+
```
119+
120+
Once these PIG tables are computed, we can start with the _univariate preselection_:
121+
122+
```python
123+
from cobra.model_building import univariate_selection
124+
from cobra.evaluation import plot_univariate_predictor_quality
125+
from cobra.evaluation import plot_correlation_matrix
126+
127+
# Get list of predictor names to use for univariate_selection
128+
preprocessed_predictors = [col for col in basetable.columns if col.endswith("_enc")]
129+
130+
# perform univariate selection on preprocessed predictors:
131+
df_auc = univariate_selection.compute_univariate_preselection(
132+
target_enc_train_data=basetable[basetable["split"] == "train"],
133+
target_enc_selection_data=basetable[basetable["split"] == "selection"],
134+
predictors=preprocessed_predictors,
135+
target_column=target_column_name,
136+
preselect_auc_threshold=0.53, # if auc_selection <= 0.53 exclude predictor
137+
preselect_overtrain_threshold=0.05 # if (auc_train - auc_selection) >= 0.05 --> overfitting!
138+
)
139+
140+
# Plot df_auc to get a horizontal barplot:
141+
plot_univariate_predictor_quality(df_auc)
142+
143+
# compute correlations between preprocessed predictors:
144+
df_corr = (univariate_selection
145+
.compute_correlations(basetable[basetable["split"] == "train"],
146+
preprocessed_predictors))
147+
148+
# plot correlation matrix
149+
plot_correlation_matrix(df_corr)
150+
151+
# get a list of predictors selection by the univariate selection
152+
preselected_predictors = (univariate_selection
153+
.get_preselected_predictors(df_auc))
154+
```
155+
156+
After a preselection is done on the predictors, we can start the model building itself using forward feature selection to choose the right set of predictors:
157+
158+
```python
159+
from cobra.model_building import ForwardFeatureSelection
160+
from cobra.evaluation import plot_performance_curves
161+
from cobra.evaluation import plot_variable_importance
162+
163+
forward_selection = ForwardFeatureSelection(max_predictors=30,
164+
pos_only=True)
165+
166+
# fit the forward feature selection on the train data
167+
# has optional parameters to force and/or exclude certain predictors
168+
forward_selection.fit(basetable[basetable["split"] == "train"],
169+
target_column_name,
170+
preselected_predictors)
171+
172+
# compute model performance (e.g. AUC for train-selection-validation)
173+
performances = (forward_selection
174+
.compute_model_performances(basetable, target_column_name))
175+
176+
# plot performance curves
177+
plot_performance_curves(performances)
178+
179+
# After plotting the performances and selecting the model,
180+
# we can extract this model from the forward_selection class:
181+
model = forward_selection.get_model_from_step(5)
182+
183+
# Note that chosen model has 6 variables (python lists start with index 0),
184+
# which can be obtained as follows:
185+
final_predictors = model.predictors
186+
# We can also compute and plot the importance of each predictor in the model:
187+
variable_importance = model.compute_variable_importance(
188+
basetable[basetable["split"] == "selection"]
189+
)
190+
plot_variable_importance(variable_importance)
191+
```
192+
193+
Now that we have build and selected a final model, it is time to evaluate it against various evaluation metrics:
194+
195+
```python
196+
from cobra.evaluation import Evaluator
197+
198+
# get numpy array of True target labels and predicted scores:
199+
y_true = basetable[basetable["split"] == "selection"][target_column_name].values
200+
y_pred = model.score_model(basetable[basetable["split"] == "selection"])
201+
202+
evaluator = Evaluator()
203+
evaluator.fit(y_true, y_pred) # Automatically find the best cut-off probability
204+
205+
# Get various scalar metrics such as accuracy, AUC, precision, recall, ...
206+
evaluator.scalar_metrics
207+
208+
# Plot non-scalar evaluation metrics:
209+
evaluator.plot_roc_curve()
210+
211+
evaluator.plot_confusion_matrix()
212+
213+
evaluator.plot_cumulative_gains()
214+
215+
evaluator.plot_lift_curve()
216+
217+
evaluator.plot_cumulative_response_curve()
218+
219+
```
220+
221+
## Development
222+
223+
We'd love you to contribute to the development of Cobra! To do so, clone the repo and create a _feature branch_ to do your development. Once your are finished, you can create a _pull request_ to merge it back into the main branch. Make sure to follow the _PEP 8_ styleguide if you make any changes to COBRA. You should also write or modify unit test for your changes if they are related to preprocessing!

0 commit comments

Comments
 (0)