Skip to content

Commit 911a4ec

Browse files
authored
Fix: Correct broken image links
1 parent 0a42712 commit 911a4ec

1 file changed

Lines changed: 33 additions & 25 deletions

File tree

model_card.md

Lines changed: 33 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,11 @@
1+
[//]: # (Image References)
2+
[image1]: ./plots/numFeats_outlierDist_sex_boxplot.png "feat dist by sex plot:"
3+
[image2]: ./plots/salary_dist_hoursPerWeek-age-sex_plot.png "salary dist plot:"
4+
[image3]: ./plots/MLOps_Proj3_trainModel_retrainBestCV_fifthRunPart0_2023-09-21.PNG "best estimator params"
5+
[image4]: ./plots/2023-09-21_21-34_best_retrained_xgb-cv_logloss_treeNo_diagram.png "best xgb cls logloss"
6+
[image5]: ./plots/2023-09-21_21-34_best_retrained_xgb-cv_roc-curve_diagram.png "best xgb cls roc auc"
7+
[image6]: ./plots/MLOps_Proj3_trainModel_retrainBestCV_fifthRunPart4_2023-09-21.PNG "best xgb cls eval"
8+
19
# Model Card
210

311
For additional information see the Model Card paper: https://arxiv.org/pdf/1810.03993.pdf
@@ -11,8 +19,8 @@ For additional information see the Model Card paper: https://arxiv.org/pdf/1810.
1119
- Intended to be used to predict if a person earns >50K US$ per year or not based on the labeled adult census data.
1220

1321
## Training Data
14-
- For the data given 1996-04-30 extraction was done by Barry Becker from the 1994 Census database.
15-
- Attribute information is given from the UCI ML repository. Regarding the raw data the columns are:
22+
- For the data given 1996-04-30 extraction was done by *Barry Becker* from the 1994 Census database.
23+
- Attribute information is given from the UCI ML repository. Regarding the raw data, the *columns* are:
1624
- age
1725
- workclass
1826
- fnlwgt
@@ -28,65 +36,65 @@ For additional information see the Model Card paper: https://arxiv.org/pdf/1810.
2836
- hours-per-week
2937
- native-country
3038

31-
with target label feature 'salary' which has been converted during preprocessing being numerical:
39+
with **target label** feature 'salary' which has been converted during preprocessing being numerical:
3240
- '>50K': 1
33-
- '<=50K': 0.
41+
- '<=50K': 0
3442

3543
- The raw numerical feature distribution is:<br>
36-
![''](https://github.com/IloBe/US_CensusData_Classifier_PipelineWithDeployment/tree/master/plots/ numFeats_outlierDist_sex_boxplot.png
44+
![feat dist by sex plot:][image1]
3745

3846
- And its salary distribution regarding hours-per-week, age and sex is:<br>
39-
![''](https://github.com/IloBe/US_CensusData_Classifier_PipelineWithDeployment/tree/master/plots/salary_dist_hoursPerWeek-age-sex_plot.png
47+
![salary dist plot:][image2]
4048

4149
- Some preprocessing has been done with pipeline implementation. These tasks include, but are not limited to, dealing with the imbalanced distribution of the target label, duplicate rows, wrong labeling or strings, non-normalised features and one-hot-encoding for the categorical nominal features.
4250
- <i>fnlwgt</i> has been removed not having an added value for the prediction task, because it seems to have a kind of index functionality
43-
- <i>native_country<i> has been changed having another bin grouping according [German Society for State Studies e.V.]: (http://www.staatenkunde.de/dgfs/datenbank/db-sb.php?sb=18&k=4)
51+
- <i>native_country<i> has been changed having another bin grouping according [German Society for State Studies e.V.](http://www.staatenkunde.de/dgfs/datenbank/db-sb.php?sb=18&k=4)
4452

45-
For both, the raw and the preprocessed data, profiling reports are created as EDA (exploratory data analysis) tasks.
53+
For both, the raw and the preprocessed data, profiling reports are created as EDA (exploratory data analysis) tasks stored in directory [./src/eda](./src/eda/).
4654

4755
## Evaluation Data
4856
For the validation datasets containing different samples to evaluate trained ML models the GridSearchCV approach from scikit-learn is implemented with 0.2 test size fraction of data used for final testing.
4957

50-
The following value ranges for the grid parameters are handled (note: read in via configuration file)
51-
- for our binary classification, output probability:<br>
58+
The following value ranges for the **grid parameters** are handled (note: read in via configuration file), they are used with early stopping mechanism:
59+
- for our *binary classification*, output probability:<br>
5260
objective: ["binary:logistic"]
53-
- evaluation metrics used e.g. for model plots:<br>
61+
- *evaluation metrics* used e.g. for model plots:<br>
5462
eval_metric: ["logloss"]<br>
5563
note regarding XGBoost: early stopping uses only the last given metric of a sequence
56-
- specifies the number of decision trees to be boosted<br>
64+
- specifies the *number of decision trees* to be boosted<br>
5765
n_estimators: [100, 150, 200, 250]
58-
- max_depth is the tree's maximum depth. Increasing it increases the model complexity<br>
66+
- *max_depth* is the tree's maximum depth. Increasing it increases the model complexity<br>
5967
max_depth: [5, 6, 8, 9]
60-
- learning_rate shrinks the weights to make the boosting process more conservative<br>
68+
- *learning_rate* shrinks the weights to make the boosting process more conservative<br>
6169
learning_rate: [0.01, 0.1, 0.5]
62-
- gamma specifies the minimum loss reduction required to make a split<br>
70+
- *gamma* specifies the minimum loss reduction required to make a split<br>
6371
gamma: [0, 1, 10]
64-
- percentage of columns to be samples for each tree<br>
72+
- *percentage of columns* to be samples for each tree<br>
6573
colsample_bytree: [0.5, 0.7, 1]
66-
- reg_alpha provides l1 regularization to the weight, higher values are more conservative<br>
74+
- *reg_alpha* provides l1 regularization to the weight, higher values are more conservative<br>
6775
reg_alpha: [0.01, 0.1, 0.5]
68-
- reg_lambda provides l2 regularization to the weight, higher values are more conservative<br>
76+
- *reg_lambda* provides l2 regularization to the weight, higher values are more conservative<br>
6977
reg_lambda: [0.01, 0.1, 0.5]
7078

7179
Best found estimator parameters are:
72-
![''](https://github.com/IloBe/US_CensusData_Classifier_PipelineWithDeployment/tree/master/plots/MLOps_Proj3_trainModel_retrainBestCV_fifthRunPart0_2023-09-21.PNG
80+
![best estimator params][image3]
7381

7482
## Metrics
7583
Few evaluation metrics are included:<br>
7684
Precision, Recall, Fbeta, Confusion Matrix, Classification Report
85+
7786
As evaluation plots given are:<br>
7887
logloss diagram and ROC curve diagram with AUC value
79-
<br>
80-
In general, the receiver operating characteristic (ROC) curve is a metric that is used to measure the performance of a classifier model. It depicts the true positive rate concerning the false positive ones. It also highlights the sensitivity of the classifier model. The area under the curve (AUC) is used for general binary classification problems. AUC will measure the whole two-dimensional area that is available under the entire ROC curve. A perfect model would have an AUC of 1, while a random model would have an AUC of 0.5.
8188

89+
In general, the receiver operating characteristic (ROC) curve is a metric that is used to measure the performance of a classifier model. It depicts the true positive rate concerning the false positive ones. It also highlights the sensitivity of the classifier model. The area under the curve (AUC) is used for general binary classification problems. AUC will measure the whole two-dimensional area that is available under the entire ROC curve. A perfect model would have an AUC of 1, while a random model would have an AUC of 0.5.<br>
8290
The AUC result of the GridSearchCV best estimator is: 0.93
8391

84-
![XGB Classifier GridSearchCV Metrics](https://github.com/IloBe/US_CensusData_Classifier_PipelineWithDeployment/tree/master/plots/2023-09-21_21-34_best_retrained_xgb-cv_logloss_treeNo_diagram.png)
92+
![best xgb cls logloss][image4]
8593

86-
![''](https://github.com/IloBe/US_CensusData_Classifier_PipelineWithDeployment/tree/master/plots/2023-09-21_21-34_best_retrained_xgb-cv_roc-curve_diagram.png)
94+
![best xgb cls roc auc][image5]
8795

8896
Validation results of the GridSearchCV best estimator are:<br>
89-
![''](https://github.com/IloBe/US_CensusData_Classifier_PipelineWithDeployment/tree/master/plots/MLOps_Proj3_trainModel_retrainBestCV_fifthRunPart4_2023-09-21.PNG
97+
![best xgb cls eval][image6]
9098

9199
## Ethical Considerations
92100
No ethical consideration topics regarding data, human life, risk and harms and their needed risk mitigation strategies or fraught model use cases are detected.
@@ -95,4 +103,4 @@ No ethical consideration topics regarding data, human life, risk and harms and t
95103
- Regarding the data, have in mind that the raw data have among others a bias towards men (twice as many men as women) and persons originally from the U.S., so, scaling activities are mandatory getting appropriate prediction results.
96104
- Regarding the prediction task, the performance of the grid search cross validation approach is already improved compared to the one of the single instance, but still not the best. As future toDo, final tuning of the XGBoost Classifier via <i>Hyperopt</i> library is recommended.
97105
- Additional, feature importance information of the final resulting XGBoost Classifier model is critical to understand the prediction process. As additional future toDo: usage of <i>SHAP</i> diagrams or simple <i>xgb feature_importances_ parameter</i> bar chart of the X_train columns from the GridSearchCV best model result for identifying which features are most relevant for the target variable.
98-
- Last topic as future toDo is the usage of other classifier types and their evaluation compared to the XGBoost Classifier, even though it was often used by teams that won Kaggle competitions.
106+
- Last topic as future toDo is the usage of other classifier types and their evaluation compared to the XGBoost Classifier, even though it was often used by teams that won Kaggle competitions.

0 commit comments

Comments
 (0)