You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
- Some preprocessing has been done with pipeline implementation. These tasks include, but are not limited to, dealing with the imbalanced distribution of the target label, duplicate rows, wrong labeling or strings, non-normalised features and one-hot-encoding for the categorical nominal features.
42
50
- <i>fnlwgt</i> has been removed not having an added value for the prediction task, because it seems to have a kind of index functionality
43
-
- <i>native_country<i> has been changed having another bin grouping according [German Society for State Studies e.V.]: (http://www.staatenkunde.de/dgfs/datenbank/db-sb.php?sb=18&k=4)
51
+
- <i>native_country<i> has been changed having another bin grouping according [German Society for State Studies e.V.](http://www.staatenkunde.de/dgfs/datenbank/db-sb.php?sb=18&k=4)
44
52
45
-
For both, the raw and the preprocessed data, profiling reports are created as EDA (exploratory data analysis) tasks.
53
+
For both, the raw and the preprocessed data, profiling reports are created as EDA (exploratory data analysis) tasks stored in directory [./src/eda](./src/eda/).
46
54
47
55
## Evaluation Data
48
56
For the validation datasets containing different samples to evaluate trained ML models the GridSearchCV approach from scikit-learn is implemented with 0.2 test size fraction of data used for final testing.
49
57
50
-
The following value ranges for the grid parameters are handled (note: read in via configuration file)
51
-
- for our binary classification, output probability:<br>
58
+
The following value ranges for the **grid parameters** are handled (note: read in via configuration file), they are used with early stopping mechanism:
59
+
- for our *binary classification*, output probability:<br>
52
60
objective: ["binary:logistic"]
53
-
- evaluation metrics used e.g. for model plots:<br>
61
+
-*evaluation metrics* used e.g. for model plots:<br>
54
62
eval_metric: ["logloss"]<br>
55
63
note regarding XGBoost: early stopping uses only the last given metric of a sequence
56
-
- specifies the number of decision trees to be boosted<br>
64
+
- specifies the *number of decision trees* to be boosted<br>
57
65
n_estimators: [100, 150, 200, 250]
58
-
- max_depth is the tree's maximum depth. Increasing it increases the model complexity<br>
66
+
-*max_depth* is the tree's maximum depth. Increasing it increases the model complexity<br>
59
67
max_depth: [5, 6, 8, 9]
60
-
- learning_rate shrinks the weights to make the boosting process more conservative<br>
68
+
-*learning_rate* shrinks the weights to make the boosting process more conservative<br>
61
69
learning_rate: [0.01, 0.1, 0.5]
62
-
- gamma specifies the minimum loss reduction required to make a split<br>
70
+
-*gamma* specifies the minimum loss reduction required to make a split<br>
63
71
gamma: [0, 1, 10]
64
-
- percentage of columns to be samples for each tree<br>
72
+
-*percentage of columns* to be samples for each tree<br>
65
73
colsample_bytree: [0.5, 0.7, 1]
66
-
- reg_alpha provides l1 regularization to the weight, higher values are more conservative<br>
74
+
-*reg_alpha* provides l1 regularization to the weight, higher values are more conservative<br>
67
75
reg_alpha: [0.01, 0.1, 0.5]
68
-
- reg_lambda provides l2 regularization to the weight, higher values are more conservative<br>
76
+
-*reg_lambda* provides l2 regularization to the weight, higher values are more conservative<br>
logloss diagram and ROC curve diagram with AUC value
79
-
<br>
80
-
In general, the receiver operating characteristic (ROC) curve is a metric that is used to measure the performance of a classifier model. It depicts the true positive rate concerning the false positive ones. It also highlights the sensitivity of the classifier model. The area under the curve (AUC) is used for general binary classification problems. AUC will measure the whole two-dimensional area that is available under the entire ROC curve. A perfect model would have an AUC of 1, while a random model would have an AUC of 0.5.
81
88
89
+
In general, the receiver operating characteristic (ROC) curve is a metric that is used to measure the performance of a classifier model. It depicts the true positive rate concerning the false positive ones. It also highlights the sensitivity of the classifier model. The area under the curve (AUC) is used for general binary classification problems. AUC will measure the whole two-dimensional area that is available under the entire ROC curve. A perfect model would have an AUC of 1, while a random model would have an AUC of 0.5.<br>
82
90
The AUC result of the GridSearchCV best estimator is: 0.93
No ethical consideration topics regarding data, human life, risk and harms and their needed risk mitigation strategies or fraught model use cases are detected.
@@ -95,4 +103,4 @@ No ethical consideration topics regarding data, human life, risk and harms and t
95
103
- Regarding the data, have in mind that the raw data have among others a bias towards men (twice as many men as women) and persons originally from the U.S., so, scaling activities are mandatory getting appropriate prediction results.
96
104
- Regarding the prediction task, the performance of the grid search cross validation approach is already improved compared to the one of the single instance, but still not the best. As future toDo, final tuning of the XGBoost Classifier via <i>Hyperopt</i> library is recommended.
97
105
- Additional, feature importance information of the final resulting XGBoost Classifier model is critical to understand the prediction process. As additional future toDo: usage of <i>SHAP</i> diagrams or simple <i>xgb feature_importances_ parameter</i> bar chart of the X_train columns from the GridSearchCV best model result for identifying which features are most relevant for the target variable.
98
-
- Last topic as future toDo is the usage of other classifier types and their evaluation compared to the XGBoost Classifier, even though it was often used by teams that won Kaggle competitions.
106
+
- Last topic as future toDo is the usage of other classifier types and their evaluation compared to the XGBoost Classifier, even though it was often used by teams that won Kaggle competitions.
0 commit comments