Skip to content

Commit 5ba53f8

Browse files
committed
Fix: change broken links; improved text content
1 parent 325b0c3 commit 5ba53f8

2 files changed

Lines changed: 46 additions & 45 deletions

File tree

README.md

Lines changed: 45 additions & 44 deletions
Original file line numberDiff line numberDiff line change
@@ -1,49 +1,50 @@
11
[//]: # (Image References)
2-
[image0]: ./sreenshots/MLOps_proj3_tree.PNG "proj3 structure"
3-
[image1]: ./plots/numFeats_outlierDist_sex_boxplot.png "feat dist by sex plot"
4-
[image2]: ./plots/normalDistTest_hours-per-week.PNG "hours-per-week gauss dist or not"
2+
[image0]: ./screenshots/MLOps_proj3_tree.PNG "proj3 structure"
3+
[image1]: ./screenshots/MLOps_proj3_FastAPI_gitHubPrecommitHook.PNG "github action"
4+
[image2]: ./plots/numFeats_outlierDist_sex_boxplot.png "feat dist by sex plot"
55
[image3]: ./plots/general_dist_age-hoursPerWeek_boxplot.png "hours-per-week by age boxplots"
6-
[image4]: ./plots/hoursPerWeek-Regression_dist_age-race_plot.png "regression hours-per-week by age race"
7-
[image5]: ./plots/salary_dist_hoursPerWeek-age-sex_plot.png "salary dist by age sex plot"
8-
[image6]: ./plots/capitalGain_dist_age-hoursPerWeek-sex_plot.png "capital gain dist by hours-per-week sex"
6+
[image4]: ./plots/salary_dist_hoursPerWeek-age-sex_plot.png "salary dist by age sex plot"
7+
[image5]: ./plots/capitalGain_dist_age-hoursPerWeek-sex_plot.png "capital gain dist by hours-per-week sex"
8+
[image6]: ./plots/sex_plot.png "sex plot"
99
[image7]: ./sreenshots/education-group_people-count.PNG "education people-count grouping"
1010
[image8]: ./plots/eduLevel_dist_age-race_plot.png "education level grouping by age race"
11-
[image9]: ./screenshots/MLOps_proj3_FastAPI_gitHubPrecommitHook.PNG "github action"
12-
[image10]: ./screenshots/MLOps_proj3_FastAPI_docsLandingPage.PNG "fastapi landing page"
13-
[image11]: ./screenshots/MLOps_proj3_FastAPI_docsGetRootWelcomeMsg.PNG "fastapi welcome"
14-
[image12]: ./screenshots/MLOps_proj3_FastAPI_docsPredictPersonIncomeNegativeExample.PNG "fastapi income negative"
15-
[image13]: ./screenshots/MLOps_proj3_FastAPI_docsPredictPersonIncomeNegativeExample_ResponseCode.PNG "fastapi income negative response"
16-
[image14]: ./screenshots/render_createNewWebService.PNG "render web service"
11+
[image9]: ./screenshots/MLOps_proj3_FastAPI_docsPredictPersonIncomeNegativeExample.PNG "fastapi income negative"
12+
[image10]: ./screenshots/MLOps_proj3_FastAPI_docsPredictPersonIncomeNegativeExample_ResponseCode.PNG "fastapi income negative response"
13+
[image11]: ./screenshots/render_createNewWebService.PNG "render web service"
14+
[image12]: ./
15+
[image13]: ./
16+
[image14]: ./
1717

1818

19-
# Creating and Deploying a Classifier Pipeline for US Census Data
19+
# US Census Data - Creating and Deploying a Classifier Pipeline as Web Service
2020

2121
This is the third project of the course <i>MLOps Engineer Nanodegree</i> by Udacity, called <i>Deploying a Scalable Pipeline in Production</i>. Its instructions are available in Udacity's [repository](https://github.com/udacity/nd0821-c3-starter-code/tree/master/starter).
2222

23-
We develop a classification model on public available US Census Bureau data and monitor the model performance on various data slices as business goal.
23+
We develop a classification model artifact for production on public available US Census Bureau data and monitor the model performance on various data slices as business goal.
2424

25-
Regarding software engineering principles, we create _unit tests_. Slice validation and the tests are incorporated into a _CI/CD framework_ using GitHub Actions. Then, the model is deployed using the FastAPI framework and render as open-source web service.
26-
27-
Regarding data science goals for this classification prediction, we start with the ETL (Extract, Transform, Load) pipeline including EDA (Exploratory Data Analysis) activities and reports, followed by the ML (Machine Learning) pipeline for the investigated prediction model, in our case a binary XGBoost Classifier. The estimator is selected by using cross validation concept with early stopping for the training phase.
25+
Regarding data science goals for this classification prediction, we start with the ETL (Extract, Transform, Load) transformer pipeline including EDA (Exploratory Data Analysis) activities, diagrams and reports, followed by the ML (Machine Learning) pipeline for the investigated prediction model, in our case a _binary XGBoost Classifier_. The estimator is selected by using cross validation concept with early stopping for the training phase. This best estimator evaluated by metrics is selected as deployment artifact together with the associated column transformer used for data preprocessing.
2826

2927
General information about the deployed XGBoost classifier, the used data, their training condition and evaluation results can be found in the [Model Card](https://github.com/IloBe/US_CensusData_Classifier_PipelineWithDeployment/blob/master/model_card.md) description.
3028

29+
Regarding software engineering principles, beside documentation, logging and python style, we create _unit tests_. Slice validation and the tests are incorporated into a _CI/CD framework_ using GitHub Actions. Then, the model is deployed using the [_FastAPI_](https://fastapi.tiangolo.com/) web framework and [_Render_](https://dashboard.render.com/#) as open-source web service.
30+
3131
The Unit tests are written via _pytest_for GET and POST prediction requests for the FastAPI component as well as for the mentioned data and model task parts. All unit test results are reported in associated html files of the [tests directory](https://github.com/IloBe/US_CensusData_Classifier_PipelineWithDeployment/tree/master/tests).
3232

33-
All project relevant configuration values, including model hyperparameter ranges for the cross validation concept, are handled via specific configuration yaml file. For versioning tasks, _git_ and _dvc_, handled with ignore files content, are chosen.
33+
All project relevant configuration values, including model hyperparameter ranges for the cross validation concept, are handled via specific configuration _config.yml_ file.<br>
34+
For versioning tasks, [_git_](https://git-scm.com/) and [_dvc_](https://dvc.org/doc/use-cases/versioning-data-and-models), handled with ignore files content, are chosen.
3435

3536

3637
## Environment Set up
37-
* Working in a command line environment is recommended for ease of use with git and dvc. If on Windows, [WSL2 and Ubuntu (Linux)](https://ubuntu.com/tutorials/install-ubuntu-on-wsl2-on-windows-11-with-gui-support#1-overview) is recommended.
38-
* We expect you have at least Python 3.10.9 e.g. via conda installed, furthermore having forked this project repo locally and activate it in your virtual environment to work on it for your own. So, in your root directory `path/to/US-census-project` create a new virtual environment depending on the selected OS and use the supplied _requirements.txt_ file to install the needed libraries e.g. via
38+
* Working in a command line environment is recommended for ease of use with git and dvc. Working on Windows, [WSL2 and Ubuntu (Linux)](https://ubuntu.com/tutorials/install-ubuntu-on-wsl2-on-windows-11-with-gui-support#1-overview) is chosen for this project implementation.
39+
* We expect you have at least Python 3.10.9 installed (e.g. via conda), furthermore having forked this project repo locally and activate it in your virtual environment to work on for your own. So, in your root directory `path/to/census-project` create a new virtual environment depending on the selected OS and use the supplied _requirements.txt_ file to install the needed libraries e.g. via
3940

4041
```
4142
pip install -r requirements/requirements.txt
4243
```
4344
or use
4445

4546
```
46-
conda create -n [envname] "python=3.10.9" scikit-learn pandas numpy pytest jupyter jupyterlab fastapi uvicorn ... <library-list> -c conda-forge
47+
conda create -n [envname] "python=3.10.9" scikit-learn pandas numpy pytest jupyter jupyterlab fastapi uvicorn ... <the-needed-library-list> -c conda-forge
4748
```
4849

4950

@@ -55,33 +56,34 @@ or use
5556

5657

5758
* In our GitHub repository an automatic Action script is set up to check amongst others dependencies, linting and unit testing.
58-
![github action][image9]
59+
![github action][image1]
5960

6061

6162
## Data
6263
* The download raw _census.csv_ file is preprocessed and stored as new .csv file. Both files are committed and versioned with _dvc_.
63-
* Some exploratory data analysis is implemented and visualised. They are stored as .png plot or screenshot files.
64+
* Some exploratory data analysis is implemented and visualised. They are stored as .png plot or screenshot files in the associated directories.
6465

65-
Examples are the following ones, regarding amongst others distributions of hours-per-week, education, capital-gain and salary by few feature attributes like age, sex or race. Several other insights are visualised and stored as .png files. So, have a look there if you are interested in further analysis.
66+
Examples are the following ones, regarding amongst others distributions of hours-per-week, salary, capital-gain and education by few feature attributes like age, sex or race. As investigated there is some bias according man (twice as much as women) and white people. Furthermore, it is interesting that according capital gain female representatives earn much often a much higher value for less working hours compared to man. In general, people work >40 hours per week if they are between 25 and 60 years old.
6667

67-
![feat dist by sex plot][image1]
68+
Several other insights are visualised and stored as .png files. So, have a look there if you are interested in further analysis.
6869

69-
![hours-per-week gauss dist or not][image2]
70+
![feat dist by sex plot][image2]
7071

7172
![hours-per-week by age boxplots][image3]
7273

73-
![regression hours-per-week by age race][image4]
74+
![salary dist by age sex plot][image4]
7475

75-
![salary dist by age sex plot[][image5]
76+
![capital gain dist by hours-per-week sex][image5]
7677

77-
![capital gain dist by hours-per-week sex][image6]
78+
![sex plot][image6]
7879

79-
![education_people_count group][image7]
80+
![education_people_count grouping][image7]
8081

8182
![education level grouping by age race][image8]
8283

84+
<br>
8385

84-
# Model
86+
## Model
8587
* As machine learning model that trains on the clean data _XGBoost Classifier_ is selected and the best found and evaluated estimator is stored as pickle file (...artifact.pkl) in the associated model directory.
8688
* Additionally, a function exists that outputs the performance of the model on slices of the categorical features. Performance evaluation metrics of such categorical census feature slices are stored in a _slice_output.txt_ file. As an example, the metric block looks like:
8789

@@ -103,7 +105,7 @@ Examples are the following ones, regarding amongst others distributions of hours
103105
* As mentioned, the model card informs about our found insights of the binary classification estimator including evaluation diagrams and general metrics.
104106

105107

106-
# API Creation
108+
## API Creation
107109
* As Web framework to create a RESTful API _fastapi_ is chosen for app implementation. A _pydantic_ _BaseModel_ instance handels the POST body, e.g. dealing with hyphens in data feature names which is not allowed in Python.
108110

109111
* As high performance ASGI server [uvicorn](https://www.uvicorn.org/) is selected. The FastAPI web app _uvicorn_ server can be started in the projects root directory via CLI python command:
@@ -131,25 +133,24 @@ uvicorn main:app host="0.0.0.0" port=8000
131133

132134
As an examples regarding the use case of having a person earning <=50K as income, you are going to get the following UI's:
133135

134-
![fastapi landing page][image10]
135-
136-
![fastapi welcome][image11]
136+
![fastapi income negative][image9]
137137

138-
![fastapi income negative][image12]
138+
![fastapi income negative response][image10]
139139

140-
![fastapi income negative response][image13]
140+
<br>
141141

142+
## API Deployment
143+
* As open-source tool for our web service deployment, we use [Render](https://render.com/docs) and a free account there. From the Render.com landing page, click the "Get Started" button to open the sign-up page. You can create an account by linking your GitHub, GitLab, or Google account or provide your email and password. Then, the render account must be connected with our GitHub account, so, the usage of render services is guaranteed. Have in mind, shell and jobs are not supported for free instance types. As stated by FastAPI company tiangolo "For a web API, it normally involves putting it in a remote machine, with a server program that provides good performance, stability, etc, so that your users can access the application efficiently and without interruptions or problems." But using a free account, the service is limited.
142144

143-
# API Deployment
144-
* As open-source tool for our web service deployment, we use [Render](https://render.com/docs) and a free account there. From the Render.com landing page, click the "Get Started" button to open the sign-up page. You can create an account by linking your GitHub, GitLab, or Google account or provide your email and password. Then, the render account must be connected with our GitHub account, so, the usage of render services is guaranteed. Have in mind, shell and jobs are not supported for free instance types.
145+
* Our new application is deployed from our public GitHub repository by creating a new [Web Service](https://render.com/docs/web-services) for this specific project GitHub URL.
145146

146-
* Our new application is deployed from our public GitHub repository by creating a new [Web Service](https://render.com/docs/web-services) for this specific GitHub URL. As it is written by FastAPI company tiangolo "For a web API, it normally involves putting it in a remote machine, with a server program that provides good performance, stability, etc, so that your users can access the application efficiently and without interruptions or problems."
147+
![render web service][image11]
147148

148-
![render web service][image14]
149+
<br>
149150

150-
* after selection, render starts its advanced deployment configuation, some parameters are already set, some have to be set manually appropriately. Render guides you through with easy to handle UI's.
151-
* That's it. Implement coding changes, push to the GitHub repository, and the app will automatically redeploy each time, but it will only deploy if your continuous integration action passes.
152-
* Have in mind: if you rely on your CI/CD to fail before fixing an issue, it slows down your deployment. Fix issues early, e.g. by running an ensemble linter like flake8 locally before committing changes.
151+
* after selection, render starts its advanced deployment configuation, some parameters are already set, some have to be set manually appropriately. Render guides you through with easy to handle UI's.
152+
* That's it. Implement coding changes, push to the GitHub repository, and the app will automatically redeploy each time, but it will only deploy if your continuous integration action passes.
153+
* Have in mind: if you rely on your CI/CD to fail before fixing an issue, it slows down your deployment. Fix issues early, e.g. by running an ensemble linter like flake8 locally before committing changes.
153154
* For checking the render deployment, a python file exists that uses the httpx module to do one GET and POST on the live render web service and prints its results.
154155

155156

model_card.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -100,7 +100,7 @@ Validation results of the GridSearchCV best estimator are:<br>
100100
No ethical consideration topics regarding data, human life, risk and harms and their needed risk mitigation strategies or fraught model use cases are detected.
101101

102102
## Caveats and Recommendations
103-
- Regarding the data, have in mind that the raw data have among others a bias towards men (twice as many men as women) and persons originally from the U.S., so, scaling activities are mandatory getting appropriate prediction results.
103+
- Regarding the data, have in mind that the raw data have among others a bias towards men (twice as many men as women) and white people mainly originally from the U.S., so, scaling activities are mandatory getting appropriate prediction results.
104104
- Regarding the prediction task, the performance of the grid search cross validation approach is already improved compared to the one of the single instance, but still not the best. As future toDo, final tuning of the XGBoost Classifier via <i>Hyperopt</i> library is recommended.
105105
- Additional, feature importance information of the final resulting XGBoost Classifier model is critical to understand the prediction process. As additional future toDo: usage of <i>SHAP</i> diagrams or simple <i>xgb feature_importances_ parameter</i> bar chart of the X_train columns from the GridSearchCV best model result for identifying which features are most relevant for the target variable.
106106
- Last topic as future toDo is the usage of other classifier types and their evaluation compared to the XGBoost Classifier, even though it was often used by teams that won Kaggle competitions.

0 commit comments

Comments
 (0)