Skip to content

Commit 325b0c3

Browse files
authored
Merge pull request #19 from IloBe/8-rework-of-overall-readme-file
Feat: modify project readme
2 parents e7b23a2 + 08acd91 commit 325b0c3

21 files changed

Lines changed: 1108 additions & 122 deletions

.github/workflows/python-app.yml

Lines changed: 9 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -38,7 +38,12 @@ jobs:
3838
flake8 . --count --exit-zero --max-complexity=10 --max-line-length=127 --statistics
3939
- name: DVC install # see: https://github.com/iterative/setup-dvc
4040
uses: iterative/setup-dvc@v1
41-
# - name: Test with pytest
42-
# run: |
43-
# Final action: runs pytest for all tests in the ./tests dir
44-
# pytest ./tests -vv
41+
# - name: DVC
42+
# run: |
43+
# # see: https://dvc.org/doc/user-guide/troubleshooting#missing-files,
44+
# # no remote cloud service available yet
45+
# dvc pull
46+
# - name: Test with pytest
47+
# run: |
48+
# # Final action: runs pytest for all tests in the ./tests dir
49+
# pytest ./tests/*.py -vv

.gitignore

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -159,6 +159,9 @@ cython_debug/
159159
# option (not recommended) you can uncomment the following to ignore the entire idea folder.
160160
#.idea/
161161

162-
# census app files
162+
# MLOPs proj3 US census app files
163+
./mlops_proj3_tree.txt
164+
./mlops_proj3_tree_dirs.txt
163165
./logs/*.*
164166
census_app.log
167+

LICENSE.txt

Lines changed: 18 additions & 43 deletions
Original file line numberDiff line numberDiff line change
@@ -1,46 +1,21 @@
1-
Copyright © 2012 - 2020, Udacity, Inc.
1+
MIT License
22

3-
Udacity hereby grants you a license in and to the Educational Content, including
4-
but not limited to homework assignments, programming assignments, code samples,
5-
and other educational materials and tools (as further described in the Udacity
6-
Terms of Use), subject to, as modified herein, the terms and conditions of the
7-
Creative Commons Attribution-NonCommercial- NoDerivs 3.0 License located at
8-
http://creativecommons.org/licenses/by-nc-nd/4.0 and successor locations for
9-
such license (the "CC License") provided that, in each case, the Educational
10-
Content is specifically marked as being subject to the CC License.
3+
Copyright (c) 2023 Ilona Brinkmeier
114

12-
Udacity expressly defines the following as falling outside the definition of
13-
"non-commercial":
14-
(a) the sale or rental of (i) any part of the Educational Content, (ii) any
15-
derivative works based at least in part on the Educational Content, or (iii)
16-
any collective work that includes any part of the Educational Content;
17-
(b) the sale of access or a link to any part of the Educational Content without
18-
first obtaining informed consent from the buyer (that the buyer is aware
19-
that the Educational Content, or such part thereof, is available at the
20-
Website free of charge);
21-
(c) providing training, support, or editorial services that use or reference the
22-
Educational Content in exchange for a fee;
23-
(d) the sale of advertisements, sponsorships, or promotions placed on the
24-
Educational Content, or any part thereof, or the sale of advertisements,
25-
sponsorships, or promotions on any website or blog containing any part of
26-
the Educational Material, including without limitation any "pop-up
27-
advertisements";
28-
(e) the use of Educational Content by a college, university, school, or other
29-
educational institution for instruction where tuition is charged; and
30-
(f) the use of Educational Content by a for-profit corporation or non-profit
31-
entity for internal professional development or training.
5+
Permission is hereby granted, free of charge, to any person obtaining a copy
6+
of this software and associated documentation files (the "Software"), to deal
7+
in the Software without restriction, including without limitation the rights
8+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9+
copies of the Software, and to permit persons to whom the Software is
10+
furnished to do so, subject to the following conditions:
3211

33-
THE SERVICES AND ONLINE COURSES (INCLUDING ANY CONTENT) ARE PROVIDED "AS IS" AND
34-
"AS AVAILABLE" WITH NO REPRESENTATIONS OR WARRANTIES OF ANY KIND, EITHER
35-
EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF
36-
MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, AND NON-INFRINGEMENT. YOU
37-
ASSUME TOTAL RESPONSIBILITY AND THE ENTIRE RISK FOR YOUR USE OF THE SERVICES,
38-
ONLINE COURSES, AND CONTENT. WITHOUT LIMITING THE FOREGOING, WE DO NOT WARRANT
39-
THAT (A) THE SERVICES, WEBSITES, CONTENT, OR THE ONLINE COURSES WILL MEET YOUR
40-
REQUIREMENTS OR EXPECTATIONS OR ACHIEVE THE INTENDED PURPOSES, (B) THE WEBSITES
41-
OR THE ONLINE COURSES WILL NOT EXPERIENCE OUTAGES OR OTHERWISE BE UNINTERRUPTED,
42-
TIMELY, SECURE OR ERROR-FREE, (C) THE INFORMATION OR CONTENT OBTAINED THROUGH
43-
THE SERVICES, SUCH AS CHAT ROOM SERVICES, WILL BE ACCURATE, COMPLETE, CURRENT,
44-
ERROR- FREE, COMPLETELY SECURE OR RELIABLE, OR (D) THAT DEFECTS IN OR ON THE
45-
SERVICES OR CONTENT WILL BE CORRECTED. YOU ASSUME ALL RISK OF PERSONAL INJURY,
46-
INCLUDING DEATH AND DAMAGE TO PERSONAL PROPERTY, SUSTAINED FROM USE OF SERVICES.
12+
The above copyright notice and this permission notice shall be included in all
13+
copies or substantial portions of the Software.
14+
15+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21+
SOFTWARE.

README.md

Lines changed: 140 additions & 39 deletions
Original file line numberDiff line numberDiff line change
@@ -1,56 +1,157 @@
1+
[//]: # (Image References)
2+
[image0]: ./sreenshots/MLOps_proj3_tree.PNG "proj3 structure"
3+
[image1]: ./plots/numFeats_outlierDist_sex_boxplot.png "feat dist by sex plot"
4+
[image2]: ./plots/normalDistTest_hours-per-week.PNG "hours-per-week gauss dist or not"
5+
[image3]: ./plots/general_dist_age-hoursPerWeek_boxplot.png "hours-per-week by age boxplots"
6+
[image4]: ./plots/hoursPerWeek-Regression_dist_age-race_plot.png "regression hours-per-week by age race"
7+
[image5]: ./plots/salary_dist_hoursPerWeek-age-sex_plot.png "salary dist by age sex plot"
8+
[image6]: ./plots/capitalGain_dist_age-hoursPerWeek-sex_plot.png "capital gain dist by hours-per-week sex"
9+
[image7]: ./sreenshots/education-group_people-count.PNG "education people-count grouping"
10+
[image8]: ./plots/eduLevel_dist_age-race_plot.png "education level grouping by age race"
11+
[image9]: ./screenshots/MLOps_proj3_FastAPI_gitHubPrecommitHook.PNG "github action"
12+
[image10]: ./screenshots/MLOps_proj3_FastAPI_docsLandingPage.PNG "fastapi landing page"
13+
[image11]: ./screenshots/MLOps_proj3_FastAPI_docsGetRootWelcomeMsg.PNG "fastapi welcome"
14+
[image12]: ./screenshots/MLOps_proj3_FastAPI_docsPredictPersonIncomeNegativeExample.PNG "fastapi income negative"
15+
[image13]: ./screenshots/MLOps_proj3_FastAPI_docsPredictPersonIncomeNegativeExample_ResponseCode.PNG "fastapi income negative response"
16+
[image14]: ./screenshots/render_createNewWebService.PNG "render web service"
17+
18+
119
# Creating and Deploying a Classifier Pipeline for US Census Data
220

3-
This is the third project of the course <i>MLOps Engineer Nanodegree</i> by Udacity, called <i>Deploying a Scalable Pipeline in Production</i>. Its instructions are available in udacity's [repository](https://github.com/udacity/nd0821-c3-starter-code/tree/master/starter).
21+
This is the third project of the course <i>MLOps Engineer Nanodegree</i> by Udacity, called <i>Deploying a Scalable Pipeline in Production</i>. Its instructions are available in Udacity's [repository](https://github.com/udacity/nd0821-c3-starter-code/tree/master/starter).
22+
23+
We develop a classification model on public available US Census Bureau data and monitor the model performance on various data slices as business goal.
24+
25+
Regarding software engineering principles, we create _unit tests_. Slice validation and the tests are incorporated into a _CI/CD framework_ using GitHub Actions. Then, the model is deployed using the FastAPI framework and render as open-source web service.
26+
27+
Regarding data science goals for this classification prediction, we start with the ETL (Extract, Transform, Load) pipeline including EDA (Exploratory Data Analysis) activities and reports, followed by the ML (Machine Learning) pipeline for the investigated prediction model, in our case a binary XGBoost Classifier. The estimator is selected by using cross validation concept with early stopping for the training phase.
28+
29+
General information about the deployed XGBoost classifier, the used data, their training condition and evaluation results can be found in the [Model Card](https://github.com/IloBe/US_CensusData_Classifier_PipelineWithDeployment/blob/master/model_card.md) description.
30+
31+
The Unit tests are written via _pytest_for GET and POST prediction requests for the FastAPI component as well as for the mentioned data and model task parts. All unit test results are reported in associated html files of the [tests directory](https://github.com/IloBe/US_CensusData_Classifier_PipelineWithDeployment/tree/master/tests).
32+
33+
All project relevant configuration values, including model hyperparameter ranges for the cross validation concept, are handled via specific configuration yaml file. For versioning tasks, _git_ and _dvc_, handled with ignore files content, are chosen.
34+
35+
36+
## Environment Set up
37+
* Working in a command line environment is recommended for ease of use with git and dvc. If on Windows, [WSL2 and Ubuntu (Linux)](https://ubuntu.com/tutorials/install-ubuntu-on-wsl2-on-windows-11-with-gui-support#1-overview) is recommended.
38+
* We expect you have at least Python 3.10.9 e.g. via conda installed, furthermore having forked this project repo locally and activate it in your virtual environment to work on it for your own. So, in your root directory `path/to/US-census-project` create a new virtual environment depending on the selected OS and use the supplied _requirements.txt_ file to install the needed libraries e.g. via
39+
40+
```
41+
pip install -r requirements/requirements.txt
42+
```
43+
or use
44+
45+
```
46+
conda create -n [envname] "python=3.10.9" scikit-learn pandas numpy pytest jupyter jupyterlab fastapi uvicorn ... <library-list> -c conda-forge
47+
```
48+
49+
50+
## Project Structure
51+
* Main coding files are stored in the ``src`` and test scripts in the ``tests`` project root subdirectories. The FastAPI RESTful web application is called via _main.py_ file stored in the src directory, but associated schemas and request examples data are part of the src/app subdirectory. All administrative asset files, like plots, screenshots, configuration, logs, as well as model and dataset files are stored in their own directories in parallel to the source code.<br>
52+
53+
* The general project structure looks like:<br>
54+
![proj3 structure][image0]
55+
56+
57+
* In our GitHub repository an automatic Action script is set up to check amongst others dependencies, linting and unit testing.
58+
![github action][image9]
59+
460

5-
We develop a classification model on publicly available US Census Bureau data. Regarding software engineering principles, we
6-
create _unit tests_ to monitor the model performance on various data slices. Then, we _deploy_ your model using the FastAPI package and create API tests. The slice validation and the API tests will be incorporated into a _CI/CD framework_ using GitHub Actions.
61+
## Data
62+
* The download raw _census.csv_ file is preprocessed and stored as new .csv file. Both files are committed and versioned with _dvc_.
63+
* Some exploratory data analysis is implemented and visualised. They are stored as .png plot or screenshot files.
764

8-
For this classification prediction, we start with the ETL (Extract, Transform, Load) pipeline including EDA (Exploratory Data Analysis) activities, followed by the ML (Machine Learning) pipeline for the investigated prediction models.
65+
Examples are the following ones, regarding amongst others distributions of hours-per-week, education, capital-gain and salary by few feature attributes like age, sex or race. Several other insights are visualised and stored as .png files. So, have a look there if you are interested in further analysis.
966

67+
![feat dist by sex plot][image1]
1068

11-
... future toDo: rework of readme text ...
69+
![hours-per-week gauss dist or not][image2]
1270

71+
![hours-per-week by age boxplots][image3]
1372

14-
Working in a command line environment is recommended for ease of use with git and dvc. If on Windows, WSL1 or 2 is recommended.
73+
![regression hours-per-week by age race][image4]
1574

16-
# Environment Set up
17-
* Download and install conda if you don’t have it already.
18-
* Use the supplied requirements file to create a new environment, or
19-
* conda create -n [envname] "python=3.8" scikit-learn pandas numpy pytest jupyter jupyterlab fastapi uvicorn -c conda-forge
20-
* Install git either through conda (“conda install git”) or through your CLI, e.g. sudo apt-get git.
75+
![salary dist by age sex plot[][image5]
2176

22-
## Repositories
23-
* Create a directory for the project and initialize git.
24-
* As you work on the code, continually commit changes. Trained models you want to use in production must be committed to GitHub.
25-
* Connect your local git repo to GitHub.
26-
* Setup GitHub Actions on your repo. You can use one of the pre-made GitHub Actions if at a minimum it runs pytest and flake8 on push and requires both to pass without error.
27-
* Make sure you set up the GitHub Action to have the same version of Python as you used in development.
77+
![capital gain dist by hours-per-week sex][image6]
78+
79+
![education_people_count group][image7]
80+
81+
![education level grouping by age race][image8]
2882

29-
# Data
30-
* Download census.csv and commit it to dvc.
31-
* This data is messy, try to open it in pandas and see what you get.
32-
* To clean it, use your favorite text editor to remove all spaces.
3383

3484
# Model
35-
* Using the starter code, write a machine learning model that trains on the clean data and saves the model. Complete any function that has been started.
36-
* Write unit tests for at least 3 functions in the model code.
37-
* Write a function that outputs the performance of the model on slices of the data.
38-
* Suggestion: for simplicity, the function can just output the performance on slices of just the categorical features.
39-
* Write a model card using the provided template.
85+
* As machine learning model that trains on the clean data _XGBoost Classifier_ is selected and the best found and evaluated estimator is stored as pickle file (...artifact.pkl) in the associated model directory.
86+
* Additionally, a function exists that outputs the performance of the model on slices of the categorical features. Performance evaluation metrics of such categorical census feature slices are stored in a _slice_output.txt_ file. As an example, the metric block looks like:
87+
88+
```
89+
workclass - Private:
90+
Precision: 0.83, Recall: 0.66, Fbeta: 0.73
91+
Confusion Matrix:
92+
[[2907 119]
93+
[ 297 572]]
94+
95+
workclass - Self-emp-not-inc:
96+
Precision: 0.83, Recall: 0.57, Fbeta: 0.68
97+
Confusion Matrix:
98+
[[358 16]
99+
[ 58 77]]
100+
101+
...
102+
```
103+
* As mentioned, the model card informs about our found insights of the binary classification estimator including evaluation diagrams and general metrics.
104+
40105

41106
# API Creation
42-
* Create a RESTful API using FastAPI this must implement:
43-
* GET on the root giving a welcome message.
44-
* POST that does model inference.
45-
* Type hinting must be used.
46-
* Use a Pydantic model to ingest the body from POST. This model should contain an example.
47-
* Hint: the data has names with hyphens and Python does not allow those as variable names. Do not modify the column names in the csv and instead use the functionality of FastAPI/Pydantic/etc to deal with this.
48-
* Write 3 unit tests to test the API (one for the GET and two for POST, one that tests each prediction).
107+
* As Web framework to create a RESTful API _fastapi_ is chosen for app implementation. A _pydantic_ _BaseModel_ instance handels the POST body, e.g. dealing with hyphens in data feature names which is not allowed in Python.
108+
109+
* As high performance ASGI server [uvicorn](https://www.uvicorn.org/) is selected. The FastAPI web app _uvicorn_ server can be started in the projects root directory via CLI python command:
110+
111+
```
112+
python ./src/main.py
113+
```
114+
115+
There in "__main__" it calls
116+
117+
```
118+
uvicorn.run("main:app", host="0.0.0.0", port=8000, reload=True)
119+
```
120+
121+
Remember, this code is for development purpose, in production the reload option shall be set to False resp. not used. In other words, the start command e.g. on our render deployment web service (see below) is:<br>
122+
uvicorn main:app host="0.0.0.0" port=8000
123+
124+
* So , we start the browser web application with
125+
126+
```
127+
http://127.0.0.1:8000/docs
128+
or
129+
http://localhost:8000/docs
130+
```
131+
132+
As an examples regarding the use case of having a person earning <=50K as income, you are going to get the following UI's:
133+
134+
![fastapi landing page][image10]
135+
136+
![fastapi welcome][image11]
137+
138+
![fastapi income negative][image12]
139+
140+
![fastapi income negative response][image13]
141+
49142

50143
# API Deployment
51-
* Create a free Heroku account (for the next steps you can either use the web GUI or download the Heroku CLI).
52-
* Create a new app and have it deployed from your GitHub repository.
53-
* Enable automatic deployments that only deploy if your continuous integration passes.
54-
* Hint: think about how paths will differ in your local environment vs. on Heroku.
55-
* Hint: development in Python is fast! But how fast you can iterate slows down if you rely on your CI/CD to fail before fixing an issue. I like to run flake8 locally before I commit changes.
56-
* Write a script that uses the requests module to do one POST on your live API.
144+
* As open-source tool for our web service deployment, we use [Render](https://render.com/docs) and a free account there. From the Render.com landing page, click the "Get Started" button to open the sign-up page. You can create an account by linking your GitHub, GitLab, or Google account or provide your email and password. Then, the render account must be connected with our GitHub account, so, the usage of render services is guaranteed. Have in mind, shell and jobs are not supported for free instance types.
145+
146+
* Our new application is deployed from our public GitHub repository by creating a new [Web Service](https://render.com/docs/web-services) for this specific GitHub URL. As it is written by FastAPI company tiangolo "For a web API, it normally involves putting it in a remote machine, with a server program that provides good performance, stability, etc, so that your users can access the application efficiently and without interruptions or problems."
147+
148+
![render web service][image14]
149+
150+
* after selection, render starts its advanced deployment configuation, some parameters are already set, some have to be set manually appropriately. Render guides you through with easy to handle UI's.
151+
* That's it. Implement coding changes, push to the GitHub repository, and the app will automatically redeploy each time, but it will only deploy if your continuous integration action passes.
152+
* Have in mind: if you rely on your CI/CD to fail before fixing an issue, it slows down your deployment. Fix issues early, e.g. by running an ensemble linter like flake8 locally before committing changes.
153+
* For checking the render deployment, a python file exists that uses the httpx module to do one GET and POST on the live render web service and prints its results.
154+
155+
156+
## License
157+
This project coding is released under the [MIT](https://github.com/IloBe/US_CensusData_Classifier_PipelineWithDeployment/blob/master/LICENSE.txt) license.

data/preproc_census.csv.dvc

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
outs:
2-
- md5: 266f9ba7e3517c486a1bfff27d02c806
3-
size: 3203951
2+
- md5: da9b7e65b7c6170ab2d36ad9d7fc92bd
3+
size: 3098866
44
hash: md5
55
path: preproc_census.csv

0 commit comments

Comments
 (0)