|
| 1 | +[//]: # (Image References) |
| 2 | +[image0]: ./sreenshots/MLOps_proj3_tree.PNG "proj3 structure" |
| 3 | +[image1]: ./plots/numFeats_outlierDist_sex_boxplot.png "feat dist by sex plot" |
| 4 | +[image2]: ./plots/normalDistTest_hours-per-week.PNG "hours-per-week gauss dist or not" |
| 5 | +[image3]: ./plots/general_dist_age-hoursPerWeek_boxplot.png "hours-per-week by age boxplots" |
| 6 | +[image4]: ./plots/hoursPerWeek-Regression_dist_age-race_plot.png "regression hours-per-week by age race" |
| 7 | +[image5]: ./plots/salary_dist_hoursPerWeek-age-sex_plot.png "salary dist by age sex plot" |
| 8 | +[image6]: ./plots/capitalGain_dist_age-hoursPerWeek-sex_plot.png "capital gain dist by hours-per-week sex" |
| 9 | +[image7]: ./sreenshots/education-group_people-count.PNG "education people-count grouping" |
| 10 | +[image8]: ./plots/eduLevel_dist_age-race_plot.png "education level grouping by age race" |
| 11 | +[image9]: ./screenshots/MLOps_proj3_FastAPI_gitHubPrecommitHook.PNG "github action" |
| 12 | +[image10]: ./screenshots/MLOps_proj3_FastAPI_docsLandingPage.PNG "fastapi landing page" |
| 13 | +[image11]: ./screenshots/MLOps_proj3_FastAPI_docsGetRootWelcomeMsg.PNG "fastapi welcome" |
| 14 | +[image12]: ./screenshots/MLOps_proj3_FastAPI_docsPredictPersonIncomeNegativeExample.PNG "fastapi income negative" |
| 15 | +[image13]: ./screenshots/MLOps_proj3_FastAPI_docsPredictPersonIncomeNegativeExample_ResponseCode.PNG "fastapi income negative response" |
| 16 | +[image14]: ./screenshots/render_createNewWebService.PNG "render web service" |
| 17 | + |
| 18 | + |
1 | 19 | # Creating and Deploying a Classifier Pipeline for US Census Data |
2 | 20 |
|
3 | | -This is the third project of the course <i>MLOps Engineer Nanodegree</i> by Udacity, called <i>Deploying a Scalable Pipeline in Production</i>. Its instructions are available in udacity's [repository](https://github.com/udacity/nd0821-c3-starter-code/tree/master/starter). |
| 21 | +This is the third project of the course <i>MLOps Engineer Nanodegree</i> by Udacity, called <i>Deploying a Scalable Pipeline in Production</i>. Its instructions are available in Udacity's [repository](https://github.com/udacity/nd0821-c3-starter-code/tree/master/starter). |
| 22 | + |
| 23 | +We develop a classification model on public available US Census Bureau data and monitor the model performance on various data slices as business goal. |
| 24 | + |
| 25 | +Regarding software engineering principles, we create _unit tests_. Slice validation and the tests are incorporated into a _CI/CD framework_ using GitHub Actions. Then, the model is deployed using the FastAPI framework and render as open-source web service. |
| 26 | + |
| 27 | +Regarding data science goals for this classification prediction, we start with the ETL (Extract, Transform, Load) pipeline including EDA (Exploratory Data Analysis) activities and reports, followed by the ML (Machine Learning) pipeline for the investigated prediction model, in our case a binary XGBoost Classifier. The estimator is selected by using cross validation concept with early stopping for the training phase. |
| 28 | + |
| 29 | +General information about the deployed XGBoost classifier, the used data, their training condition and evaluation results can be found in the [Model Card](https://github.com/IloBe/US_CensusData_Classifier_PipelineWithDeployment/blob/master/model_card.md) description. |
| 30 | + |
| 31 | +The Unit tests are written via _pytest_for GET and POST prediction requests for the FastAPI component as well as for the mentioned data and model task parts. All unit test results are reported in associated html files of the [tests directory](https://github.com/IloBe/US_CensusData_Classifier_PipelineWithDeployment/tree/master/tests). |
| 32 | + |
| 33 | +All project relevant configuration values, including model hyperparameter ranges for the cross validation concept, are handled via specific configuration yaml file. For versioning tasks, _git_ and _dvc_, handled with ignore files content, are chosen. |
| 34 | + |
| 35 | + |
| 36 | +## Environment Set up |
| 37 | +* Working in a command line environment is recommended for ease of use with git and dvc. If on Windows, [WSL2 and Ubuntu (Linux)](https://ubuntu.com/tutorials/install-ubuntu-on-wsl2-on-windows-11-with-gui-support#1-overview) is recommended. |
| 38 | +* We expect you have at least Python 3.10.9 e.g. via conda installed, furthermore having forked this project repo locally and activate it in your virtual environment to work on it for your own. So, in your root directory `path/to/US-census-project` create a new virtual environment depending on the selected OS and use the supplied _requirements.txt_ file to install the needed libraries e.g. via |
| 39 | + |
| 40 | + ``` |
| 41 | + pip install -r requirements/requirements.txt |
| 42 | + ``` |
| 43 | +or use |
| 44 | + |
| 45 | + ``` |
| 46 | + conda create -n [envname] "python=3.10.9" scikit-learn pandas numpy pytest jupyter jupyterlab fastapi uvicorn ... <library-list> -c conda-forge |
| 47 | + ``` |
| 48 | + |
| 49 | + |
| 50 | +## Project Structure |
| 51 | +* Main coding files are stored in the ``src`` and test scripts in the ``tests`` project root subdirectories. The FastAPI RESTful web application is called via _main.py_ file stored in the src directory, but associated schemas and request examples data are part of the src/app subdirectory. All administrative asset files, like plots, screenshots, configuration, logs, as well as model and dataset files are stored in their own directories in parallel to the source code.<br> |
| 52 | + |
| 53 | +* The general project structure looks like:<br> |
| 54 | +![proj3 structure][image0] |
| 55 | + |
| 56 | + |
| 57 | +* In our GitHub repository an automatic Action script is set up to check amongst others dependencies, linting and unit testing. |
| 58 | +![github action][image9] |
| 59 | + |
4 | 60 |
|
5 | | -We develop a classification model on publicly available US Census Bureau data. Regarding software engineering principles, we |
6 | | -create _unit tests_ to monitor the model performance on various data slices. Then, we _deploy_ your model using the FastAPI package and create API tests. The slice validation and the API tests will be incorporated into a _CI/CD framework_ using GitHub Actions. |
| 61 | +## Data |
| 62 | +* The download raw _census.csv_ file is preprocessed and stored as new .csv file. Both files are committed and versioned with _dvc_. |
| 63 | +* Some exploratory data analysis is implemented and visualised. They are stored as .png plot or screenshot files. |
7 | 64 |
|
8 | | -For this classification prediction, we start with the ETL (Extract, Transform, Load) pipeline including EDA (Exploratory Data Analysis) activities, followed by the ML (Machine Learning) pipeline for the investigated prediction models. |
| 65 | +Examples are the following ones, regarding amongst others distributions of hours-per-week, education, capital-gain and salary by few feature attributes like age, sex or race. Several other insights are visualised and stored as .png files. So, have a look there if you are interested in further analysis. |
9 | 66 |
|
| 67 | +![feat dist by sex plot][image1] |
10 | 68 |
|
11 | | -... future toDo: rework of readme text ... |
| 69 | +![hours-per-week gauss dist or not][image2] |
12 | 70 |
|
| 71 | +![hours-per-week by age boxplots][image3] |
13 | 72 |
|
14 | | -Working in a command line environment is recommended for ease of use with git and dvc. If on Windows, WSL1 or 2 is recommended. |
| 73 | +![regression hours-per-week by age race][image4] |
15 | 74 |
|
16 | | -# Environment Set up |
17 | | -* Download and install conda if you don’t have it already. |
18 | | - * Use the supplied requirements file to create a new environment, or |
19 | | - * conda create -n [envname] "python=3.8" scikit-learn pandas numpy pytest jupyter jupyterlab fastapi uvicorn -c conda-forge |
20 | | - * Install git either through conda (“conda install git”) or through your CLI, e.g. sudo apt-get git. |
| 75 | +![salary dist by age sex plot[][image5] |
21 | 76 |
|
22 | | -## Repositories |
23 | | -* Create a directory for the project and initialize git. |
24 | | - * As you work on the code, continually commit changes. Trained models you want to use in production must be committed to GitHub. |
25 | | -* Connect your local git repo to GitHub. |
26 | | -* Setup GitHub Actions on your repo. You can use one of the pre-made GitHub Actions if at a minimum it runs pytest and flake8 on push and requires both to pass without error. |
27 | | - * Make sure you set up the GitHub Action to have the same version of Python as you used in development. |
| 77 | +![capital gain dist by hours-per-week sex][image6] |
| 78 | + |
| 79 | +![education_people_count group][image7] |
| 80 | + |
| 81 | +![education level grouping by age race][image8] |
28 | 82 |
|
29 | | -# Data |
30 | | -* Download census.csv and commit it to dvc. |
31 | | -* This data is messy, try to open it in pandas and see what you get. |
32 | | -* To clean it, use your favorite text editor to remove all spaces. |
33 | 83 |
|
34 | 84 | # Model |
35 | | -* Using the starter code, write a machine learning model that trains on the clean data and saves the model. Complete any function that has been started. |
36 | | -* Write unit tests for at least 3 functions in the model code. |
37 | | -* Write a function that outputs the performance of the model on slices of the data. |
38 | | - * Suggestion: for simplicity, the function can just output the performance on slices of just the categorical features. |
39 | | -* Write a model card using the provided template. |
| 85 | +* As machine learning model that trains on the clean data _XGBoost Classifier_ is selected and the best found and evaluated estimator is stored as pickle file (...artifact.pkl) in the associated model directory. |
| 86 | +* Additionally, a function exists that outputs the performance of the model on slices of the categorical features. Performance evaluation metrics of such categorical census feature slices are stored in a _slice_output.txt_ file. As an example, the metric block looks like: |
| 87 | + |
| 88 | + ``` |
| 89 | + workclass - Private: |
| 90 | + Precision: 0.83, Recall: 0.66, Fbeta: 0.73 |
| 91 | + Confusion Matrix: |
| 92 | + [[2907 119] |
| 93 | + [ 297 572]] |
| 94 | + |
| 95 | + workclass - Self-emp-not-inc: |
| 96 | + Precision: 0.83, Recall: 0.57, Fbeta: 0.68 |
| 97 | + Confusion Matrix: |
| 98 | + [[358 16] |
| 99 | + [ 58 77]] |
| 100 | + |
| 101 | + ... |
| 102 | + ``` |
| 103 | +* As mentioned, the model card informs about our found insights of the binary classification estimator including evaluation diagrams and general metrics. |
| 104 | + |
40 | 105 |
|
41 | 106 | # API Creation |
42 | | -* Create a RESTful API using FastAPI this must implement: |
43 | | - * GET on the root giving a welcome message. |
44 | | - * POST that does model inference. |
45 | | - * Type hinting must be used. |
46 | | - * Use a Pydantic model to ingest the body from POST. This model should contain an example. |
47 | | - * Hint: the data has names with hyphens and Python does not allow those as variable names. Do not modify the column names in the csv and instead use the functionality of FastAPI/Pydantic/etc to deal with this. |
48 | | -* Write 3 unit tests to test the API (one for the GET and two for POST, one that tests each prediction). |
| 107 | +* As Web framework to create a RESTful API _fastapi_ is chosen for app implementation. A _pydantic_ _BaseModel_ instance handels the POST body, e.g. dealing with hyphens in data feature names which is not allowed in Python. |
| 108 | + |
| 109 | +* As high performance ASGI server [uvicorn](https://www.uvicorn.org/) is selected. The FastAPI web app _uvicorn_ server can be started in the projects root directory via CLI python command: |
| 110 | + |
| 111 | + ``` |
| 112 | + python ./src/main.py |
| 113 | + ``` |
| 114 | + |
| 115 | +There in "__main__" it calls |
| 116 | + |
| 117 | + ``` |
| 118 | + uvicorn.run("main:app", host="0.0.0.0", port=8000, reload=True) |
| 119 | + ``` |
| 120 | + |
| 121 | +Remember, this code is for development purpose, in production the reload option shall be set to False resp. not used. In other words, the start command e.g. on our render deployment web service (see below) is:<br> |
| 122 | +uvicorn main:app host="0.0.0.0" port=8000 |
| 123 | + |
| 124 | +* So , we start the browser web application with |
| 125 | + |
| 126 | + ``` |
| 127 | + http://127.0.0.1:8000/docs |
| 128 | + or |
| 129 | + http://localhost:8000/docs |
| 130 | + ``` |
| 131 | + |
| 132 | +As an examples regarding the use case of having a person earning <=50K as income, you are going to get the following UI's: |
| 133 | + |
| 134 | +![fastapi landing page][image10] |
| 135 | + |
| 136 | +![fastapi welcome][image11] |
| 137 | + |
| 138 | +![fastapi income negative][image12] |
| 139 | + |
| 140 | +![fastapi income negative response][image13] |
| 141 | + |
49 | 142 |
|
50 | 143 | # API Deployment |
51 | | -* Create a free Heroku account (for the next steps you can either use the web GUI or download the Heroku CLI). |
52 | | -* Create a new app and have it deployed from your GitHub repository. |
53 | | - * Enable automatic deployments that only deploy if your continuous integration passes. |
54 | | - * Hint: think about how paths will differ in your local environment vs. on Heroku. |
55 | | - * Hint: development in Python is fast! But how fast you can iterate slows down if you rely on your CI/CD to fail before fixing an issue. I like to run flake8 locally before I commit changes. |
56 | | -* Write a script that uses the requests module to do one POST on your live API. |
| 144 | +* As open-source tool for our web service deployment, we use [Render](https://render.com/docs) and a free account there. From the Render.com landing page, click the "Get Started" button to open the sign-up page. You can create an account by linking your GitHub, GitLab, or Google account or provide your email and password. Then, the render account must be connected with our GitHub account, so, the usage of render services is guaranteed. Have in mind, shell and jobs are not supported for free instance types. |
| 145 | + |
| 146 | +* Our new application is deployed from our public GitHub repository by creating a new [Web Service](https://render.com/docs/web-services) for this specific GitHub URL. As it is written by FastAPI company tiangolo "For a web API, it normally involves putting it in a remote machine, with a server program that provides good performance, stability, etc, so that your users can access the application efficiently and without interruptions or problems." |
| 147 | + |
| 148 | +![render web service][image14] |
| 149 | + |
| 150 | + * after selection, render starts its advanced deployment configuation, some parameters are already set, some have to be set manually appropriately. Render guides you through with easy to handle UI's. |
| 151 | + * That's it. Implement coding changes, push to the GitHub repository, and the app will automatically redeploy each time, but it will only deploy if your continuous integration action passes. |
| 152 | + * Have in mind: if you rely on your CI/CD to fail before fixing an issue, it slows down your deployment. Fix issues early, e.g. by running an ensemble linter like flake8 locally before committing changes. |
| 153 | +* For checking the render deployment, a python file exists that uses the httpx module to do one GET and POST on the live render web service and prints its results. |
| 154 | + |
| 155 | + |
| 156 | +## License |
| 157 | +This project coding is released under the [MIT](https://github.com/IloBe/US_CensusData_Classifier_PipelineWithDeployment/blob/master/LICENSE.txt) license. |
0 commit comments