Skip to content

Commit c2a6344

Browse files
authored
Merge pull request #184 from codeharborhub/dev-1
done mlops
2 parents 132f3fc + c29c9c0 commit c2a6344

File tree

8 files changed

+569
-0
lines changed

8 files changed

+569
-0
lines changed
Lines changed: 123 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,123 @@
1+
---
2+
title: "CI/CD/CT: Automated Pipelines for ML"
3+
sidebar_label: CI/CD for ML
4+
description: "Exploring Continuous Integration, Continuous Delivery, and Continuous Training in MLOps."
5+
tags: [mlops, cicd, continuous-training, automation, jenkins, github-actions]
6+
---
7+
8+
In traditional software, we have **CI** (Continuous Integration) and **CD** (Continuous Delivery). However, Machine Learning introduces a third dimension: **Data**. Because data changes over time, we need a third pillar: **CT** (Continuous Training).
9+
10+
## 1. The Three Pillars of MLOps Automation
11+
12+
To build a robust ML system, we must automate three distinct cycles:
13+
14+
### Continuous Integration (CI)
15+
Beyond testing code, ML CI involves testing **data schemas** and **models**.
16+
* **Code Testing:** Unit tests for feature engineering logic.
17+
* **Data Testing:** Validating that incoming data matches expected distributions.
18+
* **Model Validation:** Ensuring the model architecture compiles and training runs without memory leaks.
19+
20+
### Continuous Delivery (CD)
21+
This is the automation of deploying the model as a service.
22+
* **Artifact Packaging:** Wrapping the model in a [Docker container](./model-deployment#2-the-containerization-standard-docker).
23+
* **Integration Testing:** Ensuring the API endpoint responds correctly to requests.
24+
* **Deployment:** Moving the model to a staging or production environment using [Canary or Blue-Green strategies](./model-deployment#3-deployment-strategies).
25+
26+
### Continuous Training (CT)
27+
This is unique to ML. It is a property of an ML system that automatically retrains and serves the model based on new data or [Model Drift](./monitoring#1-why-models-decay).
28+
29+
## 2. The MLOps Maturity Levels
30+
31+
Google defines the evolution of CI/CD in ML through three levels of maturity:
32+
33+
1. **Level 0 (Manual):** Every step (data prep, training, deployment) is done manually in notebooks.
34+
2. **Level 1 (Automated Training):** The pipeline is automated. Whenever new data arrives, the training and validation happen automatically (CT).
35+
3. **Level 2 (CI/CD Pipeline Automation):** The entire workflow—from code commits to model monitoring—is a fully automated CI/CD pipeline.
36+
37+
## 3. The Automated Workflow
38+
39+
The following diagram illustrates how a code change or a "Drift" alert triggers a sequence of automated events.
40+
41+
```mermaid
42+
graph TD
43+
Code[Code Commit / Data Drift Alert] --> CI[CI: Build & Test]
44+
45+
subgraph Pipeline [Automated ML Pipeline]
46+
CI --> Train[Continuous Training]
47+
Train --> Eval[Model Evaluation]
48+
Eval --> Validate{Meets Threshold?}
49+
end
50+
51+
Validate -- No --> Fail[Alert Developer]
52+
Validate -- Yes --> Register[Model Registry]
53+
54+
Register --> CD[CD: Deploy to Prod]
55+
CD --> Monitor[Monitoring & Observability]
56+
Monitor -- Drift Detected --> Code
57+
58+
style Pipeline fill:#f0f4ff,stroke:#5c7aff,stroke-width:2px,color:#333
59+
style Validate fill:#fff3e0,stroke:#ef6c00,color:#333
60+
style Register fill:#c8e6c9,stroke:#2e7d32,color:#333
61+
62+
```
63+
64+
## 4. Key Components of the Pipeline
65+
66+
* **Feature Store:** A centralized repository where features are stored and shared, ensuring that the same feature logic is used in both training and serving.
67+
* **Model Registry:** A "version control" for models. It stores trained models, their metadata (hyperparameters, accuracy), and their environment dependencies.
68+
* **Metadata Store:** Records every execution of the pipeline, allowing you to trace a specific model version back to the exact dataset and code used to create it.
69+
70+
## 5. Tools of the Trade
71+
72+
Depending on your cloud provider, the tools for CI/CD/CT vary:
73+
74+
| Component | Open Source | AWS | Google Cloud |
75+
| --- | --- | --- | --- |
76+
| **Orchestration** | Kubeflow / Airflow | Step Functions | Vertex AI Pipelines |
77+
| **CI/CD** | GitHub Actions / GitLab | CodePipeline | Cloud Build |
78+
| **Tracking** | MLflow | SageMaker Experiments | Vertex AI Metadata |
79+
| **Storage** | DVC (Data Version Control) | S3 | GCS |
80+
81+
## 6. Implementation: A GitHub Actions Snippet
82+
83+
A simple CI task to check if a model's accuracy meets a threshold before allowing a "Push" to production.
84+
85+
```yaml
86+
name: Model Training CI
87+
on: [push]
88+
89+
jobs:
90+
train-and-validate:
91+
runs-on: ubuntu-latest
92+
steps:
93+
- name: Checkout code
94+
uses: actions/checkout@v2
95+
96+
- name: Set up Python
97+
uses: actions/setup-python@v2
98+
99+
- name: Install dependencies
100+
run: pip install -r requirements.txt
101+
102+
- name: Run Training & Evaluation
103+
run: python train.py # Script generates 'metrics.json'
104+
105+
- name: Check Accuracy Threshold
106+
run: |
107+
ACCURACY=$(jq '.accuracy' metrics.json)
108+
if (( $(echo "$ACCURACY < 0.85" | bc -l) )); then
109+
echo "Accuracy too low ($ACCURACY). Deployment failed."
110+
exit 1
111+
fi
112+
113+
```
114+
115+
## References
116+
117+
* **Google Cloud:** [MLOps: Continuous delivery and automation pipelines](https://cloud.google.com/architecture/mlops-continuous-delivery-and-automation-pipelines-in-machine-learning)
118+
* **ThoughtWorks:** [Continuous Delivery for Machine Learning (CD4ML)](https://martinfowler.com/articles/cd4ml.html)
119+
* **MLflow:** [Introduction to Model Registry](https://www.mlflow.org/docs/latest/model-registry.html)
120+
121+
---
122+
123+
**With CI/CD/CT, your model is now a living, breathing part of your infrastructure. But how do we ensure it remains ethical and unbiased throughout these cycles?**
Lines changed: 110 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,110 @@
1+
---
2+
title: "Data Versioning: The Git for Data"
3+
sidebar_label: Data Versioning
4+
description: "Understanding how to track changes in datasets to ensure reproducibility and auditability in ML experiments."
5+
tags: [mlops, data-versioning, dvc, reproducibility, data-lake]
6+
---
7+
8+
In traditional software development, versioning code with **Git** is enough to recreate any state of an application. In Machine Learning, code is only half the story. The resulting model depends on both the **Code** and the **Data**.
9+
10+
If you retrain your model today and get different results than yesterday, you need to know exactly which version of the dataset was used. **Data Versioning** provides the "undo button" for your data.
11+
12+
## 1. Why Git Isn't Enough for Data
13+
14+
Git is designed to track small text files. It struggles with the large binary files (CSV, Parquet, Images, Audio) typically used in ML for several reasons:
15+
16+
* **Storage Limits:** Storing gigabytes of data in a Git repository slows down operations significantly.
17+
* **Diffing:** Git cannot efficiently show differences between two 5GB binary files.
18+
* **Cost:** Hosting large blobs in GitHub or GitLab is expensive and inefficient.
19+
20+
**Data Versioning tools** solve this by tracking "pointers" (metadata) in Git, while storing the actual data in external storage (S3, GCS, Azure Blob).
21+
22+
## 2. The Core Concept: Metadata vs. Storage
23+
24+
Data versioning works by creating a **hash** (unique ID) of your data files.
25+
26+
1. **The Data:** Stored in a scalable cloud bucket (e.g., AWS S3).
27+
2. **The Metafile:** A tiny text file containing the hash and file path. This file **is** committed to Git.
28+
29+
<br />
30+
31+
<img className="rounded" src="/tutorial/img/tutorials/ml/git-dvc-s3.png" alt="The relationship between Git (tracking .dvc files) and S3 (tracking large datasets)" />
32+
33+
34+
## 3. Workflow Logic
35+
36+
The following diagram illustrates how DVC (Data Version Control) interacts with Git and remote storage to maintain synchronization.
37+
38+
```mermaid
39+
graph TD
40+
subgraph Local_Machine [Local Workspace]
41+
Code[script.py] -- "git commit" --> Git[(Git Repo)]
42+
Data[data.csv] -- "dvc add" --> Meta[.dvc file]
43+
Meta -- "git commit" --> Git
44+
end
45+
46+
subgraph Storage [Remote Storage]
47+
Data -- "dvc push" --> Cloud[(S3 / GCS Bucket)]
48+
end
49+
50+
subgraph Collaborator [Team Member]
51+
Git -- "git pull" --> NewMeta[.dvc file]
52+
NewMeta -- "dvc pull" --> NewData[data.csv]
53+
Cloud -- download --> NewData
54+
end
55+
56+
style Storage fill:#f1f8e9,stroke:#558b2f,color:#333
57+
style Git fill:#e1f5fe,stroke:#01579b,color:#333
58+
style Data fill:#fff3e0,stroke:#ef6c00,color:#333
59+
60+
```
61+
62+
## 4. Popular Data Versioning Tools
63+
64+
| Tool | Focus | Best For |
65+
| --- | --- | --- |
66+
| **DVC (Data Version Control)** | Open-source, Git-like CLI. | Teams already comfortable with Git. |
67+
| **Pachyderm** | Data lineage and pipelining. | Complex data pipelines on Kubernetes. |
68+
| **LakeFS** | Git-like branches for Data Lakes. | Teams using S3/GCS as their primary data source. |
69+
| **W&B Artifacts** | Integrated with experiment tracking. | Visualizing data lineage alongside model training. |
70+
71+
## 5. Implementation with DVC
72+
73+
DVC is the most popular tool because it integrates seamlessly with your existing Git workflow.
74+
75+
```bash
76+
# 1. Initialize DVC in your project
77+
dvc init
78+
79+
# 2. Add a large dataset (this creates data.csv.dvc)
80+
dvc add data/train_images.zip
81+
82+
# 3. Track the metadata in Git
83+
git add data/train_images.zip.dvc .gitignore
84+
git commit -m "Add raw training images version 1.0"
85+
86+
# 4. Push the actual data to a remote (S3, GCS, etc.)
87+
dvc remote add -d myremote s3://my-bucket/data
88+
dvc push
89+
90+
# 5. Switching versions
91+
git checkout v2.0-experiment
92+
dvc checkout # This physically swaps the data files in your folder
93+
94+
```
95+
96+
## 6. The Benefits of Versioning Data
97+
98+
* **Reproducibility:** You can recreate the exact environment of a model trained 6 months ago.
99+
* **Compliance & Auditing:** In regulated industries (finance/healthcare), you must be able to show exactly what data was used to train a model to explain its decisions.
100+
* **Collaboration:** Multiple researchers can work on different versions of the data without overwriting each other's work.
101+
* **Data Lineage:** Tracking the "ancestry" of a dataset—knowing that `clean_data.csv` was generated from `raw_data.csv` using `clean.py`.
102+
103+
## References
104+
105+
* **DVC Documentation:** [Get Started with DVC](https://dvc.org/doc/start)
106+
* **LakeFS:** [Git for Data Lakes](https://lakefs.io/)
107+
108+
---
109+
110+
**Data versioning is the foundation of a reproducible pipeline. Now that we can track our data and code, how do we track the experiments and hyperparameter results?**
Lines changed: 112 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,112 @@
1+
---
2+
title: "Model Deployment: Moving from Lab to Production"
3+
sidebar_label: Deployment
4+
description: "Strategies for serving machine learning models, including batch vs. real-time, containerization, and deployment patterns."
5+
tags: [mlops, deployment, docker, kubernetes, api, serving]
6+
---
7+
8+
**Model Deployment** is the process of integrating a machine learning model into an existing production environment where it can take in data and return predictions. It is the final stage of the ML pipeline, but it is also the beginning of the model's "life" where it provides actual value.
9+
10+
## 1. Deployment Modes
11+
12+
Before choosing a tool, you must decide how the users will consume the predictions.
13+
14+
| Mode | Description | Example |
15+
| :--- | :--- | :--- |
16+
| **Request-Response (Real-time)** | The model lives behind an API. Predictions are returned instantly (low latency). | **Fraud Detection** during a credit card swipe. |
17+
| **Batch Scoring** | The model runs on a large set of data at scheduled intervals (e.g., every night). | **Recommendation Emails** sent to users once a day. |
18+
| **Streaming** | The model consumes data from a queue (like Kafka) and outputs predictions continuously. | **Log Monitoring** for cybersecurity threats. |
19+
20+
## 2. The Containerization Standard: Docker
21+
22+
In MLOps, we don't just deploy code; we deploy the **environment**. To avoid the "it works on my machine" problem, we use **Docker**.
23+
24+
A Docker container packages the model file, the Python runtime, and all dependencies (NumPy, Scikit-Learn, etc.) into a single image that runs identically on any server.
25+
26+
## 3. Deployment Strategies
27+
28+
Deploying a model isn't just about "overwriting" the old one. We use strategies to minimize risk.
29+
30+
* **Blue-Green Deployment:** You have two identical environments. You route traffic to "Green" (new model). If it fails, you instantly flip back to "Blue" (old model).
31+
* **Canary Deployment:** You route 5% of traffic to the new model. If the metrics look good, you slowly increase it to 100%.
32+
* **A/B Testing:** You run two models simultaneously and compare their real-world performance (e.g., which one leads to more clicks?).
33+
34+
## 4. Logical Workflow: The Deployment Pipeline
35+
36+
The following diagram illustrates the path from a trained model to a live API endpoint.
37+
38+
```mermaid
39+
graph LR
40+
Model[Trained Model File .pkl / .h5] --> Wrap[API Wrapper: Flask/FastAPI]
41+
Wrap --> Docker[Docker Image]
42+
Docker --> Registry[Container Registry]
43+
44+
subgraph Infrastructure [Production Environment]
45+
Registry --> K8s[Kubernetes / Cloud Run]
46+
K8s --> LoadBalancer[Load Balancer]
47+
end
48+
49+
User((User)) --> LoadBalancer
50+
LoadBalancer --> K8s
51+
52+
style Docker fill:#e1f5fe,stroke:#01579b,color:#333
53+
style K8s fill:#fff3e0,stroke:#ef6c00,color:#333
54+
style Model fill:#c8e6c9,stroke:#2e7d32,color:#333
55+
56+
```
57+
58+
## 5. Model Serving Frameworks
59+
60+
While you can write your own API using **FastAPI**, dedicated "Model Serving" tools handle scaling and versioning better:
61+
62+
1. **TensorFlow Serving:** Highly optimized for TF models.
63+
2. **TorchServe:** The official serving library for PyTorch.
64+
3. **KServe (formerly KFServing):** A serverless way to deploy models on Kubernetes.
65+
4. **BentoML:** A framework that simplifies the packaging and deployment of any Python model.
66+
67+
## 6. Implementation Sketch (FastAPI + Uvicorn)
68+
69+
This is a minimal example of serving a Scikit-Learn model as a REST API.
70+
71+
```python
72+
from fastapi import FastAPI
73+
import joblib
74+
import pydantic
75+
76+
app = FastAPI()
77+
78+
# 1. Load the pre-trained model
79+
model = joblib.load("model_v1.pkl")
80+
81+
# 2. Define the input schema
82+
class InputData(pydantic.BaseModel):
83+
feature_1: float
84+
feature_2: float
85+
86+
# 3. Create the prediction endpoint
87+
@app.post("/predict")
88+
def predict(data: InputData):
89+
prediction = model.predict([[data.feature_1, data.feature_2]])
90+
return {"prediction": int(prediction[0])}
91+
92+
# Run with: uvicorn main:app --reload
93+
94+
```
95+
96+
## 7. Post-Deployment: Monitoring
97+
98+
Once a model is live, its performance will likely decrease over time (**Model Drift**). We must monitor:
99+
100+
* **Latency:** How long does a prediction take?
101+
* **Data Drift:** Is the incoming data different from the training data?
102+
* **Concept Drift:** Has the relationship between features and the target changed?
103+
104+
## References
105+
106+
* **Google Cloud:** [Practices for MLOps and CI/CD](https://cloud.google.com/architecture/mlops-continuous-delivery-and-automation-pipelines-in-machine-learning)
107+
* **FastAPI:** [Official Documentation](https://fastapi.tiangolo.com/)
108+
* **MLOps.community:** [Deployment Patterns](https://mlops.community/)
109+
110+
---
111+
112+
**Deployment is just the beginning. How do we ensure our model stays accurate as the world changes?**

0 commit comments

Comments
 (0)