Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file modified .DS_Store
Binary file not shown.
Binary file added .github/.DS_Store
Binary file not shown.
Binary file modified mlops/.DS_Store
Binary file not shown.
232 changes: 232 additions & 0 deletions mlops/PROJECT_STATUS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,232 @@
# PROJECT_STATUS.md

## 1. Project snapshot

- **Project name:** Digital Twin Resilience Model
- **Project goal:** Develop a digital twin of a major streaming platform to simulate system failure, impact radius, and response time.
- **Current objective:** Build toward simulation of the entitlements service using a graph-oriented approach on AWS.
- **Current phase:** Repo and pipeline mechanics clarified. GitHub Actions and Terraform deploy the SageMaker pipeline definition and related infrastructure; pipeline execution is started separately and currently runs a stub synthetic-data workflow.

- **Current status:**

### Deployment
Based on GitHub Actions, `terraform-plan.yml` and `terraform-apply.yml`:
- generate the SageMaker pipeline definition
- provision or update the SageMaker Pipeline resource
- validate Terraform / infra changes

`digital_twin_resilience/pipeline.py`
- defines the SageMaker pipeline
- generates `pipeline_definition.json`

### Execution
`start_pipeline.py`
- starts a specific SageMaker pipeline execution
- allows parameter overrides
- triggers the registered pipeline in AWS

The generated pipeline definition shows three steps:
- `processor.py`
- `train.py`
- `evaluate.py`

`processor.py`
- generates synthetic data
- populates train / validation / test outputs in S3

`train.py`
- builds a trivial baseline model from synthetic data

`evaluate.py`
- computes an evaluation output / trivial metric from the model

### Verification
`check_pipeline_execution.py`
- asks SageMaker for overall pipeline execution status
- lists step-level statuses and related job metadata

### Important distinction
- Deploying the pipeline is separate from executing it.
- Current GitHub Actions deploy and update the pipeline definition and infrastructure.
- Pipeline execution is started deliberately via `start_pipeline.py`.

- **Immediate next step:** Define the minimum set of starter docs and begin filling them in, starting with continuity and framing docs.
- **Biggest current blockers / gaps:**
- Input data contract is not yet defined
- Service graph schema is not yet defined
- Prediction target is not yet defined
- Definition of "good" model output is not yet defined
- It is not yet decided whether the first baseline should be graph ML or something simpler

---

## 2. Working understanding of the repo

This section is not a replacement for `repo_skeleton.yml`. It is a quick orientation note describing how the repo is currently understood.

### Repo orientation

- `.github/workflows/`
- GitHub Actions workflows for Terraform plan/apply and deployment-oriented automation
- current understanding: deploys pipeline definition and infra, but does not execute the pipeline or run Python tests

- `terraform/`
- infrastructure code for AWS resources and SageMaker pipeline registration
- `envs/dev/` contains environment-specific wiring
- `modules/` contains reusable pieces such as S3, IAM, and SageMaker pipeline setup

- `mlops/pipelines/digital_twin_resilience/`
- core pipeline orchestration area
- `pipeline.py` defines the SageMaker pipeline and generates `pipeline_definition.json`
- `start_pipeline.py` starts a pipeline execution
- `check_pipeline_execution.py` checks execution status
- `steps/processing/`, `steps/training/`, and `steps/evaluation/` contain the step logic executed by SageMaker

- `data/synthetic/`
- synthetic data support for the current stub workflow

- `tests/`
- test area exists, but CI usage has not yet been confirmed in this document

- `README.md`
- high-level explanation of repo purpose and structure

### Current understanding
- Deployment and execution are separate concerns
- GitHub Actions currently appear focused on deployment and Terraform validation
- Pipeline execution is started deliberately, not automatically from Terraform apply
- The current pipeline appears to be a stub synthetic processing/training/evaluation flow

### Key files for current understanding

The following files are currently the most relevant for understanding pipeline definition, execution, and verification:

- `pipeline.py`
- `start_pipeline.py`
- `check_pipeline_execution.py`
- `steps/processing/processor.py`
- `steps/training/train.py`
- `steps/evaluation/evaluate.py`

Additional files such as `parse_request.py`, `request_schema.py`, and `create_pipeline.py` are likely important next, but have not yet been examined in detail in this document.

---

## 3. Current working decisions

- Deployment and execution are separate concerns.
- GitHub Actions currently handle pipeline-definition generation and Terraform plan/apply.
- Current GitHub Actions do not appear to start pipeline execution or run Python tests.
- `pipeline.py` generates the SageMaker pipeline definition and writes `pipeline_definition.json`.
- `start_pipeline.py` deliberately starts a SageMaker pipeline execution.
- `check_pipeline_execution.py` checks overall execution status and step-level status through SageMaker APIs.
- The current registered pipeline executes three step scripts: `processor.py`, `train.py`, and `evaluate.py`.
- Early work should focus on framing, contracts, scope, and evaluation before sophisticated model choices.

---

## 4. Open questions

### Core problem / model questions
- What exact decision is the system supposed to support first?
- What is the narrow REV1 scope?
- What is the first prediction target?
- What would count as a useful model output?
- What is the simplest credible baseline for REV1: graph-based, heuristic, tabular, or other?

### Data / entity questions
- What are the core entities?
- What node and edge types belong in the first service graph?
- What data sources are expected to be available?
- What minimum fields are required to support the first end-to-end run?
- What synthetic substitutes are acceptable early on?

### Evaluation questions
- How will success be measured for REV1?
- What does "decision-useful" mean in practice?
- What outputs should `evaluate.py` emit?
- What evidence would justify continuing to the next phase?

### Repo / process questions
- Which starter doc should be written next?
- What should be treated as current truth vs placeholder?
- What is the first code file that should be tightened?

---

## 5. Recommended starter docs from this session

These were identified as the most useful starter docs.

### A. Problem framing doc
Should answer:
- What problem are we solving?
- Who is the decision-maker?
- What is REV1 trying to prove?
- What is explicitly out of scope?

### B. Feasibility questions / hypotheses doc
Should answer:
- What are the major unknowns?
- What do we believe right now?
- What evidence would support or weaken each hypothesis?

### C. REV1 scope and success criteria doc
Should answer:
- What are we building now?
- What are we not building?
- What must be demonstrated?
- What would count as failure or a stop condition?

### D. Data and entity contract doc
Should answer:
- What are the main entities?
- How do they relate?
- What data do we expect?
- What quality risks exist?

### E. Repo/runbook doc
Should answer:
- How is the repo organized?
- How does the flow run?
- What is implemented vs placeholder?
- How should someone orient themselves quickly?

### Note
This `PROJECT_STATUS.md` is not a replacement for those docs. It is the continuity layer that points to them and tracks what is missing.

---

## 6. Guidance agreed in this session

### What not to do
- Do not begin by locking in sophisticated model architecture
- Do not let the repo skeleton create false confidence
- Do not use a polished solution architecture doc as the first anchor
- Do not hide unresolved questions under implementation detail

### What to do first
- Clarify the project/problem framing
- Make the major unknowns explicit
- Define REV1 scope and success criteria
- Build continuity documentation that preserves momentum
- Use this file to keep current status, decisions, open questions, and next actions visible

---

## 7. Next actions

- [ ] Create a first draft of the problem framing doc
- [ ] Create a first draft of the feasibility questions / hypotheses doc
- [ ] Create a first draft of the REV1 scope and success criteria doc
- [ ] Identify the most important data/entity questions for the first pass
- [ ] Decide which current repo file should be examined first for concrete changes

---

## 8. Change log

### Session-created initial version
- Created the first session-only continuity draft of `PROJECT_STATUS.md`
- Purpose: establish a resumable project memory file and expose missing information clearly
- Constraint: uses only information discussed in this session
78 changes: 41 additions & 37 deletions mlops/README.md
Original file line number Diff line number Diff line change
@@ -1,46 +1,50 @@
# SageMaker Pipeline Feasibility PoC

## Description of directory tree elements

<b> .github/workflows/ </b></br>
**.github/workflows/**

This is CI/CD only. It is not ML logic. GitHub Actions can authenticate to AWS via OIDC instead of long-lived secrets, which is the cleaner enterprise pattern.
<ul>
<li><b>terraform-plan.yml</b>: runs fmt/validate/plan on PRs</li>
<li><b>terraform-apply.yml</b>: applies approved infra changes to dev, maybe later prod</li>
</ul>

<b>infra/terraform/</b></br>
- **terraform-plan.yml**: runs fmt/validate/plan on PRs
- **terraform-apply.yml**: applies approved infra changes to dev, maybe later prod

**infra/terraform/**

This is infrastructure only.
<ul>
<li><b>envs/dev/</b>: environment-specific wiring</li>
<li><b>modules/s3/</b>: buckets for raw, processed, model artifacts, evaluation outputs</li>
<li><b>modules/iam/</b>: execution roles and policies</li>
<li><b>modules/sagemaker_pipeline/</b>: Terraform resource for the SageMaker Pipeline</li>
</ul>
Terraform has an aws_sagemaker_pipeline resource, so using Terraform for the pipeline object itself is a legitimate pattern, not a workaround.

<b>pipelines/digital_twin_resilience/</b></br>

- **envs/dev/**: environment-specific wiring
- **modules/s3/**: buckets for raw, processed, model artifacts, evaluation outputs
- **modules/iam/**: execution roles and policies
- **modules/sagemaker_pipeline/**: Terraform resource for the SageMaker Pipeline

Terraform has an aws_sagemaker_pipeline resource, so using Terraform for the pipeline object itself is a legitimate pattern, not a workaround.



**pipelines/digital_twin_resilience/**

This is the ML workflow definition.
<ul>
<li><b>pipeline.py</b>: defines the SageMaker Pipeline DAG</li>
<li><b>config.py</b>: pipeline parameters and defaults</li>
<li><b>steps/processing/processor.py</b>: builds datasets or synthetic inputs</li>
<li><b>steps/training/train.py</b>: trains a trivial baseline model first</li>
<li><b>steps/evaluation/evaluate.py</b>: computes metrics and emits a JSON report</li>
<li><b>utils/</b>: shared helpers</li>
</ul>
SageMaker Pipelines is a DAG of interconnected steps, and AWS explicitly supports Processing and Training steps in the pipeline definition.

<b>data/synthetic/</b>

- **pipeline.py**: defines the SageMaker Pipeline DAG
- **config.py**: pipeline parameters and defaults
- **steps/processing/processor.py**: builds datasets or synthetic inputs
- **steps/training/train.py**: trains a trivial baseline model first
- **steps/evaluation/evaluate.py**: computes metrics and emits a JSON report
- **utils/**: shared helpers

SageMaker Pipelines is a DAG of interconnected steps, and AWS explicitly supports Processing and Training steps in the pipeline definition.

**data/synthetic/**

This is discovery-sprint fuel.
<ul>
<li>generate fake telemetry</li>
<li>define a graph-ish structure if needed</li>
<li>keep it tiny and boring</li>
</ul>

<b>tests/</b>
<ul>
<li><b>test_pipeline_compile.py</b>: proves the pipeline definition compiles</li>
<li><b>test_smoke_synthetic.py</b>: one tiny end-to-end synthetic run</li>
</ul>

- generate fake telemetry
- define a graph-ish structure if needed
- keep it tiny and boring

**tests/**

- **test_pipeline_compile.py**: proves the pipeline definition compiles
- **test_smoke_synthetic.py**: one tiny end-to-end synthetic run

Binary file modified mlops/pipelines/.DS_Store
Binary file not shown.
5 changes: 4 additions & 1 deletion mlops/pipelines/digital_twin_resilience/pipeline.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
import json
import os
from pathlib import Path

Expand Down Expand Up @@ -299,6 +300,8 @@ def get_pipeline(

definition = pipeline.definition()
out_path = Path(__file__).resolve().parent / "pipeline_definition.json"
out_path.write_text(definition)
with out_path.open("w", encoding="utf-8") as f:
json.dump(json.loads(definition), f, indent=2, sort_keys=False)
f.write("\n")

print(f"Wrote pipeline definition to {out_path}")
Loading
Loading