diff --git a/.DS_Store b/.DS_Store index 4fdf4b6..d6753f9 100644 Binary files a/.DS_Store and b/.DS_Store differ diff --git a/.github/.DS_Store b/.github/.DS_Store new file mode 100644 index 0000000..2ff9b28 Binary files /dev/null and b/.github/.DS_Store differ diff --git a/mlops/.DS_Store b/mlops/.DS_Store index 5f852bc..85b7cb0 100644 Binary files a/mlops/.DS_Store and b/mlops/.DS_Store differ diff --git a/mlops/PROJECT_STATUS.md b/mlops/PROJECT_STATUS.md new file mode 100644 index 0000000..68af1a9 --- /dev/null +++ b/mlops/PROJECT_STATUS.md @@ -0,0 +1,232 @@ +# PROJECT_STATUS.md + +## 1. Project snapshot + +- **Project name:** Digital Twin Resilience Model +- **Project goal:** Develop a digital twin of a major streaming platform to simulate system failure, impact radius, and response time. +- **Current objective:** Build toward simulation of the entitlements service using a graph-oriented approach on AWS. +- **Current phase:** Repo and pipeline mechanics clarified. GitHub Actions and Terraform deploy the SageMaker pipeline definition and related infrastructure; pipeline execution is started separately and currently runs a stub synthetic-data workflow. + +- **Current status:** + +### Deployment +Based on GitHub Actions, `terraform-plan.yml` and `terraform-apply.yml`: +- generate the SageMaker pipeline definition +- provision or update the SageMaker Pipeline resource +- validate Terraform / infra changes + +`digital_twin_resilience/pipeline.py` +- defines the SageMaker pipeline +- generates `pipeline_definition.json` + +### Execution +`start_pipeline.py` +- starts a specific SageMaker pipeline execution +- allows parameter overrides +- triggers the registered pipeline in AWS + +The generated pipeline definition shows three steps: +- `processor.py` +- `train.py` +- `evaluate.py` + +`processor.py` +- generates synthetic data +- populates train / validation / test outputs in S3 + +`train.py` +- builds a trivial baseline model from synthetic data + +`evaluate.py` +- computes an evaluation output / trivial metric from the model + +### Verification +`check_pipeline_execution.py` +- asks SageMaker for overall pipeline execution status +- lists step-level statuses and related job metadata + +### Important distinction +- Deploying the pipeline is separate from executing it. +- Current GitHub Actions deploy and update the pipeline definition and infrastructure. +- Pipeline execution is started deliberately via `start_pipeline.py`. + +- **Immediate next step:** Define the minimum set of starter docs and begin filling them in, starting with continuity and framing docs. +- **Biggest current blockers / gaps:** + - Input data contract is not yet defined + - Service graph schema is not yet defined + - Prediction target is not yet defined + - Definition of "good" model output is not yet defined + - It is not yet decided whether the first baseline should be graph ML or something simpler + +--- + +## 2. Working understanding of the repo + +This section is not a replacement for `repo_skeleton.yml`. It is a quick orientation note describing how the repo is currently understood. + +### Repo orientation + +- `.github/workflows/` + - GitHub Actions workflows for Terraform plan/apply and deployment-oriented automation + - current understanding: deploys pipeline definition and infra, but does not execute the pipeline or run Python tests + +- `terraform/` + - infrastructure code for AWS resources and SageMaker pipeline registration + - `envs/dev/` contains environment-specific wiring + - `modules/` contains reusable pieces such as S3, IAM, and SageMaker pipeline setup + +- `mlops/pipelines/digital_twin_resilience/` + - core pipeline orchestration area + - `pipeline.py` defines the SageMaker pipeline and generates `pipeline_definition.json` + - `start_pipeline.py` starts a pipeline execution + - `check_pipeline_execution.py` checks execution status + - `steps/processing/`, `steps/training/`, and `steps/evaluation/` contain the step logic executed by SageMaker + +- `data/synthetic/` + - synthetic data support for the current stub workflow + +- `tests/` + - test area exists, but CI usage has not yet been confirmed in this document + +- `README.md` + - high-level explanation of repo purpose and structure + +### Current understanding +- Deployment and execution are separate concerns +- GitHub Actions currently appear focused on deployment and Terraform validation +- Pipeline execution is started deliberately, not automatically from Terraform apply +- The current pipeline appears to be a stub synthetic processing/training/evaluation flow + +### Key files for current understanding + +The following files are currently the most relevant for understanding pipeline definition, execution, and verification: + +- `pipeline.py` +- `start_pipeline.py` +- `check_pipeline_execution.py` +- `steps/processing/processor.py` +- `steps/training/train.py` +- `steps/evaluation/evaluate.py` + +Additional files such as `parse_request.py`, `request_schema.py`, and `create_pipeline.py` are likely important next, but have not yet been examined in detail in this document. + +--- + +## 3. Current working decisions + +- Deployment and execution are separate concerns. +- GitHub Actions currently handle pipeline-definition generation and Terraform plan/apply. +- Current GitHub Actions do not appear to start pipeline execution or run Python tests. +- `pipeline.py` generates the SageMaker pipeline definition and writes `pipeline_definition.json`. +- `start_pipeline.py` deliberately starts a SageMaker pipeline execution. +- `check_pipeline_execution.py` checks overall execution status and step-level status through SageMaker APIs. +- The current registered pipeline executes three step scripts: `processor.py`, `train.py`, and `evaluate.py`. +- Early work should focus on framing, contracts, scope, and evaluation before sophisticated model choices. + +--- + +## 4. Open questions + +### Core problem / model questions +- What exact decision is the system supposed to support first? +- What is the narrow REV1 scope? +- What is the first prediction target? +- What would count as a useful model output? +- What is the simplest credible baseline for REV1: graph-based, heuristic, tabular, or other? + +### Data / entity questions +- What are the core entities? +- What node and edge types belong in the first service graph? +- What data sources are expected to be available? +- What minimum fields are required to support the first end-to-end run? +- What synthetic substitutes are acceptable early on? + +### Evaluation questions +- How will success be measured for REV1? +- What does "decision-useful" mean in practice? +- What outputs should `evaluate.py` emit? +- What evidence would justify continuing to the next phase? + +### Repo / process questions +- Which starter doc should be written next? +- What should be treated as current truth vs placeholder? +- What is the first code file that should be tightened? + +--- + +## 5. Recommended starter docs from this session + +These were identified as the most useful starter docs. + +### A. Problem framing doc +Should answer: +- What problem are we solving? +- Who is the decision-maker? +- What is REV1 trying to prove? +- What is explicitly out of scope? + +### B. Feasibility questions / hypotheses doc +Should answer: +- What are the major unknowns? +- What do we believe right now? +- What evidence would support or weaken each hypothesis? + +### C. REV1 scope and success criteria doc +Should answer: +- What are we building now? +- What are we not building? +- What must be demonstrated? +- What would count as failure or a stop condition? + +### D. Data and entity contract doc +Should answer: +- What are the main entities? +- How do they relate? +- What data do we expect? +- What quality risks exist? + +### E. Repo/runbook doc +Should answer: +- How is the repo organized? +- How does the flow run? +- What is implemented vs placeholder? +- How should someone orient themselves quickly? + +### Note +This `PROJECT_STATUS.md` is not a replacement for those docs. It is the continuity layer that points to them and tracks what is missing. + +--- + +## 6. Guidance agreed in this session + +### What not to do +- Do not begin by locking in sophisticated model architecture +- Do not let the repo skeleton create false confidence +- Do not use a polished solution architecture doc as the first anchor +- Do not hide unresolved questions under implementation detail + +### What to do first +- Clarify the project/problem framing +- Make the major unknowns explicit +- Define REV1 scope and success criteria +- Build continuity documentation that preserves momentum +- Use this file to keep current status, decisions, open questions, and next actions visible + +--- + +## 7. Next actions + +- [ ] Create a first draft of the problem framing doc +- [ ] Create a first draft of the feasibility questions / hypotheses doc +- [ ] Create a first draft of the REV1 scope and success criteria doc +- [ ] Identify the most important data/entity questions for the first pass +- [ ] Decide which current repo file should be examined first for concrete changes + +--- + +## 8. Change log + +### Session-created initial version +- Created the first session-only continuity draft of `PROJECT_STATUS.md` +- Purpose: establish a resumable project memory file and expose missing information clearly +- Constraint: uses only information discussed in this session \ No newline at end of file diff --git a/mlops/README.md b/mlops/README.md index a275ed3..4199a81 100644 --- a/mlops/README.md +++ b/mlops/README.md @@ -1,46 +1,50 @@ # SageMaker Pipeline Feasibility PoC + ## Description of directory tree elements - .github/workflows/
+**.github/workflows/** + This is CI/CD only. It is not ML logic. GitHub Actions can authenticate to AWS via OIDC instead of long-lived secrets, which is the cleaner enterprise pattern. - -infra/terraform/
+- **terraform-plan.yml**: runs fmt/validate/plan on PRs +- **terraform-apply.yml**: applies approved infra changes to dev, maybe later prod + +**infra/terraform/** + This is infrastructure only. - -Terraform has an aws_sagemaker_pipeline resource, so using Terraform for the pipeline object itself is a legitimate pattern, not a workaround. - -pipelines/digital_twin_resilience/
+ +- **envs/dev/**: environment-specific wiring +- **modules/s3/**: buckets for raw, processed, model artifacts, evaluation outputs +- **modules/iam/**: execution roles and policies +- **modules/sagemaker_pipeline/**: Terraform resource for the SageMaker Pipeline + + Terraform has an aws_sagemaker_pipeline resource, so using Terraform for the pipeline object itself is a legitimate pattern, not a workaround. + + + +**pipelines/digital_twin_resilience/** + This is the ML workflow definition. - -SageMaker Pipelines is a DAG of interconnected steps, and AWS explicitly supports Processing and Training steps in the pipeline definition. - -data/synthetic/ + +- **pipeline.py**: defines the SageMaker Pipeline DAG +- **config.py**: pipeline parameters and defaults +- **steps/processing/processor.py**: builds datasets or synthetic inputs +- **steps/training/train.py**: trains a trivial baseline model first +- **steps/evaluation/evaluate.py**: computes metrics and emits a JSON report +- **utils/**: shared helpers + + SageMaker Pipelines is a DAG of interconnected steps, and AWS explicitly supports Processing and Training steps in the pipeline definition. + +**data/synthetic/** This is discovery-sprint fuel. - - -tests/ - \ No newline at end of file + +- generate fake telemetry +- define a graph-ish structure if needed +- keep it tiny and boring + +**tests/** + +- **test_pipeline_compile.py**: proves the pipeline definition compiles +- **test_smoke_synthetic.py**: one tiny end-to-end synthetic run + diff --git a/mlops/pipelines/.DS_Store b/mlops/pipelines/.DS_Store index f303c20..4dec090 100644 Binary files a/mlops/pipelines/.DS_Store and b/mlops/pipelines/.DS_Store differ diff --git a/mlops/pipelines/digital_twin_resilience/pipeline.py b/mlops/pipelines/digital_twin_resilience/pipeline.py index 4e7621e..e52ef17 100644 --- a/mlops/pipelines/digital_twin_resilience/pipeline.py +++ b/mlops/pipelines/digital_twin_resilience/pipeline.py @@ -1,3 +1,4 @@ +import json import os from pathlib import Path @@ -299,6 +300,8 @@ def get_pipeline( definition = pipeline.definition() out_path = Path(__file__).resolve().parent / "pipeline_definition.json" - out_path.write_text(definition) + with out_path.open("w", encoding="utf-8") as f: + json.dump(json.loads(definition), f, indent=2, sort_keys=False) + f.write("\n") print(f"Wrote pipeline definition to {out_path}") \ No newline at end of file diff --git a/mlops/pipelines/digital_twin_resilience/pipeline_definition.json b/mlops/pipelines/digital_twin_resilience/pipeline_definition.json index 26f4888..f034c57 100644 --- a/mlops/pipelines/digital_twin_resilience/pipeline_definition.json +++ b/mlops/pipelines/digital_twin_resilience/pipeline_definition.json @@ -1 +1,368 @@ -{"Version": "2020-12-01", "Metadata": {}, "Parameters": [{"Name": "InputDataUri", "Type": "String", "DefaultValue": "s3://dougdaly-mlops-poc-input-dev/synthetic/raw/"}, {"Name": "RequestConfigUri", "Type": "String", "DefaultValue": "s3://dougdaly-mlops-poc-input-dev/requests/request.json"}, {"Name": "ProcessingInstanceType", "Type": "String", "DefaultValue": "ml.t3.medium"}, {"Name": "TrainingInstanceType", "Type": "String", "DefaultValue": "ml.t3.medium"}, {"Name": "EvaluationInstanceType", "Type": "String", "DefaultValue": "ml.t3.medium"}], "PipelineExperimentConfig": {"ExperimentName": {"Get": "Execution.PipelineName"}, "TrialName": {"Get": "Execution.PipelineExecutionId"}}, "Steps": [{"Name": "ProcessSyntheticTelemetry", "Type": "Processing", "Arguments": {"ProcessingResources": {"ClusterConfig": {"InstanceType": {"Get": "Parameters.ProcessingInstanceType"}, "InstanceCount": 1, "VolumeSizeInGB": 30}}, "AppSpecification": {"ImageUri": "246618743249.dkr.ecr.us-west-2.amazonaws.com/sagemaker-scikit-learn:1.2-1-cpu-py3", "ContainerEntrypoint": ["python3", "/opt/ml/processing/input/code/processor.py"]}, "RoleArn": "arn:aws:iam::159535637196:role/SageMakerExecutionRole-mlops", "ProcessingInputs": [{"InputName": "input-1", "AppManaged": false, "S3Input": {"S3Uri": {"Get": "Parameters.InputDataUri"}, "LocalPath": "/opt/ml/processing/input", "S3DataType": "S3Prefix", "S3InputMode": "File", "S3DataDistributionType": "FullyReplicated", "S3CompressionType": "None"}}, {"InputName": "input-2", "AppManaged": false, "S3Input": {"S3Uri": {"Get": "Parameters.RequestConfigUri"}, "LocalPath": "/opt/ml/processing/config", "S3DataType": "S3Prefix", "S3InputMode": "File", "S3DataDistributionType": "FullyReplicated", "S3CompressionType": "None"}}, {"InputName": "code", "AppManaged": false, "S3Input": {"S3Uri": "s3://dougdaly-mlops-poc-output-dev/sagemaker-scikit-learn-2026-03-26-15-41-58-624/input/code/processor.py", "LocalPath": "/opt/ml/processing/input/code", "S3DataType": "S3Prefix", "S3InputMode": "File", "S3DataDistributionType": "FullyReplicated", "S3CompressionType": "None"}}], "ProcessingOutputConfig": {"Outputs": [{"OutputName": "train", "AppManaged": false, "S3Output": {"S3Uri": {"Std:Join": {"On": "/", "Values": ["s3:/", "dougdaly-mlops-poc-output-dev", "digital-twin-resilience-dev-pipeline", {"Get": "Execution.PipelineExecutionId"}, "ProcessSyntheticTelemetry", "output", "train"]}}, "LocalPath": "/opt/ml/processing/output/train", "S3UploadMode": "EndOfJob"}}, {"OutputName": "validation", "AppManaged": false, "S3Output": {"S3Uri": {"Std:Join": {"On": "/", "Values": ["s3:/", "dougdaly-mlops-poc-output-dev", "digital-twin-resilience-dev-pipeline", {"Get": "Execution.PipelineExecutionId"}, "ProcessSyntheticTelemetry", "output", "validation"]}}, "LocalPath": "/opt/ml/processing/output/validation", "S3UploadMode": "EndOfJob"}}, {"OutputName": "test", "AppManaged": false, "S3Output": {"S3Uri": {"Std:Join": {"On": "/", "Values": ["s3:/", "dougdaly-mlops-poc-output-dev", "digital-twin-resilience-dev-pipeline", {"Get": "Execution.PipelineExecutionId"}, "ProcessSyntheticTelemetry", "output", "test"]}}, "LocalPath": "/opt/ml/processing/output/test", "S3UploadMode": "EndOfJob"}}]}}}, {"Name": "TrainBaselineModel", "Type": "Processing", "Arguments": {"ProcessingResources": {"ClusterConfig": {"InstanceType": {"Get": "Parameters.TrainingInstanceType"}, "InstanceCount": 1, "VolumeSizeInGB": 30}}, "AppSpecification": {"ImageUri": "246618743249.dkr.ecr.us-west-2.amazonaws.com/sagemaker-scikit-learn:1.2-1-cpu-py3", "ContainerEntrypoint": ["python3", "/opt/ml/processing/input/code/train.py"]}, "RoleArn": "arn:aws:iam::159535637196:role/SageMakerExecutionRole-mlops", "ProcessingInputs": [{"InputName": "input-1", "AppManaged": false, "S3Input": {"S3Uri": {"Get": "Steps.ProcessSyntheticTelemetry.ProcessingOutputConfig.Outputs['train'].S3Output.S3Uri"}, "LocalPath": "/opt/ml/processing/train", "S3DataType": "S3Prefix", "S3InputMode": "File", "S3DataDistributionType": "FullyReplicated", "S3CompressionType": "None"}}, {"InputName": "input-2", "AppManaged": false, "S3Input": {"S3Uri": {"Get": "Steps.ProcessSyntheticTelemetry.ProcessingOutputConfig.Outputs['validation'].S3Output.S3Uri"}, "LocalPath": "/opt/ml/processing/validation", "S3DataType": "S3Prefix", "S3InputMode": "File", "S3DataDistributionType": "FullyReplicated", "S3CompressionType": "None"}}, {"InputName": "code", "AppManaged": false, "S3Input": {"S3Uri": "s3://dougdaly-mlops-poc-output-dev/sagemaker-scikit-learn-2026-03-26-15-41-58-843/input/code/train.py", "LocalPath": "/opt/ml/processing/input/code", "S3DataType": "S3Prefix", "S3InputMode": "File", "S3DataDistributionType": "FullyReplicated", "S3CompressionType": "None"}}], "ProcessingOutputConfig": {"Outputs": [{"OutputName": "model", "AppManaged": false, "S3Output": {"S3Uri": {"Std:Join": {"On": "/", "Values": ["s3:/", "dougdaly-mlops-poc-output-dev", "digital-twin-resilience-dev-pipeline", {"Get": "Execution.PipelineExecutionId"}, "TrainBaselineModel", "output", "model"]}}, "LocalPath": "/opt/ml/processing/model", "S3UploadMode": "EndOfJob"}}]}}}, {"Name": "EvaluateModel", "Type": "Processing", "Arguments": {"ProcessingResources": {"ClusterConfig": {"InstanceType": {"Get": "Parameters.EvaluationInstanceType"}, "InstanceCount": 1, "VolumeSizeInGB": 30}}, "AppSpecification": {"ImageUri": "246618743249.dkr.ecr.us-west-2.amazonaws.com/sagemaker-scikit-learn:1.2-1-cpu-py3", "ContainerEntrypoint": ["python3", "/opt/ml/processing/input/code/evaluate.py"]}, "RoleArn": "arn:aws:iam::159535637196:role/SageMakerExecutionRole-mlops", "ProcessingInputs": [{"InputName": "input-1", "AppManaged": false, "S3Input": {"S3Uri": {"Get": "Steps.TrainBaselineModel.ProcessingOutputConfig.Outputs['model'].S3Output.S3Uri"}, "LocalPath": "/opt/ml/processing/model", "S3DataType": "S3Prefix", "S3InputMode": "File", "S3DataDistributionType": "FullyReplicated", "S3CompressionType": "None"}}, {"InputName": "input-2", "AppManaged": false, "S3Input": {"S3Uri": {"Get": "Steps.ProcessSyntheticTelemetry.ProcessingOutputConfig.Outputs['test'].S3Output.S3Uri"}, "LocalPath": "/opt/ml/processing/test", "S3DataType": "S3Prefix", "S3InputMode": "File", "S3DataDistributionType": "FullyReplicated", "S3CompressionType": "None"}}, {"InputName": "code", "AppManaged": false, "S3Input": {"S3Uri": "s3://dougdaly-mlops-poc-output-dev/sagemaker-scikit-learn-2026-03-26-15-41-58-899/input/code/evaluate.py", "LocalPath": "/opt/ml/processing/input/code", "S3DataType": "S3Prefix", "S3InputMode": "File", "S3DataDistributionType": "FullyReplicated", "S3CompressionType": "None"}}], "ProcessingOutputConfig": {"Outputs": [{"OutputName": "evaluation", "AppManaged": false, "S3Output": {"S3Uri": {"Std:Join": {"On": "/", "Values": ["s3:/", "dougdaly-mlops-poc-output-dev", "digital-twin-resilience-dev-pipeline", {"Get": "Execution.PipelineExecutionId"}, "EvaluateModel", "output", "evaluation"]}}, "LocalPath": "/opt/ml/processing/evaluation", "S3UploadMode": "EndOfJob"}}]}}}]} \ No newline at end of file +{ + "Version": "2020-12-01", + "Metadata": {}, + "Parameters": [ + { + "Name": "InputDataUri", + "Type": "String", + "DefaultValue": "s3://dougdaly-mlops-poc-input-dev/synthetic/raw/" + }, + { + "Name": "RequestConfigUri", + "Type": "String", + "DefaultValue": "s3://dougdaly-mlops-poc-input-dev/requests/request.json" + }, + { + "Name": "ProcessingInstanceType", + "Type": "String", + "DefaultValue": "ml.t3.medium" + }, + { + "Name": "TrainingInstanceType", + "Type": "String", + "DefaultValue": "ml.t3.medium" + }, + { + "Name": "EvaluationInstanceType", + "Type": "String", + "DefaultValue": "ml.t3.medium" + } + ], + "PipelineExperimentConfig": { + "ExperimentName": { + "Get": "Execution.PipelineName" + }, + "TrialName": { + "Get": "Execution.PipelineExecutionId" + } + }, + "Steps": [ + { + "Name": "ProcessSyntheticTelemetry", + "Type": "Processing", + "Arguments": { + "ProcessingResources": { + "ClusterConfig": { + "InstanceType": { + "Get": "Parameters.ProcessingInstanceType" + }, + "InstanceCount": 1, + "VolumeSizeInGB": 30 + } + }, + "AppSpecification": { + "ImageUri": "246618743249.dkr.ecr.us-west-2.amazonaws.com/sagemaker-scikit-learn:1.2-1-cpu-py3", + "ContainerEntrypoint": [ + "python3", + "/opt/ml/processing/input/code/processor.py" + ] + }, + "RoleArn": "arn:aws:iam::159535637196:role/SageMakerExecutionRole-mlops", + "ProcessingInputs": [ + { + "InputName": "input-1", + "AppManaged": false, + "S3Input": { + "S3Uri": { + "Get": "Parameters.InputDataUri" + }, + "LocalPath": "/opt/ml/processing/input", + "S3DataType": "S3Prefix", + "S3InputMode": "File", + "S3DataDistributionType": "FullyReplicated", + "S3CompressionType": "None" + } + }, + { + "InputName": "input-2", + "AppManaged": false, + "S3Input": { + "S3Uri": { + "Get": "Parameters.RequestConfigUri" + }, + "LocalPath": "/opt/ml/processing/config", + "S3DataType": "S3Prefix", + "S3InputMode": "File", + "S3DataDistributionType": "FullyReplicated", + "S3CompressionType": "None" + } + }, + { + "InputName": "code", + "AppManaged": false, + "S3Input": { + "S3Uri": "s3://dougdaly-mlops-poc-output-dev/sagemaker-scikit-learn-2026-03-31-18-55-49-990/input/code/processor.py", + "LocalPath": "/opt/ml/processing/input/code", + "S3DataType": "S3Prefix", + "S3InputMode": "File", + "S3DataDistributionType": "FullyReplicated", + "S3CompressionType": "None" + } + } + ], + "ProcessingOutputConfig": { + "Outputs": [ + { + "OutputName": "train", + "AppManaged": false, + "S3Output": { + "S3Uri": { + "Std:Join": { + "On": "/", + "Values": [ + "s3:/", + "dougdaly-mlops-poc-output-dev", + "digital-twin-resilience-dev-pipeline", + { + "Get": "Execution.PipelineExecutionId" + }, + "ProcessSyntheticTelemetry", + "output", + "train" + ] + } + }, + "LocalPath": "/opt/ml/processing/output/train", + "S3UploadMode": "EndOfJob" + } + }, + { + "OutputName": "validation", + "AppManaged": false, + "S3Output": { + "S3Uri": { + "Std:Join": { + "On": "/", + "Values": [ + "s3:/", + "dougdaly-mlops-poc-output-dev", + "digital-twin-resilience-dev-pipeline", + { + "Get": "Execution.PipelineExecutionId" + }, + "ProcessSyntheticTelemetry", + "output", + "validation" + ] + } + }, + "LocalPath": "/opt/ml/processing/output/validation", + "S3UploadMode": "EndOfJob" + } + }, + { + "OutputName": "test", + "AppManaged": false, + "S3Output": { + "S3Uri": { + "Std:Join": { + "On": "/", + "Values": [ + "s3:/", + "dougdaly-mlops-poc-output-dev", + "digital-twin-resilience-dev-pipeline", + { + "Get": "Execution.PipelineExecutionId" + }, + "ProcessSyntheticTelemetry", + "output", + "test" + ] + } + }, + "LocalPath": "/opt/ml/processing/output/test", + "S3UploadMode": "EndOfJob" + } + } + ] + } + } + }, + { + "Name": "TrainBaselineModel", + "Type": "Processing", + "Arguments": { + "ProcessingResources": { + "ClusterConfig": { + "InstanceType": { + "Get": "Parameters.TrainingInstanceType" + }, + "InstanceCount": 1, + "VolumeSizeInGB": 30 + } + }, + "AppSpecification": { + "ImageUri": "246618743249.dkr.ecr.us-west-2.amazonaws.com/sagemaker-scikit-learn:1.2-1-cpu-py3", + "ContainerEntrypoint": [ + "python3", + "/opt/ml/processing/input/code/train.py" + ] + }, + "RoleArn": "arn:aws:iam::159535637196:role/SageMakerExecutionRole-mlops", + "ProcessingInputs": [ + { + "InputName": "input-1", + "AppManaged": false, + "S3Input": { + "S3Uri": { + "Get": "Steps.ProcessSyntheticTelemetry.ProcessingOutputConfig.Outputs['train'].S3Output.S3Uri" + }, + "LocalPath": "/opt/ml/processing/train", + "S3DataType": "S3Prefix", + "S3InputMode": "File", + "S3DataDistributionType": "FullyReplicated", + "S3CompressionType": "None" + } + }, + { + "InputName": "input-2", + "AppManaged": false, + "S3Input": { + "S3Uri": { + "Get": "Steps.ProcessSyntheticTelemetry.ProcessingOutputConfig.Outputs['validation'].S3Output.S3Uri" + }, + "LocalPath": "/opt/ml/processing/validation", + "S3DataType": "S3Prefix", + "S3InputMode": "File", + "S3DataDistributionType": "FullyReplicated", + "S3CompressionType": "None" + } + }, + { + "InputName": "code", + "AppManaged": false, + "S3Input": { + "S3Uri": "s3://dougdaly-mlops-poc-output-dev/sagemaker-scikit-learn-2026-03-31-18-55-50-253/input/code/train.py", + "LocalPath": "/opt/ml/processing/input/code", + "S3DataType": "S3Prefix", + "S3InputMode": "File", + "S3DataDistributionType": "FullyReplicated", + "S3CompressionType": "None" + } + } + ], + "ProcessingOutputConfig": { + "Outputs": [ + { + "OutputName": "model", + "AppManaged": false, + "S3Output": { + "S3Uri": { + "Std:Join": { + "On": "/", + "Values": [ + "s3:/", + "dougdaly-mlops-poc-output-dev", + "digital-twin-resilience-dev-pipeline", + { + "Get": "Execution.PipelineExecutionId" + }, + "TrainBaselineModel", + "output", + "model" + ] + } + }, + "LocalPath": "/opt/ml/processing/model", + "S3UploadMode": "EndOfJob" + } + } + ] + } + } + }, + { + "Name": "EvaluateModel", + "Type": "Processing", + "Arguments": { + "ProcessingResources": { + "ClusterConfig": { + "InstanceType": { + "Get": "Parameters.EvaluationInstanceType" + }, + "InstanceCount": 1, + "VolumeSizeInGB": 30 + } + }, + "AppSpecification": { + "ImageUri": "246618743249.dkr.ecr.us-west-2.amazonaws.com/sagemaker-scikit-learn:1.2-1-cpu-py3", + "ContainerEntrypoint": [ + "python3", + "/opt/ml/processing/input/code/evaluate.py" + ] + }, + "RoleArn": "arn:aws:iam::159535637196:role/SageMakerExecutionRole-mlops", + "ProcessingInputs": [ + { + "InputName": "input-1", + "AppManaged": false, + "S3Input": { + "S3Uri": { + "Get": "Steps.TrainBaselineModel.ProcessingOutputConfig.Outputs['model'].S3Output.S3Uri" + }, + "LocalPath": "/opt/ml/processing/model", + "S3DataType": "S3Prefix", + "S3InputMode": "File", + "S3DataDistributionType": "FullyReplicated", + "S3CompressionType": "None" + } + }, + { + "InputName": "input-2", + "AppManaged": false, + "S3Input": { + "S3Uri": { + "Get": "Steps.ProcessSyntheticTelemetry.ProcessingOutputConfig.Outputs['test'].S3Output.S3Uri" + }, + "LocalPath": "/opt/ml/processing/test", + "S3DataType": "S3Prefix", + "S3InputMode": "File", + "S3DataDistributionType": "FullyReplicated", + "S3CompressionType": "None" + } + }, + { + "InputName": "code", + "AppManaged": false, + "S3Input": { + "S3Uri": "s3://dougdaly-mlops-poc-output-dev/sagemaker-scikit-learn-2026-03-31-18-55-50-315/input/code/evaluate.py", + "LocalPath": "/opt/ml/processing/input/code", + "S3DataType": "S3Prefix", + "S3InputMode": "File", + "S3DataDistributionType": "FullyReplicated", + "S3CompressionType": "None" + } + } + ], + "ProcessingOutputConfig": { + "Outputs": [ + { + "OutputName": "evaluation", + "AppManaged": false, + "S3Output": { + "S3Uri": { + "Std:Join": { + "On": "/", + "Values": [ + "s3:/", + "dougdaly-mlops-poc-output-dev", + "digital-twin-resilience-dev-pipeline", + { + "Get": "Execution.PipelineExecutionId" + }, + "EvaluateModel", + "output", + "evaluation" + ] + } + }, + "LocalPath": "/opt/ml/processing/evaluation", + "S3UploadMode": "EndOfJob" + } + } + ] + } + } + } + ] +} diff --git a/mlops/pipelines/digital_twin_resilience/steps/.DS_Store b/mlops/pipelines/digital_twin_resilience/steps/.DS_Store index dec24de..f32f232 100644 Binary files a/mlops/pipelines/digital_twin_resilience/steps/.DS_Store and b/mlops/pipelines/digital_twin_resilience/steps/.DS_Store differ diff --git a/mlops/repo_skeleton.yml b/mlops/repo_skeleton.yml index 374bc28..41e2a30 100644 --- a/mlops/repo_skeleton.yml +++ b/mlops/repo_skeleton.yml @@ -3,77 +3,65 @@ repo/ workflows/ terraform-plan.yml terraform-apply.yml - - docs/ - discovery-one-pager.md - architecture-notes.md - - infra/ - terraform/ - envs/ - dev/ - main.tf - variables.tf - outputs.tf - backend.tf - terraform.tfvars - modules/ - s3/ - main.tf - variables.tf - outputs.tf - iam/ - main.tf - variables.tf - outputs.tf - sagemaker_pipeline/ - main.tf - variables.tf - outputs.tf - - pipelines/ - digital_twin_resilience/ - pipeline.py - config.py - requirements.txt - steps/ - processing/ - processor.py - requirements.txt - training/ - train.py - requirements.txt - evaluation/ - evaluate.py - requirements.txt - utils/ - io_utils.py - metrics.py - schemas.py - - containers/ - processing/ - Dockerfile - requirements.txt - training/ - Dockerfile - requirements.txt - evaluation/ - Dockerfile - requirements.txt - + terraform/ + envs/ + dev/ + backend.tf + main.tf + outputs.tf + terraform.tfstate + terraform.tfvars + variables.tf + README.md + modules/ + s3/ + main.tf + variables.tf + outputs.tf + iam/ + main.tf + variables.tf + outputs.tf + sagemaker_pipeline/ + main.tf + variables.tf + outputs.tf + mlops/ + data/ + docs/ + discovery-one-pager.md + README.md + pipelines/ + digital_twin_resilience/ + check_pipeline_execution.py + config.py + create_pipeline.py + parse_request.py + pipeline_definition.json + pipeline.py + request_schema.py + request.json + requirements.txt + run_request_flow.py + show_pipeline_outputs.py + show_processing_logs.py + start_pipeline.py + steps/ + processing/ + processor.py + training/ + train.py + evaluation/ + evaluate.py + utils/ + io_utils.py + metrics.py + schemas.py data/ synthetic/ generate_synthetic_data.py sample_input.csv - tests/ - unit/ - test_config.py - test_metrics.py - test_pipeline_compile.py - integration/ - test_smoke_synthetic.py - - Makefile + bedrock_test.py + test_steps.py README.md \ No newline at end of file diff --git a/mlops_local_test/.DS_Store b/mlops_local_test/.DS_Store new file mode 100644 index 0000000..82eb920 Binary files /dev/null and b/mlops_local_test/.DS_Store differ diff --git a/some_input.csv b/some_input.csv deleted file mode 100644 index e69de29..0000000 diff --git a/terraform/.DS_Store b/terraform/.DS_Store index 5b1fb09..9801d46 100644 Binary files a/terraform/.DS_Store and b/terraform/.DS_Store differ diff --git a/terraform/envs/.DS_Store b/terraform/envs/.DS_Store index cc02eff..82185f4 100644 Binary files a/terraform/envs/.DS_Store and b/terraform/envs/.DS_Store differ diff --git a/terraform/modules/.DS_Store b/terraform/modules/.DS_Store index 0800edd..868f0a5 100644 Binary files a/terraform/modules/.DS_Store and b/terraform/modules/.DS_Store differ