Skip to content

Commit 7a7549e

Browse files
committed
Merge branch 'feature/mlflow-integration' into 'develop'
feat: Add MLflow experiment tracking integration See merge request genaiic-reusable-assets/engagement-artifacts/genaiic-idp-accelerator!596
2 parents f13774a + ccaff92 commit 7a7549e

11 files changed

Lines changed: 1231 additions & 5 deletions

File tree

Dockerfile.optimized

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -22,6 +22,7 @@ ENV UV_LINK_MODE=copy
2222
# Build argument for function path
2323
ARG FUNCTION_PATH
2424
ARG INSTALL_IDP_COMMON=true
25+
ARG INSTALL_GIT=false
2526

2627
# Create working directory
2728
WORKDIR /build
@@ -44,6 +45,10 @@ RUN --mount=from=uv,source=/uv,target=/bin/uv \
4445
# Final stage - minimal runtime
4546
FROM public.ecr.aws/lambda/python:3.12-arm64
4647

48+
# Conditionally install git (required for mlflow/gitpython)
49+
ARG INSTALL_GIT=false
50+
RUN if [ "$INSTALL_GIT" = "true" ]; then dnf install -y git && dnf clean all; fi
51+
4752
# Copy the runtime dependencies from the builder stage
4853
COPY --from=builder ${LAMBDA_TASK_ROOT} ${LAMBDA_TASK_ROOT}
4954

README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -187,6 +187,7 @@ For detailed deployment and testing instructions, see the [Deployment Guide](./d
187187
- [Assessment](./docs/assessment.md) - Extraction confidence evaluation using LLMs
188188
- [Rule Validation](./docs/rule-validation.md) - Business rule validation and compliance checking
189189
- [Evaluation Framework](./docs/evaluation.md) - Accuracy assessment system with analytics database and reporting
190+
- [MLflow Experiment Tracking](./docs/mlflow-integration.md) - Optional MLflow integration for tracking metrics, model parameters, and prompts across test runs
190191
- [Knowledge Base](./docs/knowledge-base.md) - Document knowledge base query feature
191192
- [Monitoring](./docs/monitoring.md) - Monitoring and logging capabilities
192193
- [IDP Accelerator Help Chat Bot](./docs/code-intelligence.md) - Chat bot for asking question about the IDP code base and features

docs/mlflow-integration.md

Lines changed: 263 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,263 @@
1+
---
2+
title: "MLflow Experiment Tracking"
3+
---
4+
5+
Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
6+
SPDX-License-Identifier: MIT-0
7+
8+
# MLflow Experiment Tracking
9+
10+
The GenAIIDP solution includes optional integration with [Amazon SageMaker with MLflow](https://docs.aws.amazon.com/sagemaker/latest/dg/mlflow.html) for experiment tracking. When enabled, every test run automatically logs metrics, configuration parameters, and artifacts to an MLflow tracking server, enabling you to:
11+
12+
- Compare accuracy, cost, and performance across test runs
13+
- Track which models, prompts, and inference parameters produced each result
14+
- Filter and search runs by model ID, temperature, or any logged parameter
15+
- Visualize trends in accuracy and cost over time
16+
- Download full configuration snapshots and class definitions for reproducibility
17+
18+
## Table of Contents
19+
20+
- [MLflow Experiment Tracking](#mlflow-experiment-tracking)
21+
- [Architecture](#architecture)
22+
- [Prerequisites](#prerequisites)
23+
- [Enabling MLflow](#enabling-mlflow)
24+
- [How It Works](#how-it-works)
25+
- [What Gets Logged](#what-gets-logged)
26+
- [Metrics](#metrics)
27+
- [Parameters](#parameters)
28+
- [Artifacts](#artifacts)
29+
- [Tags](#tags)
30+
- [Example MLflow Run](#example-mlflow-run)
31+
- [AWS Resources Created](#aws-resources-created)
32+
- [IAM Permissions](#iam-permissions)
33+
- [Configuration](#configuration)
34+
- [Viewing Results](#viewing-results)
35+
- [Troubleshooting](#troubleshooting)
36+
37+
## Architecture
38+
39+
```mermaid
40+
flowchart LR
41+
TR[Test Results Resolver] -->|async invoke| ML[MLflow Logger Lambda]
42+
ML -->|log metrics, params, artifacts| SM[SageMaker MLflow Tracking Server]
43+
TR -->|fetch config| DDB[(DynamoDB Config Table)]
44+
45+
style ML fill:#f9f,stroke:#333
46+
style SM fill:#ff9,stroke:#333
47+
```
48+
49+
When a test run completes and metrics are aggregated, the `TestResultsResolverFunction` asynchronously invokes the `MLflowLoggerFunction` with the full metrics payload and IDP configuration. The logger function then records everything to the SageMaker MLflow tracking server. The invocation is fire-and-forget — MLflow logging never blocks or delays the test run results.
50+
51+
## Prerequisites
52+
53+
1. An [Amazon SageMaker MLflow Tracking Server](https://docs.aws.amazon.com/sagemaker/latest/dg/mlflow-create-tracking-server-studio.html) in the same region as your IDP deployment
54+
2. The tracking server ARN in the format:
55+
```
56+
arn:aws:sagemaker:<region>:<account-id>:mlflow-tracking-server/<server-name>
57+
```
58+
59+
## Enabling MLflow
60+
61+
Set the following CloudFormation parameters during stack deployment or update:
62+
63+
| Parameter | Value | Description |
64+
|-----------|-------|-------------|
65+
| `EnableMLflow` | `true` | Enables the MLflow Logger Lambda and wires it into the test results pipeline |
66+
| `MlflowTrackingURI` | `arn:aws:sagemaker:...` | ARN of your SageMaker MLflow tracking server |
67+
68+
`MlflowTrackingURI` is required when `EnableMLflow` is `true`. A CloudFormation rule validates this at deploy time.
69+
70+
When `EnableMLflow` is `false` (the default), no MLflow resources are created and no logging occurs.
71+
72+
## How It Works
73+
74+
1. A test run completes and the `TestResultsResolverFunction` aggregates metrics (via Stickler or Athena fallback)
75+
2. The resolver fetches the IDP configuration for the test run from DynamoDB
76+
3. The resolver asynchronously invokes the `MLflowLoggerFunction` with:
77+
- All aggregated metrics (accuracy, cost, field-level scores, etc.)
78+
- The full IDP configuration (models, inference params, prompts, class definitions)
79+
4. The MLflow Logger Lambda:
80+
- Creates an MLflow experiment named after the test run ID
81+
- Logs flat numeric values as MLflow metrics (searchable, chartable)
82+
- Logs model IDs and inference parameters as MLflow params (filterable)
83+
- Logs complex structures (prompts, class definitions, cost breakdown, full config) as JSON artifacts
84+
5. The invocation is `Event` type (async) — the test results resolver does not wait for MLflow logging to complete
85+
86+
## What Gets Logged
87+
88+
### Metrics
89+
90+
Numeric values logged as MLflow metrics. These are searchable and chartable in the MLflow UI.
91+
92+
| Category | Example Keys | Description |
93+
|----------|-------------|-------------|
94+
| Overall accuracy | `overall_accuracy` | Aggregate accuracy score |
95+
| Confidence | `average_confidence` | Mean extraction confidence |
96+
| Cost | `total_cost` | Total processing cost |
97+
| Document count | `document_count` | Number of documents in the test run |
98+
| Accuracy breakdown | `accuracy_breakdown.Payslip`, `accuracy_breakdown.W2` | Per-class accuracy (flattened from nested dict) |
99+
| Split classification | `split_classification_metrics.Payslip.precision` | Per-class precision/recall/f1 (flattened) |
100+
| Field-level metrics | `PayDate.cm_recall`, `CurrentGrossPay.cm_f1` | Per-field cm_precision, cm_recall, cm_f1, cm_accuracy |
101+
| Cost breakdown | `cost.ocr.textract_analyze_document_layout_pages`, `cost.classification.bedrock_us.amazon.nova_2_lite_v1_0_inputtokens` | Per-service estimated cost (sanitized keys) |
102+
| Weighted scores | Logged as artifact (see below) | Complex nested structure |
103+
104+
Metric key sanitization: `/`, `:`, and `-` are replaced with `_`, and all keys are lowercased.
105+
106+
### Parameters
107+
108+
Short key-value strings logged as MLflow params. These are filterable in the MLflow UI — useful for comparing runs across different model configurations.
109+
110+
| Parameter | Example Value | Source |
111+
|-----------|--------------|--------|
112+
| `test_run_id` | `abc-123-def` | Test run identifier |
113+
| `classification.model` | `us.amazon.nova-2-lite-v1:0` | Classification model ID |
114+
| `classification.temperature` | `0.0` | Classification temperature |
115+
| `classification.top_p` | `0.0` | Classification top_p |
116+
| `classification.top_k` | `5.0` | Classification top_k |
117+
| `classification.max_tokens` | `4096` | Classification max tokens |
118+
| `classification.enabled` | `True` | Classification enabled flag |
119+
| `classification.method` | `multimodalPageLevelClassification` | Classification method |
120+
| `extraction.model` | `us.amazon.nova-2-lite-v1:0` | Extraction model ID |
121+
| `extraction.temperature` | `0.0` | Extraction temperature |
122+
| `extraction.top_p` | `0.0` | Extraction top_p |
123+
| `extraction.top_k` | `5.0` | Extraction top_k |
124+
| `extraction.max_tokens` | `65535` | Extraction max tokens |
125+
| `assessment.model` | `us.amazon.nova-lite-v1:0` | Assessment model ID |
126+
| `assessment.confidence_threshold` | `0.8` | Assessment confidence threshold |
127+
| `assessment.granular.enabled` | `True` | Granular assessment flag |
128+
| `summarization.model` | `us.amazon.nova-pro-v1:0` | Summarization model ID |
129+
| `evaluation.model` | `us.amazon.nova-2-lite-v1:0` | Evaluation model ID |
130+
| `ocr.backend` | `textract` | OCR backend |
131+
| `use_bda` | `False` | BDA mode flag |
132+
133+
Only parameters that exist in the configuration are logged — missing values are omitted, not set to empty strings.
134+
135+
### Artifacts
136+
137+
Complex data structures logged as JSON files under the `metrics/` artifact path.
138+
139+
| Artifact | Description |
140+
|----------|-------------|
141+
| `full_config.json` | Complete IDP configuration snapshot for the test run |
142+
| `prompts.json` | System and task prompts for each stage (classification, extraction, assessment, summarization) |
143+
| `class_definitions.json` | Document class schemas with field definitions and evaluation methods |
144+
| `weighted_overall_scores.json` | Weighted accuracy scores per document class |
145+
| `field_metrics.json` | Full per-field evaluation metrics |
146+
| `cost_breakdown.json` | Detailed cost breakdown by service and operation |
147+
148+
### Tags
149+
150+
| Tag | Value |
151+
|-----|-------|
152+
| `source` | `test_results_resolver` |
153+
154+
## Example MLflow Run
155+
156+
For a test run with the lending package sample configuration, a single MLflow run would contain:
157+
158+
```
159+
── Params (27) ──────────────────────────────────────
160+
test_run_id = "run-2026-03-25-001"
161+
classification.model = "us.amazon.nova-2-lite-v1:0"
162+
classification.temperature = "0.0"
163+
classification.top_p = "0.0"
164+
classification.top_k = "5.0"
165+
classification.max_tokens = "4096"
166+
classification.method = "multimodalPageLevelClassification"
167+
extraction.model = "us.amazon.nova-2-lite-v1:0"
168+
extraction.temperature = "0.0"
169+
extraction.top_p = "0.0"
170+
extraction.top_k = "5.0"
171+
extraction.max_tokens = "65535"
172+
assessment.model = "us.amazon.nova-lite-v1:0"
173+
assessment.temperature = "0.0"
174+
assessment.top_p = "0.0"
175+
assessment.top_k = "5.0"
176+
assessment.max_tokens = "10000"
177+
assessment.enabled = "True"
178+
assessment.confidence_threshold = "0.8"
179+
assessment.granular.enabled = "True"
180+
summarization.model = "us.amazon.nova-pro-v1:0"
181+
summarization.temperature = "0.0"
182+
summarization.top_p = "0.0"
183+
summarization.top_k = "5.0"
184+
summarization.max_tokens = "4096"
185+
summarization.enabled = "True"
186+
evaluation.model = "us.amazon.nova-2-lite-v1:0"
187+
ocr.backend = "textract"
188+
use_bda = "False"
189+
190+
── Metrics (35+) ────────────────────────────────────
191+
overall_accuracy = 0.92
192+
average_confidence = 0.87
193+
total_cost = 0.089
194+
document_count = 5
195+
PayDate.cm_recall = 1.0
196+
PayDate.cm_precision = 1.0
197+
CurrentGrossPay.cm_f1 = 0.95
198+
cost.ocr.textract_analyze_document_layout_pages = 0.02
199+
cost.classification.bedrock_us.amazon.nova_2_lite_v1_0_inputtokens = 0.0026
200+
...
201+
202+
── Artifacts ────────────────────────────────────────
203+
metrics/full_config.json
204+
metrics/prompts.json
205+
metrics/class_definitions.json
206+
metrics/weighted_overall_scores.json
207+
metrics/field_metrics.json
208+
metrics/cost_breakdown.json
209+
```
210+
211+
## AWS Resources Created
212+
213+
When `EnableMLflow` is `true`, the following resources are created in the unified pattern stack:
214+
215+
| Resource | Type | Description |
216+
|----------|------|-------------|
217+
| `MLflowLoggerFunction` | `AWS::Serverless::Function` | Lambda function (container image, arm64, 512MB, 5min timeout) that logs to MLflow |
218+
| `MLflowLoggerFunctionLogGroup` | `AWS::Logs::LogGroup` | CloudWatch log group for the Lambda function |
219+
220+
The Lambda function is built as a Docker container image using `Dockerfile.optimized` with the `sagemaker-mlflow` Python package and `git` installed (required by MLflow for artifact logging).
221+
222+
Additionally, the `TestResultsResolverFunction` in the AppSync stack receives:
223+
- `MLFLOW_LOGGER_FUNCTION_ARN` environment variable (conditional)
224+
- `lambda:InvokeFunction` IAM permission for the MLflow Logger Lambda (conditional)
225+
226+
All MLflow resources are conditional on `IsMLflowEnabled` — when disabled, no resources are created and no additional costs are incurred.
227+
228+
## IAM Permissions
229+
230+
The MLflow Logger Lambda has the following permissions:
231+
232+
| Permission | Resource | Purpose |
233+
|------------|----------|---------|
234+
| `sagemaker-mlflow:*` | `*` | Full access to SageMaker MLflow APIs |
235+
| `kms:GenerateDataKey`, `kms:Decrypt` | Customer managed key | Encryption for CloudWatch logs |
236+
| `logs:CreateLogGroup`, `logs:CreateLogStream`, `logs:PutLogEvents` | Log group | CloudWatch logging |
237+
| `s3:PutObject`, `s3:PutObjectAcl` | `sagemaker-<region>-<account>/mlflow-artifacts/*` | MLflow artifact storage in the SageMaker-managed S3 bucket |
238+
239+
## Configuration
240+
241+
No runtime configuration is needed beyond the two CloudFormation parameters. The MLflow integration automatically uses the IDP configuration that was active for each test run.
242+
243+
To change which MLflow tracking server is used, update the `MlflowTrackingURI` stack parameter and redeploy.
244+
245+
## Viewing Results
246+
247+
1. Open the SageMaker Studio UI or the MLflow tracking server UI
248+
2. Navigate to the experiment named after your test run ID
249+
3. Use the MLflow UI to:
250+
- Compare metrics across runs (accuracy, cost, confidence)
251+
- Filter runs by model parameters (e.g., show all runs using `nova-pro`)
252+
- Download artifacts (prompts, class definitions, full config)
253+
- Create charts tracking accuracy trends over time
254+
255+
## Troubleshooting
256+
257+
| Issue | Cause | Resolution |
258+
|-------|-------|------------|
259+
| No MLflow data after test run | `EnableMLflow` is `false` or `MLFLOW_LOGGER_FUNCTION_ARN` env var is empty | Verify stack parameters and redeploy with `EnableMLflow=true` |
260+
| MLflow Logger Lambda errors | Invalid tracking server ARN or permissions | Check CloudWatch logs at `/<stack-name>/lambda/MLflowLoggerFunction` |
261+
| Missing config params in MLflow | Config not found in DynamoDB for the test run | Verify the test run has a metadata record with `Config` in the tracking table |
262+
| Partial metrics logged | Some metric values are non-numeric (null, string) | Non-numeric values are skipped during flattening — this is expected behavior |
263+
| `sagemaker-mlflow` import error | Container image build issue | Verify `requirements.txt` includes `sagemaker-mlflow` and the Docker build completed successfully |

0 commit comments

Comments
 (0)