|
| 1 | +--- |
| 2 | +title: "MLflow Experiment Tracking" |
| 3 | +--- |
| 4 | + |
| 5 | +Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved. |
| 6 | +SPDX-License-Identifier: MIT-0 |
| 7 | + |
| 8 | +# MLflow Experiment Tracking |
| 9 | + |
| 10 | +The GenAIIDP solution includes optional integration with [Amazon SageMaker with MLflow](https://docs.aws.amazon.com/sagemaker/latest/dg/mlflow.html) for experiment tracking. When enabled, every test run automatically logs metrics, configuration parameters, and artifacts to an MLflow tracking server, enabling you to: |
| 11 | + |
| 12 | +- Compare accuracy, cost, and performance across test runs |
| 13 | +- Track which models, prompts, and inference parameters produced each result |
| 14 | +- Filter and search runs by model ID, temperature, or any logged parameter |
| 15 | +- Visualize trends in accuracy and cost over time |
| 16 | +- Download full configuration snapshots and class definitions for reproducibility |
| 17 | + |
| 18 | +## Table of Contents |
| 19 | + |
| 20 | +- [MLflow Experiment Tracking](#mlflow-experiment-tracking) |
| 21 | + - [Architecture](#architecture) |
| 22 | + - [Prerequisites](#prerequisites) |
| 23 | + - [Enabling MLflow](#enabling-mlflow) |
| 24 | + - [How It Works](#how-it-works) |
| 25 | + - [What Gets Logged](#what-gets-logged) |
| 26 | + - [Metrics](#metrics) |
| 27 | + - [Parameters](#parameters) |
| 28 | + - [Artifacts](#artifacts) |
| 29 | + - [Tags](#tags) |
| 30 | + - [Example MLflow Run](#example-mlflow-run) |
| 31 | + - [AWS Resources Created](#aws-resources-created) |
| 32 | + - [IAM Permissions](#iam-permissions) |
| 33 | + - [Configuration](#configuration) |
| 34 | + - [Viewing Results](#viewing-results) |
| 35 | + - [Troubleshooting](#troubleshooting) |
| 36 | + |
| 37 | +## Architecture |
| 38 | + |
| 39 | +```mermaid |
| 40 | +flowchart LR |
| 41 | + TR[Test Results Resolver] -->|async invoke| ML[MLflow Logger Lambda] |
| 42 | + ML -->|log metrics, params, artifacts| SM[SageMaker MLflow Tracking Server] |
| 43 | + TR -->|fetch config| DDB[(DynamoDB Config Table)] |
| 44 | +
|
| 45 | + style ML fill:#f9f,stroke:#333 |
| 46 | + style SM fill:#ff9,stroke:#333 |
| 47 | +``` |
| 48 | + |
| 49 | +When a test run completes and metrics are aggregated, the `TestResultsResolverFunction` asynchronously invokes the `MLflowLoggerFunction` with the full metrics payload and IDP configuration. The logger function then records everything to the SageMaker MLflow tracking server. The invocation is fire-and-forget — MLflow logging never blocks or delays the test run results. |
| 50 | + |
| 51 | +## Prerequisites |
| 52 | + |
| 53 | +1. An [Amazon SageMaker MLflow Tracking Server](https://docs.aws.amazon.com/sagemaker/latest/dg/mlflow-create-tracking-server-studio.html) in the same region as your IDP deployment |
| 54 | +2. The tracking server ARN in the format: |
| 55 | + ``` |
| 56 | + arn:aws:sagemaker:<region>:<account-id>:mlflow-tracking-server/<server-name> |
| 57 | + ``` |
| 58 | + |
| 59 | +## Enabling MLflow |
| 60 | + |
| 61 | +Set the following CloudFormation parameters during stack deployment or update: |
| 62 | + |
| 63 | +| Parameter | Value | Description | |
| 64 | +|-----------|-------|-------------| |
| 65 | +| `EnableMLflow` | `true` | Enables the MLflow Logger Lambda and wires it into the test results pipeline | |
| 66 | +| `MlflowTrackingURI` | `arn:aws:sagemaker:...` | ARN of your SageMaker MLflow tracking server | |
| 67 | + |
| 68 | +`MlflowTrackingURI` is required when `EnableMLflow` is `true`. A CloudFormation rule validates this at deploy time. |
| 69 | + |
| 70 | +When `EnableMLflow` is `false` (the default), no MLflow resources are created and no logging occurs. |
| 71 | + |
| 72 | +## How It Works |
| 73 | + |
| 74 | +1. A test run completes and the `TestResultsResolverFunction` aggregates metrics (via Stickler or Athena fallback) |
| 75 | +2. The resolver fetches the IDP configuration for the test run from DynamoDB |
| 76 | +3. The resolver asynchronously invokes the `MLflowLoggerFunction` with: |
| 77 | + - All aggregated metrics (accuracy, cost, field-level scores, etc.) |
| 78 | + - The full IDP configuration (models, inference params, prompts, class definitions) |
| 79 | +4. The MLflow Logger Lambda: |
| 80 | + - Creates an MLflow experiment named after the test run ID |
| 81 | + - Logs flat numeric values as MLflow metrics (searchable, chartable) |
| 82 | + - Logs model IDs and inference parameters as MLflow params (filterable) |
| 83 | + - Logs complex structures (prompts, class definitions, cost breakdown, full config) as JSON artifacts |
| 84 | +5. The invocation is `Event` type (async) — the test results resolver does not wait for MLflow logging to complete |
| 85 | + |
| 86 | +## What Gets Logged |
| 87 | + |
| 88 | +### Metrics |
| 89 | + |
| 90 | +Numeric values logged as MLflow metrics. These are searchable and chartable in the MLflow UI. |
| 91 | + |
| 92 | +| Category | Example Keys | Description | |
| 93 | +|----------|-------------|-------------| |
| 94 | +| Overall accuracy | `overall_accuracy` | Aggregate accuracy score | |
| 95 | +| Confidence | `average_confidence` | Mean extraction confidence | |
| 96 | +| Cost | `total_cost` | Total processing cost | |
| 97 | +| Document count | `document_count` | Number of documents in the test run | |
| 98 | +| Accuracy breakdown | `accuracy_breakdown.Payslip`, `accuracy_breakdown.W2` | Per-class accuracy (flattened from nested dict) | |
| 99 | +| Split classification | `split_classification_metrics.Payslip.precision` | Per-class precision/recall/f1 (flattened) | |
| 100 | +| Field-level metrics | `PayDate.cm_recall`, `CurrentGrossPay.cm_f1` | Per-field cm_precision, cm_recall, cm_f1, cm_accuracy | |
| 101 | +| Cost breakdown | `cost.ocr.textract_analyze_document_layout_pages`, `cost.classification.bedrock_us.amazon.nova_2_lite_v1_0_inputtokens` | Per-service estimated cost (sanitized keys) | |
| 102 | +| Weighted scores | Logged as artifact (see below) | Complex nested structure | |
| 103 | + |
| 104 | +Metric key sanitization: `/`, `:`, and `-` are replaced with `_`, and all keys are lowercased. |
| 105 | + |
| 106 | +### Parameters |
| 107 | + |
| 108 | +Short key-value strings logged as MLflow params. These are filterable in the MLflow UI — useful for comparing runs across different model configurations. |
| 109 | + |
| 110 | +| Parameter | Example Value | Source | |
| 111 | +|-----------|--------------|--------| |
| 112 | +| `test_run_id` | `abc-123-def` | Test run identifier | |
| 113 | +| `classification.model` | `us.amazon.nova-2-lite-v1:0` | Classification model ID | |
| 114 | +| `classification.temperature` | `0.0` | Classification temperature | |
| 115 | +| `classification.top_p` | `0.0` | Classification top_p | |
| 116 | +| `classification.top_k` | `5.0` | Classification top_k | |
| 117 | +| `classification.max_tokens` | `4096` | Classification max tokens | |
| 118 | +| `classification.enabled` | `True` | Classification enabled flag | |
| 119 | +| `classification.method` | `multimodalPageLevelClassification` | Classification method | |
| 120 | +| `extraction.model` | `us.amazon.nova-2-lite-v1:0` | Extraction model ID | |
| 121 | +| `extraction.temperature` | `0.0` | Extraction temperature | |
| 122 | +| `extraction.top_p` | `0.0` | Extraction top_p | |
| 123 | +| `extraction.top_k` | `5.0` | Extraction top_k | |
| 124 | +| `extraction.max_tokens` | `65535` | Extraction max tokens | |
| 125 | +| `assessment.model` | `us.amazon.nova-lite-v1:0` | Assessment model ID | |
| 126 | +| `assessment.confidence_threshold` | `0.8` | Assessment confidence threshold | |
| 127 | +| `assessment.granular.enabled` | `True` | Granular assessment flag | |
| 128 | +| `summarization.model` | `us.amazon.nova-pro-v1:0` | Summarization model ID | |
| 129 | +| `evaluation.model` | `us.amazon.nova-2-lite-v1:0` | Evaluation model ID | |
| 130 | +| `ocr.backend` | `textract` | OCR backend | |
| 131 | +| `use_bda` | `False` | BDA mode flag | |
| 132 | + |
| 133 | +Only parameters that exist in the configuration are logged — missing values are omitted, not set to empty strings. |
| 134 | + |
| 135 | +### Artifacts |
| 136 | + |
| 137 | +Complex data structures logged as JSON files under the `metrics/` artifact path. |
| 138 | + |
| 139 | +| Artifact | Description | |
| 140 | +|----------|-------------| |
| 141 | +| `full_config.json` | Complete IDP configuration snapshot for the test run | |
| 142 | +| `prompts.json` | System and task prompts for each stage (classification, extraction, assessment, summarization) | |
| 143 | +| `class_definitions.json` | Document class schemas with field definitions and evaluation methods | |
| 144 | +| `weighted_overall_scores.json` | Weighted accuracy scores per document class | |
| 145 | +| `field_metrics.json` | Full per-field evaluation metrics | |
| 146 | +| `cost_breakdown.json` | Detailed cost breakdown by service and operation | |
| 147 | + |
| 148 | +### Tags |
| 149 | + |
| 150 | +| Tag | Value | |
| 151 | +|-----|-------| |
| 152 | +| `source` | `test_results_resolver` | |
| 153 | + |
| 154 | +## Example MLflow Run |
| 155 | + |
| 156 | +For a test run with the lending package sample configuration, a single MLflow run would contain: |
| 157 | + |
| 158 | +``` |
| 159 | +── Params (27) ────────────────────────────────────── |
| 160 | +test_run_id = "run-2026-03-25-001" |
| 161 | +classification.model = "us.amazon.nova-2-lite-v1:0" |
| 162 | +classification.temperature = "0.0" |
| 163 | +classification.top_p = "0.0" |
| 164 | +classification.top_k = "5.0" |
| 165 | +classification.max_tokens = "4096" |
| 166 | +classification.method = "multimodalPageLevelClassification" |
| 167 | +extraction.model = "us.amazon.nova-2-lite-v1:0" |
| 168 | +extraction.temperature = "0.0" |
| 169 | +extraction.top_p = "0.0" |
| 170 | +extraction.top_k = "5.0" |
| 171 | +extraction.max_tokens = "65535" |
| 172 | +assessment.model = "us.amazon.nova-lite-v1:0" |
| 173 | +assessment.temperature = "0.0" |
| 174 | +assessment.top_p = "0.0" |
| 175 | +assessment.top_k = "5.0" |
| 176 | +assessment.max_tokens = "10000" |
| 177 | +assessment.enabled = "True" |
| 178 | +assessment.confidence_threshold = "0.8" |
| 179 | +assessment.granular.enabled = "True" |
| 180 | +summarization.model = "us.amazon.nova-pro-v1:0" |
| 181 | +summarization.temperature = "0.0" |
| 182 | +summarization.top_p = "0.0" |
| 183 | +summarization.top_k = "5.0" |
| 184 | +summarization.max_tokens = "4096" |
| 185 | +summarization.enabled = "True" |
| 186 | +evaluation.model = "us.amazon.nova-2-lite-v1:0" |
| 187 | +ocr.backend = "textract" |
| 188 | +use_bda = "False" |
| 189 | +
|
| 190 | +── Metrics (35+) ──────────────────────────────────── |
| 191 | +overall_accuracy = 0.92 |
| 192 | +average_confidence = 0.87 |
| 193 | +total_cost = 0.089 |
| 194 | +document_count = 5 |
| 195 | +PayDate.cm_recall = 1.0 |
| 196 | +PayDate.cm_precision = 1.0 |
| 197 | +CurrentGrossPay.cm_f1 = 0.95 |
| 198 | +cost.ocr.textract_analyze_document_layout_pages = 0.02 |
| 199 | +cost.classification.bedrock_us.amazon.nova_2_lite_v1_0_inputtokens = 0.0026 |
| 200 | +... |
| 201 | +
|
| 202 | +── Artifacts ──────────────────────────────────────── |
| 203 | +metrics/full_config.json |
| 204 | +metrics/prompts.json |
| 205 | +metrics/class_definitions.json |
| 206 | +metrics/weighted_overall_scores.json |
| 207 | +metrics/field_metrics.json |
| 208 | +metrics/cost_breakdown.json |
| 209 | +``` |
| 210 | + |
| 211 | +## AWS Resources Created |
| 212 | + |
| 213 | +When `EnableMLflow` is `true`, the following resources are created in the unified pattern stack: |
| 214 | + |
| 215 | +| Resource | Type | Description | |
| 216 | +|----------|------|-------------| |
| 217 | +| `MLflowLoggerFunction` | `AWS::Serverless::Function` | Lambda function (container image, arm64, 512MB, 5min timeout) that logs to MLflow | |
| 218 | +| `MLflowLoggerFunctionLogGroup` | `AWS::Logs::LogGroup` | CloudWatch log group for the Lambda function | |
| 219 | + |
| 220 | +The Lambda function is built as a Docker container image using `Dockerfile.optimized` with the `sagemaker-mlflow` Python package and `git` installed (required by MLflow for artifact logging). |
| 221 | + |
| 222 | +Additionally, the `TestResultsResolverFunction` in the AppSync stack receives: |
| 223 | +- `MLFLOW_LOGGER_FUNCTION_ARN` environment variable (conditional) |
| 224 | +- `lambda:InvokeFunction` IAM permission for the MLflow Logger Lambda (conditional) |
| 225 | + |
| 226 | +All MLflow resources are conditional on `IsMLflowEnabled` — when disabled, no resources are created and no additional costs are incurred. |
| 227 | + |
| 228 | +## IAM Permissions |
| 229 | + |
| 230 | +The MLflow Logger Lambda has the following permissions: |
| 231 | + |
| 232 | +| Permission | Resource | Purpose | |
| 233 | +|------------|----------|---------| |
| 234 | +| `sagemaker-mlflow:*` | `*` | Full access to SageMaker MLflow APIs | |
| 235 | +| `kms:GenerateDataKey`, `kms:Decrypt` | Customer managed key | Encryption for CloudWatch logs | |
| 236 | +| `logs:CreateLogGroup`, `logs:CreateLogStream`, `logs:PutLogEvents` | Log group | CloudWatch logging | |
| 237 | +| `s3:PutObject`, `s3:PutObjectAcl` | `sagemaker-<region>-<account>/mlflow-artifacts/*` | MLflow artifact storage in the SageMaker-managed S3 bucket | |
| 238 | + |
| 239 | +## Configuration |
| 240 | + |
| 241 | +No runtime configuration is needed beyond the two CloudFormation parameters. The MLflow integration automatically uses the IDP configuration that was active for each test run. |
| 242 | + |
| 243 | +To change which MLflow tracking server is used, update the `MlflowTrackingURI` stack parameter and redeploy. |
| 244 | + |
| 245 | +## Viewing Results |
| 246 | + |
| 247 | +1. Open the SageMaker Studio UI or the MLflow tracking server UI |
| 248 | +2. Navigate to the experiment named after your test run ID |
| 249 | +3. Use the MLflow UI to: |
| 250 | + - Compare metrics across runs (accuracy, cost, confidence) |
| 251 | + - Filter runs by model parameters (e.g., show all runs using `nova-pro`) |
| 252 | + - Download artifacts (prompts, class definitions, full config) |
| 253 | + - Create charts tracking accuracy trends over time |
| 254 | + |
| 255 | +## Troubleshooting |
| 256 | + |
| 257 | +| Issue | Cause | Resolution | |
| 258 | +|-------|-------|------------| |
| 259 | +| No MLflow data after test run | `EnableMLflow` is `false` or `MLFLOW_LOGGER_FUNCTION_ARN` env var is empty | Verify stack parameters and redeploy with `EnableMLflow=true` | |
| 260 | +| MLflow Logger Lambda errors | Invalid tracking server ARN or permissions | Check CloudWatch logs at `/<stack-name>/lambda/MLflowLoggerFunction` | |
| 261 | +| Missing config params in MLflow | Config not found in DynamoDB for the test run | Verify the test run has a metadata record with `Config` in the tracking table | |
| 262 | +| Partial metrics logged | Some metric values are non-numeric (null, string) | Non-numeric values are skipped during flattening — this is expected behavior | |
| 263 | +| `sagemaker-mlflow` import error | Container image build issue | Verify `requirements.txt` includes `sagemaker-mlflow` and the Docker build completed successfully | |
0 commit comments