|
| 1 | + |
| 2 | +# S3 Enricher Processor |
| 3 | + |
| 4 | +This plugin enables you to merge data from a S3 file with source data from your Data Prepper pipeline. |
| 5 | + |
| 6 | +## Usage |
| 7 | +```aidl |
| 8 | +ml_merge-pipeline: |
| 9 | +... |
| 10 | + processor: |
| 11 | + - s3_enrich: |
| 12 | + # ============================================================================= |
| 13 | + # S3 SOURCE BUCKET CONFIGURATION |
| 14 | + # Defines where to fetch the original/source data for enrichment |
| 15 | + # ============================================================================= |
| 16 | + bucket: |
| 17 | + # The S3 bucket containing the source records to enrich from |
| 18 | + name: offlinebatch |
| 19 | + filter: |
| 20 | + # S3 prefix path where source files are located |
| 21 | + # The processor will look for source files under this prefix |
| 22 | + # Example: s3://offlinebatch/bedrockbatch/originsource/test_batch_50k.jsonl |
| 23 | + include_prefix: bedrockbatch/originsource/ |
| 24 | + |
| 25 | + # ============================================================================= |
| 26 | + # DATA FORMAT CONFIGURATION |
| 27 | + # ============================================================================= |
| 28 | + # Codec for parsing source S3 files |
| 29 | + # Options: ndjson, json, csv, etc. |
| 30 | + codec: |
| 31 | + ndjson: |
| 32 | + |
| 33 | + # ============================================================================= |
| 34 | + # AWS CONFIGURATION |
| 35 | + # ============================================================================= |
| 36 | + # AWS account ID that owns the S3 bucket (for cross-account access) |
| 37 | + default_bucket_owner: 802041417063 |
| 38 | + |
| 39 | + aws: |
| 40 | + # AWS region where the S3 bucket is located |
| 41 | + region: us-east-1 |
| 42 | + |
| 43 | + # ============================================================================= |
| 44 | + # S3 OBJECT SETTINGS |
| 45 | + # ============================================================================= |
| 46 | + # Maximum size (in MB) of S3 source files to process |
| 47 | + # Files exceeding this limit will be skipped |
| 48 | + s3_object_size_limit: 100mb |
| 49 | + |
| 50 | + # JSON path in the incoming pipeline event that contains the S3 object key |
| 51 | + # Used to determine which source file to fetch for enrichment |
| 52 | + # Example event: {"s3": {"bucket": "...", "key": "output/file.jsonl.out"}} |
| 53 | + s3_key_path: "s3/key" |
| 54 | + |
| 55 | + # ============================================================================= |
| 56 | + # SOURCE FILE NAME EXTRACTION |
| 57 | + # ============================================================================= |
| 58 | + # Regex pattern to extract the base filename from the output S3 key |
| 59 | + # The first capture group (.*?) extracts the original source filename |
| 60 | + # |
| 61 | + # Example: |
| 62 | + # Input: test_batch_50k-2025-11-06T21-19-15Z-1762463955825635000-uuid.jsonl.out |
| 63 | + # Match: Group 1 = "test_batch_50k" |
| 64 | + # Result: Looks for source file "test_batch_50k.jsonl" in include_prefix path |
| 65 | + # |
| 66 | + # Pattern breakdown: |
| 67 | + # ^(.*?) - Capture base filename (non-greedy) |
| 68 | + # -\d{4}-\d{2}-\d{2} - Match date: -YYYY-MM-DD |
| 69 | + # T\d{2}-\d{2}-\d{2}Z - Match time: THH-MM-SSZ |
| 70 | + # -.* - Match remaining (job ID, UUID, etc.) |
| 71 | + # \.jsonl\.out$ - Match file extension |
| 72 | + s3_object_name_pattern: ^(.*?)-\d{4}-\d{2}-\d{2}T\d{2}-\d{2}-\d{2}Z-.*\.jsonl\.out$ |
| 73 | + |
| 74 | + # ============================================================================= |
| 75 | + # RECORD MATCHING & ENRICHMENT |
| 76 | + # ============================================================================= |
| 77 | + # Field name used to correlate/match records between output and source files |
| 78 | + # Both the pipeline event and source records must contain this field |
| 79 | + # Records with matching correlation values will be merged |
| 80 | + correlation_key: "recordId" |
| 81 | + |
| 82 | + # List of fields to copy from the source record into the pipeline event |
| 83 | + # Only these specified fields will be merged; all other source fields are ignored |
| 84 | + # If a field doesn't exist in source, it will be skipped |
| 85 | + keys_to_merge: |
| 86 | + - "field_A" |
| 87 | + - "field_B" |
| 88 | + - "field_C" |
| 89 | + |
| 90 | + # ============================================================================= |
| 91 | + # CONDITIONAL PROCESSING |
| 92 | + # ============================================================================= |
| 93 | + # Data Prepper expression to conditionally apply enrichment |
| 94 | + # Only events matching this condition will be processed by the enricher |
| 95 | + # Events not matching will pass through unchanged |
| 96 | + enrich_when: /s3/key != null |
| 97 | +``` |
| 98 | +`keys_to_merge` List of fields to copy from the source record into the pipeline event. |
| 99 | +`s3_object_name_pattern` as Regex pattern to extract the base filename from the output S3 key. |
| 100 | +`s3_key_path` as JSON path in the incoming pipeline event that contains the S3 object key |
| 101 | +`correlation_key` as the Field name used to correlate/match records between output and source files |
| 102 | + |
| 103 | +## Metrics |
| 104 | +- 'numberOfRecordsEnrichedSuccessFromS3': Number of pipeline records successfully enriched from S3 source |
| 105 | +- 'numberOfRecordsEnrichedFailerFromS3': Number of pipeline records that failed enrichment from S3 source |
| 106 | +- 's3EnricherObjectsFailed': Number of S3 source objects successfully loaded for enrichment |
| 107 | +- 's3EnricherObjectsSucceeded': Number of S3 source objects that failed to load for enrichment |
| 108 | + |
| 109 | +## Developer Guide |
| 110 | + |
| 111 | +The integration tests for this plugin do not run as part of the Data Prepper build. |
0 commit comments