|
| 1 | +# ADR-005: Template Drift Detection with IncludeNestedStacks |
| 2 | + |
| 3 | +## Status |
| 4 | + |
| 5 | +Proposed |
| 6 | + |
| 7 | +## Context |
| 8 | + |
| 9 | +### What we're solving |
| 10 | + |
| 11 | +The gen2-migration lock step adds `DeletionPolicy: Retain` to stateful |
| 12 | +resources before the refactor step moves them between stacks. If the |
| 13 | +refactor fails or the user runs `--rollback`, the lock rollback needs |
| 14 | +to verify that no resources have actually drifted — confirming the |
| 15 | +environment is still consistent and safe to revert the DeletionPolicy |
| 16 | +changes. |
| 17 | + |
| 18 | +Phase 2 drift detection (`detectTemplateDrift`) is the mechanism for |
| 19 | +this verification. It creates a CloudFormation changeset with |
| 20 | +`IncludeNestedStacks: true` on the root stack and compares the |
| 21 | +deployed state against the cached template. If there is no drift |
| 22 | +(beyond the expected DeletionPolicy changes from the lock step itself), |
| 23 | +lock rollback can proceed safely. |
| 24 | + |
| 25 | +### Two problems |
| 26 | + |
| 27 | +**Problem 1: FAILED changesets are discarded.** |
| 28 | + |
| 29 | +After gen2-migration refactor moves resources (e.g., DynamoDB tables, |
| 30 | +S3 buckets) from Gen1 nested stacks to Gen2 stacks, the Gen1 templates |
| 31 | +still reference those resources. CloudFormation's `EarlyValidation` |
| 32 | +step checks whether referenced resources exist in the target stack and |
| 33 | +fails the changeset with: |
| 34 | + |
| 35 | +``` |
| 36 | +EarlyValidation::ResourceExistenceCheck failed for resource(s) [activity-main] |
| 37 | +``` |
| 38 | + |
| 39 | +The current code (lines 190-203 of `detect-template-drift.ts`) treats |
| 40 | +all FAILED changesets as errors and discards them: |
| 41 | + |
| 42 | +```typescript |
| 43 | +if (changeSet.Status === 'FAILED') { |
| 44 | + // ...deletes changeset, returns { changes: [], skipped: true } |
| 45 | +} |
| 46 | +``` |
| 47 | + |
| 48 | +Similarly, `analyzeChangeSet` (lines 251-264) bails on FAILED nested |
| 49 | +changesets during recursive traversal. |
| 50 | + |
| 51 | +This means Phase 2 reports zero drift for any app that has been through |
| 52 | +gen2-migration refactor, even when real template drift exists on |
| 53 | +non-failing nested stacks. |
| 54 | + |
| 55 | +**Problem 2: Lock's DeletionPolicy changes are expected drift.** |
| 56 | + |
| 57 | +The lock step modifies templates to add `DeletionPolicy: Retain` to |
| 58 | +stateful resources. When Phase 2 compares these modified templates |
| 59 | +against the deployed stack, the DeletionPolicy additions appear as |
| 60 | +template drift. This is *expected* — the lock step intentionally made |
| 61 | +these changes. The drift detection must distinguish lock's DeletionPolicy |
| 62 | +changes from real drift (someone changed something outside of amplify). |
| 63 | + |
| 64 | +This is distinct from the FAILED changeset problem but compounds it: |
| 65 | +even when we successfully read changes from a FAILED changeset, we |
| 66 | +need to filter out the DeletionPolicy noise to determine if there is |
| 67 | +*real* drift that would make lock rollback unsafe. |
| 68 | + |
| 69 | +### Failure type distinction |
| 70 | + |
| 71 | +Not all FAILED changesets are equal: |
| 72 | + |
| 73 | +- **EarlyValidation failures** (e.g., `ResourceExistenceCheck`): CFN |
| 74 | + still populates the Changes array before failing. The changeset is |
| 75 | + describable and its changes are usable. This is the common case for |
| 76 | + post-migration stacks. |
| 77 | + |
| 78 | +- **Other failures** (e.g., `InsufficientCapabilities`, malformed |
| 79 | + template, IAM errors): CFN may not populate Changes at all. These |
| 80 | + represent real errors, not the expected post-migration state. |
| 81 | + |
| 82 | +The code must handle these differently: proceed with EarlyValidation |
| 83 | +failures (read whatever Changes are available), but treat other |
| 84 | +failures as genuine errors. |
| 85 | + |
| 86 | +### The 14570 per-nested-stack approach |
| 87 | + |
| 88 | +Issue #14570 proposed replacing `IncludeNestedStacks: true` with a |
| 89 | +client-side approach: create independent changesets on each nested stack |
| 90 | +using `UsePreviousTemplate: true`, fetch templates from S3, and use |
| 91 | +Bottleneck for rate limiting. This was prototyped as Method B across |
| 92 | +three parallel worktree experiments. |
| 93 | + |
| 94 | +### Empirical findings |
| 95 | + |
| 96 | +Testing against the live discussions app (amplify-discussions-main-c39a5, |
| 97 | +5 nested stacks, 3 of which fail EarlyValidation) revealed: |
| 98 | + |
| 99 | +1. **FAILED changesets contain usable Changes data.** CloudFormation |
| 100 | + populates `Changes` on nested changesets *before* validation fails. |
| 101 | + `DescribeChangeSet` on a FAILED nested changeset returns the full |
| 102 | + changes array. Confirmed: storageactivity (2 changes), |
| 103 | + storageavatars (6 changes), storagebookmarks (2 changes) — all |
| 104 | + FAILED with EarlyValidation, all with Changes populated. |
| 105 | + |
| 106 | +2. **CFN does not exit early on one nested failure.** When |
| 107 | + `IncludeNestedStacks: true` is set, CloudFormation creates changesets |
| 108 | + for *all* nested stacks regardless of whether some fail validation. |
| 109 | + The root changeset fails, but all 5 nested changesets are created |
| 110 | + and describable. |
| 111 | + |
| 112 | + Exact root StatusReason: `Nested change set <ARN> was not |
| 113 | + successfully created: Currently in FAILED.` — references only the |
| 114 | + first failing nested changeset. Does NOT contain "EarlyValidation". |
| 115 | + |
| 116 | + Exact nested EarlyValidation StatusReason: `The following |
| 117 | + hook(s)/validation failed: [AWS::EarlyValidation:: |
| 118 | + ResourceExistenceCheck]. To troubleshoot Early Validation errors, |
| 119 | + use the DescribeEvents API for detailed failure information.` |
| 120 | + |
| 121 | +3. **Per-nested-stack approach produces false positives.** Creating an |
| 122 | + independent changeset on a nested stack (e.g., apidiscussions) |
| 123 | + *without* `IncludeNestedStacks` reports 6 phantom `Modify` changes |
| 124 | + on `AWS::CloudFormation::Stack` resources. These changes do not exist |
| 125 | + when using `IncludeNestedStacks: true` from the root. The root |
| 126 | + approach correctly suppresses parameter-propagation noise that the |
| 127 | + isolated approach cannot. |
| 128 | + |
| 129 | +4. **Template sources are equivalent.** S3 template and deployed |
| 130 | + template are byte-for-byte identical. `UsePreviousTemplate: true` |
| 131 | + and `TemplateBody` fetched from S3 produce identical changeset |
| 132 | + results. |
| 133 | + |
| 134 | +5. **Auth and apidiscussions succeed cleanly.** 2 of 5 nested stacks |
| 135 | + pass changeset creation. Auth shows 7 real changes; apidiscussions |
| 136 | + shows 0. |
| 137 | + |
| 138 | +6. **Nested changeset race condition.** The root changeset fails as |
| 139 | + soon as *any* nested changeset fails, but other nested changesets |
| 140 | + may still be `CREATE_IN_PROGRESS`. Integration testing confirmed: |
| 141 | + apidiscussions was still in-progress when the root returned FAILED. |
| 142 | + Code must poll each nested changeset to terminal status before |
| 143 | + describing it. |
| 144 | + |
| 145 | +### Comparison |
| 146 | + |
| 147 | +| Dimension | IncludeNestedStacks: true | Per-nested (Method B) | |
| 148 | +|--------------------------|---------------------------|-----------------------| |
| 149 | +| False positives | None observed | 6 phantom changes | |
| 150 | +| Code complexity | ~30 lines changed | ~400 lines new | |
| 151 | +| CFN API calls | 1 CreateChangeSet | N+1 CreateChangeSet | |
| 152 | +| Rate limiting needed | No | Yes (Bottleneck) | |
| 153 | +| New dependencies | None | bottleneck, S3 client | |
| 154 | +| FAILED stack handling | Read Changes anyway | Same, plus false pos | |
| 155 | +| Sub-nested recursion | Built-in (ChangeSetId) | Must re-implement | |
| 156 | + |
| 157 | +## Decision |
| 158 | + |
| 159 | +Keep `IncludeNestedStacks: true` and read changes from FAILED changesets |
| 160 | +instead of discarding them. Filter out expected DeletionPolicy drift |
| 161 | +from the lock step. Do not implement the per-nested-stack approach. |
| 162 | + |
| 163 | +### Change 1: Root changeset — always fall through on FAILED |
| 164 | + |
| 165 | +Empirical finding: the root changeset's StatusReason when a nested |
| 166 | +stack fails EarlyValidation is: |
| 167 | + |
| 168 | +``` |
| 169 | +Nested change set <ARN> was not successfully created: Currently in FAILED. |
| 170 | +``` |
| 171 | + |
| 172 | +This does NOT contain "EarlyValidation" — it just references the first |
| 173 | +nested changeset that failed. The root cannot classify the failure type. |
| 174 | + |
| 175 | +Therefore, the root should always fall through to `analyzeChangeSet` |
| 176 | +when FAILED (except for "no changes"). Classification happens at the |
| 177 | +nested level: |
| 178 | + |
| 179 | +```typescript |
| 180 | +// Current: bail on all FAILED (lines 190-203) |
| 181 | +if (changeSet.Status === 'FAILED') { |
| 182 | + return { changes: [], skipped: true, skipReason: ... }; |
| 183 | +} |
| 184 | + |
| 185 | +// Proposed: fall through to analyzeChangeSet for nested inspection |
| 186 | +if (changeSet.Status === 'FAILED') { |
| 187 | + if (changeSet.StatusReason?.includes("didn't contain changes")) { |
| 188 | + // No drift — clean result |
| 189 | + return { changes: [], skipped: false }; |
| 190 | + } |
| 191 | + // Any other FAILED reason: nested stacks may still have data. |
| 192 | + // Fall through — analyzeChangeSet classifies each nested changeset. |
| 193 | + print.warn(`Root changeset FAILED: ${changeSet.StatusReason}`); |
| 194 | +} |
| 195 | +``` |
| 196 | + |
| 197 | +### Change 2: Nested changeset analysis — classify per-stack |
| 198 | + |
| 199 | +Each nested changeset classifies itself. Three observed StatusReason |
| 200 | +patterns: |
| 201 | + |
| 202 | +1. `"The submitted information didn't contain changes..."` — no drift, |
| 203 | + clean skip. |
| 204 | +2. `"The following hook(s)/validation failed: [AWS::EarlyValidation::ResourceExistenceCheck]..."` — EarlyValidation failure, |
| 205 | + Changes are populated, read them. |
| 206 | +3. `"Only executable from the root change set."` with |
| 207 | + Status=CREATE_COMPLETE — success, read Changes normally. |
| 208 | + |
| 209 | +Any other StatusReason is a genuine error — skip that stack. |
| 210 | + |
| 211 | +```typescript |
| 212 | +function isEarlyValidationFailure(reason?: string): boolean { |
| 213 | + return !!reason?.includes('EarlyValidation'); |
| 214 | +} |
| 215 | + |
| 216 | +// In analyzeChangeSet: |
| 217 | +if (changeSet.Status === 'FAILED') { |
| 218 | + if (changeSet.StatusReason?.includes("didn't contain changes") |
| 219 | + || changeSet.StatusReason?.includes('No updates')) { |
| 220 | + return result; // genuinely no changes |
| 221 | + } |
| 222 | + if (isEarlyValidationFailure(changeSet.StatusReason)) { |
| 223 | + // Changes are populated despite FAILED status — fall through |
| 224 | + print.warn(`Nested changeset FAILED (EarlyValidation): ${changeSet.StatusReason}`); |
| 225 | + } else { |
| 226 | + // Unknown failure — treat as error, skip this stack |
| 227 | + return { changes: [], skipped: true, skipReason: ... }; |
| 228 | + } |
| 229 | +} |
| 230 | +``` |
| 231 | + |
| 232 | +### Change 3: Partial results instead of all-or-nothing |
| 233 | + |
| 234 | +The current code discards all results if *any* nested stack analysis |
| 235 | +is skipped (lines 343-349). Instead, return available results and track |
| 236 | +which stacks were incomplete: |
| 237 | + |
| 238 | +```typescript |
| 239 | +// Current: discard everything |
| 240 | +if (hasNestedSkipped) { |
| 241 | + return { changes: [], skipped: true, skipReason: '...' }; |
| 242 | +} |
| 243 | + |
| 244 | +// Proposed: return partial results with metadata |
| 245 | +result.incompleteStacks = skippedStacks; // stacks with non-EV failures |
| 246 | +return result; |
| 247 | +``` |
| 248 | + |
| 249 | +### Change 4: Filter expected DeletionPolicy drift |
| 250 | + |
| 251 | +The lock step adds `DeletionPolicy: Retain` to stateful resources. |
| 252 | +These show up as Modify changes in the changeset. For lock rollback |
| 253 | +to determine whether the environment is safe to revert, these expected |
| 254 | +changes must be filtered out. |
| 255 | + |
| 256 | +The filter applies after changeset analysis and before the rollback |
| 257 | +safety decision: |
| 258 | + |
| 259 | +```typescript |
| 260 | +function isExpectedLockDrift(change: ResourceChangeWithNested): boolean { |
| 261 | + // A change is expected lock drift if: |
| 262 | + // 1. Action is 'Modify' |
| 263 | + // 2. Scope is exactly ['DeletionPolicy'] (confirmed empirically — |
| 264 | + // CFN uses 'DeletionPolicy' as a first-class Scope value) |
| 265 | + // 3. Details show DirectModification of DeletionPolicy attribute |
| 266 | + // with RequiresRecreation: 'Never' |
| 267 | + return change.Action === 'Modify' |
| 268 | + && change.Scope?.length === 1 |
| 269 | + && change.Scope[0] === 'DeletionPolicy'; |
| 270 | +} |
| 271 | + |
| 272 | +const realDrift = changes.filter(c => !isExpectedLockDrift(c)); |
| 273 | +``` |
| 274 | + |
| 275 | +If `realDrift` is empty after filtering, lock rollback can proceed |
| 276 | +safely. If there is any real drift, lock rollback must abort — the |
| 277 | +environment is in an inconsistent state. |
| 278 | + |
| 279 | +### What is NOT changed |
| 280 | + |
| 281 | +- `IncludeNestedStacks: true` stays on the `CreateChangeSetCommand` |
| 282 | +- The recursive `analyzeChangeSet` traversal via `ChangeSetId` on |
| 283 | + nested `AWS::CloudFormation::Stack` resources stays the same |
| 284 | +- The "no changes" detection (`didn't contain changes`) stays the same |
| 285 | +- Changeset cleanup logic stays the same |
| 286 | + |
| 287 | +## Risks |
| 288 | + |
| 289 | +### R1 — Reading Changes from FAILED changesets is undocumented |
| 290 | + |
| 291 | +Reading Changes from EarlyValidation-failed changesets is not |
| 292 | +explicitly documented by CloudFormation. Confirmed empirically on the |
| 293 | +discussions app (3 FAILED stacks, all with accessible Changes), but |
| 294 | +could change without notice. |
| 295 | + |
| 296 | +*Mitigation*: Add an integration test against a known FAILED stack to |
| 297 | +verify Changes are populated. If CFN changes this behavior, the test |
| 298 | +catches it before it silently regresses in production. |
| 299 | + |
| 300 | +### R2 — Changes on EarlyValidation-failed changesets may be incomplete |
| 301 | + |
| 302 | +CFN may populate *some* changes before the validation failure but not |
| 303 | +all. We could miss real drift on a FAILED stack. |
| 304 | + |
| 305 | +*Mitigation*: This is inherent to the EarlyValidation failure, not our |
| 306 | +approach. Per-nested-stack with `UsePreviousTemplate: true` hits the |
| 307 | +same EarlyValidation failure on the same stacks. The real fix is to |
| 308 | +update the Gen1 templates to reflect the post-migration state |
| 309 | +(MigrationPlaceholder resources), which is already part of the refactor |
| 310 | +step. Once templates are updated, EarlyValidation passes and this risk |
| 311 | +disappears. |
| 312 | + |
| 313 | +### R3 — Non-EarlyValidation failures conflated with EarlyValidation |
| 314 | + |
| 315 | +If CFN introduces new failure modes whose StatusReason strings don't |
| 316 | +match our `isEarlyValidationFailure` check, we'd incorrectly treat |
| 317 | +them as hard errors (skip the stack). This is the safe direction — |
| 318 | +false negatives (missed drift) are worse than false positives |
| 319 | +(unnecessary skip) for the lock rollback use case. |
| 320 | + |
| 321 | +*Mitigation*: Log all failure reasons. Expand the pattern match as |
| 322 | +new failure types are observed. |
| 323 | + |
| 324 | +### R4 — DeletionPolicy filter accuracy |
| 325 | + |
| 326 | +The filter that distinguishes lock's expected DeletionPolicy changes |
| 327 | +from real drift must be precise. A false positive (real drift |
| 328 | +classified as expected) would allow lock rollback to proceed when the |
| 329 | +environment is inconsistent. A false negative (expected drift |
| 330 | +classified as real) would block lock rollback unnecessarily. |
| 331 | + |
| 332 | +*Mitigation*: The lock step knows exactly which resources it modified |
| 333 | +and what it changed. The filter can be informed by the lock step's |
| 334 | +output (list of resources + properties changed) rather than relying on |
| 335 | +heuristic pattern matching on changeset details. |
| 336 | + |
| 337 | +## Consequences |
| 338 | + |
| 339 | +### What changes |
| 340 | + |
| 341 | +- `detectTemplateDrift` classifies FAILED changesets by failure type |
| 342 | + instead of discarding them all |
| 343 | +- EarlyValidation failures are treated as recoverable — Changes are |
| 344 | + read from the FAILED changeset |
| 345 | +- Non-EarlyValidation failures remain hard errors (skipped) |
| 346 | +- The all-or-nothing behavior on nested analysis failures is replaced |
| 347 | + with partial results |
| 348 | +- `TemplateDriftResults` gains optional metadata about incomplete stacks |
| 349 | +- A DeletionPolicy filter is added to distinguish lock's expected |
| 350 | + changes from real drift |
| 351 | +- Lock rollback consumes filtered results to make the safe/unsafe |
| 352 | + determination |
| 353 | + |
| 354 | +### What does NOT change |
| 355 | + |
| 356 | +- The `CreateChangeSetCommand` call and its parameters |
| 357 | +- The recursive traversal of nested changesets via `ChangeSetId` |
| 358 | +- The "no changes" detection path |
| 359 | +- The drift formatter and its console URL generation |
| 360 | +- The Phase 1 (CFN drift detection) and Phase 3 (local drift) paths |
| 361 | + |
| 362 | +### What gets removed |
| 363 | + |
| 364 | +- The entire 14570 per-nested-stack prototype (Method B) is abandoned |
| 365 | +- No new dependencies (bottleneck, S3 client for template fetching) |
| 366 | +- No new S3 template resolution logic |
| 367 | +- No new rate limiting infrastructure |
0 commit comments