Skip to content

Commit 6729bf0

Browse files
author
Benjamin Pace
committed
feat(drift): read changes from FAILED nested changesets with IncludeNestedStacks
Template drift detection now reads changes from EarlyValidation-failed nested changesets instead of discarding them. This enables drift detection on post-migration stacks where resources have been moved between stacks, causing CFN EarlyValidation::ResourceExistenceCheck failures. Changes: - Root changeset FAILED status falls through to analyzeChangeSet (except "didn't contain changes" which remains the no-drift happy path) - Nested changesets classify themselves: EarlyValidation failures have their Changes read; other failures are treated as genuine errors - Partial results returned with incompleteStacks tracking instead of all-or-nothing discard - Poll nested changesets to terminal status before describing (fixes race condition where root fails before all nested changesets complete) - Lock rollback validates drift with DeletionPolicy filter: recursive hasRealDrift walks the change tree, filtering out expected lock drift (Scope: ['DeletionPolicy']). Blocks rollback if real drift or incomplete stacks exist. ADR-005 documents the decision, empirical findings, and risks.
1 parent f89613a commit 6729bf0

File tree

4 files changed

+688
-36
lines changed

4 files changed

+688
-36
lines changed
Lines changed: 367 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,367 @@
1+
# ADR-005: Template Drift Detection with IncludeNestedStacks
2+
3+
## Status
4+
5+
Proposed
6+
7+
## Context
8+
9+
### What we're solving
10+
11+
The gen2-migration lock step adds `DeletionPolicy: Retain` to stateful
12+
resources before the refactor step moves them between stacks. If the
13+
refactor fails or the user runs `--rollback`, the lock rollback needs
14+
to verify that no resources have actually drifted — confirming the
15+
environment is still consistent and safe to revert the DeletionPolicy
16+
changes.
17+
18+
Phase 2 drift detection (`detectTemplateDrift`) is the mechanism for
19+
this verification. It creates a CloudFormation changeset with
20+
`IncludeNestedStacks: true` on the root stack and compares the
21+
deployed state against the cached template. If there is no drift
22+
(beyond the expected DeletionPolicy changes from the lock step itself),
23+
lock rollback can proceed safely.
24+
25+
### Two problems
26+
27+
**Problem 1: FAILED changesets are discarded.**
28+
29+
After gen2-migration refactor moves resources (e.g., DynamoDB tables,
30+
S3 buckets) from Gen1 nested stacks to Gen2 stacks, the Gen1 templates
31+
still reference those resources. CloudFormation's `EarlyValidation`
32+
step checks whether referenced resources exist in the target stack and
33+
fails the changeset with:
34+
35+
```
36+
EarlyValidation::ResourceExistenceCheck failed for resource(s) [activity-main]
37+
```
38+
39+
The current code (lines 190-203 of `detect-template-drift.ts`) treats
40+
all FAILED changesets as errors and discards them:
41+
42+
```typescript
43+
if (changeSet.Status === 'FAILED') {
44+
// ...deletes changeset, returns { changes: [], skipped: true }
45+
}
46+
```
47+
48+
Similarly, `analyzeChangeSet` (lines 251-264) bails on FAILED nested
49+
changesets during recursive traversal.
50+
51+
This means Phase 2 reports zero drift for any app that has been through
52+
gen2-migration refactor, even when real template drift exists on
53+
non-failing nested stacks.
54+
55+
**Problem 2: Lock's DeletionPolicy changes are expected drift.**
56+
57+
The lock step modifies templates to add `DeletionPolicy: Retain` to
58+
stateful resources. When Phase 2 compares these modified templates
59+
against the deployed stack, the DeletionPolicy additions appear as
60+
template drift. This is *expected* — the lock step intentionally made
61+
these changes. The drift detection must distinguish lock's DeletionPolicy
62+
changes from real drift (someone changed something outside of amplify).
63+
64+
This is distinct from the FAILED changeset problem but compounds it:
65+
even when we successfully read changes from a FAILED changeset, we
66+
need to filter out the DeletionPolicy noise to determine if there is
67+
*real* drift that would make lock rollback unsafe.
68+
69+
### Failure type distinction
70+
71+
Not all FAILED changesets are equal:
72+
73+
- **EarlyValidation failures** (e.g., `ResourceExistenceCheck`): CFN
74+
still populates the Changes array before failing. The changeset is
75+
describable and its changes are usable. This is the common case for
76+
post-migration stacks.
77+
78+
- **Other failures** (e.g., `InsufficientCapabilities`, malformed
79+
template, IAM errors): CFN may not populate Changes at all. These
80+
represent real errors, not the expected post-migration state.
81+
82+
The code must handle these differently: proceed with EarlyValidation
83+
failures (read whatever Changes are available), but treat other
84+
failures as genuine errors.
85+
86+
### The 14570 per-nested-stack approach
87+
88+
Issue #14570 proposed replacing `IncludeNestedStacks: true` with a
89+
client-side approach: create independent changesets on each nested stack
90+
using `UsePreviousTemplate: true`, fetch templates from S3, and use
91+
Bottleneck for rate limiting. This was prototyped as Method B across
92+
three parallel worktree experiments.
93+
94+
### Empirical findings
95+
96+
Testing against the live discussions app (amplify-discussions-main-c39a5,
97+
5 nested stacks, 3 of which fail EarlyValidation) revealed:
98+
99+
1. **FAILED changesets contain usable Changes data.** CloudFormation
100+
populates `Changes` on nested changesets *before* validation fails.
101+
`DescribeChangeSet` on a FAILED nested changeset returns the full
102+
changes array. Confirmed: storageactivity (2 changes),
103+
storageavatars (6 changes), storagebookmarks (2 changes) — all
104+
FAILED with EarlyValidation, all with Changes populated.
105+
106+
2. **CFN does not exit early on one nested failure.** When
107+
`IncludeNestedStacks: true` is set, CloudFormation creates changesets
108+
for *all* nested stacks regardless of whether some fail validation.
109+
The root changeset fails, but all 5 nested changesets are created
110+
and describable.
111+
112+
Exact root StatusReason: `Nested change set <ARN> was not
113+
successfully created: Currently in FAILED.` — references only the
114+
first failing nested changeset. Does NOT contain "EarlyValidation".
115+
116+
Exact nested EarlyValidation StatusReason: `The following
117+
hook(s)/validation failed: [AWS::EarlyValidation::
118+
ResourceExistenceCheck]. To troubleshoot Early Validation errors,
119+
use the DescribeEvents API for detailed failure information.`
120+
121+
3. **Per-nested-stack approach produces false positives.** Creating an
122+
independent changeset on a nested stack (e.g., apidiscussions)
123+
*without* `IncludeNestedStacks` reports 6 phantom `Modify` changes
124+
on `AWS::CloudFormation::Stack` resources. These changes do not exist
125+
when using `IncludeNestedStacks: true` from the root. The root
126+
approach correctly suppresses parameter-propagation noise that the
127+
isolated approach cannot.
128+
129+
4. **Template sources are equivalent.** S3 template and deployed
130+
template are byte-for-byte identical. `UsePreviousTemplate: true`
131+
and `TemplateBody` fetched from S3 produce identical changeset
132+
results.
133+
134+
5. **Auth and apidiscussions succeed cleanly.** 2 of 5 nested stacks
135+
pass changeset creation. Auth shows 7 real changes; apidiscussions
136+
shows 0.
137+
138+
6. **Nested changeset race condition.** The root changeset fails as
139+
soon as *any* nested changeset fails, but other nested changesets
140+
may still be `CREATE_IN_PROGRESS`. Integration testing confirmed:
141+
apidiscussions was still in-progress when the root returned FAILED.
142+
Code must poll each nested changeset to terminal status before
143+
describing it.
144+
145+
### Comparison
146+
147+
| Dimension | IncludeNestedStacks: true | Per-nested (Method B) |
148+
|--------------------------|---------------------------|-----------------------|
149+
| False positives | None observed | 6 phantom changes |
150+
| Code complexity | ~30 lines changed | ~400 lines new |
151+
| CFN API calls | 1 CreateChangeSet | N+1 CreateChangeSet |
152+
| Rate limiting needed | No | Yes (Bottleneck) |
153+
| New dependencies | None | bottleneck, S3 client |
154+
| FAILED stack handling | Read Changes anyway | Same, plus false pos |
155+
| Sub-nested recursion | Built-in (ChangeSetId) | Must re-implement |
156+
157+
## Decision
158+
159+
Keep `IncludeNestedStacks: true` and read changes from FAILED changesets
160+
instead of discarding them. Filter out expected DeletionPolicy drift
161+
from the lock step. Do not implement the per-nested-stack approach.
162+
163+
### Change 1: Root changeset — always fall through on FAILED
164+
165+
Empirical finding: the root changeset's StatusReason when a nested
166+
stack fails EarlyValidation is:
167+
168+
```
169+
Nested change set <ARN> was not successfully created: Currently in FAILED.
170+
```
171+
172+
This does NOT contain "EarlyValidation" — it just references the first
173+
nested changeset that failed. The root cannot classify the failure type.
174+
175+
Therefore, the root should always fall through to `analyzeChangeSet`
176+
when FAILED (except for "no changes"). Classification happens at the
177+
nested level:
178+
179+
```typescript
180+
// Current: bail on all FAILED (lines 190-203)
181+
if (changeSet.Status === 'FAILED') {
182+
return { changes: [], skipped: true, skipReason: ... };
183+
}
184+
185+
// Proposed: fall through to analyzeChangeSet for nested inspection
186+
if (changeSet.Status === 'FAILED') {
187+
if (changeSet.StatusReason?.includes("didn't contain changes")) {
188+
// No drift — clean result
189+
return { changes: [], skipped: false };
190+
}
191+
// Any other FAILED reason: nested stacks may still have data.
192+
// Fall through — analyzeChangeSet classifies each nested changeset.
193+
print.warn(`Root changeset FAILED: ${changeSet.StatusReason}`);
194+
}
195+
```
196+
197+
### Change 2: Nested changeset analysis — classify per-stack
198+
199+
Each nested changeset classifies itself. Three observed StatusReason
200+
patterns:
201+
202+
1. `"The submitted information didn't contain changes..."` — no drift,
203+
clean skip.
204+
2. `"The following hook(s)/validation failed: [AWS::EarlyValidation::ResourceExistenceCheck]..."` — EarlyValidation failure,
205+
Changes are populated, read them.
206+
3. `"Only executable from the root change set."` with
207+
Status=CREATE_COMPLETE — success, read Changes normally.
208+
209+
Any other StatusReason is a genuine error — skip that stack.
210+
211+
```typescript
212+
function isEarlyValidationFailure(reason?: string): boolean {
213+
return !!reason?.includes('EarlyValidation');
214+
}
215+
216+
// In analyzeChangeSet:
217+
if (changeSet.Status === 'FAILED') {
218+
if (changeSet.StatusReason?.includes("didn't contain changes")
219+
|| changeSet.StatusReason?.includes('No updates')) {
220+
return result; // genuinely no changes
221+
}
222+
if (isEarlyValidationFailure(changeSet.StatusReason)) {
223+
// Changes are populated despite FAILED status — fall through
224+
print.warn(`Nested changeset FAILED (EarlyValidation): ${changeSet.StatusReason}`);
225+
} else {
226+
// Unknown failure — treat as error, skip this stack
227+
return { changes: [], skipped: true, skipReason: ... };
228+
}
229+
}
230+
```
231+
232+
### Change 3: Partial results instead of all-or-nothing
233+
234+
The current code discards all results if *any* nested stack analysis
235+
is skipped (lines 343-349). Instead, return available results and track
236+
which stacks were incomplete:
237+
238+
```typescript
239+
// Current: discard everything
240+
if (hasNestedSkipped) {
241+
return { changes: [], skipped: true, skipReason: '...' };
242+
}
243+
244+
// Proposed: return partial results with metadata
245+
result.incompleteStacks = skippedStacks; // stacks with non-EV failures
246+
return result;
247+
```
248+
249+
### Change 4: Filter expected DeletionPolicy drift
250+
251+
The lock step adds `DeletionPolicy: Retain` to stateful resources.
252+
These show up as Modify changes in the changeset. For lock rollback
253+
to determine whether the environment is safe to revert, these expected
254+
changes must be filtered out.
255+
256+
The filter applies after changeset analysis and before the rollback
257+
safety decision:
258+
259+
```typescript
260+
function isExpectedLockDrift(change: ResourceChangeWithNested): boolean {
261+
// A change is expected lock drift if:
262+
// 1. Action is 'Modify'
263+
// 2. Scope is exactly ['DeletionPolicy'] (confirmed empirically —
264+
// CFN uses 'DeletionPolicy' as a first-class Scope value)
265+
// 3. Details show DirectModification of DeletionPolicy attribute
266+
// with RequiresRecreation: 'Never'
267+
return change.Action === 'Modify'
268+
&& change.Scope?.length === 1
269+
&& change.Scope[0] === 'DeletionPolicy';
270+
}
271+
272+
const realDrift = changes.filter(c => !isExpectedLockDrift(c));
273+
```
274+
275+
If `realDrift` is empty after filtering, lock rollback can proceed
276+
safely. If there is any real drift, lock rollback must abort — the
277+
environment is in an inconsistent state.
278+
279+
### What is NOT changed
280+
281+
- `IncludeNestedStacks: true` stays on the `CreateChangeSetCommand`
282+
- The recursive `analyzeChangeSet` traversal via `ChangeSetId` on
283+
nested `AWS::CloudFormation::Stack` resources stays the same
284+
- The "no changes" detection (`didn't contain changes`) stays the same
285+
- Changeset cleanup logic stays the same
286+
287+
## Risks
288+
289+
### R1 — Reading Changes from FAILED changesets is undocumented
290+
291+
Reading Changes from EarlyValidation-failed changesets is not
292+
explicitly documented by CloudFormation. Confirmed empirically on the
293+
discussions app (3 FAILED stacks, all with accessible Changes), but
294+
could change without notice.
295+
296+
*Mitigation*: Add an integration test against a known FAILED stack to
297+
verify Changes are populated. If CFN changes this behavior, the test
298+
catches it before it silently regresses in production.
299+
300+
### R2 — Changes on EarlyValidation-failed changesets may be incomplete
301+
302+
CFN may populate *some* changes before the validation failure but not
303+
all. We could miss real drift on a FAILED stack.
304+
305+
*Mitigation*: This is inherent to the EarlyValidation failure, not our
306+
approach. Per-nested-stack with `UsePreviousTemplate: true` hits the
307+
same EarlyValidation failure on the same stacks. The real fix is to
308+
update the Gen1 templates to reflect the post-migration state
309+
(MigrationPlaceholder resources), which is already part of the refactor
310+
step. Once templates are updated, EarlyValidation passes and this risk
311+
disappears.
312+
313+
### R3 — Non-EarlyValidation failures conflated with EarlyValidation
314+
315+
If CFN introduces new failure modes whose StatusReason strings don't
316+
match our `isEarlyValidationFailure` check, we'd incorrectly treat
317+
them as hard errors (skip the stack). This is the safe direction —
318+
false negatives (missed drift) are worse than false positives
319+
(unnecessary skip) for the lock rollback use case.
320+
321+
*Mitigation*: Log all failure reasons. Expand the pattern match as
322+
new failure types are observed.
323+
324+
### R4 — DeletionPolicy filter accuracy
325+
326+
The filter that distinguishes lock's expected DeletionPolicy changes
327+
from real drift must be precise. A false positive (real drift
328+
classified as expected) would allow lock rollback to proceed when the
329+
environment is inconsistent. A false negative (expected drift
330+
classified as real) would block lock rollback unnecessarily.
331+
332+
*Mitigation*: The lock step knows exactly which resources it modified
333+
and what it changed. The filter can be informed by the lock step's
334+
output (list of resources + properties changed) rather than relying on
335+
heuristic pattern matching on changeset details.
336+
337+
## Consequences
338+
339+
### What changes
340+
341+
- `detectTemplateDrift` classifies FAILED changesets by failure type
342+
instead of discarding them all
343+
- EarlyValidation failures are treated as recoverable — Changes are
344+
read from the FAILED changeset
345+
- Non-EarlyValidation failures remain hard errors (skipped)
346+
- The all-or-nothing behavior on nested analysis failures is replaced
347+
with partial results
348+
- `TemplateDriftResults` gains optional metadata about incomplete stacks
349+
- A DeletionPolicy filter is added to distinguish lock's expected
350+
changes from real drift
351+
- Lock rollback consumes filtered results to make the safe/unsafe
352+
determination
353+
354+
### What does NOT change
355+
356+
- The `CreateChangeSetCommand` call and its parameters
357+
- The recursive traversal of nested changesets via `ChangeSetId`
358+
- The "no changes" detection path
359+
- The drift formatter and its console URL generation
360+
- The Phase 1 (CFN drift detection) and Phase 3 (local drift) paths
361+
362+
### What gets removed
363+
364+
- The entire 14570 per-nested-stack prototype (Method B) is abandoned
365+
- No new dependencies (bottleneck, S3 client for template fetching)
366+
- No new S3 template resolution logic
367+
- No new rate limiting infrastructure

0 commit comments

Comments
 (0)