Intelligent AWS Site-to-Site VPN Tunnel Investigation with Amazon DevOps Agent#15
Intelligent AWS Site-to-Site VPN Tunnel Investigation with Amazon DevOps Agent#15rahulnarya wants to merge 15 commits into
Conversation
PR Review FeedbackThanks for the VPN tunnel investigation demo — great work overall! I've been testing it end-to-end and have some feedback below. 1. Please split this into two PRsThis PR mixes the new VPN demo with changes to the EKS demo ( Could you split this into:
This way the VPN demo and EKS fixes can be reviewed and merged independently. 2.
|
|
Hi bquintas, Thank you for the review. Here's my plan for each item: 1. Split into two PRs
This should make the PR completely independent from the EKS demo. 2. Operator App URL not output:
3. Operator App" vs "AWS DevOps Agent console:
4. Missing x-api-key header: I'll rewrite step 4c to walk through the actual console flow:
5. Wrong clone URL:
Working on these now, will push an update shortly. |
…pt, fix README - Move demo from observability/ to networking/ (independent from EKS demo) - Rebase onto latest upstream/main - Fix setup-devops-agent.sh: construct Operator URL from agent space ID, point webhook creation to AWS Console with direct URL - Fix README: correct clone URL, webhook instructions (console not Operator App), MCP registration with full wizard walkthrough (API Key auth, x-api-key header, tool selection), note about MCP webhook not needed - Add root README entry for networking/ folder
9c3e8a2 to
759815a
Compare
|
Hi bquintas, I am ready for re-review. Thank you! |
6. CloudFormation template — AZ compatibility issueThe
The fix is to either:
This is a deployment blocker in regions where not all AZs support t3.micro. |
|
Follow-up on #6 (AZ issue): An alternative would be adding an explicit parameter: Parameters:
AvailabilityZone:
Type: AWS::EC2::AvailabilityZone::Name
Description: AZ that supports t3.microThat way the user picks a known-good AZ and CloudFormation validates it exists. Either approach is better than the current random selection — up to you which fits better. |
|
One more thought on the AZ issue: Users can check which AZs support t3.micro in their account with: aws ec2 describe-instance-type-offerings \
--location-type availability-zone \
--filters Name=instance-type,Values=t3.micro \
--region us-east-1 \
--query "InstanceTypeOfferings[].Location" --output tableCould be worth adding as a helper script (e.g. |
7. Cleanup section — missing S3 bucket deletionThe cleanup instructions don't mention deleting the S3 bucket created for the MCP server Lambda package ( # Delete the MCP server Lambda package bucket
aws s3 rb s3://my-mcp-bucket-${AWS_ACCOUNT_ID} --force --region us-east-1 |
…Ops Agent - Moved to observability/ pillar, renamed directory per steering conventions - Converted to Python CDK with solution adoption tracking - Added deploy-all.ps1 for cross-platform support - Added ARCHITECTURE.md with component descriptions and data flow - All review feedback addressed (GitHub aws-samples#1-7, Slack aws-samples#9-12)
|
Hi @bquintas, Ready for re-review. Here's what changed since your last review: GitHub feedback addressed:
Slack feedback addressed:
Major changes:
Note on PR scope:
Thank you! |
Round 2 Review — Updated PRThanks for addressing the first round of feedback. The CDK migration, PowerShell script, ARCHITECTURE.md, and pillar move all look good. Found a few more issues during testing: 8.
|
…MCP cleanup, E2E tested on Linux + Windows - Created setup-devops-agent.ps1, inject-failure.ps1, cleanup.ps1, verify-cleanup.ps1 - Fixed: Operator URL constructed from agent space ID (aws-samples#9) - Fixed: Trust policy updated on re-run across regions (aws-samples#10) - Fixed: pip install uses venv (aws-samples#11) - Fixed: CDK bootstrap cleanup with y/N prompt (aws-samples#12) - Fixed: MCP registration field order and x-api-key header in README - Fixed: PowerShell BOM, file:// paths, JSON quoting, ErrorActionPreference - Added: verify-cleanup script, MCP deregister steps in cleanup - E2E tested: Linux (ap-southeast-2), Windows PowerShell (ap-southeast-2)
6b27bde to
ba53d65
Compare
…list-agent-spaces
ba53d65 to
9aece60
Compare
Round 3 Review — All 10 Scenarios Tested (Bash/macOS)Successfully deployed and tested all 10 failure scenarios end-to-end on macOS. The venv fix, trust policy update, and Operator URL construction all work correctly now. Great improvements from round 2. A few more items from testing: 13. SSH security group open to 0.0.0.0/0Both VPC security groups allow SSH from
The README should also note that in production, SSH should be locked to specific IPs and recommend SSM Session Manager as the preferred approach for CGW access. 14. (Nice-to-have)
|
bquintas
left a comment
There was a problem hiding this comment.
PR Review: Windows Testing — Critical Issues Found
⚠️ These scripts were not tested on Windows
I deployed and tested this demo end-to-end on a Windows Server EC2 instance (PowerShell 5.1, Python 3.13, Node.js, AWS CLI v2). Every .ps1 script failed on first run. The bugs found are all Windows-specific issues that are invisible when testing with pwsh on macOS/Linux — which is almost certainly how these scripts were developed.
The repo steering rules are explicit: "Must work on Windows (cmd/PowerShell primary shell)" and "Test on Windows before considering complete." PowerShell scripts exist in this repo specifically for Windows users. Testing them on macOS with pwsh does not validate Windows compatibility. Please re-test all .ps1 scripts on an actual Windows machine before resubmitting.
Bug 1: deploy-all.ps1 uses python3 — command does not exist on Windows
File: deploy-all.ps1, lines 51 and 65
Impact: Script fails immediately at Step 2 — CDK deployment never starts.
Problem: The script checks for python3 first (line 51) with a fallback to python, but then unconditionally calls python3 -m venv (line 65). On Windows, the standard Python installer registers the command as python, not python3. The python3 command is a Linux/macOS convention. The fallback check is dead code.
This is inconsistent with every other .ps1 script in the repo:
shared/scripts/check-prerequisites.ps1→python --versionshared/scripts/deploy-cdk.ps1→pip install(nopython3)operations-automation/anycompany-it-demo-portal/deploy-all.ps1→python scripts/seed-data.py
Fix: Lines 51-56, replace:
if (-not (Get-Command python3 -ErrorAction SilentlyContinue)) {
if (-not (Get-Command python -ErrorAction SilentlyContinue)) {
Write-Host "ERROR: Python 3 is required for CDK. Install from https://python.org" -ForegroundColor Red
exit 1
}
}with:
if (-not (Get-Command python -ErrorAction SilentlyContinue)) {
Write-Host "ERROR: Python 3 is required for CDK. Install from https://python.org" -ForegroundColor Red
exit 1
}Line 65, replace python3 -m venv with python -m venv.
Bug 2: CRLF line endings break all remote bash execution (Steps 7 and 8)
File: deploy-all.ps1, lines 216 and 274
Impact: Libreswan and GoBGP configuration fails completely — IPsec tunnels do not establish, BGP does not start. Every command on the CGW fails with $'\r': command not found and Invalid unit name "ipsec" escaped as "ipsec\x0d".
Problem: PowerShell on Windows produces CRLF (\r\n) in here-strings. When piped to the CGW via SSH ($script | ssh ... "sudo bash -s"), bash receives \r at the end of every line and treats it as part of the command. This breaks systemctl, sysctl, chmod, sleep, ip, and every other command in the script.
This is invisible on macOS because PowerShell on macOS uses LF natively.
Fix: Strip \r before piping. However, note that -replace "r", ""` cleans the string in memory but PowerShell may re-encode with CRLF when piping to external processes. The fix needs to be validated on Windows — options include:
- Strip on the receiving end:
$script | ssh ... "sed 's/\r$//' | sudo bash -s" - Write to a temp file with explicit LF encoding and pipe the file
- Use
[System.IO.File]::WriteAllText()to a temp file, thenGet-Content -Raw | ssh ...
The contributor must test whichever approach they choose on an actual Windows machine.
Bug 3: Step 5 silently continues when UserData fails
Impact: Misleading — the script prints "Packages installing... (30/30)" and continues to Steps 7-8 even when the CGW has no packages installed. This led to GoBGP being missing (/usr/local/bin/gobgpd: No such file or directory) with no clear error message.
Problem: The UserData script hit a transient yum cache error ([Errno 2] No such file or directory for a cached RPM). The USERDATA_COMPLETE marker was never written. The Step 5 loop timed out after 5 minutes and silently fell through. Steps 7-8 then tried to configure software that was never installed.
Suggested fixes:
- Step 5 should fail with a clear error if the loop exhausts all 30 iterations without finding
USERDATA_COMPLETE, not silently continue - The UserData script in the CDK stack should add
yum clean packagesbeforeyum installfor robustness, or retry on failure - Step 8 should verify
gobgpdexists before trying to configure it
Bug 4 (cosmetic but misleading): gobgp neighbor\r in Step 8 output
Impact: Even after applying CRLF fixes, the gobgp neighbor status check at the end of Step 8 displays Error: unknown command "neighbor\r". This led us to spend hours debugging what appeared to be a broken BGP configuration, when in fact BGP was working correctly.
Root cause: PowerShell pipe to SSH stdin may re-encode with CRLF regardless of in-memory string cleaning. The actual GoBGP configuration and service are fine — this is only the final status check command that displays incorrectly.
Fix: Same as Bug 2 — the CRLF-to-SSH piping approach needs a solution that is validated on Windows.
ASH Security Scan Results
Ran ASH v3.4.0 in local mode against the demo:
| Scanner | Result | Findings |
|---|---|---|
| bandit | ✅ PASSED | 0 |
| cdk-nag | ✅ PASSED | 0 |
| checkov | ✅ PASSED | 0 |
| semgrep | ✅ PASSED | 0 |
| npm-audit | ✅ PASSED | 0 |
| detect-secrets | api_key_name="vpn-devops-mcp-api-key" in mcp_server_stack.py:53 — parameter name flagged, not an actual secret |
No real security issues.
Summary
The demo architecture and bash scripts work well. The CDK infrastructure deploys correctly, the MCP server integration is solid, and the failure scenarios are well-designed. However, the PowerShell scripts were clearly not tested on Windows — they appear to have been generated from the bash versions (likely with AI assistance) and only validated on macOS with pwsh. All four bugs are Windows-specific and would have been caught in the first run on a Windows machine.
Required before merge: Re-test all .ps1 scripts (deploy-all.ps1, setup-devops-agent.ps1, inject-failure.ps1, cleanup.ps1) on an actual Windows machine end-to-end.
Note: I will be out of office and will not be available for follow-up reviews. Once the Windows fixes are applied and tested, please request another reviewer to validate and approve.
… throughput delay note - Bug 1: python3 → python in deploy-all.ps1 (Windows convention) - Bug 2+4: sed CRLF strip on receiving end for SSH pipes in deploy-all.ps1 - Bug 3: Step 5 fails on timeout, yum clean packages in UserData, verify gobgpd before Step 8 - aws-samples#13: SSH CIDR auto-detect (default), --ssh-cidr, --ssh-open flags - aws-samples#14: CloudWatch alarm state in inject-failure status command - aws-samples#15: Throughput alarm 5-6 min delay documented in README
- cleanup.sh/ps1: handle versioned CDK bootstrap bucket (delete all object versions and delete markers before bucket removal) - README: Node.js 18+ → 20+ (CDK CLI requirement) - README: add git as prerequisite - README: add 'verify alarms returned to OK' step between rollback and next scenario - cgw-scripts/list: group bgp-route-withdraw with throughput-degradation under 'Dedicated-Alarm Scenarios' to match README
… usage - Cleanup Step 2: make list-associations an explicit numbered step instead of a comment (both Bash and PowerShell sections) - Verify cleanup: use $(aws configure get region) instead of <region> placeholder, matching cleanup.sh example
b2086bf to
39fe4c8
Compare
- deploy-all.ps1: strip CRLF from cgw-scripts after scp (shebang fix) - deploy-all.ps1: skip venv creation if .venv already exists - cleanup.ps1: fix versioned bucket deletion (raw JSON, PS 5.1 compat) - cleanup.sh/ps1: fix 'Skipped' hint to show correct versioned bucket steps - inject-failure.ps1: update list grouping to match Dedicated-Alarm Scenarios - inject-failure.sh/ps1: add 5-minute throughput alarm note on inject - README: add PowerShell instructions for Step 3a and 3b (MCP deploy) - README: fix step numbering in cleanup (was duplicate aws-samples#3 and aws-samples#5) - README: fix troubleshooting Lambda name (CDK auto-generates)
- README: Bold PowerShell labels with (Windows) indicator for consistency - README: Replace redundant API key command with reference to step 3b - cleanup.sh: Print runnable bash commands for manual CDK bootstrap deletion - cleanup.ps1: Print runnable PowerShell commands for manual CDK bootstrap deletion - Both cleanup scripts: Add --no-cli-pager to all printed commands Tested E2E on AL2023 (ap-southeast-2): deploy, psk-mismatch inject/rollback, cleanup, verify-cleanup all passed.
- deploy-all.ps1: sudo sed redirect runs as ec2-user (permission denied on root-owned /opt/vpn-demo/). Use sudo tee to write as root. - README: PowerShell 7+ → 5.1+ (tested on Windows Server 2025, PS 5.1 works for all scripts)
Tester finding: Steps 1, 3, 4 in 'Run the Demo' only showed bash commands. Added PowerShell equivalents for Windows users, matching the pattern used in the Quick Start section.
|
Hi Team, The VPN demo is ready for review. Clarification: Earlier I had run the Windows PowerShell scripts on my windows laptop. My windows had Microsoft Python Stub that's why my windows test had worked locally. However, I have now changed the approach of testing my demo: Testing method: I spun up EC2 instances (Linux and Windows Server 2025) in my AWS Account and ran the demo end-to-end following the TESTER-ONBOARDING-GUIDE — fresh machines, no prior setup. Looking for the feedback from the review and address them accordingly. Please let me know if you have any questions-Thank you! |
cc1c3dd to
66cba4f
Compare
Summary
Adds a new demo under
observability/vpn-tunnel-investigation-devops-agent/that deploys a fully self-contained Site-to-Site VPN environment and lets users inject 10 realistic failure scenarios to watch Amazon DevOps Agent automatically investigate each one.What's included
Testing
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.