Skip to content

Commit 544e796

Browse files
committed
Add AWS investigation tooling with dual-stack support for eu-west-2 and us-east-1
1 parent 3b20e7f commit 544e796

7 files changed

Lines changed: 519 additions & 13 deletions

File tree

.buildkite/buildkite.md

Lines changed: 43 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -24,16 +24,50 @@ determine killed reason:
2424

2525
dmesg | grep -E -i -B100 'killed process'
2626

27-
connecting to AWS EC2 instance:
27+
AWS infrastructure:
2828

29-
# check instance running
30-
aws ec2 get-console-output --instance-id i-04bc2edc8b4187ca8 --region us-east-1
29+
Build agents run in an AutoScaling Group managed by a Lambda-based autoscaler.
30+
There are two stacks — check the **current** stack first.
31+
See `.opencode/skills/aws-investigation/SKILL.md` for full details.
3132

32-
# ensure keypair has correct permissions
33-
chmod 400 ~/Downloads/mockserver-buildkite.pem
33+
# authenticate (SSO, browser-based MFA)
34+
aws sso login --profile mockserver-build
3435

35-
# connect to EC2 linux instance using keypair and domain name
36-
ssh -i ~/Downloads/mockserver-buildkite.pem ec2-user@ec2-34-204-42-237.compute-1.amazonaws.com
36+
Current: Terraform-managed (eu-west-2)
37+
---------------------------------------
3738

38-
# connect to EC2 linux instance using keypair and ip address
39-
ssh -i ~/Downloads/mockserver-buildkite.pem ec2-user@52.91.13.160
39+
Managed by `terraform/buildkite-agents/`. ASG name is generated by Terraform.
40+
41+
# find ASGs tagged with buildkite-mockserver
42+
aws autoscaling describe-auto-scaling-groups \
43+
--region eu-west-2 --profile mockserver-build \
44+
--query 'AutoScalingGroups[?contains(Tags[?Key==`Stack`].Value | [0], `buildkite-mockserver`)].{Name:AutoScalingGroupName,Desired:DesiredCapacity,Instances:Instances[*].{ID:InstanceId,State:LifecycleState}}'
45+
46+
# list running EC2 instances (substitute ASG name from above)
47+
aws ec2 describe-instances \
48+
--filters "Name=tag:aws:autoscaling:groupName,Values=<ASG_NAME>" \
49+
--region eu-west-2 --profile mockserver-build \
50+
--query 'Reservations[].Instances[].{ID:InstanceId,State:State.Name,Launch:LaunchTime}'
51+
52+
# check console output for a specific instance (for debugging boot issues)
53+
aws ec2 get-console-output --instance-id <instance-id> --region eu-west-2 --profile mockserver-build \
54+
--query 'Output' --output text
55+
56+
# manually scale up agents if Lambda scaler is broken
57+
aws autoscaling set-desired-capacity \
58+
--auto-scaling-group-name "<ASG_NAME>" \
59+
--desired-capacity 2 --region eu-west-2 --profile mockserver-build
60+
61+
Legacy: CloudFormation-managed (us-east-1) — being replaced
62+
-------------------------------------------------------------
63+
64+
# check ASG status (instance count, health)
65+
aws autoscaling describe-auto-scaling-groups \
66+
--auto-scaling-group-names "buildkite-AgentAutoScaleGroup-VGG28FR0DE6Q" \
67+
--region us-east-1 --profile mockserver-build \
68+
--query 'AutoScalingGroups[0].{Desired:DesiredCapacity,Instances:Instances[*].{ID:InstanceId,State:LifecycleState}}'
69+
70+
# manually scale up agents if Lambda scaler is broken
71+
aws autoscaling set-desired-capacity \
72+
--auto-scaling-group-name "buildkite-AgentAutoScaleGroup-VGG28FR0DE6Q" \
73+
--desired-capacity 2 --region us-east-1 --profile mockserver-build

.opencode/agents/debugger.md

Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,7 @@ You are a debugger for the MockServer codebase. You investigate issues, errors,
66
2. Correlate data across logs, CI builds, and code changes
77
3. Identify root causes with evidence
88
4. Provide actionable remediation steps
9+
5. Investigate AWS infrastructure issues (EC2 instances, AutoScaling Groups, Lambda)
910

1011
## Investigation Approach
1112

@@ -28,6 +29,7 @@ You are a debugger for the MockServer codebase. You investigate issues, errors,
2829
- Buildkite pipeline status at https://buildkite.com/mockserver
2930
- GitHub Actions workflow runs
3031
- Docker Hub image build status
32+
- AWS ASG and EC2 instance health (use `aws` CLI with `--profile mockserver-build`; check eu-west-2 first, then us-east-1 legacy)
3133

3234
### 5. Inspect Code
3335
- Stack traces and exception chains
@@ -62,6 +64,40 @@ You are a debugger for the MockServer codebase. You investigate issues, errors,
6264
- <log snippet, build output, command output>
6365
```
6466

67+
## AWS Infrastructure
68+
69+
Build agents run in two stacks. Check the **current** stack first; fall back to **legacy** only if the current one has not been provisioned yet.
70+
71+
### Current: Terraform-managed (eu-west-2)
72+
73+
Managed by `terraform/buildkite-agents/`. ASG name is generated by Terraform — find it via tags:
74+
75+
```bash
76+
# Find ASGs tagged with buildkite-mockserver
77+
aws autoscaling describe-auto-scaling-groups \
78+
--region eu-west-2 --profile mockserver-build \
79+
--query 'AutoScalingGroups[?contains(Tags[?Key==`Stack`].Value | [0], `buildkite-mockserver`)].{Name:AutoScalingGroupName,Desired:DesiredCapacity,Instances:Instances[*].{ID:InstanceId,State:LifecycleState,Health:HealthStatus}}'
80+
81+
# Recent scaling activities (substitute ASG name from above)
82+
aws autoscaling describe-scaling-activities \
83+
--auto-scaling-group-name "<ASG_NAME>" \
84+
--region eu-west-2 --profile mockserver-build --max-items 10
85+
```
86+
87+
### Legacy: CloudFormation-managed (us-east-1)
88+
89+
Being replaced by the Terraform stack. May still be active during migration.
90+
91+
```bash
92+
# Quick ASG health check
93+
aws autoscaling describe-auto-scaling-groups \
94+
--auto-scaling-group-names "buildkite-AgentAutoScaleGroup-VGG28FR0DE6Q" \
95+
--region us-east-1 --profile mockserver-build \
96+
--query 'AutoScalingGroups[0].{Desired:DesiredCapacity,Min:MinSize,Max:MaxSize,Instances:Instances[*].{ID:InstanceId,State:LifecycleState,Health:HealthStatus}}'
97+
```
98+
99+
For the full investigation workflow, load the `aws-investigation` skill.
100+
65101
## Important
66102

67103
- Follow the evidence. Do not guess.

.opencode/agents/pipeline-investigator.md

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -83,6 +83,18 @@ gh run view {run_id} --repo mockserver/mockserver --log-failed
8383
| `Connection refused` or `BindException` | Port conflict | Check for port contention in tests |
8484
| `Timeout` | Operation stuck | Check for deadlocks, slow external deps |
8585
| `SNAPSHOT` dependency errors | Maven dep issue | Check artifact repository availability |
86+
| Build stuck in `scheduled` | Agent not running | Check AWS ASG via `/aws-investigation` |
87+
| Agent did not connect | Agent infrastructure | Check AWS ASG via `/aws-investigation` |
88+
89+
## Agent Infrastructure
90+
91+
If builds are stuck in `scheduled` state with no agent picking them up, the issue is likely with the AWS EC2 instances that run the Buildkite agents. Use `/aws-investigation` to check:
92+
93+
- AutoScaling Group desired capacity and instance health
94+
- Autoscaling Lambda invocations and errors
95+
- EC2 instance status checks
96+
97+
See `.opencode/skills/aws-investigation/SKILL.md` for full details.
8698

8799
## Important
88100

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
---
2+
description: Investigate AWS infrastructure issues affecting Buildkite build agents
3+
agent: debugger
4+
subtask: true
5+
---
6+
Load the `aws-investigation` skill and execute it for the following request:
7+
8+
$ARGUMENTS

.opencode/plugins/buildkite-status.ts

Lines changed: 37 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -67,22 +67,55 @@ export const BuildkiteStatus: Plugin = async ({ $, client, worktree }) => {
6767
}
6868

6969
const failed = builds.filter((b) => b.state === "failed")
70+
const TEN_MINUTES = 10 * 60 * 1000
71+
const stuck = builds.filter(
72+
(b) =>
73+
b.state === "scheduled" &&
74+
Date.now() - new Date(b.created_at).getTime() > TEN_MINUTES
75+
)
76+
77+
const problems: string[] = []
78+
let toastMessage = ""
79+
let toastVariant: "warning" | "error" = "warning"
80+
81+
if (stuck.length > 0) {
82+
const stuckSummary = stuck
83+
.map(
84+
(b) =>
85+
`- Build #${b.number} (${b.branch}): scheduled ${Math.round((Date.now() - new Date(b.created_at).getTime()) / 60000)}min ago — no agent\n ${b.web_url}`
86+
)
87+
.join("\n")
88+
problems.push(
89+
`${stuck.length} build(s) stuck waiting for an agent:\n\n${stuckSummary}`
90+
)
91+
toastMessage = `Buildkite: ${stuck.length} build(s) waiting for agent — check AWS ASG`
92+
toastVariant = "error"
93+
}
7094

7195
if (failed.length > 0) {
72-
const summary = failed
96+
const failedSummary = failed
7397
.map(
7498
(b) =>
7599
`- Build #${b.number} (${b.branch}): ${b.message?.split("\n")[0] || "no message"}\n ${b.web_url}`
76100
)
77101
.join("\n")
102+
problems.push(
103+
`${failed.length} of ${builds.length} recent builds failed:\n\n${failedSummary}`
104+
)
105+
if (!toastMessage) {
106+
toastMessage = `Buildkite: ${failed.length} recent build(s) failing`
107+
}
108+
}
109+
110+
if (problems.length > 0) {
78111
await fs.writeFile(
79112
path.join(worktree, STATUS_FILE),
80-
`# Buildkite Status\n\nChecked at ${new Date().toISOString()}\n\n${failed.length} of ${builds.length} recent builds failed:\n\n${summary}\n`
113+
`# Buildkite Status\n\nChecked at ${new Date().toISOString()}\n\n${problems.join("\n\n---\n\n")}\n`
81114
)
82115
await client.tui.showToast({
83116
body: {
84-
message: `Buildkite: ${failed.length} recent build(s) failing`,
85-
variant: "warning",
117+
message: toastMessage,
118+
variant: toastVariant,
86119
},
87120
})
88121
} else {

0 commit comments

Comments
 (0)