Skip to content

Commit f5f8d07

Browse files
authored
feat(bench): fixing bench running: added step verify required repo variables, added retry for rate limiting and our of credits (Tracer-Cloud#2708)
* added step Verify required repo variables * fix(bench) fixed permission error in dockerfile * added system prompt and tools schema for messages tokens estimation * added separated configs for each LLM model * handled rate limiting and retry * stop bench running if our of LLM credits
1 parent 1bbe4c8 commit f5f8d07

28 files changed

Lines changed: 2803 additions & 48 deletions
Lines changed: 128 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,128 @@
1+
name: Benchmark image — promote tag to task definition
2+
3+
# Manually-triggered workflow that runs `terraform apply -var=image_tag=<TAG>`
4+
# in infra/bench/ to register a new ECS task definition revision pointing at
5+
# the chosen ECR image. This is the privileged "deploy" step that comes
6+
# between an image push (automatic) and a bench run (manual).
7+
#
8+
# Why not auto-promote on every image push? An image build is a code-change
9+
# event. A task-def update is a deploy event. Decoupling them lets you
10+
# stage many images and choose deliberately which one production runs.
11+
#
12+
# Trigger from the GitHub UI:
13+
# Actions → "Benchmark image — promote tag to task definition" → Run
14+
#
15+
# Pre-reqs (one-time):
16+
# - infra/bench/ Terraform applied at least once
17+
# - Repo secrets seeded into AWS Secrets Manager
18+
# - Repo vars set (AWS_ACCOUNT_ID etc., see AWS_BENCH_SETUP.md step 4)
19+
# - The opensre-bench-github-actions OIDC role MUST be granted these
20+
# additional permissions before this workflow can succeed:
21+
# - ecs:RegisterTaskDefinition
22+
# - ecs:DescribeTaskDefinition
23+
# - iam:PassRole on the bench task + execution roles
24+
# - s3:GetObject, s3:PutObject on the Terraform state bucket
25+
# - dynamodb:GetItem, PutItem, DeleteItem on the lock table
26+
# The existing role currently has RunTask + Seed only. Extending it is
27+
# a 10-line addition to infra/bench/iam_oidc.tf — see the comment in
28+
# that file. This workflow will FAIL with AccessDenied until that
29+
# IAM diff is applied.
30+
31+
on:
32+
workflow_dispatch:
33+
inputs:
34+
image_tag:
35+
description: 'ECR image tag to promote (e.g. 3792493)'
36+
required: true
37+
38+
permissions:
39+
contents: read
40+
id-token: write # required for AWS OIDC role assumption
41+
42+
concurrency:
43+
group: benchmark-promote-image
44+
cancel-in-progress: false
45+
46+
jobs:
47+
promote:
48+
name: terraform apply image_tag=${{ inputs.image_tag }}
49+
if: github.repository == 'Tracer-Cloud/opensre'
50+
runs-on: ubuntu-latest
51+
timeout-minutes: 10
52+
53+
env:
54+
AWS_REGION: us-east-1
55+
56+
steps:
57+
- name: Verify required repo variables
58+
env:
59+
AWS_ACCOUNT_ID: ${{ vars.AWS_ACCOUNT_ID }}
60+
run: |
61+
if [ -z "${AWS_ACCOUNT_ID:-}" ]; then
62+
echo "::error::Missing repo variable AWS_ACCOUNT_ID. See AWS_BENCH_SETUP.md."
63+
exit 1
64+
fi
65+
66+
- uses: actions/checkout@v5
67+
68+
- name: Configure AWS credentials (OIDC role assumption)
69+
uses: aws-actions/configure-aws-credentials@v4
70+
with:
71+
role-to-assume: arn:aws:iam::${{ vars.AWS_ACCOUNT_ID }}:role/opensre-bench-github-actions
72+
role-session-name: bench-promote-${{ github.run_id }}
73+
aws-region: us-east-1
74+
75+
- name: Verify image tag exists in ECR
76+
# Fail loudly if the operator typos the tag, before Terraform tries
77+
# to register a task definition pointing at a missing image.
78+
env:
79+
IMAGE_TAG: ${{ inputs.image_tag }}
80+
run: |
81+
if ! aws ecr describe-images \
82+
--repository-name opensre-bench \
83+
--image-ids imageTag="$IMAGE_TAG" \
84+
>/dev/null 2>&1; then
85+
echo "::error::Image tag $IMAGE_TAG not found in ECR repo opensre-bench."
86+
echo "::error::Push it first via 'Benchmark image — build + push to ECR'."
87+
exit 1
88+
fi
89+
90+
- uses: hashicorp/setup-terraform@v3
91+
with:
92+
terraform_version: 1.7.5
93+
94+
- name: Terraform init
95+
working-directory: infra/bench
96+
run: terraform init -input=false
97+
98+
- name: Terraform apply
99+
# Plan is captured in the workflow log; review it in the run page.
100+
# -auto-approve is intentional — this workflow IS the human approval
101+
# (the operator triggered it manually with a specific tag).
102+
working-directory: infra/bench
103+
env:
104+
IMAGE_TAG: ${{ inputs.image_tag }}
105+
run: |
106+
terraform apply -input=false -auto-approve -var="image_tag=$IMAGE_TAG"
107+
108+
- name: Surface the new task definition revision in the job summary
109+
working-directory: infra/bench
110+
env:
111+
IMAGE_TAG: ${{ inputs.image_tag }}
112+
run: |
113+
TASK_DEF_ARN=$(terraform output -raw task_definition_arn)
114+
IMAGE_URI=$(aws ecs describe-task-definition \
115+
--task-definition "$TASK_DEF_ARN" \
116+
--query 'taskDefinition.containerDefinitions[0].image' \
117+
--output text)
118+
{
119+
echo "## Image promoted"
120+
echo ""
121+
echo "- Promoted tag: \`$IMAGE_TAG\`"
122+
echo "- New task definition ARN: \`$TASK_DEF_ARN\`"
123+
echo "- Image now in task definition: \`$IMAGE_URI\`"
124+
echo ""
125+
echo "### Next step"
126+
echo ""
127+
echo "Trigger **Benchmark — run on Fargate** to launch a run against this image."
128+
} >> "$GITHUB_STEP_SUMMARY"

.github/workflows/benchmark-run.yml

Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -60,6 +60,34 @@ jobs:
6060
AWS_REGION: us-east-1
6161

6262
steps:
63+
- name: Verify required repo variables
64+
# Fail loudly BEFORE the AWS auth step if any required repo variable
65+
# is missing. Without this, an unset var would surface downstream as
66+
# an opaque error like "Task Definition can not be blank" from
67+
# describe-task-definition. See AWS_BENCH_SETUP.md step 4 for how
68+
# to set these.
69+
env:
70+
AWS_ACCOUNT_ID: ${{ vars.AWS_ACCOUNT_ID }}
71+
BENCH_ECS_CLUSTER: ${{ vars.BENCH_ECS_CLUSTER }}
72+
BENCH_TASK_DEFINITION_FAMILY: ${{ vars.BENCH_TASK_DEFINITION_FAMILY }}
73+
BENCH_SUBNET_IDS: ${{ vars.BENCH_SUBNET_IDS }}
74+
BENCH_SECURITY_GROUP_ID: ${{ vars.BENCH_SECURITY_GROUP_ID }}
75+
run: |
76+
missing=()
77+
for var in AWS_ACCOUNT_ID BENCH_ECS_CLUSTER BENCH_TASK_DEFINITION_FAMILY \
78+
BENCH_SUBNET_IDS BENCH_SECURITY_GROUP_ID; do
79+
if [ -z "${!var:-}" ]; then
80+
missing+=("$var")
81+
fi
82+
done
83+
if [ "${#missing[@]}" -gt 0 ]; then
84+
echo "::error::Missing repo variable(s): ${missing[*]}"
85+
echo "::error::Set them under Settings → Secrets and variables → Actions → Variables."
86+
echo "::error::Values come from \`cd infra/bench && terraform output\` — see tests/benchmarks/AWS_BENCH_SETUP.md step 4."
87+
exit 1
88+
fi
89+
echo "All 5 required repo variables are set."
90+
6391
- name: Configure AWS credentials (OIDC role assumption)
6492
uses: aws-actions/configure-aws-credentials@v4
6593
with:

app/agent/investigation.py

Lines changed: 93 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -25,6 +25,18 @@
2525
_TOOL_EXECUTOR_WORKERS = 10
2626
_UNSET: object = object() # sentinel distinguishing "not yet started" from a None tool result
2727

28+
# Defensive context-window ceiling. Below this we never trim; above this we
29+
# drop the oldest tool_use/tool_result pair until back under the ceiling.
30+
#
31+
# Anthropic's 200k prompt limit is the hard cap. The estimator at
32+
# ``_estimate_message_tokens`` covers messages + system + tool schemas
33+
# (all three count toward the limit). 170k ceiling leaves ~30k headroom
34+
# for the response. ratio=0.40 absorbs JSON-structural overhead in tool
35+
# payloads — empirically tuned from overflow logs where Anthropic landed
36+
# at 0.32–0.40 tokens/char for opensre's tool-result mix.
37+
_TOKEN_BUDGET_CEILING = 170_000
38+
_TOKENS_PER_CHAR = 0.40
39+
2840
# Maps alert_source → tool source keys. Tools from these sources are auto-called
2941
# before the LLM loop starts when the alert source is known.
3042
_ALERT_SOURCE_TO_TOOL_SOURCES: dict[str, list[str]] = {
@@ -167,6 +179,7 @@ def _record_tool_end(tc: ToolCall, output: Any) -> None:
167179
for iteration in range(MAX_INVESTIGATION_LOOPS):
168180
logger.debug("[agent] iteration=%d", iteration)
169181
_emit("llm_start", {"iteration": iteration})
182+
_enforce_context_budget(messages, system=system, tools=tool_schemas)
170183
try:
171184
response = llm.invoke(messages, system=system, tools=tool_schemas)
172185

@@ -263,6 +276,86 @@ def _record_tool_end(tc: ToolCall, output: Any) -> None:
263276
InvestigationAgent = ConnectedInvestigationAgent
264277

265278

279+
def _estimate_message_tokens(
280+
messages: list[dict[str, Any]],
281+
*,
282+
system: str | None = None,
283+
tools: list[dict[str, Any]] | None = None,
284+
) -> int:
285+
"""Cheap upper-bound token estimate covering everything Anthropic sees.
286+
287+
Anthropic counts ``messages`` + ``system`` + ``tools`` toward the 200k
288+
prompt limit. Earlier versions counted only ``messages`` and trimmed
289+
aggressively while system + tools (tens of thousands of tokens for
290+
opensre's 100+ tool registry) silently pushed us over the line.
291+
"""
292+
total = 0
293+
for message in messages:
294+
content = message.get("content", "")
295+
if isinstance(content, str):
296+
total += int(len(content) * _TOKENS_PER_CHAR)
297+
elif isinstance(content, list):
298+
for block in content:
299+
if isinstance(block, dict):
300+
total += int(len(json.dumps(block, default=str)) * _TOKENS_PER_CHAR)
301+
elif isinstance(block, str):
302+
total += int(len(block) * _TOKENS_PER_CHAR)
303+
if system:
304+
total += int(len(system) * _TOKENS_PER_CHAR)
305+
if tools:
306+
for schema in tools:
307+
total += int(len(json.dumps(schema, default=str)) * _TOKENS_PER_CHAR)
308+
return total
309+
310+
311+
def _trim_oldest_tool_pair(messages: list[dict[str, Any]]) -> bool:
312+
"""Drop the oldest assistant tool_use message together with the
313+
immediate next user message carrying its tool_results. Anthropic
314+
requires every ``tool_use`` block to be followed by a matching
315+
``tool_result`` block, so the pair must be removed together to keep
316+
the conversation valid.
317+
318+
Returns True if a pair was dropped, False if nothing trimmable
319+
remains (e.g. only the initial user prompt is left).
320+
"""
321+
for index, message in enumerate(messages):
322+
if message.get("role") != "assistant":
323+
continue
324+
content = message.get("content", [])
325+
if not isinstance(content, list):
326+
continue
327+
has_tool_use = any(
328+
isinstance(block, dict) and block.get("type") == "tool_use" for block in content
329+
)
330+
if not has_tool_use:
331+
continue
332+
# Drop the assistant turn AND its paired user turn (the tool_results).
333+
# If the user turn isn't present (e.g. truncated mid-iteration),
334+
# del messages[i:i+2] safely just drops the assistant turn.
335+
del messages[index : index + 2]
336+
return True
337+
return False
338+
339+
340+
def _enforce_context_budget(
341+
messages: list[dict[str, Any]],
342+
*,
343+
system: str | None = None,
344+
tools: list[dict[str, Any]] | None = None,
345+
) -> None:
346+
"""Trim oldest tool pairs until prompt fits under the budget ceiling.
347+
348+
No-op on the happy path: the estimate covers messages + system + tools
349+
in one pass and returns under the ceiling for normal investigations.
350+
Only fires on long CloudOpsBench cases where unbounded tool history
351+
has pushed the prompt past the model's limit.
352+
"""
353+
while _estimate_message_tokens(messages, system=system, tools=tools) > _TOKEN_BUDGET_CEILING:
354+
if not _trim_oldest_tool_pair(messages):
355+
return
356+
logger.warning("[agent] trimmed oldest tool pair to fit context budget")
357+
358+
266359
def _degraded_investigation_from_llm_failure(
267360
failure: LLMInvokeFailure,
268361
*,

app/agent/llm_invoke_errors.py

Lines changed: 14 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -64,12 +64,25 @@ def _looks_like_timeout(exc: BaseException) -> bool:
6464

6565

6666
def classify_llm_invoke_failure(exc: BaseException) -> LLMInvokeFailure | None:
67-
"""Return a structured failure when *exc* is a known operational LLM error."""
67+
"""Return a structured failure when *exc* is a known operational LLM error.
68+
69+
Returns ``None`` to signal the caller should re-raise. In particular,
70+
:class:`LLMCreditExhaustedError` is intentionally NOT classified — it
71+
represents a non-recoverable billing condition that the bench runner
72+
(and production agent) must halt on, not wrap into a degraded result.
73+
"""
6874
from app.integrations.llm_cli.errors import (
6975
CLIAuthenticationRequired,
7076
CLIInterruptedError,
7177
CLITimeoutError,
7278
)
79+
from app.utils.llm_retry import LLMCreditExhaustedError
80+
81+
# Fatal — propagate to the runner / operator. Do NOT wrap into the
82+
# generic "rate-limited" classification (which the text branch below
83+
# would otherwise match against "credit balance too low" / "quota").
84+
if isinstance(exc, LLMCreditExhaustedError):
85+
return None
7386

7487
if isinstance(exc, CLIAuthenticationRequired):
7588
return LLMInvokeFailure(

0 commit comments

Comments
 (0)