feat(observability): inject solution into outbound AWS SDK User-Agent (#319)#338
feat(observability): inject solution into outbound AWS SDK User-Agent (#319)#338scottschreckengaust wants to merge 8 commits into
Conversation
app/uksb-wt64nei4u6/{STACKNAME} + md/uksb-wt64nei4u6#agent[#{TRACE}],
emitted via botocore's verbatim user_agent_extra path (the sanitizing
app-id field would mangle the '/' separator). Static part baked at
construction; optional #{TRACE} appended per-request by a before-send
handler so cached clients keep their connection pool. Wire-capture
tests assert the emitted header via a short-circuiting before-send stub.
Task 1 of PR #338 plan. Part of #319
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
a8425e7 to
d126f1c
Compare
…ites
Session-level user_agent_extra on both the scoped (refreshable) and
plain singleton sessions covers every tenant_client/tenant_resource
caller; new platform_client() helper carries the UA + trace appender
for the ambient-chain call sites (logs x5, secretsmanager x2,
bedrock-agentcore x1) that bypass the session by design.
configure_session() doubles the task id as the UA trace handle.
The trace appender splices #{TRACE} onto the md/ segment (not the
header end) because boto3 renders 'Botocore/x.y.z' after the
session-level extra.
Task 2 of PR #338 plan. Part of #319
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…racking
abcaUserAgent() bakes app/uksb-wt64nei4u6/{STACKNAME} (single-element
customUserAgent pair — the name path preserves '/') and
['md/uksb-wt64nei4u6', component] (the '#' comes from the SDK's own
name#value join; a '#' inside a name would be escaped to '-').
Component label from ABCA_COMPONENT env since shared modules bundle
into multiple Lambdas. withAbcaTrace() adds a middleware after
getUserAgentMiddleware that splices #{TRACE} onto the md/ segment in
both UA headers per-request; no-ops on bare-object constructor mocks.
Wire-capture tests run the full middleware stack against a stub
requestHandler and assert the emitted headers for trace-present,
trace-absent, and hostile-input cases.
Task 3 of PR #338 plan. Part of #319
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
… invocation All ~60 client instantiations now pass abcaUserAgent() and are wrapped withAbcaTrace() (document clients instrument the inner DynamoDBClient — shared middleware stack). Handler entry points set the trace to their freshly minted ulid() request id; orchestrate-task uses the task id. Stream consumers (fanout, approval-metrics) carry no trace by design. No behavior change beyond the UA headers; all 2031 existing tests pass unmodified (withAbcaTrace no-ops on bare-object constructor mocks). Task 4 of PR #338 plan. Part of #319 Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…face Component labels: 'api' (task-api commonEnv), 'webhook' (webhookEnv + slack/linear/github-screenshot integrations), 'orchestr' (orchestrator, reconcilers, pending-upload cleanup, fanout + approval-metrics stream consumers). Agent compute (AgentCore runtime env, ECS container env) gets ABCA_STACK_NAME only — agent/src/ua.py hardcodes its label. approval-metrics-publisher previously had no environment block; one is added. Template assertions verify each surface's label (#319 acceptance criterion). Task 5 of PR #338 plan. Part of #319 Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
CLI-local ua.ts mirror (same convention as types.ts — CLI can't import from CDK). Component 'cli', stack name from new optional stack_name config field (configure --stack-name), trace = process pid set once at startup. All Cognito/SecretsManager/CloudFormation/DynamoDB client sites migrated. Wire-capture test mirrors the CDK one via a real Cognito client + stub requestHandler; auth.test.ts asserts the constructor receives the customUserAgent. Task 6 of PR #338 plan. Part of #319 Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
scottschreckengaust
left a comment
There was a problem hiding this comment.
Reviewing
| from datetime import datetime, timedelta | ||
|
|
||
| import boto3 | ||
| import boto3 # noqa: F401 — availability probe; client built via platform_client |
There was a problem hiding this comment.
Is the import boto3 still here once and ignored via "noqa" to ensure it is available? Why not let it fail at the platform_client?
Recommend a near-total simplification — verified against the installed botocore 1.43.9 + JS v3Both SDKs honor
So Prescription — collapse to native config
Net: ~1.5k lines / 83 files → env-var threading + one param + docs/tests. Cost of the simplification (honest trade-offs)
Full verified SDK snippets + the kept-vs-dropped breakdown are in #319. The research and implementation here are high-quality and correct — this is a scope call, not a defect. The |
|
Correction — you're right,
Three tiers — keep the first two, drop the third:
So what actually deletes is the trace plane (per-request appender, Full kept-vs-dropped detail in #319. |
|
Confirmed: Verified both SDKs: Final shape:
Comment rewrite required: every |
|
Companion draft posted: #345 — the simplified native-app-id alternative to this PR, per the discussion above (use |
Empty start-signal commit. Execution plan lives in the draft PR body; implementation commits follow. Part of #319 Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
app/uksb-wt64nei4u6/{STACKNAME} + md/uksb-wt64nei4u6#agent[#{TRACE}],
emitted via botocore's verbatim user_agent_extra path (the sanitizing
app-id field would mangle the '/' separator). Static part baked at
construction; optional #{TRACE} appended per-request by a before-send
handler so cached clients keep their connection pool. Wire-capture
tests assert the emitted header via a short-circuiting before-send stub.
Task 1 of PR #338 plan. Part of #319
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…ites
Session-level user_agent_extra on both the scoped (refreshable) and
plain singleton sessions covers every tenant_client/tenant_resource
caller; new platform_client() helper carries the UA + trace appender
for the ambient-chain call sites (logs x5, secretsmanager x2,
bedrock-agentcore x1) that bypass the session by design.
configure_session() doubles the task id as the UA trace handle.
The trace appender splices #{TRACE} onto the md/ segment (not the
header end) because boto3 renders 'Botocore/x.y.z' after the
session-level extra.
Task 2 of PR #338 plan. Part of #319
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…racking
abcaUserAgent() bakes app/uksb-wt64nei4u6/{STACKNAME} (single-element
customUserAgent pair — the name path preserves '/') and
['md/uksb-wt64nei4u6', component] (the '#' comes from the SDK's own
name#value join; a '#' inside a name would be escaped to '-').
Component label from ABCA_COMPONENT env since shared modules bundle
into multiple Lambdas. withAbcaTrace() adds a middleware after
getUserAgentMiddleware that splices #{TRACE} onto the md/ segment in
both UA headers per-request; no-ops on bare-object constructor mocks.
Wire-capture tests run the full middleware stack against a stub
requestHandler and assert the emitted headers for trace-present,
trace-absent, and hostile-input cases.
Task 3 of PR #338 plan. Part of #319
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
… invocation All ~60 client instantiations now pass abcaUserAgent() and are wrapped withAbcaTrace() (document clients instrument the inner DynamoDBClient — shared middleware stack). Handler entry points set the trace to their freshly minted ulid() request id; orchestrate-task uses the task id. Stream consumers (fanout, approval-metrics) carry no trace by design. No behavior change beyond the UA headers; all 2031 existing tests pass unmodified (withAbcaTrace no-ops on bare-object constructor mocks). Task 4 of PR #338 plan. Part of #319 Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…face Component labels: 'api' (task-api commonEnv), 'webhook' (webhookEnv + slack/linear/github-screenshot integrations), 'orchestr' (orchestrator, reconcilers, pending-upload cleanup, fanout + approval-metrics stream consumers). Agent compute (AgentCore runtime env, ECS container env) gets ABCA_STACK_NAME only — agent/src/ua.py hardcodes its label. approval-metrics-publisher previously had no environment block; one is added. Template assertions verify each surface's label (#319 acceptance criterion). Task 5 of PR #338 plan. Part of #319 Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
CLI-local ua.ts mirror (same convention as types.ts — CLI can't import from CDK). Component 'cli', stack name from new optional stack_name config field (configure --stack-name), trace = process pid set once at startup. All Cognito/SecretsManager/CloudFormation/DynamoDB client sites migrated. Wire-capture test mirrors the CDK one via a real Cognito client + stub requestHandler; auth.test.ts asserts the constructor receives the customUserAgent. Task 6 of PR #338 plan. Part of #319 Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
18ea957 to
686931b
Compare
Closes #319
What this builds
Every outbound AWS API call from ABCA carries two solution-attribution User-Agent segments:
Both are emitted via the raw/verbatim user-agent path (botocore
user_agent_extra/ JScustomUserAgent) — not the sanitizing app-id config field, which would mangle the/separator (/is not in the SDK app-id charset; verified against installed botocore 1.43.9useragent.pyand@aws-sdk/middleware-user-agent). Static parts are baked at client construction;#{TRACE}is appended per-request by a botocorebefore-sendhandler / JS middleware so cached clients and pooled connections are never reissued.Verified SDK mechanics this design relies on
customUserAgentpair encoding (middleware-user-agent/dist-cjs/index.js): a pair['app/uksb-wt64nei4u6/STACK']survives intact — the name is split on/, each part escaped (all our chars are token-safe), and rejoined with/. A pair['md/uksb-wt64nei4u6', 'api']rendersmd/uksb-wt64nei4u6#api(#is legal in values, not names).customUserAgentlands in bothuser-agentandx-amz-user-agent(node runtime), always at the end of each header — which is what lets the trace middleware do a safe suffix-append.getUserAgentMiddleware, stepbuild, prioritylow— our trace middleware is addedaddRelativeTo(..., { relation: 'after', toMiddleware: 'getUserAgentMiddleware' }).DynamoDBDocumentClient.from(inner)shares the inner client'smiddlewareStackandconfig— instrumenting the innerDynamoDBClientcovers the document client.user_agent_extralast (_build_extrais the final component into_string()), so abefore-sendhandler can append#{TRACE}to the end ofrequest.headers['User-Agent']and it lands exactly on themd/segment.before-sendhandlers can also return a stubbedAWSResponseto short-circuit the HTTP send — that's the wire-capture test mechanism (no network, no moto).Identity & trace mapping
{IDENTIFIER}{TRACE}handleagentconfigure_session()apiABCA_COMPONENTenvulid()requestId handlers already mintwebhookABCA_COMPONENTenvorchestrABCA_COMPONENTenvevent.task_id(orchestrate-task); else absentbgagentCLIcliprocess.pid, set once at startup{IDENTIFIER}comes fromABCA_COMPONENTenv (not per-call-site parameters) because shared modules (shared/orchestrator.ts,shared/create-task-core.ts, …) construct module-level clients and are bundled into multiple Lambdas — the env var gives each Lambda its own label with one source of truth in CDK.{STACKNAME}comes from a newABCA_STACK_NAMEenv var (Lambdas, agent container) / optionalstack_nameCLI config field. When absent (local dev, unconfigured CLI), theapp/segment is omitted entirely — never a placeholder.Trace ambient context: module-level variable (with the JS/Lambda single-concurrent-invocation guarantee, and matching the agent's existing
aws_session._tagsmodule-state pattern) rather thancontextvar/AsyncLocalStorage.server.py:300already documents whyContextVaris a per-thread trap in the agent (task runs on a spawned thread). The issue allows any ambient mechanism.Solution-ID constant placement: per-surface literal constant (
agent/src/ua.py,cdk/src/handlers/shared/ua.ts,cli/src/ua.ts) with cross-referencing comments — NOTcontracts/constants.json. That contract's schema and its drift checker (scripts/check-constants-sync.ts) are numeric-only (min/max/defaultinvariants policed inagent/src/policy.py); the existing repo precedent foruksb-wt64nei4u6is a literal (cdk/src/main.ts:41). Each surface's tests assert the exact emitted string, which is the de-facto drift guard.Shared sanitization spec (implemented identically in Python and TS, with identical test vectors)
Execution plan
Task 1 — Python:
agent/src/ua.py+ tests (TDD)Create:
agent/src/ua.py,agent/tests/test_ua.pySOLUTION_ID = "uksb-wt64nei4u6",COMPONENT = "agent",STACK_NAME_ENV = "ABCA_STACK_NAME".sanitize_ua_value(raw: str) -> strper the shared spec.static_user_agent_extra() -> str—app/uksb-wt64nei4u6/{stack(≤34)} md/uksb-wt64nei4u6#agent, dropping theapp/segment whenABCA_STACK_NAMEis unset/empty.set_trace(handle: str | None)/get_trace()— lock-guarded module state; sanitized on read.register_trace_appender(events)— registers abefore-send.*handler that appends#{trace}torequest.headers['User-Agent']when a trace is set (no-op otherwise; never reconstructs the client). Works for bothclient.meta.eventsand botocore-session-level registration (sessions propagate to all derived clients and resources).client_config() -> botocore.config.Config—Config(user_agent_extra=static_user_agent_extra())for direct-client call sites./,#, non-ASCII →-; 50-char budget; clip-after-sanitize ordering; trace-absent ⇒ no trailing#).register_trace_appender+ aregister_last('before-send.*', …)stub returning a cannedAWSResponse. Two calls on the same client under differentset_trace()values assert different#{TRACE}suffixes, intactapp/.../slash, and intactmd/...#agent— straight from the capturedUser-Agentheader.Task 2 — Python: wire into
aws_session.pyand the 8 bypass sitesModify:
agent/src/aws_session.py,agent/src/config.py:46,110,agent/src/shell.py:67,agent/src/telemetry.py:62,170,agent/src/server.py:157,184,agent/src/memory.py:43; tests inagent/tests/test_aws_session.py_build_scoped_session()(aws_session.py:160): setbotocore_session.user_agent_extra = ua.static_user_agent_extra()andua.register_trace_appender(...)on the botocore session we already construct. Same for the plain-session path (aws_session.py:213) by constructing the botocore session explicitly. Coverstenant_client/tenant_resource(progress_writer, task_state, nudge_reader, attachments, deliverers, telemetry S3) and the singleton's pool is untouched across trace changes.tenant_client/tenant_resourcegap (aws_session.py:240/250): these bypass the session and callboto3.client(...)directly (kept that way deliberately for test transparency) — applyconfig=ua.client_config()+ua.register_trace_appender(client.meta.events)there too, merging with any caller-suppliedconfigkwarg.aws_session.platform_client(service, **kwargs)—boto3.client(service, config=ua.client_config(), **kwargs)+ua.register_trace_appender(client.meta.events). Migrate the 8 directboto3.client(...)sites (logs ×5, secretsmanager ×2, bedrock-agentcore ×1) to it. The ambient STS client at aws_session.py:142 getsconfig=ua.client_config()inline (its events come from the default session). Existing tests patchboto3.clientglobally — MagicMocks tolerate the extraconfigkwarg and the events registration.configure_session()(aws_session.py:81) additionally callsua.set_trace(task_id)— the agent's trace is live for the whole task without touchingpipeline.pycall sites beyond what already exists (pipeline.py:634).reset_session_cache()also clears the trace (test hygiene).mise //agent:quality(lint, typecheck, tests). Commit.Task 3 — JS handlers:
cdk/src/handlers/shared/ua.ts+ wire tests (TDD)Create:
cdk/src/handlers/shared/ua.ts,cdk/test/handlers/shared/ua.test.tssanitizeUaValue()mirroring Task 1 (same test vectors).abcaUserAgent(): { customUserAgent: UserAgentPair[] }—[['app/uksb-wt64nei4u6/' + stack]](only whenABCA_STACK_NAMEset) +[['md/uksb-wt64nei4u6', component]]with component fromABCA_COMPONENT(defaultapi). Spread into client config:new DynamoDBClient({ ...abcaUserAgent() }).setAbcaTrace(handle?: string)module state;withAbcaTrace<T>(client: T): T— adds the suffix-append middleware aftergetUserAgentMiddleware, mutating bothuser-agentandx-amz-user-agentonly when they end with the staticmd/segment. Defensive no-op whenmiddlewareStackis absent — required because ~40 existing test files mock client constructors asjest.fn(() => ({}))and module-level instrumentation must not crash under those mocks (real clients always have a stack; this guard is test-environment-only and documented as such).DynamoDBClientwith a stubrequestHandlerrecordingrequest.headersand returning a cannedHttpResponse. Asserts: both segments present and intact in the emitted header (literal/survived — i.e. NOT routed through the app-id field); two sends on the same client under differentsetAbcaTrace()values emit different suffixes; trace-absent ⇒ exactlymd/uksb-wt64nei4u6#apiwith no trailing#; sanitization of a hostile trace value. Commit.Task 4 — JS handlers: migrate all client sites + trace at handler entries
Modify: all
new XClient(sites undercdk/src/handlers/(≈60, enumerated during exploration — every one currently passes{}or{ region }), plus the ~20 handler entry points that already mintconst requestId = ulid().new DynamoDBClient({})→new DynamoDBClient(abcaUserAgent()), wrappedwithAbcaTrace(...)at the outermost cached client (for DocumentClient: instrument the inner client — shared stack confirmed). Lazy singletons (strategies/ecs-strategy.ts:28,strategies/agentcore-strategy.ts:26,shared/memory.ts:149,confirm-uploads.ts:700) instrumented inside theirgetClient()s.setAbcaTrace(requestId)immediately after each existingulid()mint;setAbcaTrace(event.task_id)atorchestrate-task.tsdurable-handler entry. Stream consumers (fanout-task-events,approval-metrics-publisher) carry no trace (optional by design).withAbcaTraceno-ops on bare-object mocks).mise //cdk:test. Commit (possibly split: shared modules / top-level handlers).Task 5 — CDK: thread
ABCA_STACK_NAME+ABCA_COMPONENTenv varsModify:
cdk/src/constructs/task-api.ts:442(commonEnv→ adds both, componentapi) and:931(webhookEnv→webhook; note webhookEnv does not spread commonEnv),cdk/src/constructs/task-orchestrator.ts:221(orchestr), consumer constructs (concurrency-reconciler:71,stranded-task-reconciler:107,pending-upload-cleanup:90→orchestr;fanout-consumerusesthis.fn.addEnvironment(...)at :168ff — add both vars the same way;approval-metrics-publisher-consumer:115currently has noenvironment:block — add one), integrations (slack-integration:188,220,252,linear-integration:192,250,289,github-screenshot-integration:159,228→webhook),cdk/src/stacks/agent.ts:297(runtime env) +cdk/src/constructs/ecs-agent-cluster.ts:124(container env) →ABCA_STACK_NAMEonly (agent hardcodes its component).Stack.of(this).stackName(concrete —stackNameis set explicitly inmain.ts:34).cdk/test/constructs/task-api.test.tsandtask-orchestrator.test.ts(each surface emits its expected component label — acceptance criterion).mise //cdk:test+mise //cdk:synth. Commit.Task 6 — CLI
Create:
cli/src/ua.ts(CLI-local mirror of the tiny helper — cli cannot import from cdk; same pattern as thetypes.tsmirror),cli/test/ua.test.tsModify:
cli/src/auth.ts:37,103,cli/src/commands/{slack,github,linear,admin}.tsclient sites,cli/src/bin/bgagent.ts(onesetAbcaTrace(String(process.pid))at startup),cli/src/types.ts(CliConfig.stack_name?: string),cli/src/commands/configure.ts(--stack-nameflag + bundle passthrough)cli; stack name fromloadConfig().stack_name(omitapp/when unset).auth.test.tsassertion that the Cognito client constructor receivescustomUserAgent.Risk check: type-sync drift hookResolved:CliConfigis already inCLI_ONLY_ALLOWLIST(scripts/check-types-sync.ts:156) — addingstack_name?requires no CDK mirror.cd cli && mise run build. Commit.Task 7 — Docs + full verification
AGENTS.md“Common mistakes”: new bullet — every new AWS SDK client must go through the ABCA UA helpers (agent/src/ua.py·cdk/src/handlers/shared/ua.ts·cli/src/ua.ts); nakedboto3.client(...)/new XClient({})silently drops solution attribution (addresses @krokoko’s comment). Root-level file — no Starlight sync needed.ua.ts/ua.pystating the wire format and pointing at feat(observability): inject solution into outbound AWS SDK User-Agent #319.mise run build(agent quality → cdk → cli → docs),mise run security:sast. Sweep the acceptance-criteria checklist below; mark PR ready.Acceptance-criteria map
/survives{STACKNAME}sanitize-then-clip ≤34, value ≤50; test w//,#, non-ASCII{IDENTIFIER}; per-surface testABCA_COMPONENTenv + Task 5 template assertions + wire tests{TRACE}per-request, never baked, optional; same-client different-trace test; no trailing#; sanitizedbefore-sendshort-circuit capture; JS stubrequestHandlercaptureaws_session.py/config.py; singleton pool preserved{STACKNAME}threaded via env / container env / CLI config — not hard-codedOut of scope (per issue)
Dashboards/querying of attributed usage (#215/#245); the deploy-time
uksbstack-description token (#292, done).🤖 Generated with Claude Code