Skip to content

Commit 8736716

Browse files
krokokobgagentClintEastman02
authored
feat(security): per-session IAM scoping — tag-scoped SessionRole for agent tenant-data access (#209) (#211)
* feat(agent): per-task scoped credential provider (#209) Add agent/src/aws_session.py: a portable, refreshable STS AssumeRole credential provider that assumes a per-task SessionRole with session tags {user_id, repo, task_id}. Refreshable (botocore DeferredRefreshableCredentials) to survive past the 1h role-chaining cap on tasks running up to maxLifetime (8h). Wire the 6 tenant-data boto3 clients (TaskTable, TaskEvents, Approvals, Nudges DDB; trace + attachments S3) through tenant_client/ tenant_resource. Non-tenant clients (secrets, logs, AgentCore Memory) stay on the compute role. Fail-safe: with AGENT_SESSION_ROLE_ARN unset (local/dev/tests), behaves identically to the prior ambient-credential path. Refs #209 * feat(cdk): per-task SessionRole with tag-scoped tenant-data access (#209) Add AgentSessionRole construct: a per-task role assumed by the agent with session tags {user_id, repo, task_id}. Carries the tenant-data grants moved off the runtime ExecutionRole: - DynamoDB item access on the 4 task_id-partitioned tables, gated by a dynamodb:LeadingKeys = ${aws:PrincipalTag/task_id} condition (Scan not granted — it ignores leading-keys). - S3 trace PutObject scoped to traces/${aws:PrincipalTag/user_id}/*. - S3 attachment reads scoped to attachments/${aws:PrincipalTag/user_id}/*. Trust policy admits the runtime ExecutionRole for sts:AssumeRole + sts:TagSession. Wire into agent.ts: slim the runtime role to non-tenant grants (concurrency, secret, logs, memory), inject AGENT_SESSION_ROLE_ARN via Lazy.string, resolve the agent.ts:376 trace-prefix TODO. CDK construct tests assert trust, leading-keys, S3 prefixes, and the compute-role assume grant. 1795 cdk tests pass; cdk-nag clean. Refs #209 * feat(cdk): ECS backend parity for per-session scoping (#209) Make the SessionRole backend-agnostic: - AgentSessionRole.addAssumingRole() admits a second compute role (the ECS Fargate task role) to the trust policy. - EcsAgentCluster accepts an optional agentSessionRole. When wired, the task role assumes the SessionRole for tenant-data access instead of receiving direct task-table grants; AGENT_SESSION_ROLE_ARN is injected into the container. - ARN-scope the ECS task-role Bedrock grant: replace Resource:'*' on InvokeModel with explicit foundation-model + us.* inference-profile ARNs (parity with the AgentCore runtime), and update the IAM5 suppression reason. This is an unconditional security fix for the dormant backend. - Wire agentSessionRole into the commented-out ECS block in agent.ts for when #164 enables the backend. Tests: multi-principal trust, ARN-scoped Bedrock (no wildcard), and the SessionRole-wired ECS path (assume grant + env var + conditioned DDB). Refs #209 * docs: per-session IAM scoping in SECURITY.md + ROADMAP.md (#209) Document the implemented SessionRole model: per-task sts:AssumeRole with {user_id, repo, task_id} tags, task_id leading-key DDB isolation, user_id-prefixed S3, refreshable creds for the 1h chaining cap, and the backend-agnostic (AgentCore + ECS) design. Mark the roadmap item implemented. Regenerate Starlight mirrors. Refs #209 * fix(review): fail-closed scoping, accurate docs, hardened tests (#209) Address PR #211 review findings: - **Fail closed (not silent-open)**: when AGENT_SESSION_ROLE_ARN is set but the scoped session can't be built, raise SessionScopingError and emit a loud log_error_cw instead of silently downgrading to unscoped (cross-tenant) ambient credentials. Unset ARN still fails safe (local/dev). Per-task log line now records SCOPED/UNSCOPED via the previously-unused is_scoped(). - **Docs accuracy**: soften the SECURITY.md 'validated against the IAM policy engine' claim to describe what the committed tests assert (CDK template + mocked-STS unit tests) plus the one-time simulator check; fix the task_id '12-char hex' comment (canonical id is a ULID). - **Tests**: replace the test that codified silent fallback with one asserting fail-closed + loud log; add thread-safety (20-thread double-checked-locking) and tag-truncation (256-char) tests; make the ECS 'no direct task-table grant' assertion real (was a tautology). - **Construct guard**: AgentSessionRole rejects empty assumingRoles. 810 agent + 1798 cdk tests pass; full build green. Refs #209 * refactor(agent): keep session tags in module state, not os.environ (#209) Address type-design review: configure_session previously wrote tenant identifiers (user_id/repo/task_id) to os.environ, which the untrusted repo subprocesses the agent spawns would inherit. Store them in a private module dict instead, removing the information-exposure surface for a control whose whole purpose is tenant isolation. Add a test asserting tags never reach os.environ. 811 agent + 1798 cdk + 286 cli tests pass; full build green. Refs #209 --------- Co-authored-by: bgagent <bgagent@noreply.github.com> Co-authored-by: Clint Eastman <clintae@amazon.com>
1 parent 23651e0 commit 8736716

17 files changed

Lines changed: 1281 additions & 65 deletions

File tree

agent/src/attachments.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -53,12 +53,12 @@ def download_attachments(
5353
if not attachments:
5454
return []
5555

56-
import boto3
56+
from aws_session import tenant_client
5757

5858
attachments_dir = Path(workspace) / ATTACHMENTS_DIR
5959
attachments_dir.mkdir(parents=True, exist_ok=True)
6060

61-
s3_client = boto3.client("s3")
61+
s3_client = tenant_client("s3")
6262
prepared: list[PreparedAttachment] = []
6363

6464
try:

agent/src/aws_session.py

Lines changed: 250 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,250 @@
1+
"""Per-task scoped AWS credentials for tenant-data access.
2+
3+
This module centralizes how the agent obtains boto3 clients/resources for
4+
**tenant data** (the task's own DynamoDB rows and its S3 trace/attachment
5+
objects). Instead of using the long-lived compute role (AgentCore Runtime
6+
``ExecutionRole`` or ECS Fargate task role) directly, the agent assumes a
7+
per-task **SessionRole** with session tags ``{user_id, repo, task_id}`` and
8+
uses the resulting short-lived, tag-scoped credentials. The SessionRole's IAM
9+
policy self-constrains via ``aws:PrincipalTag/*`` conditions
10+
(``dynamodb:LeadingKeys`` on ``task_id`` for the task tables, an S3 prefix
11+
condition on ``user_id`` for the trace bucket), so a compromised session can
12+
only reach its own task's data — not other tenants'.
13+
14+
Two properties matter for correctness:
15+
16+
1. **Refreshable, not one-shot.** The agent runs under credentials that are
17+
themselves an assumed role, so the agent's own ``sts:AssumeRole`` is *role
18+
chaining*, which AWS hard-caps at 1 hour regardless of the role's
19+
``MaxSessionDuration``. Tasks can run up to ``maxLifetime`` (8 h), so a
20+
single ``assume_role()`` call would yield credentials that expire mid-task
21+
and fail every subsequent call with ``ExpiredToken``. We wrap the assume
22+
call in botocore ``RefreshableCredentials`` so boto3 transparently
23+
re-assumes before expiry.
24+
25+
2. **Backend-agnostic.** The same code path works whether the agent boots
26+
under an AgentCore execution role or an ECS task role — both are valid
27+
assuming principals in the SessionRole trust policy.
28+
29+
**Fail-safe vs. fail-closed:**
30+
31+
- When ``AGENT_SESSION_ROLE_ARN`` is **unset** (local dev, tests, or a
32+
deployment that has not yet provisioned the SessionRole), this module returns
33+
plain boto3 clients/resources backed by the ambient credential chain —
34+
identical to the pre-feature behavior. Scoping is opt-in and its absence does
35+
not block task execution.
36+
- When ``AGENT_SESSION_ROLE_ARN`` **is set**, scoping has been *requested*, so a
37+
failure to build the scoped session is treated as a hard error: this module
38+
raises :class:`SessionScopingError` rather than silently degrading to the
39+
ambient (cross-tenant) compute role. For a tenant-isolation control, silently
40+
running unscoped is the most dangerous failure mode — the agent keeps working
41+
but stops isolating tenants — so we fail **closed** (abort the task) instead.
42+
"""
43+
44+
from __future__ import annotations
45+
46+
import os
47+
import threading
48+
from typing import Any
49+
50+
# Env var holding the per-task SessionRole ARN. Set by the orchestrator on the
51+
# compute environment (AgentCore runtime env / ECS container overrides).
52+
SESSION_ROLE_ARN_ENV = "AGENT_SESSION_ROLE_ARN"
53+
54+
# Role chaining caps the assumed session at 1 hour. Request the maximum the
55+
# cap allows; botocore refreshes well before this elapses.
56+
_CHAINED_SESSION_DURATION_S = 3600
57+
58+
# IAM session-tag value constraints: keys <=128 chars, values <=256 chars.
59+
_MAX_TAG_VALUE_LEN = 256
60+
61+
_lock = threading.Lock()
62+
_session: Any = None # cached boto3.Session (scoped or plain)
63+
_scoped: bool | None = None # None until first resolution; True if tag-scoped
64+
65+
# Session-tag values, set once at startup by ``configure_session`` from the
66+
# resolved TaskConfig. Kept in private module state — NOT os.environ — so the
67+
# tenant identifiers are not inherited by the untrusted repo subprocesses the
68+
# agent spawns (build/test/tooling). Read by ``_session_tags`` at assume time.
69+
_tags: dict[str, str] = {}
70+
71+
72+
class SessionScopingError(RuntimeError):
73+
"""Per-session IAM scoping was requested but could not be established.
74+
75+
Raised when ``AGENT_SESSION_ROLE_ARN`` is set but the scoped session cannot
76+
be built. Fails the task closed rather than silently falling back to the
77+
ambient (cross-tenant) compute role.
78+
"""
79+
80+
81+
def configure_session(user_id: str, repo: str, task_id: str) -> None:
82+
"""Record session-tag values in private module state for later use.
83+
84+
Called once at agent startup from the resolved ``TaskConfig``. Idempotent;
85+
safe to call before any tenant-data client is created. Does not itself
86+
assume the role — assumption is deferred until the first client is built so
87+
that a missing SessionRole never delays startup. Values are stored in a
88+
module global (not ``os.environ``) so tenant identifiers do not leak into
89+
spawned subprocesses.
90+
"""
91+
global _tags
92+
_tags = {
93+
key: value
94+
for key, value in (("user_id", user_id), ("repo", repo), ("task_id", task_id))
95+
if value
96+
}
97+
98+
99+
def reset_session_cache() -> None:
100+
"""Drop the cached session and tags. For tests that toggle config."""
101+
global _session, _scoped, _tags
102+
with _lock:
103+
_session = None
104+
_scoped = None
105+
_tags = {}
106+
107+
108+
def _session_tags() -> list[dict[str, str]]:
109+
"""Build the AssumeRole ``Tags`` list from the configured tag values.
110+
111+
Only non-empty values are included (filtered at ``configure_session``).
112+
Values are truncated to the IAM limit so an over-long repo slug can never
113+
make ``AssumeRole`` fail closed.
114+
"""
115+
return [{"Key": key, "Value": value[:_MAX_TAG_VALUE_LEN]} for key, value in _tags.items()]
116+
117+
118+
def _build_scoped_session(role_arn: str) -> Any:
119+
"""Build a boto3 Session backed by refreshable assumed-role credentials.
120+
121+
The refresh callback re-invokes ``sts:AssumeRole`` (with session tags) each
122+
time botocore decides the cached credentials are near expiry, so a task
123+
running past the 1-hour role-chaining cap keeps working.
124+
"""
125+
import boto3
126+
from botocore.credentials import (
127+
DeferredRefreshableCredentials,
128+
)
129+
from botocore.session import get_session as get_botocore_session
130+
131+
region = os.environ.get("AWS_REGION") or os.environ.get("AWS_DEFAULT_REGION")
132+
task_id = _tags.get("task_id", "")
133+
# Role session name must be <=64 chars and match [\w+=,.@-]. task_id is a
134+
# short slug (a ULID, ~26 chars, in the API path; a 12-char hex fallback
135+
# when the agent generates its own) — well under 64. The ``abca-`` prefix
136+
# keeps CloudTrail entries identifiable. Truncate defensively regardless.
137+
session_name = f"abca-{task_id}"[:64] or "abca-session"
138+
139+
# A dedicated STS client built from the *ambient* (compute-role) chain.
140+
# This is the role-chaining caller; the assumed SessionRole credentials it
141+
# returns must NOT be used to build it, or refresh would recurse.
142+
sts_client = boto3.client("sts", region_name=region)
143+
144+
def _refresh() -> dict[str, str]:
145+
resp = sts_client.assume_role(
146+
RoleArn=role_arn,
147+
RoleSessionName=session_name,
148+
DurationSeconds=_CHAINED_SESSION_DURATION_S,
149+
Tags=_session_tags(),
150+
)
151+
creds = resp["Credentials"]
152+
return {
153+
"access_key": creds["AccessKeyId"],
154+
"secret_key": creds["SecretAccessKey"],
155+
"token": creds["SessionToken"],
156+
# botocore expects an ISO8601 string; the SDK returns a datetime.
157+
"expiry_time": creds["Expiration"].isoformat(),
158+
}
159+
160+
botocore_session = get_botocore_session()
161+
# Deferred: the first assume_role happens on first credential use, not now,
162+
# so a transient STS hiccup at startup doesn't crash the agent before it
163+
# has even begun.
164+
botocore_session._credentials = DeferredRefreshableCredentials(
165+
method="sts-assume-role-session-tags",
166+
refresh_using=_refresh,
167+
)
168+
if region:
169+
botocore_session.set_config_variable("region", region)
170+
return boto3.Session(botocore_session=botocore_session)
171+
172+
173+
def get_session() -> Any:
174+
"""Return the cached boto3 Session for tenant-data access.
175+
176+
Tag-scoped (assumed SessionRole) when ``AGENT_SESSION_ROLE_ARN`` is set;
177+
otherwise a plain session on the ambient credential chain (fail-safe).
178+
"""
179+
global _session, _scoped
180+
if _session is not None:
181+
return _session
182+
with _lock:
183+
if _session is not None:
184+
return _session
185+
import boto3
186+
187+
role_arn = os.environ.get(SESSION_ROLE_ARN_ENV, "").strip()
188+
if role_arn:
189+
# Scoping was requested. Build the scoped session or FAIL CLOSED —
190+
# never silently downgrade to the ambient compute role, which can
191+
# reach every tenant's data (that is exactly what this control
192+
# prevents).
193+
try:
194+
_session = _build_scoped_session(role_arn)
195+
_scoped = True
196+
except Exception as exc:
197+
from shell import log_error_cw
198+
199+
log_error_cw(
200+
"SESSION_SCOPING_FAILED: AGENT_SESSION_ROLE_ARN is set but the "
201+
f"scoped session could not be built ({type(exc).__name__}: {exc}). "
202+
"Failing closed — refusing to run on unscoped ambient credentials, "
203+
"which would disable tenant isolation.",
204+
task_id=_tags.get("task_id") or None,
205+
)
206+
raise SessionScopingError(
207+
"per-session IAM scoping requested via "
208+
f"{SESSION_ROLE_ARN_ENV} but could not be established"
209+
) from exc
210+
else:
211+
# Scoping not requested (local/dev/tests, or pre-provisioning):
212+
# plain ambient session, behaviorally identical to pre-feature code.
213+
_session = boto3.Session(
214+
region_name=os.environ.get("AWS_REGION") or os.environ.get("AWS_DEFAULT_REGION")
215+
)
216+
_scoped = False
217+
return _session
218+
219+
220+
def is_scoped() -> bool:
221+
"""Whether the current session uses tag-scoped assumed-role credentials."""
222+
if _scoped is None:
223+
get_session()
224+
return bool(_scoped)
225+
226+
227+
def tenant_client(service_name: str, **kwargs: Any) -> Any:
228+
"""boto3 client for tenant data.
229+
230+
When the per-task SessionRole is configured, the client is built from the
231+
tag-scoped, refreshable session. Otherwise it delegates directly to
232+
``boto3.client`` — behaviorally identical to the pre-feature code path
233+
(and transparent to callers/tests that mock ``boto3.client``).
234+
"""
235+
session = get_session()
236+
if is_scoped():
237+
return session.client(service_name, **kwargs)
238+
import boto3
239+
240+
return boto3.client(service_name, **kwargs)
241+
242+
243+
def tenant_resource(service_name: str, **kwargs: Any) -> Any:
244+
"""boto3 resource for tenant data. See :func:`tenant_client`."""
245+
session = get_session()
246+
if is_scoped():
247+
return session.resource(service_name, **kwargs)
248+
import boto3
249+
250+
return boto3.resource(service_name, **kwargs)

agent/src/nudge_reader.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -76,10 +76,10 @@ def _get_table() -> Any | None:
7676
return None
7777

7878
try:
79-
import boto3
79+
from aws_session import tenant_resource
8080

8181
region = os.environ.get("AWS_REGION") or os.environ.get("AWS_DEFAULT_REGION")
82-
dynamodb = boto3.resource("dynamodb", region_name=region)
82+
dynamodb = tenant_resource("dynamodb", region_name=region)
8383
_TABLE_CACHE = dynamodb.Table(table_name)
8484
return _TABLE_CACHE
8585
except Exception as exc:

agent/src/pipeline.py

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -324,6 +324,22 @@ def run_task(
324324
if cedar_policies:
325325
config.cedar_policies = cedar_policies
326326

327+
# Export session-tag values so tenant-data boto3 clients (DDB/S3) assume
328+
# the per-task SessionRole with {user_id, repo, task_id} tags. No-op when
329+
# AGENT_SESSION_ROLE_ARN is unset (local/dev/tests).
330+
from aws_session import configure_session, is_scoped
331+
332+
configure_session(
333+
user_id=config.user_id,
334+
repo=config.repo_url,
335+
task_id=config.task_id,
336+
)
337+
# Surface the credential-scoping posture once per task so every task's logs
338+
# state plainly whether tenant-data isolation was active. is_scoped()
339+
# resolves the session; if scoping was requested but unbuildable it raises
340+
# SessionScopingError here (fail closed) rather than running unscoped.
341+
log("TASK", f"Tenant-data credential scoping: {'SCOPED' if is_scoped() else 'UNSCOPED'}")
342+
327343
log("TASK", f"Task ID: {config.task_id}")
328344
log("TASK", f"Repository: {config.repo_url}")
329345
log("TASK", f"Issue: {config.issue_number or '(none)'}")

agent/src/progress_writer.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -374,10 +374,10 @@ def _ensure_table(self):
374374
self._disabled = True
375375
return
376376

377-
import boto3
377+
from aws_session import tenant_resource
378378

379379
region = os.environ.get("AWS_REGION") or os.environ.get("AWS_DEFAULT_REGION")
380-
dynamodb = boto3.resource("dynamodb", region_name=region)
380+
dynamodb = tenant_resource("dynamodb", region_name=region)
381381
self._table = dynamodb.Table(self._table_name)
382382

383383
# -- core write ------------------------------------------------------------

agent/src/task_state.py

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -53,10 +53,10 @@ def _get_table():
5353
if not table_name:
5454
return None
5555

56-
import boto3
56+
from aws_session import tenant_resource
5757

5858
region = os.environ.get("AWS_REGION") or os.environ.get("AWS_DEFAULT_REGION")
59-
dynamodb = boto3.resource("dynamodb", region_name=region)
59+
dynamodb = tenant_resource("dynamodb", region_name=region)
6060
_table = dynamodb.Table(table_name)
6161
return _table
6262

@@ -543,10 +543,10 @@ def _get_ddb_client(*, client=None):
543543
"""
544544
if client is not None:
545545
return client
546-
import boto3
546+
from aws_session import tenant_client
547547

548548
region = os.environ.get("AWS_REGION") or os.environ.get("AWS_DEFAULT_REGION")
549-
return boto3.client("dynamodb", region_name=region)
549+
return tenant_client("dynamodb", region_name=region)
550550

551551

552552
def _require_tables() -> tuple[str, str]:

agent/src/telemetry.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -450,10 +450,10 @@ def upload_trace_to_s3(
450450

451451
key = f"traces/{user_id}/{task_id}.jsonl.gz"
452452
try:
453-
import boto3
453+
from aws_session import tenant_client
454454

455455
region = os.environ.get("AWS_REGION") or os.environ.get("AWS_DEFAULT_REGION")
456-
client = boto3.client("s3", region_name=region)
456+
client = tenant_client("s3", region_name=region)
457457
# Intentionally omit ContentEncoding=gzip: Node's fetch (undici) auto-
458458
# decompresses responses whose metadata declares gzip encoding, which
459459
# violates the CLI's `-o <file>` "raw gzipped bytes" contract and

0 commit comments

Comments
 (0)