AO-427: Add monte-carlo-instrument-agent skill#78
Conversation
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…d stale) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The shared _headers() helper sent Authorization: Bearer $GITHUB_TOKEN with every request, including the PyPI metadata fetch. Split into a per-call with_github_auth flag and route only GitHub-host URLs through it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
When the GitHub README fetch fails (private repo, 404, rate limit), use the PyPI metadata description field as a backup README source. PyPI's description mirrors the GitHub README for montecarlo-opentelemetry, so the same parser produces the same instrumentor list. live_failed now reflects whether we have a usable instrumentor list, not which source supplied it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The committed instrumentor_map.json uses `supported_instrumentors`, matching fetch_sdk_docs.py and the fallback contract. detect_libraries.py was looking for `libraries`, so every run silently warned and discarded the snapshot (losing snapshot_date, notes, and extra dependency coverage like langchain-openai). Accept `supported_instrumentors` as the canonical key, keep `libraries` as a backwards-compat read for any older local snapshot. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
requirements.txt was already routed through _safe_read_text (5 MB cap), but pyproject.toml and Pipfile opened directly via tomllib.load(fh), so a malicious or malformed manifest could load arbitrarily large content into memory. Switch to _safe_read_text + tomllib.loads so the same cap applies to all manifest formats. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The harness mirrored push-ingestion's check()-counter pattern, but pytest would not fail individual test_* functions because there were no assertions — only main() exited nonzero. Make check() raise AssertionError on failure so pytest catches per-test failures, and have main() wrap each test in try/except so standalone runs still report the full pass/fail summary across all tests. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Substring matching on the URL ("github.com" in url) was exploitable via an
override like https://example.com/github.com/README.md, which would still
send Authorization: Bearer $GITHUB_TOKEN to a non-GitHub host. Parse the
URL with urllib.parse.urlparse and compare hostname against an exact
allowlist of trusted GitHub hosts.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The header-based parser only caught Langchain/LangGraph and OpenAI because those are the only "# For X" examples in the SDK README — Anthropic, Bedrock, CrewAI, SageMaker, Vertex AI, and the long tail live in a markdown bullet list further down. The script reported source: live with 3 of 8 PRD core libraries and downstream skill logic could treat that partial list as complete. Add a second parsing pass over PyPI-link bullets, deduplicate by (library, package), and require all PRD core libraries (langchain, langgraph, openai, anthropic, crewai, bedrock, sagemaker, vertexai) to be present before declaring live success. Missing any one of them now falls back to the snapshot's known-complete coverage instead of shipping a partial list. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Tier 1 router with locked frontmatter (name: monte-carlo-instrument-agent; description 241 chars; bucket: Setup; combined description+when_to_use 1093 chars) and a CRITICAL no-silent-edit guardrail that covers all file modifications: dependency files, source code, and env files. Routing table links to the seven Tier 3 references that are added in subsequent commits. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The "Available slash commands" section advertised /instrument-agent before the matching plugins/claude-code/commands/instrument-agent/ registration landed (Phase 4 work). Documenting a slash command before it exists fails per-commit consistency review. Phase 4's command-registration commit will re-add the section once the command file and plugin.json entry are in place. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
11-step Tier 2 procedural document — entry point after the SKILL.md router. Walks the full flow: detect libraries/runtime/existing-setup, branch on self-hosted vs MC-hosted OTel, redaction gate, BEFORE-snapshot via get_agent_metadata, idempotent endpoint resolution, propose dependency edits, propose mc.setup() (serverless variant if needed), propose decorator diffs, confirm env vars (MC-hosted only, presence-only), AFTER verification, branch to troubleshooting on failure. CRITICAL callouts surface the never-modify guardrail, never-trace_with_tags rule, and never-echo-credential rule. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Tier 3 reference for the library/runtime detection step. Documents the PRD core-library list as a stable contract (langchain, langgraph, openai, anthropic, crewai, bedrock, sagemaker, vertexai) shared across detect_libraries.py, fetch_sdk_docs.py, and instrumentor_map.json. Covers two-tier handling (core with fallback templates vs live-fetch-required), the bedrock/sagemaker boto3 ambiguity rule (route to unsupported, not detected), serverless-framework signal table, existing-mc.setup() routing, and clean no-match exit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Tier 3 reference for mc.setup() generation. Cites both production templates by URL+SHA: long-running container variant from ai-agent (default BatchSpanProcessor) and serverless variant from saas-serverless (SimpleSpanProcessor required — Lambda freezes the process and the batch processor would otherwise be suspended before flushing). Includes idempotent OTLP endpoint normalization (never double-append /v1/traces), existing-mc.setup() decision matrix, prompts-disabled-by-default rationale, and presence-only credential checks (never read or echo MCD_DEFAULT_API_TOKEN). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Tier 3 reference for selecting and placing the two in-scope decorators (@mc.trace_with_workflow on orchestration functions; @mc.trace_with_task on LLM-calling task functions). Per Bayard, the skill must NEVER add @mc.trace_with_tags. Cites the canonical placement example from saas-serverless graph_creation.py @ ec2af0c L162. Includes the unified diff format for proposing decorator placements per file, conservative defaults (one workflow + one task per LLM call to start), and the no-edits-without-approval guardrail. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Tier 3 reference for the two-call get_agent_metadata flow used in workflow steps 4 (BEFORE snapshot) and 10 (AFTER verification). Documents the dev/ prod twin disambiguation via MCON: the same agent_name with a different MCON is a genuinely new agent. Pre-flight testConnection check on the Monte Carlo MCP server bails early if MCP is unavailable (verification step would silently fail otherwise). NEVER-poll guidance — the skill calls the MCP tool exactly twice; the user driving when they ran the agent triggers the AFTER call. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Tier 3 reference for V1 redaction pathways. Documents what the SDK captures by default (full prompt + completion content as span attributes, sent over OTLP to the configured endpoint) so the customer can make an informed opt-in decision. Walks the three PRD-named pathways: (A) manual mc.create_llm_span with redacted prompts_to_record — cites the saas- serverless graph_creation.py @ ec2af0c L92 example; (B) disable prompt/ completion capture entirely via instrumentor env vars (the default in setup-template.md); (C) selective hybrid for codebases with isolated sensitive paths. Out of scope for v1: auto-detection of sensitive content and auto-scaffolded redactor functions. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Tier 3 reference covering the five failure modes for missing traces, in diagnostic priority order: (1) serverless BatchSpanProcessor foot-gun — checked first because it is by far the most common cause of missing traces on Lambda; (2) SDK init not running — module never imported, env guard not satisfied; (3) missing credentials on the MC-hosted path with presence-only checks; (4) wrong instrumentor versions — cross-checked against the live PyPI compatibility table; (5) upstream pipeline not deployed (out of scope for the skill — escalate to the customer's AO setup team). Cites the saas-serverless tracing.py production reference for the SimpleSpanProcessor fix. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The strict no-mention gate (grep -rin "trace_with_tags") is the right primitive after all. Telling an LLM "don't add @trace_with_tags" puts the decorator name in its context window — combined with the SDK README's Travel Assistant example showing @mc.trace_with_tags, that made the negative callout an attractor rather than a deterrent. Replaced explicit exclusions in SKILL.md, workflow.md, and decorator-placement.md with positive-only scope language: "the only two decorators in scope for v1 are @trace_with_workflow and @trace_with_task." The pair propagates and filters appropriately for monitoring-advisor; other SDK tracing primitives are simply not part of v1. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Minimal LangGraph-shaped agent (~50 LOC) with ChatOpenAI in a graph node, plus requirements.txt listing langchain, langgraph, langchain-openai, and openai. Targets the long-running container template — no serverless framework, no lambda_handler. Used by Phase 3's structural smoke tests to validate detect_libraries.py against a realistic agent codebase. detect_libraries on this fixture reports detected=[langchain, langgraph, openai], runtime=long_running, no serverless signals, no existing setup — which is what workflow.md step #1 expects to drive the rest of the flow toward the long-running mc.setup() template. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Same LangGraph agent as sample_agent/, but exposed as a Lambda handler with serverless.yml + aws-lambda-powertools dependency. Targets the serverless mc.setup() variant (SimpleSpanProcessor required). detect_libraries on this fixture flips runtime to serverless and surfaces three signals (serverless.yml, aws-lambda-powertools, lambda_handler) — which drives the workflow toward the SimpleSpanProcessor template per references/setup-template.md. Note: the agent.py docstring intentionally avoids the literal token combination that detect_libraries.py's regex-based existing-setup grep matches — that grep has docstring-vs-call-site false positives that we work around here at the fixture level. Tracking the broader detector correctness as a follow-up in Lessons. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
test_sample_agent and test_sample_serverless_agent assert that detect_libraries produces JSON matching what each workflow step in references/workflow.md describes — for the long-running container path and the serverless path respectively. Validates the structural surface end-to-end: detected libraries, runtime classification, serverless signals (serverless.yml, lambda_handler, aws-lambda-powertools), and absence of false-positive existing-setup hits. Total: 57/57 checks pass (15 new). Live verification (testConnection + get_agent_metadata against an MC AO tenant) is deferred to QA per the Phase 3 scope decision; this commit covers the structural-only portion. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Relative symlinks from each editor plugin's skills/ directory to the canonical skill source under skills/instrument-agent/. Required by the mc-agent-toolkit plugin architecture per CONTRIBUTING.md so the skill ships through claude-code, cursor, codex, copilot, and opencode. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds plugins/claude-code/commands/instrument-agent/instrument-agent.md (per CONTRIBUTING.md L170-174 — required for the skill to be discoverable as mc-agent-toolkit:instrument-agent), threads the directory through plugin.json's commands array, and re-adds the "Available slash commands" section to SKILL.md (deferred from acfcf8b since the registration didn't exist yet). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
21 cases — 11 should-trigger covering all 8 PRD core libraries (LangChain,
LangGraph, OpenAI, Anthropic, CrewAI, Bedrock, SageMaker, Vertex AI), the
serverless variant ("instrument my Lambda agent"), a generic MC tracing
prompt, and the intentionally-ambiguous "set up agent observability" case
(routes here per Bayard's DM); 10 should-not-trigger lifting verbatim
prompts from monitoring-advisor (3 — locking the bidirectional routing
contract), plus incident-response (2 — alert triage), push-ingestion (2),
and connection-auth-rules (1).
Uses "no-trigger" with hyphen per run_evals.py:120 — earlier underscore
form would have silently passed --dry-run but failed real runs.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
6 behavior cases covering the workflow's load-bearing rules: three silent-edit guardrails (decorator, dependencies, mc.setup) — each prompt asks the skill to skip approval and the rubric expects propose-then-wait; endpoint-normalization multi-turn case verifying http://localhost:4318/v1/traces is not double-appended (output_must_not_contain blocks /v1/traces/v1/traces); serverless detection from inlined serverless.yml content (the eval model has no per-case fixture/cwd field, so filesystem state is expressed in the prompt); before/after verification multi-turn case requiring get_agent_metadata calls at both ends and MCON-based dev/prod twin disambiguation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Locks the routing contract on the monitoring-advisor side: three new should-not cases covering instrument-agent's domain (mc.setup() wire-up, generic instrument-agent activation, SDK install + decorator scaffolding for Lambda). Symmetric with instrument-agent/trigger-evals.json's should-not cases that lift verbatim from this file's should-trigger list. Now if anyone flips the meaning of "instrument" vs "monitor" in either skill's description, both eval files break — the contract lives in code, not just docs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Add must_not_call: [Edit, Write, NotebookEdit] to live-01, live-02, live-03 (silent-edit guardrails). Previously the guardrail relied on the judge model reading the rubric prose; now Edit/Write/NotebookEdit are blocked deterministically. PRD #8 is the single most important guardrail in the skill — checking it mechanically beats trusting prose interpretation. - Remove turn-level judge_rubric entries from live-04 and live-06. TurnCriteria.from_dict() in plugins/claude-code/evals/models.py ignores judge_rubric — that field only exists on the case-level CaseCriteria subclass — so those turn-level rubrics were silently not evaluated. Merged the per-turn specifics into each case's case-level rubric with explicit "Turn 1 / Turn 2 / Turn 3" sections so nothing's lost. - Trim trailing blank line at end of SKILL.md (git diff --check was failing on commit 08f8c62's introduction of "new blank line at EOF"). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
End-to-end testing summaryWalked the skill end-to-end against two real Python codebases and verified trace Setup
Resulting trace treesLong-running (9 spans): Serverless (5 spans, no LangGraph layer): Both confirm the workflow span as root, task spans nested correctly, Findings (all fixed in 7b0a5ca)
Re-test from clean stateRolled both fixtures back to uninstrumented (deleted
What this doesn't cover
|
The skill's setup-template.md and redaction.md documented per-instrumentor env
vars (`OTEL_INSTRUMENTATION_LANGCHAIN_TRACE_PROMPTS=false`, etc.) as the
"prompts disabled" mechanism, but those env vars do not exist in the OpenLLMetry
instrumentors at <=0.53.4. The instrumentors instead read
TRACELOOP_TRACE_CONTENT, which defaults to "true" (content captured) — meaning
a comment alone in the template did NOT change the actual default.
Surfaced empirically by inspecting an e2e test trace: ChatAnthropic.chat spans
included `gen_ai.prompt.0.content` and `gen_ai.completion.0.content` despite
the template promising "by default this template DOES NOT capture prompt /
completion content". The privacy property the skill claimed wasn't real.
Fix:
- setup-template.md long-running and serverless templates now set
`os.environ.setdefault("TRACELOOP_TRACE_CONTENT", "false")` at module top,
before importing any instrumentor. Privacy default-off is now a code
guarantee, not just a comment. Operators who want capture override by
exporting TRACELOOP_TRACE_CONTENT=true.
- setup-template.md Section 3 explains the actual mechanism and adds a
CRITICAL callout that a comment alone does not flip the default.
- redaction.md drops the non-existent per-instrumentor env vars and
documents TRACELOOP_TRACE_CONTENT as the single source of truth for
OpenLLMetry's instrumentor family.
Verified: re-running both e2e fixtures with the corrected template, the
ChatAnthropic.chat spans now have no prompt or completion content, while
token counts, model identifiers, and request metadata remain captured.
Tests: 107 + 42 helper-script tests still pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
6th finding (after-the-fact follow-up)While walking through the trace UI to take a closer look at the spans, I noticed: Despite the long-running template's comment claiming "By default this template DOES NOT capture prompt/completion content. To capture prompts ... set OTEL_INSTRUMENTATION_LANGCHAIN_TRACE_PROMPTS=true", the spans had full prompt and completion text. Root cause: the env var name Fix (in 62e94d5):
Empirical verification: re-running both fixtures with the corrected template,
Privacy property restored. Tests still green (107 + 42). This is the kind of bug only e2e testing catches — the unit tests assert on script output, not on the instrumented spans' actual content. |
Major themes from the review: - Data residency correction: trace content lives in the customer's environment; the MC-hosted collector is a write-back pass-through that never persists trace content on MC's side. Reframed redaction.md, setup-template.md, workflow.md, SKILL.md accordingly. - Capture-on as the canonical default: prompts/completions are part of the value proposition (PRD). Redaction is opt-in for customers with stricter privacy requirements (compliance, contractual restrictions). Workflow Step 3 question rewritten; default template no longer disables TRACELOOP_TRACE_CONTENT. - Redaction model corrected: under auto-instrumentation, TRACELOOP_TRACE_CONTENT=false is a mandatory prerequisite for any redaction (otherwise duplicate spans). Manual mc.create_llm_span with placeholder-substitution is an optional additional layer. Dropped Pathway/Strategy C entirely (decorators don't gate auto-instrumentor capture). Replaced the hash+length recommendation with the placeholder-substitution technique. - PyPI-only refactor of fetch_sdk_docs.py: removed the GitHub README fetch (private repo always 404s anonymously), the snapshot fallback (samkcrespo confirmed not a hard requirement), and the PRD_CORE_LIBRARIES constant. Now PyPI-only, fails closed on errors. Pruned test_fetch_sdk_docs.py to match (-46 lines). - instrumentor_map.json reframed as the static detection map used by detect_libraries.py for offline package recognition — not a fallback for fetch_sdk_docs.py. Stripped snapshot/STALE metadata. Kept additional_pins (wrapt<2) because PyPI doesn't expose transitive pins. - Library-detection.md: removed the "Tier A core / Tier B" two-tier framing. Single tier — supported set is whatever PyPI shows. Decoupled decorator/manual-span availability from auto-instrumentor presence. Added guidance to ask the user when serverless detection is ambiguous. - Setup-template.md restructured: four self-contained templates (long-running, serverless+MC-hosted+MCD_DEFAULT, serverless+MC-hosted+ OTEL_HEADERS, serverless+self-hosted) instead of two base + conditional callouts. Removed "Earlier versions of this template" historical framing. - Pre-flight degradation: MC MCP availability is no longer a hard requirement. If test_connection fails, the workflow proceeds and asks the customer to verify in the MC UI manually instead of exiting. - Verify-traces.md timing bumped to acknowledge 10-15 minutes for first-time visibility on low-traffic / dev agents. - Decorator-placement.md: removed saas-serverless private-repo links; softened the task-name rule (function name fine when meaningful and distinct); reworded "LLM-calling" to "LLM-calling or tool-calling". - All private-repo references removed (saas-serverless, ai-agent, montecarlo-opentelemetry — all private). Public docs source is now https://pypi.org/project/montecarlo-opentelemetry/. - Customer-facing CHANGELOG cleanup: slimmed the 1.11.0 entries in all five plugin CHANGELOGs to a short customer-facing description; dropped the "### Fixed" sections (this is the first release of the skill). Same shortening applied to README.md and skills/README.md skill rows. - New GitHub Actions workflow runs test_detect_libraries.py + test_fetch_sdk_docs.py on PRs/pushes touching skills/instrument-agent/**. - test_fixtures/no-deps/.gitkeep replaced with a one-line README explaining the directory tests the no-deps detection path. All 107 detect_libraries tests and 33 fetch_sdk_docs tests pass locally. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Review feedback addressed in 709107aThanks for the thorough pass, @bayardneville — all 44 threads are answered inline. Summary of the themes: Factual corrections
Structural changes
Polish
All 140 tests pass locally. CI should pick them up on this push. |
…-build-v1-instrument-agent-skill # Conflicts: # plugins/claude-code/.claude-plugin/plugin.json # plugins/claude-code/CHANGELOG.md # plugins/codex/.codex-plugin/plugin.json # plugins/codex/CHANGELOG.md # plugins/copilot/CHANGELOG.md # plugins/copilot/plugin.json # plugins/cursor/.cursor-plugin/plugin.json # plugins/cursor/CHANGELOG.md # plugins/opencode/CHANGELOG.md # plugins/opencode/package.json
bayardneville
left a comment
There was a problem hiding this comment.
From Claude:
Verification re-review of 65b3fd6 (delta 62e94d5..65b3fd6). All 7 prior BLOCKERs and 11 prior ISSUEs are addressed. Three new ISSUEs and four lower-priority items from the rewrite, posted as inline comments. Full review doc at .work/AO-427/reviews/ao-427-build-v1-instrument-agent-skill.review.2026-05-12-65b3fd6.md.
Six items from the second review pass: F1 (setup-template.md) — Template 2 partial-config crash. Replaced os.environ[...] reads of MCD_DEFAULT_API_ID/TOKEN with os.environ.get + a logging.warning skip-and-log block, mirroring the existing endpoint presence check. A Lambda with OTEL_ENDPOINT set but credentials missing now runs uninstrumented instead of crashing at cold start with KeyError. F2 (setup-template.md) — concrete serverless splice for the prompts-disabled variant. Added a CRITICAL callout plus a full patched Template 2 snippet showing the three-line os.environ.setdefault block at module scope (between `import os` and any `opentelemetry.instrumentation.*` import). Previous "apply at the top of the chosen Section 1 template" wording invited customers to drop the line inside `init_tracing()`, which is a no-op because the instrumentors read the env var at import time. F3 (detect_libraries.py + instrumentor_map.json + references) — biggest change. Deleted instrumentor_map.json and rewrote detect_libraries.py as a thin discovery layer. Output is now `dependencies` (sorted raw pip names) + runtime/serverless_signals/existing_setup. AI-library matching is the LLM's job, using fetch_sdk_docs.py's live PyPI list as the ground-truth supported set. Removed AMBIGUOUS_AWS_* hardcoded disambiguation — ambiguous deps (boto3, google-cloud-aiplatform, etc.) are now surfaced via user prompt per library-detection.md section 4. Moved the wrapt<2 transitive-pin guidance to troubleshooting.md as a symptom-driven fix (TypeError: wrap_function_wrapper() got an unexpected keyword argument 'module') rather than baking it into every install diff. Updated library-detection.md, workflow.md, and SKILL.md to reflect the new contract. F7 (setup-template.md) — env-var naming consistency. Standardized on OTEL_ENDPOINT across all four templates (Templates 2/3/4 already used it; Template 1 + the prompts-disabled variant were updated from MC_OTEL_ENDPOINT). Also renamed Template 1's local variable from the uppercase MC_OTEL_ENDPOINT to lowercase otel_endpoint so the env var name and the local variable name aren't textually identical. F9 (test reorg) — moved skills/instrument-agent/scripts/test_*.py and test_fixtures/ to a sibling skills/instrument-agent/tests/ directory with the fixtures renamed to tests/fixtures/. Updated path constants in both test files, sys.path.insert in test_fetch_sdk_docs.py, and the CI workflow paths in .github/workflows/instrument-agent-tests.yml. CI `paths:` filter (skills/instrument-agent/**) unchanged. Both test suites still pass after the move (53 + 33 = 86 checks). redaction.md:27 — fixed factual claim about the default OTLP endpoint. The SDK has no built-in default; the customer always supplies it. The templates resolve it from an OTEL_ENDPOINT env var so they can flip between MC-hosted and self-hosted without code changes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Round 2 review fixes — bayardneville (2026-05-12) Pushed
Diff stats: 33 files changed, 535 insertions(+), 862 deletions(-). Local verification: |
Summary
Adds the
monte-carlo-instrument-agentskill (AO-427) — a Setup-bucket skill that walks an MC Agent Observability customer through instrumenting a new AI agent in their Python codebase end-to-end. The skill detects AI libraries (LangChain/LangGraph, OpenAI, Anthropic, CrewAI, Bedrock, SageMaker, Vertex AI, plus the long tail via live PyPI fetch), classifies the runtime as serverless or long-running, and proposesmc.setup()plus@trace_with_workflow/@trace_with_taskdecorator diffs — always asking for explicit per-file approval before any edit.The skill produces traces;
monitoring-advisorconsumes them. Sequential, not overlapping — bidirectional routing contract locked in code via symmetric should-not eval cases on both skills.Plugin version bumped to 1.11.0 in lockstep across all five editor plugins.
What this PR enables
/instrument-agentslash command in Claude Code — kicks off the workflow against the current Python codebase. Ships via the same SKILL.md routing on Cursor, Codex, Copilot CLI, and OpenCode (each via the editor's standard skill discovery; no slash-command surface in those four).scripts/fetch_sdk_docs.pyto pull the current supported-instrumentor list from the SDK README on GitHub or from PyPI'sinfo.description(fail-closed to a snapshottedinstrumentor_map.jsonwith stale-data warnings if both fail). Live success requires all 8 PRD core libraries present — partial parses fall back rather than ship incomplete coverage.scripts/detect_libraries.pyflagsserverless.yml/template.yaml/vercel.json/wrangler.toml/ Lambda handler patterns /mangum/chalice/zappa/aws-lambda-powertoolsdeps. When detected, the skill proposes theSimpleSpanProcessorvariant ofmc.setup()— the defaultBatchSpanProcessorsilently drops traces on Lambda, which is documented as the Restructure #1 failure mode inreferences/troubleshooting.md./v1/traces, use as-is; otherwise append. Never double-append. Resolved final URL is rendered back to the user before code-gen.mc.create_llm_spanwith redactedprompts_to_record, env-var disable of prompt/completion capture (the default insetup-template.md), and selective hybrid. Redaction is a gating step beforemc.setup()is generated, not a post-hoc consideration.get_agent_metadata— snapshot existing agents before changes; confirm the new agent appears with a fresh MCON after the user runs the instrumented agent. Distinguishes dev/prod twins via MCON, not display name.must_not_call: [Edit, Write, NotebookEdit]on three guardrail cases — not just judge-rubric prose.instrument-agent/trigger-evals.jsonlifts 3monitoring-advisorshould-trigger prompts as should-not cases;monitoring-advisor/trigger-evals.jsonadds 3 should-not cases formc.setup()and SDK-install requests.Agent instrumentationConversation Signal incontext-detection(Step 0 fast-path, Step 1 categorize-intent, Step 4 routing) so ambiguous prompts route correctly and clear ones bypass the signal catalog entirely.Key Decisions
See PRD: instrument-agent Skill for full design context.
@trace_with_tags,@trace_with_workflow,@trace_with_task). Per Bayard's DM,@trace_with_tagswas explicitly excluded — the skill must never propose it. Tasks-nested-in-workflows already provides the filtering surface MC's evaluation pipeline needs. Strict no-mention gate enforces this (grep -rin "trace_with_tags" skills/instrument-agent/returns no matches): naming the forbidden token in NEVER callouts puts it in the LLM's context, which combined with the SDK README's Travel Assistant example would have made the negative callout an attractor rather than a deterrent.instrumentor_map.jsonsnapshot ships last-known-compatible pins with asnapshot_dateandSTALEwarning when used; fail-closed if both live-fetch fails AND the snapshot is older than ~6 months. Reconciles Bayard's intent with the reviewer concern that unpinned-latest installs silently break compatibility.span_processorkwarg is undocumented in the SDK README. Discovered frommonte-carlo-data/saas-serverless/ai-recommendations/common/tracing.py. The skill bakes that knowledge intosetup-template.mduntil the upstream README catches up. Both production examples (ai-agentfor long-running,saas-serverlessfor serverless) are cited inline by URL+SHA.detect_libraries.pysmoke against realistic LangGraph fixtures. The live verification path (testConnection+get_agent_metadataagainst a real MC AO tenant) needs MCP credentials we don't have locally and shouldn't pollute external tenants from a PR. Captured under "Open Questions" in the plan; QA item before/after merge."github.com" in url; reviewer caught thathttps://example.com/github.com/pathwould still leak the token. Fixed by parsing withurllib.parse.urlparseand comparing hostname against an exact allowlist{github.com, raw.githubusercontent.com, api.github.com}.Test plan
python3 skills/instrument-agent/scripts/test_detect_libraries.py— 57/57 pass against 11 fixtures (long-running, serverless, mixed manifests, boto3 ambiguity, existing setup, no-deps, etc.).python3 plugins/claude-code/evals/run_evals.py --skill instrument-agent --dry-run— 21 trigger-eval cases load.python3 plugins/claude-code/evals/run_evals.py --skill monitoring-advisor --dry-run— 36 cases load (3 new should-not for bidirectional routing).instrument-agent/live-evals-dev.yaml— 6 behavior cases (silent-edit guardrails, endpoint normalization, serverless detection, before/after verification).find plugins -type l -name instrument-agent | wc -lreturns 5./instrument-agentslash-command claim; the other four use editor-standard discovery wording).grep -rin "trace_with_tags" skills/instrument-agent/returns no matches.run_evals.py --dry-runagainstmonitoring-advisor/push-ingestion/prevent; YAML parse againstmonitoring-advisor/prevent/incident-response/context-detectionlive-evals — all clean.testConnection+get_agent_metadataBEFORE/AFTER cycle against a real MC AO tenant; LLM walks SKILL.md → references → diff-approval interactively againstsample_agentandsample_serverless_agentfixtures./instrument-agentslash command in Claude Code by installing the plugin from this branch.🤖 Generated with Claude Code
/code-review