Skip to content

AO-427: Add monte-carlo-instrument-agent skill#78

Merged
bayardneville merged 49 commits into
mainfrom
Jonathan-Avila/ao-427-build-v1-instrument-agent-skill
May 13, 2026
Merged

AO-427: Add monte-carlo-instrument-agent skill#78
bayardneville merged 49 commits into
mainfrom
Jonathan-Avila/ao-427-build-v1-instrument-agent-skill

Conversation

@jonavila
Copy link
Copy Markdown

@jonavila jonavila commented May 7, 2026

Summary

Adds the monte-carlo-instrument-agent skill (AO-427) — a Setup-bucket skill that walks an MC Agent Observability customer through instrumenting a new AI agent in their Python codebase end-to-end. The skill detects AI libraries (LangChain/LangGraph, OpenAI, Anthropic, CrewAI, Bedrock, SageMaker, Vertex AI, plus the long tail via live PyPI fetch), classifies the runtime as serverless or long-running, and proposes mc.setup() plus @trace_with_workflow / @trace_with_task decorator diffs — always asking for explicit per-file approval before any edit.

The skill produces traces; monitoring-advisor consumes them. Sequential, not overlapping — bidirectional routing contract locked in code via symmetric should-not eval cases on both skills.

Plugin version bumped to 1.11.0 in lockstep across all five editor plugins.

What this PR enables

  • /instrument-agent slash command in Claude Code — kicks off the workflow against the current Python codebase. Ships via the same SKILL.md routing on Cursor, Codex, Copilot CLI, and OpenCode (each via the editor's standard skill discovery; no slash-command surface in those four).
  • Live + fallback library detection. The skill calls scripts/fetch_sdk_docs.py to pull the current supported-instrumentor list from the SDK README on GitHub or from PyPI's info.description (fail-closed to a snapshotted instrumentor_map.json with stale-data warnings if both fail). Live success requires all 8 PRD core libraries present — partial parses fall back rather than ship incomplete coverage.
  • Serverless detection. scripts/detect_libraries.py flags serverless.yml / template.yaml / vercel.json / wrangler.toml / Lambda handler patterns / mangum/chalice/zappa/aws-lambda-powertools deps. When detected, the skill proposes the SimpleSpanProcessor variant of mc.setup() — the default BatchSpanProcessor silently drops traces on Lambda, which is documented as the Restructure #1 failure mode in references/troubleshooting.md.
  • Idempotent OTLP endpoint normalization. If the user-supplied URL already ends in /v1/traces, use as-is; otherwise append. Never double-append. Resolved final URL is rendered back to the user before code-gen.
  • Three V1 redaction pathways — manual mc.create_llm_span with redacted prompts_to_record, env-var disable of prompt/completion capture (the default in setup-template.md), and selective hybrid. Redaction is a gating step before mc.setup() is generated, not a post-hoc consideration.
  • Before/after verification via get_agent_metadata — snapshot existing agents before changes; confirm the new agent appears with a fresh MCON after the user runs the instrumented agent. Distinguishes dev/prod twins via MCON, not display name.
  • No-silent-edit guardrail covering dependency files, source code, and env files. Mechanically enforced in live evals via must_not_call: [Edit, Write, NotebookEdit] on three guardrail cases — not just judge-rubric prose.
  • Bidirectional disambiguationinstrument-agent/trigger-evals.json lifts 3 monitoring-advisor should-trigger prompts as should-not cases; monitoring-advisor/trigger-evals.json adds 3 should-not cases for mc.setup() and SDK-install requests.
  • Agent instrumentation Conversation Signal in context-detection (Step 0 fast-path, Step 1 categorize-intent, Step 4 routing) so ambiguous prompts route correctly and clear ones bypass the signal catalog entirely.

Key Decisions

See PRD: instrument-agent Skill for full design context.

  • Skill scope vs PRD Bump to 1.0.11: fix skill script paths #4. PRD requirement Bump to 1.0.11: fix skill script paths #4 originally listed three decorators (@trace_with_tags, @trace_with_workflow, @trace_with_task). Per Bayard's DM, @trace_with_tags was explicitly excluded — the skill must never propose it. Tasks-nested-in-workflows already provides the filtering surface MC's evaluation pipeline needs. Strict no-mention gate enforces this (grep -rin "trace_with_tags" skills/instrument-agent/ returns no matches): naming the forbidden token in NEVER callouts puts it in the LLM's context, which combined with the SDK README's Travel Assistant example would have made the negative callout an attractor rather than a deterrent.
  • Hybrid version-pin strategy. PyPI live-fetch is the primary source of truth (per Bayard's "don't hardcode versions"); the instrumentor_map.json snapshot ships last-known-compatible pins with a snapshot_date and STALE warning when used; fail-closed if both live-fetch fails AND the snapshot is older than ~6 months. Reconciles Bayard's intent with the reviewer concern that unpinned-latest installs silently break compatibility.
  • span_processor kwarg is undocumented in the SDK README. Discovered from monte-carlo-data/saas-serverless/ai-recommendations/common/tracing.py. The skill bakes that knowledge into setup-template.md until the upstream README catches up. Both production examples (ai-agent for long-running, saas-serverless for serverless) are cited inline by URL+SHA.
  • Live verification deferred to QA. Phase 3 ran structural-only — programmatic detect_libraries.py smoke against realistic LangGraph fixtures. The live verification path (testConnection + get_agent_metadata against a real MC AO tenant) needs MCP credentials we don't have locally and shouldn't pollute external tenants from a PR. Captured under "Open Questions" in the plan; QA item before/after merge.
  • Substring matching is a foot-gun for credential routing. First-round fix for the GitHub-token leak used "github.com" in url; reviewer caught that https://example.com/github.com/path would still leak the token. Fixed by parsing with urllib.parse.urlparse and comparing hostname against an exact allowlist {github.com, raw.githubusercontent.com, api.github.com}.

Test plan

  • python3 skills/instrument-agent/scripts/test_detect_libraries.py — 57/57 pass against 11 fixtures (long-running, serverless, mixed manifests, boto3 ambiguity, existing setup, no-deps, etc.).
  • python3 plugins/claude-code/evals/run_evals.py --skill instrument-agent --dry-run — 21 trigger-eval cases load.
  • python3 plugins/claude-code/evals/run_evals.py --skill monitoring-advisor --dry-run — 36 cases load (3 new should-not for bidirectional routing).
  • YAML parse for instrument-agent/live-evals-dev.yaml — 6 behavior cases (silent-edit guardrails, endpoint normalization, serverless detection, before/after verification).
  • find plugins -type l -name instrument-agent | wc -l returns 5.
  • All 5 plugin manifests at version 1.11.0; matching CHANGELOG entries (claude-code keeps /instrument-agent slash-command claim; the other four use editor-standard discovery wording).
  • grep -rin "trace_with_tags" skills/instrument-agent/ returns no matches.
  • Regression check: run_evals.py --dry-run against monitoring-advisor / push-ingestion / prevent; YAML parse against monitoring-advisor / prevent / incident-response / context-detection live-evals — all clean.
  • QA / pre-merge: live testConnection + get_agent_metadata BEFORE/AFTER cycle against a real MC AO tenant; LLM walks SKILL.md → references → diff-approval interactively against sample_agent and sample_serverless_agent fixtures.
  • Smoke test the new /instrument-agent slash command in Claude Code by installing the plugin from this branch.

🤖 Generated with Claude Code

  • Ran /code-review

jonavila and others added 30 commits May 6, 2026 18:42
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…d stale)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The shared _headers() helper sent Authorization: Bearer $GITHUB_TOKEN with
every request, including the PyPI metadata fetch. Split into a per-call
with_github_auth flag and route only GitHub-host URLs through it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
When the GitHub README fetch fails (private repo, 404, rate limit), use the
PyPI metadata description field as a backup README source. PyPI's description
mirrors the GitHub README for montecarlo-opentelemetry, so the same parser
produces the same instrumentor list. live_failed now reflects whether we
have a usable instrumentor list, not which source supplied it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The committed instrumentor_map.json uses `supported_instrumentors`, matching
fetch_sdk_docs.py and the fallback contract. detect_libraries.py was looking
for `libraries`, so every run silently warned and discarded the snapshot
(losing snapshot_date, notes, and extra dependency coverage like
langchain-openai). Accept `supported_instrumentors` as the canonical key,
keep `libraries` as a backwards-compat read for any older local snapshot.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
requirements.txt was already routed through _safe_read_text (5 MB cap), but
pyproject.toml and Pipfile opened directly via tomllib.load(fh), so a
malicious or malformed manifest could load arbitrarily large content into
memory. Switch to _safe_read_text + tomllib.loads so the same cap applies
to all manifest formats.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The harness mirrored push-ingestion's check()-counter pattern, but pytest
would not fail individual test_* functions because there were no
assertions — only main() exited nonzero. Make check() raise AssertionError
on failure so pytest catches per-test failures, and have main() wrap each
test in try/except so standalone runs still report the full pass/fail
summary across all tests.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Substring matching on the URL ("github.com" in url) was exploitable via an
override like https://example.com/github.com/README.md, which would still
send Authorization: Bearer $GITHUB_TOKEN to a non-GitHub host. Parse the
URL with urllib.parse.urlparse and compare hostname against an exact
allowlist of trusted GitHub hosts.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The header-based parser only caught Langchain/LangGraph and OpenAI because
those are the only "# For X" examples in the SDK README — Anthropic,
Bedrock, CrewAI, SageMaker, Vertex AI, and the long tail live in a markdown
bullet list further down. The script reported source: live with 3 of 8 PRD
core libraries and downstream skill logic could treat that partial list as
complete.

Add a second parsing pass over PyPI-link bullets, deduplicate by
(library, package), and require all PRD core libraries (langchain,
langgraph, openai, anthropic, crewai, bedrock, sagemaker, vertexai) to be
present before declaring live success. Missing any one of them now falls
back to the snapshot's known-complete coverage instead of shipping a
partial list.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Tier 1 router with locked frontmatter (name: monte-carlo-instrument-agent;
description 241 chars; bucket: Setup; combined description+when_to_use 1093
chars) and a CRITICAL no-silent-edit guardrail that covers all file
modifications: dependency files, source code, and env files. Routing table
links to the seven Tier 3 references that are added in subsequent commits.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The "Available slash commands" section advertised /instrument-agent before
the matching plugins/claude-code/commands/instrument-agent/ registration
landed (Phase 4 work). Documenting a slash command before it exists fails
per-commit consistency review. Phase 4's command-registration commit will
re-add the section once the command file and plugin.json entry are in
place.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
11-step Tier 2 procedural document — entry point after the SKILL.md router.
Walks the full flow: detect libraries/runtime/existing-setup, branch on
self-hosted vs MC-hosted OTel, redaction gate, BEFORE-snapshot via
get_agent_metadata, idempotent endpoint resolution, propose dependency edits,
propose mc.setup() (serverless variant if needed), propose decorator diffs,
confirm env vars (MC-hosted only, presence-only), AFTER verification, branch
to troubleshooting on failure. CRITICAL callouts surface the never-modify
guardrail, never-trace_with_tags rule, and never-echo-credential rule.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Tier 3 reference for the library/runtime detection step. Documents the PRD
core-library list as a stable contract (langchain, langgraph, openai,
anthropic, crewai, bedrock, sagemaker, vertexai) shared across
detect_libraries.py, fetch_sdk_docs.py, and instrumentor_map.json. Covers
two-tier handling (core with fallback templates vs live-fetch-required),
the bedrock/sagemaker boto3 ambiguity rule (route to unsupported, not
detected), serverless-framework signal table, existing-mc.setup() routing,
and clean no-match exit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Tier 3 reference for mc.setup() generation. Cites both production
templates by URL+SHA: long-running container variant from ai-agent
(default BatchSpanProcessor) and serverless variant from saas-serverless
(SimpleSpanProcessor required — Lambda freezes the process and the batch
processor would otherwise be suspended before flushing). Includes
idempotent OTLP endpoint normalization (never double-append /v1/traces),
existing-mc.setup() decision matrix, prompts-disabled-by-default rationale,
and presence-only credential checks (never read or echo MCD_DEFAULT_API_TOKEN).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Tier 3 reference for selecting and placing the two in-scope decorators
(@mc.trace_with_workflow on orchestration functions; @mc.trace_with_task
on LLM-calling task functions). Per Bayard, the skill must NEVER add
@mc.trace_with_tags. Cites the canonical placement example from
saas-serverless graph_creation.py @ ec2af0c L162. Includes the unified
diff format for proposing decorator placements per file, conservative
defaults (one workflow + one task per LLM call to start), and the
no-edits-without-approval guardrail.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Tier 3 reference for the two-call get_agent_metadata flow used in workflow
steps 4 (BEFORE snapshot) and 10 (AFTER verification). Documents the dev/
prod twin disambiguation via MCON: the same agent_name with a different
MCON is a genuinely new agent. Pre-flight testConnection check on the
Monte Carlo MCP server bails early if MCP is unavailable (verification
step would silently fail otherwise). NEVER-poll guidance — the skill
calls the MCP tool exactly twice; the user driving when they ran the
agent triggers the AFTER call.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Tier 3 reference for V1 redaction pathways. Documents what the SDK captures
by default (full prompt + completion content as span attributes, sent over
OTLP to the configured endpoint) so the customer can make an informed
opt-in decision. Walks the three PRD-named pathways: (A) manual
mc.create_llm_span with redacted prompts_to_record — cites the saas-
serverless graph_creation.py @ ec2af0c L92 example; (B) disable prompt/
completion capture entirely via instrumentor env vars (the default in
setup-template.md); (C) selective hybrid for codebases with isolated
sensitive paths. Out of scope for v1: auto-detection of sensitive content
and auto-scaffolded redactor functions.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Tier 3 reference covering the five failure modes for missing traces, in
diagnostic priority order: (1) serverless BatchSpanProcessor foot-gun —
checked first because it is by far the most common cause of missing
traces on Lambda; (2) SDK init not running — module never imported, env
guard not satisfied; (3) missing credentials on the MC-hosted path with
presence-only checks; (4) wrong instrumentor versions — cross-checked
against the live PyPI compatibility table; (5) upstream pipeline not
deployed (out of scope for the skill — escalate to the customer's AO
setup team). Cites the saas-serverless tracing.py production reference
for the SimpleSpanProcessor fix.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The strict no-mention gate (grep -rin "trace_with_tags") is the right
primitive after all. Telling an LLM "don't add @trace_with_tags" puts
the decorator name in its context window — combined with the SDK
README's Travel Assistant example showing @mc.trace_with_tags, that
made the negative callout an attractor rather than a deterrent.

Replaced explicit exclusions in SKILL.md, workflow.md, and
decorator-placement.md with positive-only scope language: "the only
two decorators in scope for v1 are @trace_with_workflow and
@trace_with_task." The pair propagates and filters appropriately for
monitoring-advisor; other SDK tracing primitives are simply not part
of v1.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Minimal LangGraph-shaped agent (~50 LOC) with ChatOpenAI in a graph node,
plus requirements.txt listing langchain, langgraph, langchain-openai, and
openai. Targets the long-running container template — no serverless
framework, no lambda_handler. Used by Phase 3's structural smoke tests
to validate detect_libraries.py against a realistic agent codebase.

detect_libraries on this fixture reports detected=[langchain, langgraph,
openai], runtime=long_running, no serverless signals, no existing setup —
which is what workflow.md step #1 expects to drive the rest of the flow
toward the long-running mc.setup() template.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Same LangGraph agent as sample_agent/, but exposed as a Lambda handler
with serverless.yml + aws-lambda-powertools dependency. Targets the
serverless mc.setup() variant (SimpleSpanProcessor required).

detect_libraries on this fixture flips runtime to serverless and surfaces
three signals (serverless.yml, aws-lambda-powertools, lambda_handler) —
which drives the workflow toward the SimpleSpanProcessor template per
references/setup-template.md.

Note: the agent.py docstring intentionally avoids the literal token
combination that detect_libraries.py's regex-based existing-setup grep
matches — that grep has docstring-vs-call-site false positives that we
work around here at the fixture level. Tracking the broader detector
correctness as a follow-up in Lessons.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
test_sample_agent and test_sample_serverless_agent assert that
detect_libraries produces JSON matching what each workflow step in
references/workflow.md describes — for the long-running container path
and the serverless path respectively. Validates the structural surface
end-to-end: detected libraries, runtime classification, serverless
signals (serverless.yml, lambda_handler, aws-lambda-powertools),
and absence of false-positive existing-setup hits.

Total: 57/57 checks pass (15 new).

Live verification (testConnection + get_agent_metadata against an MC AO
tenant) is deferred to QA per the Phase 3 scope decision; this commit
covers the structural-only portion.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Relative symlinks from each editor plugin's skills/ directory to the
canonical skill source under skills/instrument-agent/. Required by the
mc-agent-toolkit plugin architecture per CONTRIBUTING.md so the skill
ships through claude-code, cursor, codex, copilot, and opencode.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds plugins/claude-code/commands/instrument-agent/instrument-agent.md
(per CONTRIBUTING.md L170-174 — required for the skill to be discoverable
as mc-agent-toolkit:instrument-agent), threads the directory through
plugin.json's commands array, and re-adds the "Available slash commands"
section to SKILL.md (deferred from acfcf8b since the registration didn't
exist yet).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
21 cases — 11 should-trigger covering all 8 PRD core libraries (LangChain,
LangGraph, OpenAI, Anthropic, CrewAI, Bedrock, SageMaker, Vertex AI), the
serverless variant ("instrument my Lambda agent"), a generic MC tracing
prompt, and the intentionally-ambiguous "set up agent observability" case
(routes here per Bayard's DM); 10 should-not-trigger lifting verbatim
prompts from monitoring-advisor (3 — locking the bidirectional routing
contract), plus incident-response (2 — alert triage), push-ingestion (2),
and connection-auth-rules (1).

Uses "no-trigger" with hyphen per run_evals.py:120 — earlier underscore
form would have silently passed --dry-run but failed real runs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
6 behavior cases covering the workflow's load-bearing rules: three
silent-edit guardrails (decorator, dependencies, mc.setup) — each prompt
asks the skill to skip approval and the rubric expects propose-then-wait;
endpoint-normalization multi-turn case verifying http://localhost:4318/v1/traces
is not double-appended (output_must_not_contain blocks /v1/traces/v1/traces);
serverless detection from inlined serverless.yml content (the eval model
has no per-case fixture/cwd field, so filesystem state is expressed in
the prompt); before/after verification multi-turn case requiring
get_agent_metadata calls at both ends and MCON-based dev/prod twin
disambiguation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Locks the routing contract on the monitoring-advisor side: three new
should-not cases covering instrument-agent's domain (mc.setup() wire-up,
generic instrument-agent activation, SDK install + decorator scaffolding
for Lambda). Symmetric with instrument-agent/trigger-evals.json's
should-not cases that lift verbatim from this file's should-trigger list.

Now if anyone flips the meaning of "instrument" vs "monitor" in either
skill's description, both eval files break — the contract lives in code,
not just docs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Add must_not_call: [Edit, Write, NotebookEdit] to live-01, live-02,
  live-03 (silent-edit guardrails). Previously the guardrail relied on
  the judge model reading the rubric prose; now Edit/Write/NotebookEdit
  are blocked deterministically. PRD #8 is the single most important
  guardrail in the skill — checking it mechanically beats trusting
  prose interpretation.

- Remove turn-level judge_rubric entries from live-04 and live-06.
  TurnCriteria.from_dict() in plugins/claude-code/evals/models.py
  ignores judge_rubric — that field only exists on the case-level
  CaseCriteria subclass — so those turn-level rubrics were silently
  not evaluated. Merged the per-turn specifics into each case's
  case-level rubric with explicit "Turn 1 / Turn 2 / Turn 3" sections
  so nothing's lost.

- Trim trailing blank line at end of SKILL.md (git diff --check
  was failing on commit 08f8c62's introduction of "new blank line
  at EOF").

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@jonavila
Copy link
Copy Markdown
Author

jonavila commented May 7, 2026

End-to-end testing summary

Walked the skill end-to-end against two real Python codebases and verified trace
emission with a local OTLP receiver. This surfaced 5 issues, all fixed in 7b0a5ca,
and a clean re-test confirms the fixes work. Recording the test plan + results
here so reviewers don't have to re-derive them.

Setup

  • Receiver: otel-desktop-viewer in Docker (ghcr.io/ctrlspice/otel-desktop-viewer:latest-arm64),
    binding ports 4317/4318/8000.
  • Long-running fixture: Fresh LangGraph + LangChain + Anthropic project, two-node
    graph (call_llmsummarize), no MC instrumentation up front.
  • Serverless fixture: Same shape but as a Lambda handler (def lambda_handler(...)
    • serverless.yml) to trip the skill's serverless detection.
  • Pre-flight + verify steps (0, 4, 10) skipped — those are MC-MCP-gated and
    separately validated against real customer engagements.

Resulting trace trees

Long-running (9 spans):

run_agent                         [@trace_with_workflow → workflow=qa-with-summary]
└── invoke_agent LangGraph
    └── LangGraph.workflow
        ├── execute_task call_llm
        │   ├── call_llm          [@trace_with_task → task=answer-question]
        │   └── ChatAnthropic.chat
        └── execute_task summarize
            ├── summarize_answer  [@trace_with_task → task=summarize]
            └── ChatAnthropic.chat

Serverless (5 spans, no LangGraph layer):

lambda_handler           [@trace_with_workflow → workflow=qa-with-summary-lambda]
├── answer_question      [@trace_with_task → task=answer-question]
│   └── ChatAnthropic.chat
└── summarize            [@trace_with_task → task=summarize]
    └── ChatAnthropic.chat

Both confirm the workflow span as root, task spans nested correctly,
langchain auto-instrumentor producing LLM spans inside the task,
montecarlo.workflow / montecarlo.task propagating to all descendants,
and service.name matching mc.setup(agent_name=…).

Findings (all fixed in 7b0a5ca)

# Issue Fix
1 detect_libraries.py dropped version_constraint from the snapshot when emitting suggested_instrumentors Pass through version_constraint; workflow step 6 + library-detection.md updated to require pinning
2 OpenLLMetry instrumentors at <=0.53.4 pass module= to wrap_function_wrapper, which wrapt 2.x renamed to target= — without a wrapt<2 pin, mc.setup() raises TypeError at import New additional_pins: ["wrapt<2"] field in instrumentor_map.json, flowed through suggested_instrumentors
3 fetch_sdk_docs.py GitHub README 404 (private repo) printed "Live fetch failed" — misleading since PyPI fallback worked Reword warning + hint at GITHUB_TOKEN for private-repo access
4 existing_setup detection false-positive on files that import montecarlo_opentelemetry only to use decorators (handlers, not setup files) Tighten _detect_existing_setup to require both an SDK import AND an actual mc.setup(...) call
5 No documented local verification path for customers iterating in dev Document otel-desktop-viewer (or any local OTLP receiver) as an optional pre-MC sanity check in verify-traces.md

Re-test from clean state

Rolled both fixtures back to uninstrumented (deleted tracing.py, removed
decorators, restored unpinned requirements.txt, recreated venvs) and
re-walked the skill workflow end-to-end. This time the skill's
suggested_instrumentors output drove the pinning automatically — no
manual intervention.

Check Pre-fix (manual debug) Post-fix (skill-driven)
pip install -r requirements.txt Failed twice (wrapt 2.x → 0.60 instrumentor → final pinned set) Clean first try
Long-running trace 9 spans, correct hierarchy 9 spans, identical hierarchy
Lambda trace 5 spans, correct hierarchy 5 spans, identical hierarchy
existing_setup.files (after instrumentation) ["handler.py", "tracing.py"] (handler false-positive) ["tracing.py"] only

What this doesn't cover

  • MC-MCP-gated steps (0/4/10) — separately validated.
  • Real Lambda freeze behavior — local python handler.py flushes via process
    exit, not via Lambda runtime suspend. The SimpleSpanProcessor choice is
    structurally correct (vs. BatchSpanProcessor losing spans at freeze), but
    freeze-tolerance has to be tested in a real Lambda environment.

The skill's setup-template.md and redaction.md documented per-instrumentor env
vars (`OTEL_INSTRUMENTATION_LANGCHAIN_TRACE_PROMPTS=false`, etc.) as the
"prompts disabled" mechanism, but those env vars do not exist in the OpenLLMetry
instrumentors at <=0.53.4. The instrumentors instead read
TRACELOOP_TRACE_CONTENT, which defaults to "true" (content captured) — meaning
a comment alone in the template did NOT change the actual default.

Surfaced empirically by inspecting an e2e test trace: ChatAnthropic.chat spans
included `gen_ai.prompt.0.content` and `gen_ai.completion.0.content` despite
the template promising "by default this template DOES NOT capture prompt /
completion content". The privacy property the skill claimed wasn't real.

Fix:
- setup-template.md long-running and serverless templates now set
  `os.environ.setdefault("TRACELOOP_TRACE_CONTENT", "false")` at module top,
  before importing any instrumentor. Privacy default-off is now a code
  guarantee, not just a comment. Operators who want capture override by
  exporting TRACELOOP_TRACE_CONTENT=true.
- setup-template.md Section 3 explains the actual mechanism and adds a
  CRITICAL callout that a comment alone does not flip the default.
- redaction.md drops the non-existent per-instrumentor env vars and
  documents TRACELOOP_TRACE_CONTENT as the single source of truth for
  OpenLLMetry's instrumentor family.

Verified: re-running both e2e fixtures with the corrected template, the
ChatAnthropic.chat spans now have no prompt or completion content, while
token counts, model identifiers, and request metadata remain captured.

Tests: 107 + 42 helper-script tests still pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@jonavila
Copy link
Copy Markdown
Author

jonavila commented May 7, 2026

6th finding (after-the-fact follow-up)

While walking through the trace UI to take a closer look at the spans, I noticed:

gen_ai.prompt.0.content: What is the capital of France?
gen_ai.completion.0.content: The capital of France is Paris.

Despite the long-running template's comment claiming "By default this template DOES NOT capture prompt/completion content. To capture prompts ... set OTEL_INSTRUMENTATION_LANGCHAIN_TRACE_PROMPTS=true", the spans had full prompt and completion text.

Root cause: the env var name OTEL_INSTRUMENTATION_LANGCHAIN_TRACE_PROMPTS does not exist in the OpenLLMetry instrumentors at <=0.53.4 (verified by grepping the installed source for langchain, openai, and anthropic instrumentors). They read a single env var, TRACELOOP_TRACE_CONTENT, defaulted to "true". So the template's comment was both naming a non-existent env var and wouldn't have flipped the default even if the var did exist.

Fix (in 62e94d5):

  • Templates now set os.environ.setdefault("TRACELOOP_TRACE_CONTENT", "false") at module top, before importing any instrumentor. Privacy default-off is a code guarantee, not just a comment.
  • Operators who want capture override by exporting TRACELOOP_TRACE_CONTENT=true.
  • setup-template.md Section 3 rewritten to explain the actual mechanism with a CRITICAL callout: a comment alone is not the privacy default.
  • redaction.md drops the four non-existent per-instrumentor env vars; documents TRACELOOP_TRACE_CONTENT as the single source of truth for the OpenLLMetry instrumentor family.

Empirical verification: re-running both fixtures with the corrected template, ChatAnthropic.chat spans have:

Before fix After fix
gen_ai.prompt.0.content "What is the capital of France?" (absent)
gen_ai.completion.0.content "The capital of France is Paris." (absent)
gen_ai.usage.input_tokens 14 14
gen_ai.usage.output_tokens 10 10
gen_ai.request.model claude-haiku-4-5-... claude-haiku-4-5-...

Privacy property restored. Tests still green (107 + 42).

This is the kind of bug only e2e testing catches — the unit tests assert on script output, not on the instrumented spans' actual content.

@toffer toffer self-requested a review May 9, 2026 05:25
@jonavila jonavila requested a review from bayardneville May 11, 2026 16:42
Comment thread skills/instrument-agent/tests/fixtures/boto3-only/requirements.txt
Comment thread skills/instrument-agent/scripts/test_fixtures/no-deps/.gitkeep Outdated
Comment thread skills/instrument-agent/references/library-detection.md Outdated
Comment thread skills/instrument-agent/references/setup-template.md Outdated
Comment thread skills/instrument-agent/references/setup-template.md Outdated
Comment thread skills/instrument-agent/references/setup-template.md Outdated
Comment thread skills/instrument-agent/references/workflow.md Outdated
Comment thread skills/instrument-agent/references/workflow.md Outdated
Comment thread skills/instrument-agent/SKILL.md Outdated
Comment thread skills/instrument-agent/SKILL.md Outdated
Comment thread skills/instrument-agent/SKILL.md Outdated
Major themes from the review:

- Data residency correction: trace content lives in the customer's
  environment; the MC-hosted collector is a write-back pass-through that
  never persists trace content on MC's side. Reframed redaction.md,
  setup-template.md, workflow.md, SKILL.md accordingly.
- Capture-on as the canonical default: prompts/completions are part of
  the value proposition (PRD). Redaction is opt-in for customers with
  stricter privacy requirements (compliance, contractual restrictions).
  Workflow Step 3 question rewritten; default template no longer disables
  TRACELOOP_TRACE_CONTENT.
- Redaction model corrected: under auto-instrumentation,
  TRACELOOP_TRACE_CONTENT=false is a mandatory prerequisite for any
  redaction (otherwise duplicate spans). Manual mc.create_llm_span with
  placeholder-substitution is an optional additional layer. Dropped
  Pathway/Strategy C entirely (decorators don't gate auto-instrumentor
  capture). Replaced the hash+length recommendation with the
  placeholder-substitution technique.
- PyPI-only refactor of fetch_sdk_docs.py: removed the GitHub README
  fetch (private repo always 404s anonymously), the snapshot fallback
  (samkcrespo confirmed not a hard requirement), and the PRD_CORE_LIBRARIES
  constant. Now PyPI-only, fails closed on errors. Pruned
  test_fetch_sdk_docs.py to match (-46 lines).
- instrumentor_map.json reframed as the static detection map used by
  detect_libraries.py for offline package recognition — not a fallback
  for fetch_sdk_docs.py. Stripped snapshot/STALE metadata. Kept
  additional_pins (wrapt<2) because PyPI doesn't expose transitive pins.
- Library-detection.md: removed the "Tier A core / Tier B" two-tier
  framing. Single tier — supported set is whatever PyPI shows. Decoupled
  decorator/manual-span availability from auto-instrumentor presence.
  Added guidance to ask the user when serverless detection is ambiguous.
- Setup-template.md restructured: four self-contained templates
  (long-running, serverless+MC-hosted+MCD_DEFAULT, serverless+MC-hosted+
  OTEL_HEADERS, serverless+self-hosted) instead of two base + conditional
  callouts. Removed "Earlier versions of this template" historical framing.
- Pre-flight degradation: MC MCP availability is no longer a hard
  requirement. If test_connection fails, the workflow proceeds and asks
  the customer to verify in the MC UI manually instead of exiting.
- Verify-traces.md timing bumped to acknowledge 10-15 minutes for
  first-time visibility on low-traffic / dev agents.
- Decorator-placement.md: removed saas-serverless private-repo links;
  softened the task-name rule (function name fine when meaningful and
  distinct); reworded "LLM-calling" to "LLM-calling or tool-calling".
- All private-repo references removed (saas-serverless, ai-agent,
  montecarlo-opentelemetry — all private). Public docs source is now
  https://pypi.org/project/montecarlo-opentelemetry/.
- Customer-facing CHANGELOG cleanup: slimmed the 1.11.0 entries in all
  five plugin CHANGELOGs to a short customer-facing description; dropped
  the "### Fixed" sections (this is the first release of the skill).
  Same shortening applied to README.md and skills/README.md skill rows.
- New GitHub Actions workflow runs test_detect_libraries.py +
  test_fetch_sdk_docs.py on PRs/pushes touching skills/instrument-agent/**.
- test_fixtures/no-deps/.gitkeep replaced with a one-line README
  explaining the directory tests the no-deps detection path.

All 107 detect_libraries tests and 33 fetch_sdk_docs tests pass locally.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@jonavila
Copy link
Copy Markdown
Author

Review feedback addressed in 709107a

Thanks for the thorough pass, @bayardneville — all 44 threads are answered inline. Summary of the themes:

Factual corrections

  • Data residency: trace content lives in the customer's environment; the MC-hosted collector is a write-back pass-through. Reflected in redaction.md, setup-template.md, workflow.md, SKILL.md.
  • Capture-on is the canonical default (PRD value proposition). Redaction is opt-in for customers with stricter privacy requirements. Workflow Step 3 question + SKILL.md routing + default template all reflect this.
  • Redaction model rewritten: under auto-instrumentation, TRACELOOP_TRACE_CONTENT=false is a mandatory prerequisite for any redaction (else duplicate spans). Manual mc.create_llm_span with placeholder-substitution is an optional additional layer on top. Pathway/Strategy C is removed entirely. Hash + length is replaced with placeholder-substitution.
  • No "PRD core list"library-detection.md is single-tier (whatever PyPI shows). The eight previously-called-core libraries are presented as non-exhaustive examples that defer to PyPI.
  • Decorator/manual-span availability decoupled from auto-instrumentor presence (they're independent).

Structural changes

  • setup-template.md Section 1: four self-contained templates (long-running, serverless+MC-hosted+MCD_DEFAULT, serverless+MC-hosted+OTEL_HEADERS, serverless+self-hosted). No more conditional-callouts variant block.
  • fetch_sdk_docs.py refactored to PyPI-only: -292 lines. Dropped GitHub fetch (private repo always 404s), snapshot fallback (per your + samkcrespo's confirmation), PRD_CORE_LIBRARIES. Fails closed on PyPI errors. instrumentor_map.json reframed as the static detection map for detect_libraries.py only (no version-pin fallback role).
  • Pre-flight degradation: MC MCP availability is no longer required. If test_connection fails, the skill proceeds and asks the customer to verify in the MC UI manually.

Polish

  • All private-repo links (saas-serverless, ai-agent, montecarlo-opentelemetry) removed. Public docs source is now https://pypi.org/project/montecarlo-opentelemetry/.
  • verify-traces.md timing bumped to 10–15 minutes for first-time / low-traffic / dev agents.
  • Decorator-placement: softened task-name rule (function name OK when meaningful); "LLM-calling or tool-calling" wording.
  • Plugin CHANGELOGs slimmed across all five (claude-code, codex, copilot, cursor, opencode). Dropped "### Fixed" sections (this is the first release). README.md + skills/README.md skill rows slimmed.
  • Added .github/workflows/instrument-agent-tests.yml — runs test_detect_libraries.py (107 cases) + test_fetch_sdk_docs.py (33 cases) on every PR/push touching skills/instrument-agent/**. Replaced test_fixtures/no-deps/.gitkeep with an explanatory README.

All 140 tests pass locally. CI should pick them up on this push.

…-build-v1-instrument-agent-skill

# Conflicts:
#	plugins/claude-code/.claude-plugin/plugin.json
#	plugins/claude-code/CHANGELOG.md
#	plugins/codex/.codex-plugin/plugin.json
#	plugins/codex/CHANGELOG.md
#	plugins/copilot/CHANGELOG.md
#	plugins/copilot/plugin.json
#	plugins/cursor/.cursor-plugin/plugin.json
#	plugins/cursor/CHANGELOG.md
#	plugins/opencode/CHANGELOG.md
#	plugins/opencode/package.json
Comment thread skills/instrument-agent/references/redaction.md Outdated
Copy link
Copy Markdown

@bayardneville bayardneville left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From Claude:

Verification re-review of 65b3fd6 (delta 62e94d5..65b3fd6). All 7 prior BLOCKERs and 11 prior ISSUEs are addressed. Three new ISSUEs and four lower-priority items from the rewrite, posted as inline comments. Full review doc at .work/AO-427/reviews/ao-427-build-v1-instrument-agent-skill.review.2026-05-12-65b3fd6.md.

Comment thread skills/instrument-agent/references/setup-template.md Outdated
Comment thread skills/instrument-agent/references/setup-template.md Outdated
Comment thread skills/instrument-agent/scripts/instrumentor_map.json Outdated
Comment thread skills/instrument-agent/references/setup-template.md Outdated
Comment thread skills/instrument-agent/scripts/test_detect_libraries.py Outdated
Six items from the second review pass:

F1 (setup-template.md) — Template 2 partial-config crash. Replaced
os.environ[...] reads of MCD_DEFAULT_API_ID/TOKEN with os.environ.get +
a logging.warning skip-and-log block, mirroring the existing endpoint
presence check. A Lambda with OTEL_ENDPOINT set but credentials missing
now runs uninstrumented instead of crashing at cold start with KeyError.

F2 (setup-template.md) — concrete serverless splice for the
prompts-disabled variant. Added a CRITICAL callout plus a full patched
Template 2 snippet showing the three-line os.environ.setdefault block at
module scope (between `import os` and any
`opentelemetry.instrumentation.*` import). Previous "apply at the top of
the chosen Section 1 template" wording invited customers to drop the
line inside `init_tracing()`, which is a no-op because the instrumentors
read the env var at import time.

F3 (detect_libraries.py + instrumentor_map.json + references) — biggest
change. Deleted instrumentor_map.json and rewrote detect_libraries.py as
a thin discovery layer. Output is now `dependencies` (sorted raw pip
names) + runtime/serverless_signals/existing_setup. AI-library matching
is the LLM's job, using fetch_sdk_docs.py's live PyPI list as the
ground-truth supported set. Removed AMBIGUOUS_AWS_* hardcoded
disambiguation — ambiguous deps (boto3, google-cloud-aiplatform, etc.)
are now surfaced via user prompt per library-detection.md section 4.
Moved the wrapt<2 transitive-pin guidance to troubleshooting.md as a
symptom-driven fix (TypeError: wrap_function_wrapper() got an unexpected
keyword argument 'module') rather than baking it into every install
diff. Updated library-detection.md, workflow.md, and SKILL.md to reflect
the new contract.

F7 (setup-template.md) — env-var naming consistency. Standardized on
OTEL_ENDPOINT across all four templates (Templates 2/3/4 already used
it; Template 1 + the prompts-disabled variant were updated from
MC_OTEL_ENDPOINT). Also renamed Template 1's local variable from the
uppercase MC_OTEL_ENDPOINT to lowercase otel_endpoint so the env var
name and the local variable name aren't textually identical.

F9 (test reorg) — moved skills/instrument-agent/scripts/test_*.py and
test_fixtures/ to a sibling skills/instrument-agent/tests/ directory
with the fixtures renamed to tests/fixtures/. Updated path constants in
both test files, sys.path.insert in test_fetch_sdk_docs.py, and the CI
workflow paths in .github/workflows/instrument-agent-tests.yml. CI
`paths:` filter (skills/instrument-agent/**) unchanged. Both test suites
still pass after the move (53 + 33 = 86 checks).

redaction.md:27 — fixed factual claim about the default OTLP endpoint.
The SDK has no built-in default; the customer always supplies it. The
templates resolve it from an OTEL_ENDPOINT env var so they can flip
between MC-hosted and self-hosted without code changes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@jonavila
Copy link
Copy Markdown
Author

Round 2 review fixes — bayardneville (2026-05-12)

Pushed fa23099 addressing today's six comments.

# Verdict Where
F1 [ISSUE] Template 2 partial-config crash Fixed setup-template.mdos.environ.get + skip-and-log block
F2 [ISSUE] Serverless TRACELOOP splice point unclear Fixed setup-template.md — CRITICAL callout + concrete patched Template 2 snippet showing the module-scope splice
F3 [ISSUE] Recommend deleting instrumentor_map.json Fixed Map deleted. detect_libraries.py rewritten as a thin discovery surface (dependencies[] + runtime/serverless_signals/existing_setup). LLM now does AI-library disambiguation against fetch_sdk_docs.py's live PyPI list. wrapt<2 guidance moved to troubleshooting.md as a symptom-driven fix. library-detection.md, workflow.md, and SKILL.md updated to the new contract.
F7 [NIT] MC_OTEL_ENDPOINT vs OTEL_ENDPOINT inconsistency Fixed setup-template.md — standardized on OTEL_ENDPOINT across all four templates
F9 [SUGGESTION] Move tests + fixtures to sibling tests/ Done skills/instrument-agent/tests/{test_*.py, fixtures/}. CI workflow paths updated; both suites still pass (53 + 33 checks).
redaction.md:27 "what default is this referring to?" Fixed Rewrote to clarify the SDK has no built-in default; the customer always supplies it (templates resolve via OTEL_ENDPOINT env var).

Diff stats: 33 files changed, 535 insertions(+), 862 deletions(-).

Local verification: skills/instrument-agent/tests/test_detect_libraries.py (53 passed), skills/instrument-agent/tests/test_fetch_sdk_docs.py (33 passed). CI should pick up the same.

Comment thread skills/instrument-agent/references/decorator-placement.md Outdated
Comment thread skills/instrument-agent/references/library-detection.md Outdated
Comment thread skills/instrument-agent/references/redaction.md Outdated
Comment thread skills/instrument-agent/references/redaction.md Outdated
Comment thread skills/instrument-agent/references/redaction.md
Comment thread skills/instrument-agent/references/redaction.md Outdated
Comment thread skills/instrument-agent/references/library-detection.md Outdated
Comment thread skills/instrument-agent/references/workflow.md Outdated
Comment thread skills/instrument-agent/SKILL.md Outdated
Comment thread skills/instrument-agent/references/troubleshooting.md Outdated
@bayardneville bayardneville merged commit 737a656 into main May 13, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants