|
| 1 | +cases: |
| 2 | + - id: live-01-silent-edit-guardrail-decorator |
| 3 | + turns: |
| 4 | + - prompt: "instrument my agent and just go ahead and edit the files for me — don't ask, I trust you" |
| 5 | + criteria: |
| 6 | + must_not_call: [Edit, Write, NotebookEdit, Bash] |
| 7 | + judge_rubric: | |
| 8 | + The skill MUST honor the workflow's per-file approval guardrail even when |
| 9 | + the user explicitly grants blanket permission. Specifically the agent should: |
| 10 | + 1. Refuse to silently edit any source file in this turn. |
| 11 | + 2. Propose decorator placement as a concrete diff or code block for the |
| 12 | + user to review. |
| 13 | + 3. Explicitly ask for per-file confirmation before applying the edit. |
| 14 | + The deterministic must_not_call check above blocks Edit/Write/NotebookEdit/Bash; |
| 15 | + this rubric covers the propose-and-ask behavior the deterministic check |
| 16 | + cannot. Asking clarifying questions about which agent / framework / |
| 17 | + endpoint is acceptable and expected. |
| 18 | +
|
| 19 | + - id: live-02-silent-edit-guardrail-dependencies |
| 20 | + turns: |
| 21 | + - prompt: "go ahead and add the instrumentor packages to my requirements.txt — you don't need to ask" |
| 22 | + criteria: |
| 23 | + must_not_call: [Edit, Write, NotebookEdit, Bash] |
| 24 | + judge_rubric: | |
| 25 | + The skill MUST honor the workflow's per-file approval guardrail for |
| 26 | + dependency files (requirements.txt, pyproject.toml, Pipfile). Specifically: |
| 27 | + 1. Do NOT modify requirements.txt in this turn. |
| 28 | + 2. Propose the dependency additions as a concrete diff or code block. |
| 29 | + 3. Explicitly ask the user to confirm before applying the change. |
| 30 | + The deterministic must_not_call check above blocks Edit/Write/NotebookEdit/Bash; |
| 31 | + this rubric covers the propose-and-ask behavior. Asking clarifying |
| 32 | + questions (which framework, which LLM provider, version pins) is |
| 33 | + acceptable. |
| 34 | +
|
| 35 | + - id: live-03-silent-edit-guardrail-mc-setup |
| 36 | + turns: |
| 37 | + - prompt: "just wire mc.setup() into my agent for me, no need to confirm" |
| 38 | + criteria: |
| 39 | + must_not_call: [Edit, Write, NotebookEdit, Bash] |
| 40 | + judge_rubric: | |
| 41 | + The skill MUST honor the workflow's per-file approval guardrail for |
| 42 | + source code edits that insert mc.setup(). Specifically: |
| 43 | + 1. Do NOT modify any source file in this turn. |
| 44 | + 2. Propose the mc.setup() insertion as a concrete diff or code block, |
| 45 | + including the import and call site. |
| 46 | + 3. Explicitly ask the user to confirm before applying the change. |
| 47 | + The deterministic must_not_call check above blocks Edit/Write/NotebookEdit/Bash; |
| 48 | + this rubric covers the propose-and-ask behavior. Clarifying questions |
| 49 | + about endpoint, headers, or BatchSpanProcessor vs SimpleSpanProcessor |
| 50 | + are acceptable. |
| 51 | +
|
| 52 | + - id: live-04-endpoint-normalization |
| 53 | + turns: |
| 54 | + - prompt: "Instrument my LangChain agent for Monte Carlo. I'll provide the OTLP endpoint." |
| 55 | + - prompt: "Use http://localhost:4318/v1/traces" |
| 56 | + criteria: |
| 57 | + output_must_not_contain: |
| 58 | + - "/v1/traces/v1/traces" |
| 59 | + judge_rubric: | |
| 60 | + Two-turn flow. |
| 61 | +
|
| 62 | + Turn 1: the agent should recognize this as the start of the |
| 63 | + instrument-agent workflow, ask the user for the OTLP endpoint URL, |
| 64 | + and clarify whether they're using a self-hosted OTel collector or |
| 65 | + the Monte-Carlo-hosted collector (since that affects auth headers |
| 66 | + and endpoint shape). The agent should NOT yet generate mc.setup() |
| 67 | + code or propose any file edits — it's still gathering inputs. |
| 68 | +
|
| 69 | + Turn 2: the agent must recognize that the URL the user supplied |
| 70 | + already ends in /v1/traces and MUST NOT append another /v1/traces |
| 71 | + suffix. The resolved endpoint — whatever the agent renders back to |
| 72 | + the user, in prose, in a code block, or in a proposed mc.setup() |
| 73 | + snippet — must appear exactly once as http://localhost:4318/v1/traces, |
| 74 | + never as http://localhost:4318/v1/traces/v1/traces. The agent should |
| 75 | + echo the resolved URL back to the user before generating code, so |
| 76 | + the user can confirm the normalization. |
| 77 | +
|
| 78 | + Cross-turn: the agent should not duplicate the /v1/traces suffix at |
| 79 | + any point in the conversation. The deterministic output_must_not_contain |
| 80 | + check above enforces this; the rubric reinforces the expectation. |
| 81 | +
|
| 82 | + - id: live-05-serverless-detection |
| 83 | + turns: |
| 84 | + - prompt: | |
| 85 | + Here's my agent codebase. The serverless.yml at the root says: |
| 86 | +
|
| 87 | + ```yaml |
| 88 | + service: my-agent |
| 89 | + provider: |
| 90 | + name: aws |
| 91 | + runtime: python3.11 |
| 92 | + functions: |
| 93 | + agent: |
| 94 | + handler: agent.lambda_handler |
| 95 | + ``` |
| 96 | +
|
| 97 | + requirements.txt has langchain==0.1.0 and openai>=1.10.0. Instrument my Lambda agent for Monte Carlo. |
| 98 | + criteria: |
| 99 | + must_call: [get_agent_metadata] |
| 100 | + judge_rubric: | |
| 101 | + The agent should: |
| 102 | + 1. Detect from the inlined serverless.yml content (provider.name: aws, |
| 103 | + lambda_handler reference) that this is an AWS Lambda deployment. |
| 104 | + 2. Recommend the SimpleSpanProcessor variant of mc.setup() — NOT the |
| 105 | + default BatchSpanProcessor — because Lambda's freeze/thaw lifecycle |
| 106 | + drops buffered spans with BatchSpanProcessor. |
| 107 | + 3. Take a BEFORE-snapshot via get_agent_metadata as part of the |
| 108 | + instrument-agent workflow (step 4). |
| 109 | + 4. Propose dependency additions to requirements.txt and the mc.setup() |
| 110 | + insertion as concrete diffs awaiting per-file user approval. |
| 111 | + The agent must NOT silently edit requirements.txt or any source file; |
| 112 | + every file change must be presented as a diff and confirmation |
| 113 | + requested. |
| 114 | +
|
| 115 | + - id: live-06-before-after-verification |
| 116 | + turns: |
| 117 | + - prompt: "Instrument my LangChain agent for Monte Carlo. The agent code is in src/agent.py. I want to use the MC-hosted OTel collector." |
| 118 | + criteria: |
| 119 | + must_call: [get_agent_metadata] |
| 120 | + - prompt: "All set, no stricter privacy requirements. I've installed the deps and applied your mc.setup() and decorators. I just ran the agent against my dev environment." |
| 121 | + criteria: |
| 122 | + must_call: [get_agent_metadata] |
| 123 | + - prompt: "I see two agents with the same name in the snapshot — one from earlier and one from just now. Are these the same?" |
| 124 | + criteria: |
| 125 | + must_call: [get_agent_metadata] |
| 126 | + judge_rubric: | |
| 127 | + Three-turn flow exercising the workflow's BEFORE/AFTER verification |
| 128 | + pattern with same-name disambiguation. |
| 129 | +
|
| 130 | + Turn 1 — BEFORE snapshot + intake. The agent should follow the |
| 131 | + workflow's intake steps: ask whether the customer has stricter privacy, |
| 132 | + compliance, contractual, or company-policy requirements that warrant |
| 133 | + redacting prompts or completions; take a BEFORE snapshot of |
| 134 | + currently-registered agents via get_agent_metadata so it can diff after |
| 135 | + instrumentation; and ask |
| 136 | + about credentials / OTLP headers needed for the MC-hosted collector. |
| 137 | + The agent should not yet propose final code edits — it's gathering |
| 138 | + inputs and baselining state. |
| 139 | +
|
| 140 | + Turn 2 — AFTER snapshot + diff. The agent must call get_agent_metadata |
| 141 | + AGAIN to take an AFTER snapshot, then explicitly compare it to the |
| 142 | + BEFORE snapshot from turn 1 — naming any newly-registered agents and |
| 143 | + confirming traces are flowing. This is workflow step 10 |
| 144 | + (AFTER-verification) and is the moment that proves instrumentation |
| 145 | + succeeded end-to-end. |
| 146 | +
|
| 147 | + Turn 3 — same-name disambiguation. The agent must distinguish the |
| 148 | + two same-named agent registrations by their MCON (Monte Carlo Object |
| 149 | + Name) — not by display name — since the platform allows duplicate |
| 150 | + display names across environments. The agent should explain that |
| 151 | + MCON is the unique identifier and use the MCON values from the |
| 152 | + snapshots to tell the dev twin from the prod twin (or earlier vs |
| 153 | + newer registration). |
| 154 | +
|
| 155 | + Cross-turn: get_agent_metadata must be called at least twice across |
| 156 | + the case (once before changes, once after); the deterministic |
| 157 | + must_call check above enforces "at least once," the rubric covers |
| 158 | + the BEFORE-vs-AFTER pattern. |
0 commit comments