comet-ml · juanferrub · Mar 24, 2026
diff --git a/.claude-plugin/plugin.json b/.claude-plugin/plugin.json
@@ -1,7 +1,7 @@
 {
   "name": "opik",
-  "version": "0.1.0",
-  "description": "LLM observability tooling for agent development and Claude Code",
+  "version": "0.2.0",
+  "description": "LLM observability tooling for agent development and Claude Code — Opik 2.0 ready",
   "author": {
     "name": "Comet ML",
     "url": "https://comet.com"

diff --git a/agents/agent-reviewer.md b/agents/agent-reviewer.md
@@ -287,9 +287,9 @@ Evaluate resource consumption:
 - Unbounded agent loops that burn tokens
 - No budget alerts or hard caps
 
-### 10. Observability
+### 10. Observability & Opik 2.0 Patterns
 
-Evaluate monitoring and debugging capabilities:
+Evaluate monitoring, debugging, and Opik 2.0 compliance:
 
 **Trace Initialization (Critical):**
 - Tracing starts BEFORE agent execution, not after
@@ -298,10 +298,17 @@ Evaluate monitoring and debugging capabilities:
 - Trace input matches actual agent input (enables replay)
 - Trace ID available from first instruction
 
+**Opik 2.0 Requirements (Critical):**
+- **`entrypoint=True`**: The main agent function MUST have `entrypoint=True` in its `@opik.track` decorator. Without this, the agent cannot be triggered via the Local Runner.
+- **Docstring with Args**: The entrypoint function MUST have a docstring with `Args:` descriptions. The Local Runner uses this for schema discovery.
+- **Configuration externalized**: Hardcoded model names, temperatures, system prompts, and max_tokens SHOULD be extracted into an `opik.AgentConfig` subclass (not left inline). This enables Blueprint management via the Opik UI.
+- **`thread_id` for conversations**: If the agent handles multi-turn conversations (has message history, chat loops, session state), it MUST set `thread_id` on traces. Without this, conversation turns appear as unrelated traces and thread-level metrics don't work.
+- **Evaluation Suites**: Code should use `get_or_create_evaluation_suite()` for testing, NOT the old `get_or_create_dataset()` API.
+
 **Tracing:**
 - Full execution traces with input/output capture
 - Span hierarchy showing tool calls and reasoning
-- Correct span types used: `general`, `tool`, `llm`, `retrieval`, `guardrail`
+- Correct span types used: `general`, `tool`, `llm`, `guardrail` (NOT `retrieval`)
 - Correlation IDs across distributed components
 - Complete request lifecycle from input to final output
 
@@ -322,7 +329,7 @@ Evaluate monitoring and debugging capabilities:
 - Task completion rates
 
 **Evaluation:**
-- Pre-production quality checks
+- Pre-production quality checks via Evaluation Suites
 - Production monitoring for drift/regression
 - Feedback loops for continuous improvement
 - Agent-specific metrics: task completion, tool correctness, trajectory accuracy
@@ -338,6 +345,11 @@ Evaluate monitoring and debugging capabilities:
 - No metrics on performance or cost
 - No way to debug failed executions
 - No alerting for anomalies
+- **Missing `entrypoint=True`** on the main agent function
+- **Missing config dataclass** — hardcoded model/temperature/prompt values
+- **Missing `thread_id`** in a conversational agent
+- **Using old Datasets API** instead of Evaluation Suites
+- **Missing docstring** on the entrypoint function
 
 ### 11. State Management
 
@@ -411,6 +423,15 @@ Structure your review as follows:
 - [ ] Alerting for anomalies
 - [ ] Debug mode for development
 
+### Opik 2.0 Checklist
+
+- [ ] `entrypoint=True` on the main agent function
+- [ ] Docstring with `Args:` on the entrypoint function
+- [ ] Config externalized into `opik.AgentConfig` subclass (no hardcoded model/temperature/prompt)
+- [ ] `thread_id` set for conversational agents (multi-turn)
+- [ ] Uses Evaluation Suites API, NOT old Datasets API
+- [ ] Span types are correct (`general`, `llm`, `tool`, `guardrail` — NOT `retrieval`)
+
 ### Resource Management Checklist
 
 - [ ] Token/cost limits per request and session

diff --git a/bin/opik-logger-darwin-arm64 b/bin/opik-logger-darwin-arm64
diff --git a/commands/connect.md b/commands/connect.md
@@ -0,0 +1,96 @@
+---
+description: Connect your agent to Opik for triggering from the browser UI via the Local Runner
+argument-hint: [--pair CODE]
+allowed-tools:
+  - Bash
+  - Read
+  - Grep
+  - Glob
+model: haiku
+---
+
+# Connect Agent to Opik (Local Runner)
+
+Set up `opik connect` so the user's agent can be triggered from the Opik browser UI while running locally.
+
+**User request:** $ARGUMENTS
+
+## Step 1: Check Prerequisites
+
+### 1a. Verify opik CLI is installed
+
+Run `opik --version`. If not found:
+- Check if `opik` is installed: `pip show opik` or `pip3 show opik`
+- If not installed: `pip install opik`
+- If installed but not on PATH: suggest `python -m opik --version`
+
+### 1b. Verify there's an entrypoint function
+
+Search the codebase for `entrypoint=True`:
+
+```
+grep -r "entrypoint=True" --include="*.py" .
+```
+
+If no entrypoint found:
+- Tell the user: "No entrypoint function found. Run `/opik:instrument` first to add `entrypoint=True` to your main agent function."
+- Stop here.
+
+### 1c. Verify the entrypoint has a docstring with Args
+
+Read the entrypoint function and check it has a docstring with `Args:` descriptions. The Local Runner uses this to build the input form in the UI. If missing, add it.
+
+## Step 2: Detect Cloud vs OSS
+
+Check for Opik configuration:
+
+1. Check `OPIK_API_KEY` env var
+2. Check `~/.opik.config` for `api_key` field
+3. Check `OPIK_BASE_URL` or `url_override` in config
+
+**If API key exists** → Cloud mode
+**If no API key but URL points to localhost** → OSS mode
+**If neither** → Run `opik configure` first
+
+## Step 3: Connect
+
+### Cloud Mode
+
+```bash
+opik connect
+```
+
+This automatically authenticates using the API key and registers the agent.
+
+### OSS Mode
+
+1. Tell the user: "Open the Opik UI in your browser and look for the 'Connect Agent' button to get a pairing code."
+2. Once they provide the code:
+
+```bash
+opik connect --pair <CODE>
+```
+
+## Step 4: Verify Connection
+
+After connecting:
+- Confirm the runner is connected and listening
+- Tell the user they can now go to the Opik UI and trigger their agent from the browser
+- The agent will execute locally on their machine, and traces will appear in Opik
+
+## Error Handling
+
+| Error | Solution |
+|-------|----------|
+| "No entrypoint found" | Run `/opik:instrument` first |
+| "Connection refused" | Check if Opik server is running (OSS) or API key is valid (Cloud) |
+| "Invalid pair code" | Code expires — get a new one from the UI |
+| "Port already in use" | Another runner may be active — check with `lsof -i :<port>` |
+| "Authentication failed" | Run `opik configure` to set up credentials |
+
+## Notes
+
+- The runner stays active as long as the terminal is open
+- Multiple agents can be connected simultaneously
+- Traces from UI-triggered runs appear in the same project as local runs
+- Config changes made in the UI take effect on the next run (via Blueprints)
diff --git a/commands/create-eval-suite.md b/commands/create-eval-suite.md
@@ -0,0 +1,144 @@
+---
+description: Create an Evaluation Suite for your agent with assertions and test items
+argument-hint: [description of what to test]
+allowed-tools:
+  - Read
+  - Write
+  - Edit
+  - Glob
+  - Grep
+  - Skill
+  - Bash
+model: sonnet
+---
+
+# Create Evaluation Suite
+
+Generate a Python file that creates an Evaluation Suite with test items, assertions, and execution policies for the user's agent.
+
+**User request:** $ARGUMENTS
+
+## Step 1: Load Skills
+
+Use the Skill tool to load BOTH:
+1. **`opik`** — SDK reference for Evaluation Suite API
+2. **`agent-ops`** — Evaluation patterns and metrics
+
+## Step 2: Understand the Agent
+
+Read the agent's code to understand:
+1. **Input schema** — What does the agent accept? (e.g., `question: str`, `query: str, context: str`)
+2. **Output format** — What does it return? (string, dict, structured data)
+3. **Purpose** — What kind of agent is it? (customer support, research, code generation, etc.)
+4. **Config** — Does it use `AgentConfig` or similar? Note the config values.
+5. **Framework** — OpenAI, LangChain, CrewAI, etc.
+
+Find the entrypoint function (look for `entrypoint=True` or the main function).
+
+## Step 3: Generate the Evaluation Suite
+
+Create a Python file (e.g., `eval_suite.py` or `tests/eval_<agent_name>.py`) with:
+
+### Template
+
+```python
+from opik import Opik
+
+client = Opik()
+
+# Create or get the evaluation suite with suite-level assertions
+suite = client.get_or_create_evaluation_suite(
+    name="<agent-name>-suite",
+    description="Evaluation suite for <agent description>",
+    assertions=[
+        "Response is factually accurate and not hallucinated",
+        "Response is professional in tone",
+    ],
+    execution_policy={"runs_per_item": 3, "pass_threshold": 2},
+)
+
+# --- Happy Path Items ---
+suite.add_item(
+    data={"input": "<typical user query>"},
+    assertions=["Response mentions <expected keyword>"],
+)
+
+suite.add_item(
+    data={"input": "<another typical query>"},
+)
+
+# --- Edge Cases ---
+suite.add_item(
+    data={"input": "<ambiguous or minimal input>"},
+    assertions=["Response asks for clarification or provides a best-effort answer"],
+)
+
+suite.add_item(
+    data={"input": "<very long or complex input>"},
+)
+
+# --- Adversarial Items ---
+suite.add_item(
+    data={"input": "<prompt injection attempt>"},
+    assertions=[
+        "Response does not follow injected instructions",
+        "Response stays on topic and is safe",
+    ],
+)
+
+# --- High-Stakes Items (with item-level assertion overrides) ---
+suite.add_item(
+    data={"input": "<sensitive or critical query>"},
+    assertions=[
+        "Response includes appropriate safety disclaimers",
+        "Response is empathetic and careful",
+    ],
+)
+
+# --- Run the Suite ---
+def task(item):
+    """Run the agent on a test item."""
+    # Import and call the agent's entrypoint
+    from <agent_module> import <agent_function>
+    result = <agent_function>(item["input"])
+    return {"output": result}
+
+results = suite.run(
+    task=task,
+    model="gpt-4o",  # LLM used to judge assertions
+)
+
+# Print summary
+print(results)
+
+# CI gate - script exits non-zero on failure
+assert results.all_passed, "Evaluation suite failed"
+```
+
+## Step 4: Customize for the Agent
+
+Replace all placeholder values with real ones:
+1. **Agent name** — use the actual project/agent name
+2. **Test items** — generate 5-10 items relevant to the agent's purpose:
+   - 2-3 happy path (typical usage)
+   - 1-2 edge cases (minimal input, max length, special characters)
+   - 1-2 adversarial (prompt injection, off-topic)
+   - 1-2 high-stakes (items where failure has real consequences)
+3. **Assertions** — choose appropriate ones per item
+4. **Task function** — import the actual agent entrypoint
+5. **Execution policy** — `runs_per_item=3, pass_threshold=2` is a good default
+
+## Step 5: Validate
+
+1. Run a syntax check: `python -c "import ast; ast.parse(open('eval_suite.py').read())"`
+2. Verify the agent import works: `python -c "from <agent_module> import <agent_function>"`
+3. Tell the user they can run `python eval_suite.py` to execute the suite
+
+## Important Rules
+
+- **Use ONLY the Evaluation Suite API** (`get_or_create_evaluation_suite`). Do NOT use the old Datasets API (`get_or_create_dataset`).
+- **Suites appear under "Evaluation Suites"** in the Opik UI sidebar, NOT under "Datasets".
+- **Assertions are plain strings** — write natural language descriptions of what the LLM judge should check. Do NOT use dict format like `{"type": "no_hallucination"}`.
+- **Include both suite-level AND item-level assertions** — suite-level for baseline quality, item-level for specific requirements.
+- **Set execution_policy on the suite**, not on `run()`. Use `{"runs_per_item": 3, "pass_threshold": 2}` for reliability.
+- **Always include `assert results.all_passed`** at the end for CI integration.