Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions .claude-plugin/plugin.json
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
{
"name": "opik",
"version": "0.1.0",
"description": "LLM observability tooling for agent development and Claude Code",
"version": "0.2.0",
"description": "LLM observability tooling for agent development and Claude Code — Opik 2.0 ready",
"author": {
"name": "Comet ML",
"url": "https://comet.com"
Expand Down
29 changes: 25 additions & 4 deletions agents/agent-reviewer.md
Original file line number Diff line number Diff line change
Expand Up @@ -287,9 +287,9 @@ Evaluate resource consumption:
- Unbounded agent loops that burn tokens
- No budget alerts or hard caps

### 10. Observability
### 10. Observability & Opik 2.0 Patterns

Evaluate monitoring and debugging capabilities:
Evaluate monitoring, debugging, and Opik 2.0 compliance:

**Trace Initialization (Critical):**
- Tracing starts BEFORE agent execution, not after
Expand All @@ -298,10 +298,17 @@ Evaluate monitoring and debugging capabilities:
- Trace input matches actual agent input (enables replay)
- Trace ID available from first instruction

**Opik 2.0 Requirements (Critical):**
- **`entrypoint=True`**: The main agent function MUST have `entrypoint=True` in its `@opik.track` decorator. Without this, the agent cannot be triggered via the Local Runner.
- **Docstring with Args**: The entrypoint function MUST have a docstring with `Args:` descriptions. The Local Runner uses this for schema discovery.
- **Configuration externalized**: Hardcoded model names, temperatures, system prompts, and max_tokens SHOULD be extracted into an `opik.AgentConfig` subclass (not left inline). This enables Blueprint management via the Opik UI.
- **`thread_id` for conversations**: If the agent handles multi-turn conversations (has message history, chat loops, session state), it MUST set `thread_id` on traces. Without this, conversation turns appear as unrelated traces and thread-level metrics don't work.
- **Evaluation Suites**: Code should use `get_or_create_evaluation_suite()` for testing, NOT the old `get_or_create_dataset()` API.

**Tracing:**
- Full execution traces with input/output capture
- Span hierarchy showing tool calls and reasoning
- Correct span types used: `general`, `tool`, `llm`, `retrieval`, `guardrail`
- Correct span types used: `general`, `tool`, `llm`, `guardrail` (NOT `retrieval`)
- Correlation IDs across distributed components
- Complete request lifecycle from input to final output

Expand All @@ -322,7 +329,7 @@ Evaluate monitoring and debugging capabilities:
- Task completion rates

**Evaluation:**
- Pre-production quality checks
- Pre-production quality checks via Evaluation Suites
- Production monitoring for drift/regression
- Feedback loops for continuous improvement
- Agent-specific metrics: task completion, tool correctness, trajectory accuracy
Expand All @@ -338,6 +345,11 @@ Evaluate monitoring and debugging capabilities:
- No metrics on performance or cost
- No way to debug failed executions
- No alerting for anomalies
- **Missing `entrypoint=True`** on the main agent function
- **Missing config dataclass** — hardcoded model/temperature/prompt values
- **Missing `thread_id`** in a conversational agent
- **Using old Datasets API** instead of Evaluation Suites
- **Missing docstring** on the entrypoint function

### 11. State Management

Expand Down Expand Up @@ -411,6 +423,15 @@ Structure your review as follows:
- [ ] Alerting for anomalies
- [ ] Debug mode for development

### Opik 2.0 Checklist

- [ ] `entrypoint=True` on the main agent function
- [ ] Docstring with `Args:` on the entrypoint function
- [ ] Config externalized into `opik.AgentConfig` subclass (no hardcoded model/temperature/prompt)
- [ ] `thread_id` set for conversational agents (multi-turn)
- [ ] Uses Evaluation Suites API, NOT old Datasets API
- [ ] Span types are correct (`general`, `llm`, `tool`, `guardrail` — NOT `retrieval`)

### Resource Management Checklist

- [ ] Token/cost limits per request and session
Expand Down
Binary file modified bin/opik-logger-darwin-arm64
Binary file not shown.
96 changes: 96 additions & 0 deletions commands/connect.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,96 @@
---
description: Connect your agent to Opik for triggering from the browser UI via the Local Runner
argument-hint: [--pair CODE]
allowed-tools:
- Bash
- Read
- Grep
- Glob
model: haiku
---

# Connect Agent to Opik (Local Runner)

Set up `opik connect` so the user's agent can be triggered from the Opik browser UI while running locally.

**User request:** $ARGUMENTS

## Step 1: Check Prerequisites

### 1a. Verify opik CLI is installed

Run `opik --version`. If not found:
- Check if `opik` is installed: `pip show opik` or `pip3 show opik`
- If not installed: `pip install opik`
- If installed but not on PATH: suggest `python -m opik --version`

### 1b. Verify there's an entrypoint function

Search the codebase for `entrypoint=True`:

```
grep -r "entrypoint=True" --include="*.py" .
```

If no entrypoint found:
- Tell the user: "No entrypoint function found. Run `/opik:instrument` first to add `entrypoint=True` to your main agent function."
- Stop here.

### 1c. Verify the entrypoint has a docstring with Args

Read the entrypoint function and check it has a docstring with `Args:` descriptions. The Local Runner uses this to build the input form in the UI. If missing, add it.

## Step 2: Detect Cloud vs OSS

Check for Opik configuration:

1. Check `OPIK_API_KEY` env var
2. Check `~/.opik.config` for `api_key` field
3. Check `OPIK_BASE_URL` or `url_override` in config

**If API key exists** → Cloud mode
**If no API key but URL points to localhost** → OSS mode
**If neither** → Run `opik configure` first

## Step 3: Connect

### Cloud Mode

```bash
opik connect
```

This automatically authenticates using the API key and registers the agent.

### OSS Mode

1. Tell the user: "Open the Opik UI in your browser and look for the 'Connect Agent' button to get a pairing code."
2. Once they provide the code:

```bash
opik connect --pair <CODE>
```

## Step 4: Verify Connection

After connecting:
- Confirm the runner is connected and listening
- Tell the user they can now go to the Opik UI and trigger their agent from the browser
- The agent will execute locally on their machine, and traces will appear in Opik

## Error Handling

| Error | Solution |
|-------|----------|
| "No entrypoint found" | Run `/opik:instrument` first |
| "Connection refused" | Check if Opik server is running (OSS) or API key is valid (Cloud) |
| "Invalid pair code" | Code expires — get a new one from the UI |
| "Port already in use" | Another runner may be active — check with `lsof -i :<port>` |
| "Authentication failed" | Run `opik configure` to set up credentials |

## Notes

- The runner stays active as long as the terminal is open
- Multiple agents can be connected simultaneously
- Traces from UI-triggered runs appear in the same project as local runs
- Config changes made in the UI take effect on the next run (via Blueprints)
144 changes: 144 additions & 0 deletions commands/create-eval-suite.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,144 @@
---
description: Create an Evaluation Suite for your agent with assertions and test items
argument-hint: [description of what to test]
allowed-tools:
- Read
- Write
- Edit
- Glob
- Grep
- Skill
- Bash
model: sonnet
---

# Create Evaluation Suite

Generate a Python file that creates an Evaluation Suite with test items, assertions, and execution policies for the user's agent.

**User request:** $ARGUMENTS

## Step 1: Load Skills

Use the Skill tool to load BOTH:
1. **`opik`** — SDK reference for Evaluation Suite API
2. **`agent-ops`** — Evaluation patterns and metrics

## Step 2: Understand the Agent

Read the agent's code to understand:
1. **Input schema** — What does the agent accept? (e.g., `question: str`, `query: str, context: str`)
2. **Output format** — What does it return? (string, dict, structured data)
3. **Purpose** — What kind of agent is it? (customer support, research, code generation, etc.)
4. **Config** — Does it use `AgentConfig` or similar? Note the config values.
5. **Framework** — OpenAI, LangChain, CrewAI, etc.

Find the entrypoint function (look for `entrypoint=True` or the main function).

## Step 3: Generate the Evaluation Suite

Create a Python file (e.g., `eval_suite.py` or `tests/eval_<agent_name>.py`) with:

### Template

```python
from opik import Opik

client = Opik()

# Create or get the evaluation suite with suite-level assertions
suite = client.get_or_create_evaluation_suite(
name="<agent-name>-suite",
description="Evaluation suite for <agent description>",
assertions=[
"Response is factually accurate and not hallucinated",
"Response is professional in tone",
],
execution_policy={"runs_per_item": 3, "pass_threshold": 2},
)

# --- Happy Path Items ---
suite.add_item(
data={"input": "<typical user query>"},
assertions=["Response mentions <expected keyword>"],
)

suite.add_item(
data={"input": "<another typical query>"},
)

# --- Edge Cases ---
suite.add_item(
data={"input": "<ambiguous or minimal input>"},
assertions=["Response asks for clarification or provides a best-effort answer"],
)

suite.add_item(
data={"input": "<very long or complex input>"},
)

# --- Adversarial Items ---
suite.add_item(
data={"input": "<prompt injection attempt>"},
assertions=[
"Response does not follow injected instructions",
"Response stays on topic and is safe",
],
)

# --- High-Stakes Items (with item-level assertion overrides) ---
suite.add_item(
data={"input": "<sensitive or critical query>"},
assertions=[
"Response includes appropriate safety disclaimers",
"Response is empathetic and careful",
],
)

# --- Run the Suite ---
def task(item):
"""Run the agent on a test item."""
# Import and call the agent's entrypoint
from <agent_module> import <agent_function>
result = <agent_function>(item["input"])
return {"output": result}

results = suite.run(
task=task,
model="gpt-4o", # LLM used to judge assertions
)

# Print summary
print(results)

# CI gate - script exits non-zero on failure
assert results.all_passed, "Evaluation suite failed"
```

## Step 4: Customize for the Agent

Replace all placeholder values with real ones:
1. **Agent name** — use the actual project/agent name
2. **Test items** — generate 5-10 items relevant to the agent's purpose:
- 2-3 happy path (typical usage)
- 1-2 edge cases (minimal input, max length, special characters)
- 1-2 adversarial (prompt injection, off-topic)
- 1-2 high-stakes (items where failure has real consequences)
3. **Assertions** — choose appropriate ones per item
4. **Task function** — import the actual agent entrypoint
5. **Execution policy** — `runs_per_item=3, pass_threshold=2` is a good default

## Step 5: Validate

1. Run a syntax check: `python -c "import ast; ast.parse(open('eval_suite.py').read())"`
2. Verify the agent import works: `python -c "from <agent_module> import <agent_function>"`
3. Tell the user they can run `python eval_suite.py` to execute the suite

## Important Rules

- **Use ONLY the Evaluation Suite API** (`get_or_create_evaluation_suite`). Do NOT use the old Datasets API (`get_or_create_dataset`).
- **Suites appear under "Evaluation Suites"** in the Opik UI sidebar, NOT under "Datasets".
- **Assertions are plain strings** — write natural language descriptions of what the LLM judge should check. Do NOT use dict format like `{"type": "no_hallucination"}`.
- **Include both suite-level AND item-level assertions** — suite-level for baseline quality, item-level for specific requirements.
- **Set execution_policy on the suite**, not on `run()`. Use `{"runs_per_item": 3, "pass_threshold": 2}` for reliability.
- **Always include `assert results.all_passed`** at the end for CI integration.
Loading