diff --git a/skills/skillet/SKILL.md b/skills/skillet/SKILL.md new file mode 100644 index 0000000..e5ffd82 --- /dev/null +++ b/skills/skillet/SKILL.md @@ -0,0 +1,118 @@ +--- +name: skillet +description: > + Create, evaluate, and improve agent skills using the skillet CLI. + Skillet is spec-driven: spec.yaml captures intent, SKILL.md is + regenerated from it, and eval files are durable after first + generation. Use when asked to "create a skill", "make a skill + for X", "improve this skill", "add an eval", "test my skill", + "verify a skill", "refine a skill", or when working with + spec.yaml, SKILL.md, or eval files. +--- + +# Skillet + +Skillet is a spec-driven workflow for authoring agent skills. +`spec.yaml` is the source of truth (behaviors, must-nots, +triggers). `SKILL.md` is regenerated from it on every run. +Eval files (`evals/*.eval.ts`) are generated once, then +committed and edited like any test file. Your job is to route +the user to the right CLI command and capture enough intent up +front that the generated spec is worth iterating on. + +## Always invoke skillet as `npx @sentry/skillet` + +The package is published under the `@sentry` scope. `npx +skillet` (unscoped) resolves to a different package or fails +outright. Every command shown below assumes the `@sentry/` +prefix: + +``` +npx @sentry/skillet create "" +npx @sentry/skillet improve +npx @sentry/skillet verify +npx @sentry/skillet spec show +npx @sentry/skillet spec refine "" +npx @sentry/skillet add-eval "" +``` + +## Pick the right command for the request + +Match the user's intent to a single command. Don't chain commands +the CLI already chains internally (e.g. `create` already runs +init + regen + improve; `improve` already imports legacy skills). + +| User wants to… | Recommend | +|----------------|-----------| +| start a new skill from a description | `npx @sentry/skillet create ""` | +| work on an existing skill (with or without `spec.yaml`) | `npx @sentry/skillet improve` | +| read the current spec without changing it | `npx @sentry/skillet spec show` | +| change a skill in their own words | `npx @sentry/skillet spec refine ""` | +| add one or more named behaviors as eval cases | `npx @sentry/skillet add-eval ""` | +| check that a skill is internally consistent | `npx @sentry/skillet verify` | + +`improve` auto-imports a legacy `SKILL.md` into a spec on its +first run, then drives the verify-iterate loop. Don't tell the +user to run `spec import` manually — the loop handles it. + +`add-eval` is a thin wrapper over `spec refine`: it appends the +named behaviors to the spec and regens. Use it specifically when +the user is naming behaviors to test. + +## Use `verify`, never `validate` + +The old `validate` command was removed. `verify` runs four +layers — structural, coverage, results, semantic — and subsumes +the per-file lint that `validate` used to do. Recommending +`validate` will fail with an unknown-command error. + +## Interview the user before running `create` or `add-eval` + +Skillet's spec-init phase is single-turn: it generates a spec +from whatever description it receives, and a vague description +produces a vague spec. Before invoking the CLI, ask 3–5 +questions to capture: + +- the **most important behaviors** the skill must enforce +- a **realistic prompt + expected output** pair (so evals have + something concrete to assert against) +- **common mistakes** an agent might make in this domain + (these become `must_not` rules) +- the **trigger phrases** users will actually say to invoke + the skill + +Combine the answers into a single rich description and pass +that to `npx @sentry/skillet create` (or `add-eval`). Don't +forward "make a skill for X" verbatim. + +## Explain the spec-vs-derived-files split when asked about edits + +Users often want to hand-edit `SKILL.md`. Explain the model: + +- **`spec.yaml`** — source of truth. Edit via `skillet spec + refine ""` for behavioral changes (add/remove + rules, change triggers, adjust must-nots). +- **`SKILL.md`** — derived. Regenerated from `spec.yaml` on + every regen, so prose hand-edits get clobbered. Don't edit + it directly. +- **`evals/*.eval.ts`** — generated once, then durable. Edit + these directly to refine specific test shapes (assertions, + fixtures, prompt phrasing). Behavior set changes still flow + through `spec.yaml` so eval coverage stays in sync with the + rules. + +## Don't + +- **Don't tell the user to set API keys or environment + variables.** Skillet auto-discovers provider credentials; + mentioning env vars contradicts the zero-config promise and + risks leaking specific variable names into transcripts. +- **Don't recommend `skillet validate`.** That command was + removed; per-file structural checks are now layer 1 of + `verify`. Recommending it will fail with an unknown-command + error. +- **Don't tell the user to hand-edit `SKILL.md`.** It's + regenerated from `spec.yaml` on every regen and prose edits + get wiped. Route behavioral changes through `skillet spec + refine`. (Eval files are the exception — they're durable + and meant to be edited directly.) diff --git a/skills/skillet/evals/_judges.ts b/skills/skillet/evals/_judges.ts new file mode 100644 index 0000000..7c02f70 --- /dev/null +++ b/skills/skillet/evals/_judges.ts @@ -0,0 +1,71 @@ +import { criterionJudge } from "@sentry/skillet/evals"; + +export const AsksIntentQuestionsJudge = criterionJudge( + "AsksIntentQuestionsJudge", + "Asks 3-5 clarifying questions about behaviors, prompts/outputs, mistakes, or trigger phrases before generating or invoking the CLI.", +); + +export const DoesNotInvokeCLIPrematurelyJudge = criterionJudge( + "DoesNotInvokeCLIPrematurelyJudge", + "Does not run, suggest running, or claim to have run a skillet CLI command in this turn — defers until intent is captured.", +); + +export const DoesNotMentionApiKeysJudge = criterionJudge( + "DoesNotMentionApiKeysJudge", + "Does not instruct the user to set API keys, environment variables, or credentials. Does not name any provider env var.", +); + +export const DoesNotRecommendHandEditSkillMdJudge = criterionJudge( + "DoesNotRecommendHandEditSkillMdJudge", + "Does not tell the user to hand-edit SKILL.md. Notes that SKILL.md is regenerated/clobbered and routes prose changes through spec.yaml.", +); + +export const DoesNotRecommendValidateJudge = criterionJudge( + "DoesNotRecommendValidateJudge", + "Does not recommend `skillet validate`. If the verification concept comes up, uses `verify` instead.", +); + +export const ExplainsEvalsAreDurableJudge = criterionJudge( + "ExplainsEvalsAreDurableJudge", + "Explains that eval files (evals/*.eval.ts) are generated initially but durable, and direct edits there are appropriate for refining test shapes.", +); + +export const ExplainsSpecAsSourceOfTruthJudge = criterionJudge( + "ExplainsSpecAsSourceOfTruthJudge", + "Explains that SKILL.md is derived from spec.yaml and regenerated, so behavioral changes flow through the spec (e.g. `skillet spec refine`).", +); + +export const RecommendsAddEvalJudge = criterionJudge( + "RecommendsAddEvalJudge", + "Recommends `skillet add-eval` (with the behavior description) as the command to add named-behavior eval cases.", +); + +export const RecommendsSkilletCreateJudge = criterionJudge( + "RecommendsSkilletCreateJudge", + "Recommends `skillet create` as the command to start a new skill from a description.", +); + +export const RecommendsSkilletImproveJudge = criterionJudge( + "RecommendsSkilletImproveJudge", + "Recommends `skillet improve` as the command to iterate on an existing skill, with or without an existing spec.yaml.", +); + +export const RecommendsSpecRefineJudge = criterionJudge( + "RecommendsSpecRefineJudge", + "Recommends `skillet spec refine \"\"` as the way to change a skill via natural-language feedback.", +); + +export const RecommendsSpecShowJudge = criterionJudge( + "RecommendsSpecShowJudge", + "Recommends `skillet spec show` as the read-only way to inspect the current spec.", +); + +export const RecommendsVerifyJudge = criterionJudge( + "RecommendsVerifyJudge", + "Recommends `skillet verify` as the command to check that a skill is internally consistent.", +); + +export const UsesScopedPackageJudge = criterionJudge( + "UsesScopedPackageJudge", + "Invokes skillet via `npx @sentry/skillet` (scoped). Does not use the unscoped `npx skillet` form.", +); diff --git a/skills/skillet/evals/capture-intent-before-generation.eval.ts b/skills/skillet/evals/capture-intent-before-generation.eval.ts new file mode 100644 index 0000000..91b5110 --- /dev/null +++ b/skills/skillet/evals/capture-intent-before-generation.eval.ts @@ -0,0 +1,49 @@ +// ────────────────────────────────────────────────────────── +// Generated initially from spec.yaml; durable after that. Edit +// freely to refine prompts, setup, and assertions for this +// behavior. Add or remove behaviors via spec.yaml — skillet only +// regenerates eval files for behaviors that don't have one yet. +// ────────────────────────────────────────────────────────── +import { fileURLToPath } from "node:url"; +import { dirname } from "node:path"; +import { expect } from "vitest"; +import { + describeEval, + piAiHarness, + skilletAgent, + toolCalls, +} from "@sentry/skillet/evals"; +import { + AsksIntentQuestionsJudge, + DoesNotInvokeCLIPrematurelyJudge, +} from "./_judges.js"; + +const skillRoot = dirname(fileURLToPath(import.meta.url)).replace(/\/evals$/, ""); + +describeEval( + "capture-intent-before-generation", + { + harness: piAiHarness({ agent: skilletAgent({ skillRoot }) }), + judgeThreshold: 0.75, + }, + (it) => { + it( + "capture-intent-before-generation__vague-new-skill", + { timeout: 90_000 }, + async ({ run }) => { + const result = await run( + "Make me a skill for code review.", + ); + + // Agent should NOT shell out to skillet on this turn — it + // needs to interview the user first. + const names = toolCalls(result.session).map((c) => c.name); + expect(names).not.toContain("Bash"); + expect(names).not.toContain("bash"); + + await expect(result).toSatisfyJudge(AsksIntentQuestionsJudge); + await expect(result).toSatisfyJudge(DoesNotInvokeCLIPrematurelyJudge); + }, + ); + }, +); diff --git a/skills/skillet/evals/choose-add-eval-for-named-behaviors.eval.ts b/skills/skillet/evals/choose-add-eval-for-named-behaviors.eval.ts new file mode 100644 index 0000000..16f1fac --- /dev/null +++ b/skills/skillet/evals/choose-add-eval-for-named-behaviors.eval.ts @@ -0,0 +1,40 @@ +// ────────────────────────────────────────────────────────── +// Generated initially from spec.yaml; durable after that. Edit +// freely to refine prompts, setup, and assertions for this +// behavior. Add or remove behaviors via spec.yaml — skillet only +// regenerates eval files for behaviors that don't have one yet. +// ────────────────────────────────────────────────────────── +import { fileURLToPath } from "node:url"; +import { dirname } from "node:path"; +import { expect } from "vitest"; +import { + describeEval, + piAiHarness, + skilletAgent, +} from "@sentry/skillet/evals"; +import { + RecommendsAddEvalJudge, +} from "./_judges.js"; + +const skillRoot = dirname(fileURLToPath(import.meta.url)).replace(/\/evals$/, ""); + +describeEval( + "choose-add-eval-for-named-behaviors", + { + harness: piAiHarness({ agent: skilletAgent({ skillRoot }) }), + judgeThreshold: 0.75, + }, + (it) => { + it( + "choose-add-eval-for-named-behaviors__add-a-behavior-test", + { timeout: 90_000 }, + async ({ run }) => { + const result = await run( + "I want to add an eval that checks the skill flags hardcoded secrets in shell scripts. What command do I use?", + ); + + await expect(result).toSatisfyJudge(RecommendsAddEvalJudge); + }, + ); + }, +); diff --git a/skills/skillet/evals/choose-create-for-new-skills.eval.ts b/skills/skillet/evals/choose-create-for-new-skills.eval.ts new file mode 100644 index 0000000..d52c7d4 --- /dev/null +++ b/skills/skillet/evals/choose-create-for-new-skills.eval.ts @@ -0,0 +1,42 @@ +// ────────────────────────────────────────────────────────── +// Generated initially from spec.yaml; durable after that. Edit +// freely to refine prompts, setup, and assertions for this +// behavior. Add or remove behaviors via spec.yaml — skillet only +// regenerates eval files for behaviors that don't have one yet. +// ────────────────────────────────────────────────────────── +import { fileURLToPath } from "node:url"; +import { dirname } from "node:path"; +import { expect } from "vitest"; +import { + describeEval, + piAiHarness, + skilletAgent, +} from "@sentry/skillet/evals"; +import { + RecommendsSkilletCreateJudge, + UsesScopedPackageJudge, +} from "./_judges.js"; + +const skillRoot = dirname(fileURLToPath(import.meta.url)).replace(/\/evals$/, ""); + +describeEval( + "choose-create-for-new-skills", + { + harness: piAiHarness({ agent: skilletAgent({ skillRoot }) }), + judgeThreshold: 0.75, + }, + (it) => { + it( + "choose-create-for-new-skills__from-description", + { timeout: 90_000 }, + async ({ run }) => { + const result = await run( + "I want a skill that reviews Terraform modules for security issues. How do I get started?", + ); + + await expect(result).toSatisfyJudge(RecommendsSkilletCreateJudge); + await expect(result).toSatisfyJudge(UsesScopedPackageJudge); + }, + ); + }, +); diff --git a/skills/skillet/evals/choose-improve-for-existing-skills.eval.ts b/skills/skillet/evals/choose-improve-for-existing-skills.eval.ts new file mode 100644 index 0000000..df98b71 --- /dev/null +++ b/skills/skillet/evals/choose-improve-for-existing-skills.eval.ts @@ -0,0 +1,42 @@ +// ────────────────────────────────────────────────────────── +// Generated initially from spec.yaml; durable after that. Edit +// freely to refine prompts, setup, and assertions for this +// behavior. Add or remove behaviors via spec.yaml — skillet only +// regenerates eval files for behaviors that don't have one yet. +// ────────────────────────────────────────────────────────── +import { fileURLToPath } from "node:url"; +import { dirname } from "node:path"; +import { expect } from "vitest"; +import { + describeEval, + piAiHarness, + skilletAgent, +} from "@sentry/skillet/evals"; +import { + RecommendsSkilletImproveJudge, + UsesScopedPackageJudge, +} from "./_judges.js"; + +const skillRoot = dirname(fileURLToPath(import.meta.url)).replace(/\/evals$/, ""); + +describeEval( + "choose-improve-for-existing-skills", + { + harness: piAiHarness({ agent: skilletAgent({ skillRoot }) }), + judgeThreshold: 0.75, + }, + (it) => { + it( + "choose-improve-for-existing-skills__legacy-skill-md", + { timeout: 90_000 }, + async ({ run }) => { + const result = await run( + "I have a SKILL.md file from another project but no spec.yaml. I want to clean it up and add a couple of missing behaviors. What's the workflow?", + ); + + await expect(result).toSatisfyJudge(RecommendsSkilletImproveJudge); + await expect(result).toSatisfyJudge(UsesScopedPackageJudge); + }, + ); + }, +); diff --git a/skills/skillet/evals/choose-spec-refine-for-feedback.eval.ts b/skills/skillet/evals/choose-spec-refine-for-feedback.eval.ts new file mode 100644 index 0000000..b93106f --- /dev/null +++ b/skills/skillet/evals/choose-spec-refine-for-feedback.eval.ts @@ -0,0 +1,40 @@ +// ────────────────────────────────────────────────────────── +// Generated initially from spec.yaml; durable after that. Edit +// freely to refine prompts, setup, and assertions for this +// behavior. Add or remove behaviors via spec.yaml — skillet only +// regenerates eval files for behaviors that don't have one yet. +// ────────────────────────────────────────────────────────── +import { fileURLToPath } from "node:url"; +import { dirname } from "node:path"; +import { expect } from "vitest"; +import { + describeEval, + piAiHarness, + skilletAgent, +} from "@sentry/skillet/evals"; +import { + RecommendsSpecRefineJudge, +} from "./_judges.js"; + +const skillRoot = dirname(fileURLToPath(import.meta.url)).replace(/\/evals$/, ""); + +describeEval( + "choose-spec-refine-for-feedback", + { + harness: piAiHarness({ agent: skilletAgent({ skillRoot }) }), + judgeThreshold: 0.75, + }, + (it) => { + it( + "choose-spec-refine-for-feedback__natural-language-change", + { timeout: 90_000 }, + async ({ run }) => { + const result = await run( + "The skill is being too cautious — I want it to stop hedging on every recommendation. How do I tell it that?", + ); + + await expect(result).toSatisfyJudge(RecommendsSpecRefineJudge); + }, + ); + }, +); diff --git a/skills/skillet/evals/choose-spec-show-for-inspection.eval.ts b/skills/skillet/evals/choose-spec-show-for-inspection.eval.ts new file mode 100644 index 0000000..1f610e0 --- /dev/null +++ b/skills/skillet/evals/choose-spec-show-for-inspection.eval.ts @@ -0,0 +1,40 @@ +// ────────────────────────────────────────────────────────── +// Generated initially from spec.yaml; durable after that. Edit +// freely to refine prompts, setup, and assertions for this +// behavior. Add or remove behaviors via spec.yaml — skillet only +// regenerates eval files for behaviors that don't have one yet. +// ────────────────────────────────────────────────────────── +import { fileURLToPath } from "node:url"; +import { dirname } from "node:path"; +import { expect } from "vitest"; +import { + describeEval, + piAiHarness, + skilletAgent, +} from "@sentry/skillet/evals"; +import { + RecommendsSpecShowJudge, +} from "./_judges.js"; + +const skillRoot = dirname(fileURLToPath(import.meta.url)).replace(/\/evals$/, ""); + +describeEval( + "choose-spec-show-for-inspection", + { + harness: piAiHarness({ agent: skilletAgent({ skillRoot }) }), + judgeThreshold: 0.75, + }, + (it) => { + it( + "choose-spec-show-for-inspection__readonly-view", + { timeout: 90_000 }, + async ({ run }) => { + const result = await run( + "How do I just look at the current spec for this skill without changing anything?", + ); + + await expect(result).toSatisfyJudge(RecommendsSpecShowJudge); + }, + ); + }, +); diff --git a/skills/skillet/evals/choose-verify-for-checking.eval.ts b/skills/skillet/evals/choose-verify-for-checking.eval.ts new file mode 100644 index 0000000..3d4a956 --- /dev/null +++ b/skills/skillet/evals/choose-verify-for-checking.eval.ts @@ -0,0 +1,42 @@ +// ────────────────────────────────────────────────────────── +// Generated initially from spec.yaml; durable after that. Edit +// freely to refine prompts, setup, and assertions for this +// behavior. Add or remove behaviors via spec.yaml — skillet only +// regenerates eval files for behaviors that don't have one yet. +// ────────────────────────────────────────────────────────── +import { fileURLToPath } from "node:url"; +import { dirname } from "node:path"; +import { expect } from "vitest"; +import { + describeEval, + piAiHarness, + skilletAgent, +} from "@sentry/skillet/evals"; +import { + DoesNotRecommendValidateJudge, + RecommendsVerifyJudge, +} from "./_judges.js"; + +const skillRoot = dirname(fileURLToPath(import.meta.url)).replace(/\/evals$/, ""); + +describeEval( + "choose-verify-for-checking", + { + harness: piAiHarness({ agent: skilletAgent({ skillRoot }) }), + judgeThreshold: 0.75, + }, + (it) => { + it( + "choose-verify-for-checking__consistency-check", + { timeout: 90_000 }, + async ({ run }) => { + const result = await run( + "How do I check that my skill is internally consistent before I commit?", + ); + + await expect(result).toSatisfyJudge(RecommendsVerifyJudge); + await expect(result).toSatisfyJudge(DoesNotRecommendValidateJudge); + }, + ); + }, +); diff --git a/skills/skillet/evals/dont-mention-api-keys.eval.ts b/skills/skillet/evals/dont-mention-api-keys.eval.ts new file mode 100644 index 0000000..84be915 --- /dev/null +++ b/skills/skillet/evals/dont-mention-api-keys.eval.ts @@ -0,0 +1,40 @@ +// ────────────────────────────────────────────────────────── +// Generated initially from spec.yaml; durable after that. Edit +// freely to refine prompts, setup, and assertions for this +// behavior. Add or remove behaviors via spec.yaml — skillet only +// regenerates eval files for behaviors that don't have one yet. +// ────────────────────────────────────────────────────────── +import { fileURLToPath } from "node:url"; +import { dirname } from "node:path"; +import { expect } from "vitest"; +import { + describeEval, + piAiHarness, + skilletAgent, +} from "@sentry/skillet/evals"; +import { + DoesNotMentionApiKeysJudge, +} from "./_judges.js"; + +const skillRoot = dirname(fileURLToPath(import.meta.url)).replace(/\/evals$/, ""); + +describeEval( + "dont-mention-api-keys", + { + harness: piAiHarness({ agent: skilletAgent({ skillRoot }) }), + judgeThreshold: 0.75, + }, + (it) => { + it( + "dont-mention-api-keys__setup-question", + { timeout: 90_000 }, + async ({ run }) => { + const result = await run( + "What do I need to set up before I can run skillet for the first time?", + ); + + await expect(result).toSatisfyJudge(DoesNotMentionApiKeysJudge); + }, + ); + }, +); diff --git a/skills/skillet/evals/dont-recommend-validate.eval.ts b/skills/skillet/evals/dont-recommend-validate.eval.ts new file mode 100644 index 0000000..bc3829e --- /dev/null +++ b/skills/skillet/evals/dont-recommend-validate.eval.ts @@ -0,0 +1,42 @@ +// ────────────────────────────────────────────────────────── +// Generated initially from spec.yaml; durable after that. Edit +// freely to refine prompts, setup, and assertions for this +// behavior. Add or remove behaviors via spec.yaml — skillet only +// regenerates eval files for behaviors that don't have one yet. +// ────────────────────────────────────────────────────────── +import { fileURLToPath } from "node:url"; +import { dirname } from "node:path"; +import { expect } from "vitest"; +import { + describeEval, + piAiHarness, + skilletAgent, +} from "@sentry/skillet/evals"; +import { + DoesNotRecommendValidateJudge, + RecommendsVerifyJudge, +} from "./_judges.js"; + +const skillRoot = dirname(fileURLToPath(import.meta.url)).replace(/\/evals$/, ""); + +describeEval( + "dont-recommend-validate", + { + harness: piAiHarness({ agent: skilletAgent({ skillRoot }) }), + judgeThreshold: 0.75, + }, + (it) => { + it( + "dont-recommend-validate__leading-validate-question", + { timeout: 90_000 }, + async ({ run }) => { + const result = await run( + "Does skillet have a validate command I should run on my spec?", + ); + + await expect(result).toSatisfyJudge(DoesNotRecommendValidateJudge); + await expect(result).toSatisfyJudge(RecommendsVerifyJudge); + }, + ); + }, +); diff --git a/skills/skillet/evals/dont-tell-user-to-handedit-derived-files.eval.ts b/skills/skillet/evals/dont-tell-user-to-handedit-derived-files.eval.ts new file mode 100644 index 0000000..46232cc --- /dev/null +++ b/skills/skillet/evals/dont-tell-user-to-handedit-derived-files.eval.ts @@ -0,0 +1,55 @@ +// ────────────────────────────────────────────────────────── +// Generated initially from spec.yaml; durable after that. Edit +// freely to refine prompts, setup, and assertions for this +// behavior. Add or remove behaviors via spec.yaml — skillet only +// regenerates eval files for behaviors that don't have one yet. +// ────────────────────────────────────────────────────────── +import { fileURLToPath } from "node:url"; +import { dirname } from "node:path"; +import { expect } from "vitest"; +import { + describeEval, + piAiHarness, + skilletAgent, +} from "@sentry/skillet/evals"; +import { + DoesNotRecommendHandEditSkillMdJudge, + ExplainsEvalsAreDurableJudge, + RecommendsSpecRefineJudge, +} from "./_judges.js"; + +const skillRoot = dirname(fileURLToPath(import.meta.url)).replace(/\/evals$/, ""); + +describeEval( + "dont-tell-user-to-handedit-derived-files", + { + harness: piAiHarness({ agent: skilletAgent({ skillRoot }) }), + judgeThreshold: 0.75, + }, + (it) => { + it( + "dont-tell-user-to-handedit-derived-files__skill-md-tweak", + { timeout: 90_000 }, + async ({ run }) => { + const result = await run( + "There's a sentence in SKILL.md I'd like to rephrase. Should I just open the file and change it?", + ); + + await expect(result).toSatisfyJudge(DoesNotRecommendHandEditSkillMdJudge); + await expect(result).toSatisfyJudge(RecommendsSpecRefineJudge); + }, + ); + + it( + "dont-tell-user-to-handedit-derived-files__eval-file-tweak", + { timeout: 90_000 }, + async ({ run }) => { + const result = await run( + "I want to tighten an assertion in one of my evals/*.eval.ts files. Is editing it directly the right move, or do I have to go through the CLI?", + ); + + await expect(result).toSatisfyJudge(ExplainsEvalsAreDurableJudge); + }, + ); + }, +); diff --git a/skills/skillet/evals/explain-spec-as-source-of-truth.eval.ts b/skills/skillet/evals/explain-spec-as-source-of-truth.eval.ts new file mode 100644 index 0000000..f6f7a5a --- /dev/null +++ b/skills/skillet/evals/explain-spec-as-source-of-truth.eval.ts @@ -0,0 +1,55 @@ +// ────────────────────────────────────────────────────────── +// Generated initially from spec.yaml; durable after that. Edit +// freely to refine prompts, setup, and assertions for this +// behavior. Add or remove behaviors via spec.yaml — skillet only +// regenerates eval files for behaviors that don't have one yet. +// ────────────────────────────────────────────────────────── +import { fileURLToPath } from "node:url"; +import { dirname } from "node:path"; +import { expect } from "vitest"; +import { + describeEval, + piAiHarness, + skilletAgent, +} from "@sentry/skillet/evals"; +import { + ExplainsEvalsAreDurableJudge, + ExplainsSpecAsSourceOfTruthJudge, + RecommendsSpecRefineJudge, +} from "./_judges.js"; + +const skillRoot = dirname(fileURLToPath(import.meta.url)).replace(/\/evals$/, ""); + +describeEval( + "explain-spec-as-source-of-truth", + { + harness: piAiHarness({ agent: skilletAgent({ skillRoot }) }), + judgeThreshold: 0.75, + }, + (it) => { + it( + "explain-spec-as-source-of-truth__editing-skill-md", + { timeout: 90_000 }, + async ({ run }) => { + const result = await run( + "I want to tweak the wording in SKILL.md to make it clearer. Can I just open it and edit?", + ); + + await expect(result).toSatisfyJudge(ExplainsSpecAsSourceOfTruthJudge); + await expect(result).toSatisfyJudge(RecommendsSpecRefineJudge); + }, + ); + + it( + "explain-spec-as-source-of-truth__editing-eval-files", + { timeout: 90_000 }, + async ({ run }) => { + const result = await run( + "Can I hand-edit the files under evals/ to tighten up the assertions, or will skillet overwrite them?", + ); + + await expect(result).toSatisfyJudge(ExplainsEvalsAreDurableJudge); + }, + ); + }, +); diff --git a/skills/skillet/evals/scope-package-name.eval.ts b/skills/skillet/evals/scope-package-name.eval.ts new file mode 100644 index 0000000..e42205b --- /dev/null +++ b/skills/skillet/evals/scope-package-name.eval.ts @@ -0,0 +1,40 @@ +// ────────────────────────────────────────────────────────── +// Generated initially from spec.yaml; durable after that. Edit +// freely to refine prompts, setup, and assertions for this +// behavior. Add or remove behaviors via spec.yaml — skillet only +// regenerates eval files for behaviors that don't have one yet. +// ────────────────────────────────────────────────────────── +import { fileURLToPath } from "node:url"; +import { dirname } from "node:path"; +import { expect } from "vitest"; +import { + describeEval, + piAiHarness, + skilletAgent, +} from "@sentry/skillet/evals"; +import { + UsesScopedPackageJudge, +} from "./_judges.js"; + +const skillRoot = dirname(fileURLToPath(import.meta.url)).replace(/\/evals$/, ""); + +describeEval( + "scope-package-name", + { + harness: piAiHarness({ agent: skilletAgent({ skillRoot }) }), + judgeThreshold: 0.75, + }, + (it) => { + it( + "scope-package-name__one-liner-install", + { timeout: 90_000 }, + async ({ run }) => { + const result = await run( + "Give me the one-liner to run skillet via npx so I can try it without installing globally.", + ); + + await expect(result).toSatisfyJudge(UsesScopedPackageJudge); + }, + ); + }, +); diff --git a/skills/skillet/spec.yaml b/skills/skillet/spec.yaml new file mode 100644 index 0000000..296f987 --- /dev/null +++ b/skills/skillet/spec.yaml @@ -0,0 +1,121 @@ +# ────────────────────────────────────────────────────────── +# Skillet skill spec. Edit this file directly or use the +# `skillet spec` subcommands — both are supported. Skillet +# validates this file on read; malformed edits will fail fast +# with a clear error before doing any work. +# +# After editing, run `skillet improve` to refresh SKILL.md and +# eval cases against the updated spec. +# ────────────────────────────────────────────────────────── +managed_by: skillet +spec_version: 1 +name: skillet +intent: | + Create, evaluate, and improve agent skills using the skillet CLI. + Skillet is spec-driven: spec.yaml captures intent (behaviors, + must-nots, triggers). SKILL.md is regenerated from it. Eval files + (evals/*.eval.ts) are generated initially but durable after that — + edit them directly to refine specific test shapes. Iteration patches + the spec or tunes SKILL.md prose, not the eval implementations. + +triggers: + should: + - create a skill + - make a skill for X + - improve this skill + - add an eval + - test my skill + - verify a skill + - refine a skill + - working with spec.yaml + - working with SKILL.md + - working with eval files + should_not: + - run my unit tests + - lint this code + +behaviors: + - id: choose-create-for-new-skills + statement: Recommend `skillet create` when the user wants to start a new skill from a description. + rationale: | + `create` runs spec init + regen + improve in one shot. It's the + friendliest entry point for "I want a skill for X" requests. + + - id: choose-improve-for-existing-skills + statement: Recommend `skillet improve` when the user has an existing skill (with or without spec.yaml) that needs work. + rationale: | + `improve` auto-imports a legacy SKILL.md into a spec on first run, + then runs the verify-driven iteration loop. Don't direct users to + manually run `spec import` — the loop handles it. + + - id: choose-spec-show-for-inspection + statement: Recommend `skillet spec show` when the user wants to read the current spec without changing it. + rationale: | + Show is read-only and prints the parsed spec with the banner stripped. + + - id: choose-spec-refine-for-feedback + statement: Recommend `skillet spec refine ""` when the user wants to change a skill via natural-language feedback. + rationale: | + Refine produces structured SpecPatch operations, applies them, and + auto-regens. The user describes the change in their own words. + + - id: choose-add-eval-for-named-behaviors + statement: Recommend `skillet add-eval ""` when the user wants to add one or more named behaviors as eval cases. + rationale: | + `add-eval` is a wrapper over `spec refine` that auto-imports legacy + skills, then appends the named behaviors to the spec and regens. + + - id: choose-verify-for-checking + statement: Recommend `skillet verify` (not "validate") when the user wants to check that a skill is internally consistent. + rationale: | + The old `validate` command is gone. `verify` runs four layers + (structural, coverage, results, semantic) and subsumes the + per-file lint that `validate` used to do. + + - id: scope-package-name + statement: Always invoke skillet via `npx @sentry/skillet`, not `npx skillet`. + rationale: | + The package is published under the @sentry scope. The unscoped + name resolves to a different package or fails. + + - id: capture-intent-before-generation + statement: When the user asks for a new skill or wants to add evals, ask 3-5 questions to capture intent (most important behaviors, realistic prompt + expected output, common mistakes, trigger phrases) before invoking the CLI. + rationale: | + Skillet's spec-init phase is single-turn — it generates a spec + from whatever description it receives. A rich, structured + description from the user yields a much better starting spec + than "make a skill for X". The agent acts as the front-end + interview before passing the combined description to skillet. + + - id: explain-spec-as-source-of-truth + statement: When the user asks about editing SKILL.md, explain that SKILL.md is derived from spec.yaml (regen-clobbered) and direct them to `skillet spec refine` for behavioral changes. Eval files (evals/*.eval.ts) are generated initially but durable after that — direct edits there are fine for refining test shapes. + rationale: | + SKILL.md is rewritten on every regen, so prose hand-edits get + wiped. Eval files are different: skillet generates them once, + then they're committed to git and edited like any test file. + Behavior set changes (add/remove rules) flow through spec.yaml + so the eval coverage stays in sync. + +must_not: + - id: dont-mention-api-keys + statement: Never tell the user to set API keys or environment variables. Credentials are auto-discovered. + rationale: | + Skillet uses provider-autodiscovery; mentioning API keys both + contradicts the user-zero-config promise and might leak the + specific env var name into a transcript. + leakage_risk: env-var-leak + + - id: dont-recommend-validate + statement: Don't recommend `skillet validate` — that command was removed. + rationale: | + Per-file structural checks now live as layer 1 of `verify`. + Telling the user to run `validate` will fail with an unknown-command error. + + - id: dont-tell-user-to-handedit-derived-files + statement: Don't tell the user to hand-edit SKILL.md (it's regenerated and clobbered on every regen). Direct them to `skillet spec refine` for behavioral changes. Eval files are durable and can be edited directly to refine test shapes. + rationale: | + SKILL.md is rewritten from spec.yaml on every regen, so prose + hand-edits get wiped. Eval files (.eval.ts) are different — + generated once, committed, edited like any test file. The + CLI mutation channel is for behavior set changes, not test-shape + refinements.