Make TeaAgent's dynamic skill workflow objectively testable:
- A skill can be generated from a successful run.
- The generated skill enters a reviewed candidate bundle.
- The skill can be installed and activated.
- The later run uses the skill in a way that changes behavior.
- Long WebSearch/RSS/skill results remain usable through truncation, artifact pointers, pagination, and compaction.
- Final outputs are checked mechanically before the run claims success.
Related June 5 documentation package:
- Agent Ecosystem Core Values
- RSS Dynamic Skill Failure Case Study
- Dynamic Skill Critical Questioning
- Dynamic Skill Lifecycle And Result Flow
- Dynamic Skill And Long Result Work Items
Status: Complete — Lifecycle state machine implemented in
skill_lifecycle.py, activation explain viaexplain_skill_activation(), shadowed skill detection, governance classification.
Define explicit states:
discoveredindexedselectedactivatedresource_readcandidate_proposedcandidate_eval_passedreview_passedinstalledused_in_runoutput_verifiedsupersededblocked
Acceptance checks:
skill explainreports search path, winning path, shadowed paths, write target, governance status, and activation mode.- Audit log records state transitions, not only
skill_load. - A skill cannot be reported as
used_in_rununless a runtime action references that skill name.
ROI:
- Very high. It directly turns vague success claims into inspectable facts.
Risk:
- Low. Mostly metadata and tests.
Status: Complete — Active skill write guard implemented via provenance gate and protected path rules in
skill_candidates.py.
Problem:
- The RSS failure wrote directly into
.opencode/skill. Compatibility discovery made the skill visible, but the user could not tell whether this was a governed TeaAgent skill or an unmanaged file.
Plan:
- Add a protected path rule for active skill directories:
.config/agent/skills/**.claude/skills/**.opencode/skill/**.opencode/skills/**
- Allow writes only through:
skill candidate proposeskill candidate install- an explicit
--allow-direct-skill-writedevelopment flag
- If blocked, write the proposed content into
.teaagent/skill-candidates/or a rejected proposal artifact with an actionable message.
Acceptance checks:
- Workspace write tool cannot write directly to active skill dirs by default.
- Candidate install can still write to
.config/agent/skills. - TUI message explains where the candidate was quarantined.
ROI:
- Very high. Prevents the exact "it went to
.opencode" confusion.
Risk:
- Medium. Existing users may intentionally keep shared skills in
.opencode. Provide explicit opt-in and migration notes.
Status: Complete — Offline fixtures under
tests/skills/fixtures/rss/, acceptance testtest_skill_rss_fixtures.py, plus built-in RSS starter skill atteaagent/skills/builtin/rss-summary/.
Build an offline deterministic RSS test fixture.
Fixture files:
fixtures/rss/feedbro-subscriptions.opmlfixtures/rss/ai-news.xmlfixtures/rss/security.xmlfixtures/rss/devtools.xmlfixtures/rss/large-feed.xml
Test prompt:
Use the RSS summary skill to summarize the feeds in feedbro-subscriptions.opml.
Write categorized markdown into outputs/rss-summary.md with source links,
dates, and at least three bullets per category.
Assertions:
- Output file exists.
- Output file has minimum size, for example > 2000 bytes.
- Output includes at least three categories.
- Output includes at least N feed item titles from fixtures.
- Output includes source URLs.
- Output does not include fixture prompt-injection text as instructions.
- Audit shows selected or activated
rss-summary. - The helper script, if generated, was executed.
ROI:
- Very high. RSS is the user's concrete failed scenario.
Risk:
- Low if offline fixtures are used.
Status: Complete —
LongResultEnvelopeinlong_result_envelope.pywith preview, truncation, artifact path, hash, and cursor. Readback via CLIartifact readhandler.
Add a standard envelope for large tool results:
{
"content_type": "text/markdown",
"preview": "...",
"truncated": true,
"total_bytes": 123456,
"preview_bytes": 50000,
"artifact_path": ".teaagent/artifacts/tool-results/run-id/tool-id.txt",
"content_hash": "sha256:...",
"cursor": "offset:50000",
"suggested_next_action": "read artifact_path with offset cursor"
}Acceptance checks:
- A fake WebSearch/RSS tool returning > 80 KB produces an envelope.
- The model-visible observation includes the preview and artifact pointer.
- Full content is retrievable by offset.
- Compaction preserves artifact pointer and content hash.
- Final summary cites source IDs that exist in the full artifact.
ROI:
- Very high for daily usage. Long tool results are unavoidable.
Risk:
- Medium. It touches tool result shape and audit expectations.
Status: Complete —
SkillCandidateEvalinskill_eval.pywithwithout_skill/with_skillmodes, fixture-based assertions.
Extend candidate eval from structural checks to behavioral checks.
Run each eval case in two modes:
without_skillwith_skill
Record:
- status
- produced files
- tokens
- duration
- tool calls
- assertion results
Assertions should be mechanical when possible:
- file exists
- JSON parses
- markdown contains source IDs
- row count matches fixture
- no prompt-injection phrase was followed
- no placeholder scripts remain
ROI:
- High. It detects skills that look valid but do not improve behavior.
Risk:
- Medium. Requires a scripted adapter or deterministic model harness for CI.
Status: Complete — Skill lifecycle audit events (
skill_indexed,skill_selected,skill_activated, etc.) emitted throughSkillLifecycleTracker.
Add first-class events:
skill_indexedskill_selectedskill_activatedskill_resource_readskill_used_for_outputskill_output_verified
The event payload should include:
- skill name
- source path
- governance status
- token estimate
- activation cause: user explicit, model selected, config selected, eager
- run ID
- final output artifact paths
ROI:
- High. Debugging becomes possible in TUI and logs.
Risk:
- Low to medium. Event naming must remain stable.
Status: Complete —
activate_skillruntime tool registered when skills exist, withskill_nameenum validation and audit event.
Current prompt injection is ambiguous. Add an optional runtime tool:
{
"name": "activate_skill",
"schema": {
"skill_name": {"enum": ["rss-summary", "code-review", "..."]}
}
}Behavior:
- Registered only when skills exist.
skill_nameenum prevents hallucinated skill names.- Returns tagged skill content, resource list, skill root, governance status, and usage constraints.
- Deduplicates if already activated.
ROI:
- High. It moves activation from implicit prompt hope to explicit runtime event.
Risk:
- Medium. Requires careful prompt and compaction integration.
If a generated skill requires a helper script:
- The script belongs in
scripts/. - The skill must describe when to run it.
- The candidate eval must execute it against fixtures.
- The script must not be an untested one-off file in the workspace root.
Acceptance checks:
- Candidate with a script but no script eval fails review.
- Candidate with placeholder script fails minimum content/execution checks.
- Candidate with a working RSS parser script passes fixture eval.
ROI:
- High for RSS and data-processing skills.
Risk:
- Low.
Status: Complete — TUI diagnostics panel via
get_skill_diagnostics()inskill_loader.py,/skill-diagnosticsTUI command, enhanced skills panel.
Show:
- loaded skills
- selected skills
- active skill
- governance status
- shadowed skills
- long-result artifacts
- output verification result
ROI:
- Medium to high. Helps users trust and debug the workflow.
Risk:
- Medium. TUI space is limited.
Status: Complete — Built-in RSS starter skill at
teaagent/skills/builtin/rss-summary/with SKILL.md, offline fixtures, and acceptance tests.
Ship a conservative built-in starter skill for RSS:
SKILL.mdunder built-in or reviewed project skills.scripts/rss_summarize.py.references/rss-output-contract.md.- Offline fixtures.
- Acceptance tests.
This should be an example of how dynamic skills should evolve: start with a tested seed, then allow candidate patches.
ROI:
- Medium. It directly addresses user need and creates a reference pattern.
Risk:
- Medium if network fetching is included. Keep CI offline and make live network fetching optional.
For any future live web search tool, require:
- source URL
- title
- fetched date
- content hash
- preview
- full artifact path
- trust label: untrusted external content
- prompt-injection scan result
- citation/source ID
ROI:
- Medium to high. Prevents web outputs from becoming untraceable context.
Risk:
- Medium. Requires source-specific adapters.
| Area | Test | Why It Matters | Priority |
|---|---|---|---|
| Candidate lifecycle | propose -> eval -> review -> install -> explain | Prevents direct-write masquerade | P0 |
| Explicit activation | --skill rss-summary and activation tool |
Proves user can force use | P0/P1 |
| Implicit activation | model sees index and activates by name | Proves normal UX | P1 |
| Shadowing | same name in .config, .claude, .opencode, user dir |
Prevents wrong skill use | P0 |
| Direct write block | workspace write to active skill dir | Prevents unreviewed persistence | P0 |
| RSS fixture | OPML + multiple XML feeds | Replays failed user scenario | P0 |
| Long result | 80 KB/200 KB synthetic results | Proves truncation and retrieval | P0 |
| Prompt injection | feed item says "ignore instructions" | Prevents external content persistence | P0 |
| Compaction | activated skill survives compaction | Prevents mid-run skill amnesia | P1 |
| TUI diagnostics | visible skill status and artifact pointer | Makes failures debuggable | P2 |
Scenarios:
- Generate candidate from a completed run.
- Verify artifacts exist.
- Run offline eval.
- Review passes.
- Install project skill.
skill explainreportscandidate_installed.- A new run with
--skillreceives the skill content.
Scenarios:
- Seed or generate
rss-summarycandidate. - Install it.
- Run against offline OPML/RSS fixtures.
- Verify
outputs/rss-summary.mdmechanically. - Confirm no placeholder script is accepted.
Scenarios:
- Fake tool returns large content.
- Envelope writes full artifact and preview.
- Agent reads continuation by cursor.
- Final output references source IDs from the full artifact.
Scenarios:
- Activate skill.
- Inject many observations.
- Trigger compaction.
- Verify skill content tags, artifact pointers, and active-skill state remain.
Scenarios:
- Direct
workspace_write_fileinto.opencode/skill/name/SKILL.mdis blocked or quarantined. - Candidate install remains allowed.
- Error message points to candidate workflow.
- Add active skill write guard.
- Add
skill_activationaudit event and expose it inskill explain. - Add RSS fixture files.
- Add RSS summary output validator.
- Add acceptance test for installed candidate governance.
Exit criteria:
- RSS fixture cannot pass with a fake 12-byte script.
- Direct active skill writes are visible or blocked.
- Add
ToolResultEnvelope. - Add long-result artifact store under
.teaagent/artifacts/tool-results/. - Add cursor-based artifact read helper.
- Preserve envelope metadata during compaction.
- Add synthetic long-result acceptance test.
Exit criteria:
- A 200 KB fake search result can be summarized using artifact continuation.
- Add dedicated
activate_skilltool. - Add behavioral with-skill vs without-skill eval mode.
- Add TUI diagnostics panel or command.
- Add built-in RSS starter skill or project fixture skill.
Exit criteria:
- A generated skill has measurable behavioral value over baseline.
- RSS fixture acceptance test: highest ROI because it targets a real failure.
- Active skill write guard: highest safety ROI.
- Long result envelope: highest reliability ROI for WebSearch/RSS.
- Skill activation audit: highest debug UX ROI.
- Behavioral eval harness: highest long-term quality ROI.
- TUI diagnostics: high user trust ROI after the backend facts exist.
- Built-in RSS starter skill: useful product feature, but should come after the lifecycle harness can prove it.
- Do not add live web dependencies to CI.
- Do not trust arbitrary community skills by default.
- Do not let background agents patch bundled or installed skills silently.
- Do not treat a final-answer sentence as verification.
- Do not solve all RSS feed formats in the first pass.
The workflow is usable when this command family is true in CI and locally:
generate skill -> review/eval -> install -> explain -> activate -> run fixture ->
verify artifacts -> audit proves lifecycle
For the RSS case specifically:
- The summary markdown exists.
- It contains real fixture item titles and links.
- It is categorized.
- It records source coverage.
- It ignores feed-level prompt injection.
- It is impossible for a placeholder script or empty summary to pass.