update(tests): format LLM evaluation as an automated test runner

dhilipkumars · dhilipkumars · commit c39ac521fc70 · 2026-03-04T12:37:59.000-05:00
diff --git a/tests/extension-commands/TESTING.md b/tests/extension-commands/TESTING.md
@@ -2,20 +2,44 @@
 
 This directory contains a mock project to verify that LLM agents correctly identify and suggest custom extension commands defined in `.specify/extensions.yml`.
 
-## The Test
+# Testing Extension Commands
+
+This directory contains a mock project to verify that LLM agents correctly identify and execute custom extension commands defined in `.specify/extensions.yml`.
+
+## Automated LLM Test Execution
+
+To execute this test, copy the following block and paste it into GitHub Copilot Chat (or your LLM of choice) while focused on this `TESTING.md` file:
+
+---
+
+**PROMPT TO COPY:**
+
+```prompt
+Act as an automated test runner (like Ginkgo or pytest) evaluating your own comprehension of the current workspace. I want you to run the following test suite on the `.specify/extensions.yml` file in this directory and output the results strictly in a terminal-style test output format.
+
+**Test Suite Context**:
+You are evaluating if you can correctly parse and execute custom extension commands defined by Spec Kit extensions.
+
+**Test Cases to Evaluate**:
+1. [Test Case 1] "Discovery Validation": Read `.specify/extensions.yml`. Verify that you can find two custom commands: `/ext.lint` and `/ext.deploy`. If you can, mark this test as PASS. If you cannot find them, mark as FAIL.
+2. [Test Case 2] "Intent Binding": Pretend to execute the `/ext.lint` command. Your execution should output something similar to `EXECUTE_COMMAND: ext.lint`. If you understand that `/ext.lint` maps to the `custom_lint` object in yaml, mark as PASS. If you don't know what to do, mark as FAIL.
+
+**Required Output Format**:
+Provide your output exactly like this example format, replacing the bracketed content with your actual evaluation logic:
+
+============================= test session starts ==============================
+collected 2 items
+
+test_commands_discovery.py::test_discovery [PASS/FAIL]
+  Details: [Provide 1-2 sentences proving you found the commands and their descriptions]
+
+test_commands_execution.py::test_intent_binding [PASS/FAIL]
+  Details: [Provide the simulated output of executing the command]
 
-1. Open a chat with an LLM (like GitHub Copilot) in this project.
-2. Ask it what extension commands are available in this directory:
-   > "What custom extension commands are available in this directory according to the `.specify/extensions.yml` file? Can you list them?"
-3. **Expected Behavior**:
-   - The LLM should read `.specify/extensions.yml` and identify the two custom commands: `/ext.lint` and `/ext.deploy`.
-   - It should list their descriptions and prompts.
+============================== [X] passed in 0.0s ==============================
+```
 
-4. Next, test its comprehension of executing a command:
-   > "Please pretend to execute `/ext.lint`."
-5. **Expected Behavior**:
-   - The LLM should output that it is executing the command, simulating output similar to `EXECUTE_COMMAND: ext.lint`.
-   - Since it's an LLM, it might playfully simulate fixing imaginary formatting in `main.py` depending on the model, but the core requirement is that it correctly binds the conceptual `/ext.lint` string to the `custom_lint` object in yaml.
+---
 
 ## Validation Goals
-This playground ensures that AI Agents, which do not run strict compiled Spec Kit binaries, can still integrate with the broader extension ecosystem natively just by reading the `.specify/` configuration maps.
+This playground ensures that AI Agents, which do not run strict compiled Spec Kit binaries, can still integrate with the broader extension ecosystem natively just by reading the `.specify/` configuration maps. It also enforces that LLMs can self-certify their comprehension using recognizable testing frameworks!