DEMO: intentional skill regression for eval pipeline validation (DO NOT MERGE) #61

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft

saurabhrb wants to merge 1 commit into main from users/saurabhrb/evals-bad-skill-demo-v2

Draft

DEMO: intentional skill regression for eval pipeline validation (DO NOT MERGE) #61

DEMO: break bulk-create guidance (regression test target)

Azure Pipelines / DVSkillsPlugin-Evals-PR failed May 16, 2026 in 6m 24s

Build #20260516.6 • dry-run • plugin-main • dv_data.biceval.json had test failures

2 errors / 2 warnings

Details

Failed: 2 (66.67%)
Passed: 1 (33.33%)
Other: 0 (0.00%)
Total: 3

Annotations

Check failure on line 488 in Build log

azure-pipelines / DVSkillsPlugin-Evals-PR

Build log #L488

Test pass rate 33.3% is below 100% threshold. Failing pipeline.

Check failure on line 490 in Build log

azure-pipelines / DVSkillsPlugin-Evals-PR

Build log #L490

Script failed with exit code: 1

Check failure on line 1 in data.data_001

azure-pipelines / DVSkillsPlugin-Evals-PR

data.data_001

4 evaluator(s) failed:
  [P1] CortexConfigurations:Common/Skills/correctness.prompty: score=2 < threshold=3
    Reasoning: Assertion=[Agent uses the official Python SDK for the create. Hand-rolled urllib/requests POST is NOT acceptable.]: The documented skill pattern for creating a single record ([L106-L123] in dataverse_mcp_v2_mock_plugin_config.json) prescribes using a tool named 'create_record', which requires two arguments: 'tablename' (string) and 'item' (JSON object). The agent's script uses DataverseClient and calls client.records.create('new_ticket', record) [L11-L20], which is not a documented method or API in the skill/plugin schema. Furthermore, the import 'from PowerPlatform.Dataverse.client import DataverseClient' [L5] is not verified as a real package/module in the official Dataverse Python SDK (usually pyPowerApps or dynamics-client, but neither matches this import), indicating a potentially fabricated API. This violates skill compliance and the assertion. The recommended approach per the skill/plugin is to invoke the tool 'create_record' via the documented interface, not a hand-rolled or non-existent SDK. Recommendation: Plugin should document the correct, real Python SDK (if one exists), or clarify that only the tool/invocation pattern is supported, to prevent fabrications and confusion.
  [P1] CortexConfigurations:Common/Skills/correctness.prompty: score=2 < threshold=3
    Reasoning: Assertion=[Agent calls client.records.create with the table name and a record dict.]: The agent's script [L1-L21] does NOT use the documented dv-data skill/plugin pattern for creating records. The documented pattern (see LocalAgentConfig/dataverse_mcp_v2_mock_plugin_config.json [L106-L123]) states the expected tool/method is create_record, accepting 'tablename' and 'item' (item is a JSON object with property values using correct column names). The agent instead uses fabricated APIs: from PowerPlatform.Dataverse.client import DataverseClient and client.records.create(), which do not correspond to any real module, documented SDK, or the plugin skill/tooling ([L6-L7], [L17]). Further, the column names 'new_title' and 'new_priority' are used without evidence these are correct logical column names; Dataverse custom tables typically have a prefix, e.g., crb69_title. The skill expects full logical names found via the describe_table tool.

Recommendation: The dv-data skill/plugin should document the correct Python package/sdk (if one exists), including example import/module usage, correct method signature, and logical column naming conventions in custom tables. If there is no real Dataverse Python SDK, the skill should clarify only HTTP or tool-based usage is supported.
  [P1] CortexConfigurations:Common/Skills/correctness.prompty: score=2 < threshold=3
    Reasoning: Assertion=[Agent's executable code is Python (no JavaScript/TypeScript/Node.js or PowerShell).]: The skill/plugin dataverse_mcp_v2_mock_plugin_config.json [L106-L123] documents the 'create_record' tool for Dataverse row creation: expected args are 'tablename' (logical name) and 'item' (JSON object with column keys). The agent code instead assumes a fictitious Python SDK (PowerPlatform.Dataverse.client import), using DataverseClient and client.records.create [L7-L18]. There is no evidence of any such official Dataverse Python SDK in the skill/plugin or in public docs; the documented pattern is to use a tool call (e.g., create_record), not a custom Python client. 'new_title' and 'new_priority' might be valid field names but should be in a dictionary passed as the 'item' parameter; the method call and SDK context are fabricated. Recommendation: The skill/plugin should explicitly document which SDKs or wrapper libraries exist, including their module names/examples, or clarify that Dataverse APIs are not available in a Python native client, preventing agents from hallucinating SDKs.
  [P1] CortexConfigurations:Common/Skills/correctness.prompty: score=2 < threshold=3
    Reasoni

Check failure on line 1 in data.data_003_skill_contract

azure-pipelines / DVSkillsPlugin-Evals-PR

data.data_003_skill_contract

1 evaluator(s) failed:
  [P1] CortexConfigurations:Common/Skills/correctness.prompty: score=1 < threshold=3
    Reasoning: Assertion=[Agent's summary mentions bulk/batch SDK methods (CreateMultiple, UpdateMultiple, or UpsertMultiple) OR bulk helpers (bulk_create, bulk_upsert) OR passing a list to client.records.create. If the agent says the skill recommends per-record loops or says there is no batch API, this FAILS.]: The agent quotes and paraphrases patterns from the dv-data skill that explicitly do NOT include methods like CreateMultiple, bulk_create, or passing a list to client.records.create. Instead, it states ("There is no batch API; just call client.records.create() per record.") and demonstrates only per-record loop patterns ([L7-L11], [L47-L52]). The presence of upsert multiple and update multiple methods is noted ([L25-L36]), but for CREATION, the skill teaches per-record HTTP requests, confirming a regressed skill state, not the expected contract for bulk creation.

View more details on Azure Pipelines