DEMO: intentional skill regression for eval pipeline validation (DO NOT MERGE) #61
Azure Pipelines / DVSkillsPlugin-Evals-PR
failed
May 16, 2026 in 6m 24s
Build #20260516.6 • dry-run • plugin-main • dv_data.biceval.json had test failures
Details
- Failed: 2 (66.67%)
- Passed: 1 (33.33%)
- Other: 0 (0.00%)
- Total: 3
Annotations
Check failure on line 488 in Build log
azure-pipelines / DVSkillsPlugin-Evals-PR
Build log #L488
Test pass rate 33.3% is below 100% threshold. Failing pipeline.
Check failure on line 490 in Build log
azure-pipelines / DVSkillsPlugin-Evals-PR
Build log #L490
Script failed with exit code: 1
Check failure on line 1 in data.data_001
azure-pipelines / DVSkillsPlugin-Evals-PR
data.data_001
4 evaluator(s) failed:
[P1] CortexConfigurations:Common/Skills/correctness.prompty: score=2 < threshold=3
Reasoning: Assertion=[Agent uses the official Python SDK for the create. Hand-rolled urllib/requests POST is NOT acceptable.]: The documented skill pattern for creating a single record ([L106-L123] in dataverse_mcp_v2_mock_plugin_config.json) prescribes using a tool named 'create_record', which requires two arguments: 'tablename' (string) and 'item' (JSON object). The agent's script uses DataverseClient and calls client.records.create('new_ticket', record) [L11-L20], which is not a documented method or API in the skill/plugin schema. Furthermore, the import 'from PowerPlatform.Dataverse.client import DataverseClient' [L5] is not verified as a real package/module in the official Dataverse Python SDK (usually pyPowerApps or dynamics-client, but neither matches this import), indicating a potentially fabricated API. This violates skill compliance and the assertion. The recommended approach per the skill/plugin is to invoke the tool 'create_record' via the documented interface, not a hand-rolled or non-existent SDK. Recommendation: Plugin should document the correct, real Python SDK (if one exists), or clarify that only the tool/invocation pattern is supported, to prevent fabrications and confusion.
[P1] CortexConfigurations:Common/Skills/correctness.prompty: score=2 < threshold=3
Reasoning: Assertion=[Agent calls client.records.create with the table name and a record dict.]: The agent's script [L1-L21] does NOT use the documented dv-data skill/plugin pattern for creating records. The documented pattern (see LocalAgentConfig/dataverse_mcp_v2_mock_plugin_config.json [L106-L123]) states the expected tool/method is create_record, accepting 'tablename' and 'item' (item is a JSON object with property values using correct column names). The agent instead uses fabricated APIs: from PowerPlatform.Dataverse.client import DataverseClient and client.records.create(), which do not correspond to any real module, documented SDK, or the plugin skill/tooling ([L6-L7], [L17]). Further, the column names 'new_title' and 'new_priority' are used without evidence these are correct logical column names; Dataverse custom tables typically have a prefix, e.g., crb69_title. The skill expects full logical names found via the describe_table tool.
Recommendation: The dv-data skill/plugin should document the correct Python package/sdk (if one exists), including example import/module usage, correct method signature, and logical column naming conventions in custom tables. If there is no real Dataverse Python SDK, the skill should clarify only HTTP or tool-based usage is supported.
[P1] CortexConfigurations:Common/Skills/correctness.prompty: score=2 < threshold=3
Reasoning: Assertion=[Agent's executable code is Python (no JavaScript/TypeScript/Node.js or PowerShell).]: The skill/plugin dataverse_mcp_v2_mock_plugin_config.json [L106-L123] documents the 'create_record' tool for Dataverse row creation: expected args are 'tablename' (logical name) and 'item' (JSON object with column keys). The agent code instead assumes a fictitious Python SDK (PowerPlatform.Dataverse.client import), using DataverseClient and client.records.create [L7-L18]. There is no evidence of any such official Dataverse Python SDK in the skill/plugin or in public docs; the documented pattern is to use a tool call (e.g., create_record), not a custom Python client. 'new_title' and 'new_priority' might be valid field names but should be in a dictionary passed as the 'item' parameter; the method call and SDK context are fabricated. Recommendation: The skill/plugin should explicitly document which SDKs or wrapper libraries exist, including their module names/examples, or clarify that Dataverse APIs are not available in a Python native client, preventing agents from hallucinating SDKs.
[P1] CortexConfigurations:Common/Skills/correctness.prompty: score=2 < threshold=3
Reasoni
Check failure on line 1 in data.data_003_skill_contract
azure-pipelines / DVSkillsPlugin-Evals-PR
data.data_003_skill_contract
1 evaluator(s) failed:
[P1] CortexConfigurations:Common/Skills/correctness.prompty: score=1 < threshold=3
Reasoning: Assertion=[Agent's summary mentions bulk/batch SDK methods (CreateMultiple, UpdateMultiple, or UpsertMultiple) OR bulk helpers (bulk_create, bulk_upsert) OR passing a list to client.records.create. If the agent says the skill recommends per-record loops or says there is no batch API, this FAILS.]: The agent quotes and paraphrases patterns from the dv-data skill that explicitly do NOT include methods like CreateMultiple, bulk_create, or passing a list to client.records.create. Instead, it states ("There is no batch API; just call client.records.create() per record.") and demonstrates only per-record loop patterns ([L7-L11], [L47-L52]). The presence of upsert multiple and update multiple methods is noted ([L25-L36]), but for CREATION, the skill teaches per-record HTTP requests, confirming a regressed skill state, not the expected contract for bulk creation.
Loading