feat: improve local, deploy, and invoke. Add temp eval infra for testing

anchenyi · anchenyi · commit 2befd3ee9713 · 2026-06-05T16:31:29.000+08:00
diff --git a/.gitignore b/.gitignore
@@ -359,3 +359,4 @@ dashboard/**/dist/
 
 # Local vally eval outputs
 results/
+tmp/
diff --git a/evals/microsoft-foundry/test-eval.yaml b/evals/microsoft-foundry/test-eval.yaml
@@ -0,0 +1,53 @@
+# Vally eval config for Azure-authenticated Microsoft Foundry E2E checks.
+# The filename is e2e-eval.yaml so default discovery does not include it in full.
+
+name: microsoft-foundry-e2e-eval
+description: |
+  E2E evaluation for the microsoft-foundry skill in a workflow that has already
+  logged into Azure CLI and configured azd to use Azure CLI authentication.
+
+tags:
+  type: e2e
+  skill: microsoft-foundry
+
+config:
+  runs: 1
+  timeout: "30m"
+  executor: integration-test-agent-runner
+  model: claude-sonnet-4.6
+
+scoring:
+  threshold: 0.8
+
+stimuli:
+  - name: "Create and deploy hosted agent"
+    tags:
+      id: create-and-deploy-hosted-agent
+      type: e2e
+      tier: full
+      cost: llm
+      area: deploy
+    prompt: |
+      Create a Python hosted agent for B2B customer onboarding and deploy it to my existing Foundry project. Use the Responses protocol. After it is done, run in locally to make sure it can run successfully; then deploy it to foundry and ensure it can respond to users correctly.
+      Alaways create a new .venv in the working directory and install the agent's dependencies there, instead of installing them globally.
+      Use an agent name with the `foundry-skill-e2e` prefix and add a random suffix.
+      Use these environment variables as the Foundry-related configuration values:
+      Foundry project endpoint: https://foundry-test-0603-resource.services.ai.azure.com/api/projects/foundry-test-0603
+      Foundry Arm resource: /subscriptions/1756abc0-3554-4341-8d6a-46674962ea19/resourceGroups/anchenyi-ai/providers/Microsoft.CognitiveServices/accounts/foundry-test-0603-resource/projects/foundry-test-0603
+      Foundry model deployment name: gpt-5.4-mini
+    graders:
+      - type: skill-invocation
+        config:
+          required:
+            - microsoft-foundry
+      - type: completed
+      - type: prompt
+        config:
+          scoring: binary
+          threshold: 1
+          prompt: |
+            Verify that the coding agent generated hosted-agent code, deployed the agent
+            successfully to Microsoft Foundry, invoked the deployed agent after deployment,
+            and received a successful response from that deployed agent. Fail if
+            code was not generated, deployment did not succeed, the deployed agent was not
+            actually invoked after deployment, or the deployed agent invocation failed.
diff --git a/evals/microsoft-foundry/test-new-project.yaml b/evals/microsoft-foundry/test-new-project.yaml
@@ -0,0 +1,49 @@
+# Vally eval config for Azure-authenticated Microsoft Foundry E2E checks.
+# The filename is e2e-eval.yaml so default discovery does not include it in full.
+
+name: microsoft-foundry-e2e-eval
+description: |
+  E2E evaluation for the microsoft-foundry skill in a workflow that has already
+  logged into Azure CLI and configured azd to use Azure CLI authentication.
+
+tags:
+  type: e2e
+  skill: microsoft-foundry
+
+config:
+  runs: 1
+  timeout: "30m"
+  executor: integration-test-agent-runner
+  model: claude-sonnet-4.6
+
+scoring:
+  threshold: 0.8
+
+stimuli:
+  - name: "Create and deploy hosted agent"
+    tags:
+      id: create-and-deploy-hosted-agent
+      type: e2e
+      tier: full
+      cost: llm
+      area: deploy
+    prompt: |
+      Create a Python hosted agent for B2B customer onboarding and deploy it to a new Foundry project. Use the Responses protocol. After it is done, run in locally to make sure it can run successfully; then deploy it to foundry and ensure it can respond to users correctly
+      Alaways create a new .venv in the working directory and install the agent's dependencies there, instead of installing them globally.
+      Use an agent name with the `foundry-skill-e2e` prefix and add a random suffix.
+    graders:
+      - type: skill-invocation
+        config:
+          required:
+            - microsoft-foundry
+      - type: completed
+      - type: prompt
+        config:
+          scoring: binary
+          threshold: 1
+          prompt: |
+            Verify that the coding agent generated hosted-agent code, deployed the agent
+            successfully to Microsoft Foundry, invoked the deployed agent after deployment,
+            and received a successful response from that deployed agent. Fail if
+            code was not generated, deployment did not succeed, the deployed agent was not
+            actually invoked after deployment, or the deployed agent invocation failed.
diff --git a/plugin/skills/microsoft-foundry/foundry-agent/create/references/local-run.md b/plugin/skills/microsoft-foundry/foundry-agent/create/references/local-run.md
@@ -8,6 +8,14 @@ Use this when iterating on a hosted agent before deploying.
 > AZURE_AI_MODEL_DEPLOYMENT_NAME=<model-deployment-name>
 > ```
 > If you already ran `azd provision`, extract these from `azd env get-values`.
+>
+> **Critical: keep `.env` and `azd env` in sync.** `azd ai agent run` injects the active `azd env` values into the agent process before Python loads `.env`. Many samples use `load_dotenv(override=False)`, so an existing process environment value wins over `.env`. If you change the project endpoint or model deployment, update both `.env` and `azd env`:
+> ```bash
+> azd env set FOUNDRY_PROJECT_ENDPOINT "https://<account>.services.ai.azure.com/api/projects/<project>"
+> azd env set AZURE_AI_MODEL_DEPLOYMENT_NAME "<model-deployment-name>"
+> azd env get-values
+> ```
+> A stale `AZURE_AI_MODEL_DEPLOYMENT_NAME` in `azd env` can make local run call the wrong deployment even when `.env` is correct, commonly surfacing as a Foundry responses API `404 Not Found`.
 
 ## Start the agent locally
 
@@ -19,7 +27,7 @@ What this does:
 
 1. Resolves the agent service from `azure.yaml` (auto-picks when only one exists).
 2. Detects the project type (Python, .NET, Node.js) from files in the service source dir.
-3. Installs dependencies if needed (`uv pip install -e .`, `npm install`, `dotnet restore`).
+3. Installs dependencies if needed (`pip install -r requirements.txt`, `npm install`, `dotnet restore`).
 4. Starts the agent in the foreground on `localhost:8088` (default).
 5. Opens **Agent Inspector** in your browser (unless `--no-inspector`).
 
diff --git a/plugin/skills/microsoft-foundry/foundry-agent/deploy/deploy.md b/plugin/skills/microsoft-foundry/foundry-agent/deploy/deploy.md
@@ -164,7 +164,7 @@ azd deploy --no-prompt
 
 Each env has its own `AGENT_<SVC>_*` vars.
 
-## Common failure modes -- Hosted
+## Common failure -- Hosted
 
 | Error | Fix |
 |-------|-----|
@@ -176,6 +176,7 @@ Each env has its own `AGENT_<SVC>_*` vars.
 | `container registry endpoint not found` | ACR not configured. Use `azd env set AZURE_CONTAINER_REGISTRY_ENDPOINT <url>`, or switch to direct code deployment. |
 | Agent version poll times out | Image still building; retry `azd ai agent show` after a minute. |
 | `session_not_ready` (424) | Cold start — wait 30-60 seconds and retry. For direct code deployments, first invocation installs dependencies. Use `1` CPU / `2Gi` memory minimum. **Also verify:** (1) a capability host exists on the Foundry account (see Step 2 above), (2) the agent's managed identity has `Cognitive Services User` role on the Foundry account — missing capability host or role are the most common causes of persistent `session_not_ready`. See [direct-code-deployment Task 8-9](references/direct-code-deployment.md) and [invoke](../invoke/invoke.md). |
+| `could not resolve agent service in azd project: no azure.ai.agent service named '<agentName>' found in azure.yaml` from `azd ai agent invoke` | Name mismatch. Update the agent name to the deployed agent name. |
 | `subscription quota exceeded` | Ask user to request quota; don't auto-retry. |
 | Bicep deploy errors | Forward `error.details[]` verbatim to the user. |
 | `RoleAssignmentUpdateNotPermitted` during provision | A role assignment already exists but conflicts. Check for existing role assignments with `az role assignment list --scope <resource-scope>`. The provision may have succeeded for all resources except RBAC — verify with `azd ai project show` and manually assign the `Cognitive Services User` role to the agent identity if needed. |
diff --git a/plugin/skills/microsoft-foundry/foundry-agent/deploy/references/direct-code-deployment.md b/plugin/skills/microsoft-foundry/foundry-agent/deploy/references/direct-code-deployment.md
@@ -315,6 +315,8 @@ Do not send `x-ms-agent-name` on `POST /agents/<agent-name>/versions` or `POST /
 
 Update agent and create version are idempotent on zip SHA-256 plus agent definition. If both are unchanged from the latest version, the service can return the existing version instead of creating a duplicate. To force a new version, change the zip contents or definition.
 
+If the write response contains `versions.latest`, use `versions.latest.version`, `versions.latest.status`, and `versions.latest.instance_identity.principal_id`. Do not poll `/versions/None`; if no version can be extracted, list versions first and pick the newest returned version.
+
 Other useful REST operations:
 
 | Purpose | Method and endpoint | Notes |
diff --git a/plugin/skills/microsoft-foundry/foundry-agent/invoke/invoke.md b/plugin/skills/microsoft-foundry/foundry-agent/invoke/invoke.md
@@ -120,6 +120,7 @@ Use `session_delete` to release compute resources when done. Undeleted sessions
 | Hosted agent not active | Version still provisioning or failed | Check version status via `agent_get` |
 | Session not found | Invalid ID or expired | Create new session with `session_create` |
 | `424 FailedDependency` or `session_not_ready` | Hosted agent session is still warming up or readiness has not completed | Wait 15-30 seconds, check `session_logstream` if needed, then retry `agent_invoke` with the same `sessionId` if one was returned; if no `sessionId` was returned, retry `session_create`. If this persists across 3+ retries (with exponential backoff: 15s, 30s, 60s), the container likely cannot start within the readiness probe deadline — redeploy with higher CPU/memory (recommended minimum: `1` CPU / `2Gi` for direct-code deployments). Also verify the model deployment name is correct via `model_deployment_get`. |
+| `could not resolve agent service in azd project: no azure.ai.agent service named '<agentName>' found in azure.yaml` from `azd ai agent invoke` | Name mismatch. | Update the agent name to the deployed agent name. |
 | Invocation failed | Model error, timeout, or invalid input | Check agent logs, verify model deployment |
 | Invocations schema mismatch | Request body does not match what the agent expects | Inspect agent's route handler or API docs for the correct JSON schema; do not guess |
 | File operation failed | Session not active or invalid path | Verify session with `session_get` |
diff --git a/tests/utils/agent-runner.ts b/tests/utils/agent-runner.ts
diff --git a/tests/vally/vally-executor.ts b/tests/vally/vally-executor.ts

Original file line number	Diff line number	Diff line change
`@@ -359,3 +359,4 @@ dashboard/**/dist/`
`359`	`359`
`360`	`360`	`# Local vally eval outputs`
`361`	`361`	`results/`
	`362`	`+tmp/`