diff --git a/skills/locale-validation/README.md b/skills/locale-validation/README.md new file mode 100644 index 0000000..807e0ed --- /dev/null +++ b/skills/locale-validation/README.md @@ -0,0 +1,348 @@ +# locale-validation + +A skill for validating Agentforce agent responses across multiple locales. Given any agent script (`.agent` file) or `genAiPluginMetadata`, it derives realistic test utterances per topic, translates them into the target languages, runs them against the agent in preview or batch mode, and validates that responses are in the correct language. + +## Supported languages + +All 23 Agentforce-supported locales. Source: [Agentforce Employee Agent Considerations](https://help.salesforce.com/s/articleView?id=ai.agent_employee_agent_considerations.htm&type=5) — Salesforce updates language support monthly, so check that page for new additions. + +| Code | Language | +|------|----------| +| `ar` | Arabic | +| `zh_CN` | Chinese (Simplified) | +| `zh_TW` | Chinese (Traditional) | +| `da` | Danish | +| `nl` | Dutch | +| `fi` | Finnish | +| `fr` | French | +| `de` | German | +| `in` | Indonesian | +| `it` | Italian | +| `ja` | Japanese | +| `ko` | Korean | +| `ms` | Malay | +| `no` | Norwegian | +| `pl` | Polish | +| `pt_BR` | Portuguese (Brazil) | +| `pt_PT` | Portuguese (European) | +| `ru` | Russian | +| `es` | Spanish | +| `es_MX` | Spanish (Mexico) | +| `sv` | Swedish | +| `th` | Thai | +| `tr` | Turkish | + +**Default set** (used when you say "use defaults"): `ja fr it de es es_MX pt_BR` + +--- + +## Files + +``` +locale-validation/ +├── SKILL.md # Skill definition & 5-phase workflow +├── README.md # This file +├── references/ +│ ├── adlc-mode.md # sf agent preview + sf agent test execution guide +│ └── fit-tests-mode.md # Maven/JUnit utterances.json + test class guide +└── scripts/ + └── validate_locale_responses.py # Python validator for batch test results +``` + +--- + +## How to invoke + +This skill is automatically triggered when Claude detects locale/language testing intent. You can also invoke it explicitly. + +### Natural language triggers (automatic) + +``` +"Run locale validation on MySDRAgent" +"Test MyAgent in Japanese, French, and German" +"Generate multilingual test cases for EngagementAgent" +"Create a batch locale test suite for MyAgent" +"Validate that MyAgent responds in Spanish and Arabic" +"Check if additional_locales are working in MyAgent" +``` + +If your `.agent` file has `additional_locales` set and you ask about testing or quality, the skill also activates proactively. + +### Explicit invocation + +In any Claude Code session with `agentforce-adlc` loaded: + +``` +/locale-validation MyAgent.agent +``` + +Or with arguments: + +``` +/locale-validation force-app/main/default/aiAuthoringBundles/MyAgent/MyAgent.agent --locales ja fr de ko --mode preview +``` + +--- + +## Workflow + +The skill runs five phases (plus two automatic checks after Phase 1). It pauses at Phase 1b if a locale patch is needed, and again after Phase 2 for utterance review. + +| Phase | What happens | +|-------|-------------| +| 1. Introspect | Reads `.agent` or `genAiPluginMetadata`, extracts topics + actions | +| **1b. Check `additional_locales`** | Reads the `language:` block. If `additional_locales` is missing or empty, **presents the full 23-language list and asks you to pick locales**, then patches the `.agent` file before continuing | +| **1c. Check language-response instruction** | Searches `system.instructions` for the language-response rule. If missing, **patches it automatically** (no confirmation needed) so the agent responds in the user's language | +| 2. Derive utterances | Generates 2–3 English utterances per topic, **shows you for review** | +| 3. Translate | Translates each utterance into all target locales using Claude inline (no external API call) | +| 4. Run tests | Executes via `sf agent preview` (Mode A) or `sf agent test` (Mode B). **In Mode B, always writes both a `testSpec.yaml` and a companion `-input.csv`** for manual Testing Center UI upload | +| 5. Validate & report | Checks responses for correct language, reports ✅/❌ per locale per topic | + +### Phase 1b — locale picker + +When the skill detects a missing or empty `additional_locales`, it presents the full Agentforce language table and asks: + +> "Reply with the codes you want (e.g. `ja fr de ko`) or say **"use defaults"** to add the standard seven: `ja, fr, it, de, es, es_MX, pt_BR`." + +It then writes the confirmed locales into the `language:` block using the required format — a **quoted comma-separated string with no spaces** and 4-space indentation: + +``` +language: + default_locale: "en_US" + additional_locales: "ja,fr,de" +``` + +The patched locales become the working locale set for the rest of the workflow (merged with any `--locales` argument you passed). + +--- + +## Execution modes + +### Mode A — Preview (smoke testing) + +Best for iterative development. Runs `sf agent preview` per locale and extracts responses from local trace files. + +``` +"Run locale validation on MyAgent in preview mode" +"Quick locale smoke test for MyAgent" +``` + +Claude reads `references/adlc-mode.md` for the exact `sf agent preview` commands. + +### Mode B — Batch (regression testing) + +Best for CI/CD and regression suites. Generates a `testSpec.yaml` and companion `-input.csv`, then runs via `sf agent test`. + +``` +"Create a batch locale test suite for MyAgent" +"Run locale regression tests for MyAgent in batch mode" +``` + +### fit-tests mode + +When working in the `einstein-copilot-fit-tests` Maven project: + +``` +"Generate multilingual eval data for EngagementAgent" +``` + +--- + +## Validation logic + +The validator flags two severity levels: + +| Severity | Condition | Example | +|----------|-----------|---------| +| **CRITICAL** | Response is in English when target locale ≠ `en_US` | `ja` utterance → English response | +| **Warning** | No locale-specific characters detected in a Latin-script response | `fr` utterance → no accented characters in a long response | + +Script detection method per locale: + +| Script type | Locales | Detection method | +|-------------|---------|-----------------| +| Unicode range | `ja`, `zh_CN`, `zh_TW`, `ar`, `ko`, `th` | Character range match — CRITICAL if absent | +| Cyrillic range | `ru` | Cyrillic range (U+0400–U+04FF) — CRITICAL if absent | +| Diacritics | `fr`, `fr_CA`, `de`, `es`, `es_MX`, `it`, `pt_BR`, `pt_PT`, `nl`, `da`, `sv`, `no`, `fi`, `pl`, `tr` | Locale-specific accent characters — Warning if absent | +| Latin-only | `ms`, `in` | No diacritic check (these languages use unaccented Latin) — LLM-as-judge only | +| Skip | `en_US`, `en_GB` | Always passes | + +--- + +## Using the Python validator directly + +After a batch run, validate a results JSON file without Claude. + +### Input JSON format + +The script auto-detects two input formats. + +**Format 1 — Raw `sf agent test results` output** (detected automatically): + +```bash +sf agent test results --json --job-id --result-format json -o \ + | tee /tmp/results.json +``` + +The script reads `inputs.utterance`, `generatedData.outcome`, and `generatedData.topic` from each test case. Because the raw format has no `locale` field, pass `--spec` (recommended) or `--locales` to assign locales. + +**Format 2 — Custom intermediate format** (used by fit-tests and Claude-generated results): + +```json +{ + "result": { + "testCases": [ + { + "testCaseName": "test_ja_web_reply", + "locale": "ja", + "utterance": "製品について教えてください", + "botResponse": "Weloは...", + "status": "pass", + "topic": "web_reply" + } + ] + } +} +``` + +Required fields: `locale`, `botResponse` (or `response` as fallback). Optional: `testCaseName`, `utterance`, `status`, `topic`. + +### Options reference + +| Option | Default | Description | +|--------|---------|-------------| +| `--results ` | *(required)* | Path to the JSON results file | +| `--spec ` | *(optional)* | Path to testSpec YAML — enables exact per-utterance locale assignment and preserves spec row order in the report | +| `--locales ` | `ja fr it de es es_MX pt_BR` | Space-separated locale codes to validate | +| `--agent-name ` | `Agent` | Agent name shown in the report header | +| `--output ` | stdout | Write markdown report to a file | +| `--llm-validate` | off | Enable LLM-as-judge on top of heuristic checks | +| `--llm-provider ` | `anthropic` | `anthropic` (default) or `openai` | +| `--llm-api-key ` | env var → interactive prompt | API key. Reads `ANTHROPIC_API_KEY` or `OPENAI_API_KEY` by default | +| `--llm-model ` | `claude-haiku-4-5` / `gpt-4o` | Model name (provider-specific default applied automatically) | +| `--llm-endpoint ` | `https://api.openai.com/v1/chat/completions` | OpenAI-compatible URL (only used with `--llm-provider openai`) | +| `--llm-call-delay ` | `1.0` | Pause between LLM calls — increase to 3–5 for low-rate-limit keys | + +### Heuristic-only (fast, no API cost) + +```bash +python3 skills/locale-validation/scripts/validate_locale_responses.py \ + --results /tmp/locale-test-results.json \ + --locales ja fr it de es es_MX pt_BR \ + --agent-name MyAgent \ + --output /tmp/locale-validation-report.md +``` + +### LLM-as-judge via Claude (default) + +```bash +# Key is read from ANTHROPIC_API_KEY; if unset, the script prompts at runtime +python3 skills/locale-validation/scripts/validate_locale_responses.py \ + --results /tmp/locale-test-results.json \ + --locales ja fr de ko ar \ + --agent-name MyAgent \ + --llm-validate \ + --output /tmp/locale-validation-report.md +``` + +### LLM-as-judge via OpenAI (opt-in) + +```bash +python3 skills/locale-validation/scripts/validate_locale_responses.py \ + --results /tmp/locale-test-results.json \ + --locales ja fr de \ + --agent-name MyAgent \ + --llm-validate \ + --llm-provider openai \ + --llm-api-key "$OPENAI_API_KEY" \ + --llm-model gpt-4o \ + --output /tmp/locale-validation-report.md +``` + +### Azure OpenAI endpoint + +```bash +python3 skills/locale-validation/scripts/validate_locale_responses.py \ + --results /tmp/locale-test-results.json \ + --locales ja fr \ + --llm-validate \ + --llm-provider openai \ + --llm-endpoint "https://.openai.azure.com/openai/deployments//chat/completions?api-version=2024-02-01" \ + --llm-api-key "$AZURE_OPENAI_KEY" \ + --llm-model gpt-4o +``` + +### Exit codes + +| Code | Meaning | +|------|---------| +| `0` | All validations passed | +| `1` | One or more **critical** failures (English response in non-English locale) — use as a CI gate | + +--- + +## Validating the skill itself + +### Quick sanity check + +1. Open a Claude Code session in the `agentforce-adlc` directory +2. Verify the skill is loaded: + ``` + What skills do you have available? + ``` + You should see `locale-validation` in the list. + +3. Trigger it with a test prompt: + ``` + I want to test MyAgent in Japanese, French, and Korean. + ``` + Claude should start the 5-phase workflow at Phase 1 (introspect). + +4. Confirm it does NOT activate for unrelated prompts: + ``` + Deploy my agent to production. + ``` + Claude should use `developing-agentforce` instead. + +### Functional validation with an agent file + +```bash +cd /path/to/agentforce-adlc +claude + +# Ask Claude to run locale validation on the example agent +"Run locale validation on force-app/main/default/aiAuthoringBundles/MS_Agent_hp_Apr22_adlc/MS_Agent_hp_Apr22_adlc.agent — just derive and show me the utterances, don't run them yet." +``` + +Expected: Claude reads the `.agent` file, lists topics, proposes 2–3 English utterances per topic, and waits for your review before translating or running anything. + +### Validate the Python script + +```bash +python3 -c " +import json +mock = {'result': {'testCases': [ + {'testCaseName': 'test_ja', 'locale': 'ja', 'utterance': 'check order', 'botResponse': 'Your order is ready.', 'status': 'pass'}, + {'testCaseName': 'test_fr', 'locale': 'fr', 'utterance': 'check order', 'botResponse': 'Votre commande est prête.', 'status': 'pass'}, + {'testCaseName': 'test_ar', 'locale': 'ar', 'utterance': 'check order', 'botResponse': 'Your order is ready.', 'status': 'pass'}, +]}} +print(json.dumps(mock)) +" > /tmp/mock-results.json + +python3 skills/locale-validation/scripts/validate_locale_responses.py \ + --results /tmp/mock-results.json \ + --locales ja fr ar \ + --agent-name TestAgent +``` + +Expected: `test_ja` and `test_ar` flagged as CRITICAL (English response for non-English locale), `test_fr` passes. + +--- + +## Related skills + +| Skill | When to use instead | +|-------|-------------------| +| `testing-agentforce` | General agent testing without locale focus | +| `developing-agentforce` | Authoring/editing `.agent` files | +| `observing-agentforce` | Analyzing production session traces for locale failures | diff --git a/skills/locale-validation/SKILL.md b/skills/locale-validation/SKILL.md new file mode 100644 index 0000000..e20adee --- /dev/null +++ b/skills/locale-validation/SKILL.md @@ -0,0 +1,445 @@ +--- +name: locale-validation +description: "Validate Agentforce agent responses across multiple locales by reading the agent script or genAiPluginMetadata, deriving test utterances per topic, and running them in all target languages. Automatically checks whether the agent declares additional_locales — if missing or empty, presents the full Agentforce supported language list and asks the user to pick locales, then patches the .agent file before proceeding. TRIGGER when: user asks to test an agent in multiple languages or locales; mentions 'locale validation', 'multilingual testing', 'language support', 'additional_locales', or 'localization'; wants to verify agent responds correctly in Japanese, French, Italian, German, Spanish, Portuguese, Arabic, Chinese, Korean, Dutch, Russian, Turkish, or any other supported language; asks to generate locale test cases from an agent script; mentions any locale code such as 'ja', 'fr', 'it', 'de', 'es', 'es_MX', 'pt_BR', 'ar', 'zh_CN', 'ko', 'nl', 'ru', 'tr', etc. Also trigger proactively when you notice a .agent file has a non-empty additional_locales field or all_additional_locales: True and the user asks about testing or quality." +allowed-tools: Bash Read Write Edit Glob Grep +license: Apache-2.0 +metadata: + version: "0.6.0" + last_updated: "2026-05-01" + argument-hint: " [--locales ja,fr,de,...] [--mode preview|batch]" + compatibility: claude-code, agentforce-adlc, einstein-copilot-fit-tests +--- + +# Locale Validation for Agentforce Agents + +Derive locale-specific test cases from any agent script or genAiPluginMetadata, then run them in preview or batch mode to verify the agent responds in the correct language. + +## Agentforce supported languages + +All locales available for `additional_locales`. Source: [Agentforce Employee Agent Considerations](https://help.salesforce.com/s/articleView?id=ai.agent_employee_agent_considerations.htm&type=5) — check that page for the latest additions, as Salesforce updates language support monthly. + +| Code | Language | +|------|----------| +| `ar` | Arabic | +| `zh_CN` | Chinese (Simplified) | +| `zh_TW` | Chinese (Traditional) | +| `da` | Danish | +| `nl` | Dutch | +| `fi` | Finnish | +| `fr` | French | +| `de` | German | +| `in` | Indonesian | +| `it` | Italian | +| `ja` | Japanese | +| `ko` | Korean | +| `ms` | Malay | +| `no` | Norwegian | +| `pl` | Polish | +| `pt_BR` | Portuguese (Brazil) | +| `pt_PT` | Portuguese (European) | +| `ru` | Russian | +| `es` | Spanish | +| `es_MX` | Spanish (Mexico) | +| `sv` | Swedish | +| `th` | Thai | +| `tr` | Turkish | + +## Default locale set (when user says "use defaults") + +| Code | Language | +|------|----------| +| `ja` | Japanese | +| `fr` | French | +| `it` | Italian | +| `de` | German | +| `es` | Spanish | +| `es_MX` | Spanish (Mexico) | +| `pt_BR` | Portuguese (Brazil) | + +Override with `--locales` if the user specifies a subset or adds others. + +## Workflow overview + +``` +1. Introspect agent → 1b. Check & patch additional_locales → 1c. Check & patch language-response instruction → 2. Derive utterances → 3. Translate → 4. Run tests → 5. Validate & report +``` + +Work through these phases in order. Pause after Phase 1b if a locale patch is needed. Phase 1c is fully automatic (no confirmation). Pause after Phase 2 for utterance review. + +--- + +## Phase 1 — Introspect the agent + +Accept: a path to a `.agent` file, a `genAiPluginMetadata` directory, or just an agent name (search for it). + +**Find the agent file:** +```bash +# By path +cat path/to/MyAgent.agent + +# By name in force-app +find . -name "*.agent" | xargs grep -l "MyAgent" 2>/dev/null + +# genAiPluginMetadata (Salesforce metadata XML) +find . -path "*/genAiPlugins/*.genAiPlugin-meta.xml" | head -5 +``` + +**Extract from `.agent`:** +- `config.developer_name` — agent API name +- `language.default_locale` and `language.additional_locales` — extend the default locale set with any locales already declared +- Every `topic ` block — collect the topic name and `description` +- Every `action` inside topics — collect action name and `description` +- Any `system.instructions` welcome/greeting text — use as inspiration for entry utterances + +**Extract from `genAiPluginMetadata`:** +```bash +# List topics/actions from XML +grep -E "||" path/to/plugin.genAiPlugin-meta.xml +``` + +Build a table of topics and actions with their descriptions — this is the source for utterance derivation. + +--- + +## Phase 1b — Check and patch `additional_locales` + +**Always run this step immediately after reading the agent file, before deriving utterances.** + +### Check + +Look for the `language:` block in the `.agent` file. Three possible states: + +| State | Condition | Action | +|---|---|---| +| **Present and populated** | `additional_locales` exists and is non-empty | Extract the declared locales, merge with `--locales` argument, proceed to Phase 2 | +| **Present but empty** | `additional_locales: ""` or `additional_locales:` with no value | Treat as missing — go to Ask step | +| **Missing** | No `language:` block, or block exists but has no `additional_locales` line | Go to Ask step | + +### Ask + +When `additional_locales` is missing or empty, stop and ask the user: + +> "The agent script does not declare any `additional_locales`. Which locales should I add? +> +> Agentforce supported languages: +> +> | Code | Language | +> |------|----------| +> | `ar` | Arabic | +> | `zh_CN` | Chinese (Simplified) | +> | `zh_TW` | Chinese (Traditional) | +> | `da` | Danish | +> | `nl` | Dutch | +> | `fi` | Finnish | +> | `fr` | French | +> | `de` | German | +> | `in` | Indonesian | +> | `it` | Italian | +> | `ja` | Japanese | +> | `ko` | Korean | +> | `ms` | Malay | +> | `no` | Norwegian | +> | `pl` | Polish | +> | `pt_BR` | Portuguese (Brazil) | +> | `pt_PT` | Portuguese (European) | +> | `ru` | Russian | +> | `es` | Spanish | +> | `es_MX` | Spanish (Mexico) | +> | `sv` | Swedish | +> | `th` | Thai | +> | `tr` | Turkish | +> +> Reply with the codes you want (e.g. `ja fr de ko`) or say **"use defaults"** to add the standard seven: `ja, fr, it, de, es, es_MX, pt_BR`." + +Wait for the user's answer before proceeding. + +### Patch + +Once the user confirms the locale list, update the `.agent` file using the Edit tool. + +**If `language:` block exists but `additional_locales` is missing or empty**, add/replace the line: + +``` +language: + default_locale: "en_US" + additional_locales: "ja,fr,de" ← replace with user's choices, comma-separated, no spaces +``` + +**If no `language:` block exists at all**, insert one after the `config:` block: + +``` +language: + default_locale: "en_US" + additional_locales: "ja,fr,de" +``` + +**Format rules (required for Agent Script compiler):** +- `additional_locales` value is a **quoted comma-separated string** with no spaces: `"ja,fr,de"` not `"ja, fr, de"` +- Indentation is **4 spaces** (tabs break the compiler) +- `default_locale` must always be present in the `language:` block + +**After patching**, show the user the diff and confirm: + +> "I've added `additional_locales: \"ja,fr,de\"` to the `language:` block. Continuing to Phase 2." + +Use the confirmed locale list (union of the patched `additional_locales` and any `--locales` argument) as the working locale set for the rest of the workflow. + +--- + +## Phase 1c — Check and patch language-response instruction + +**Always run this step immediately after Phase 1b, before deriving utterances.** + +### Check + +Search `system: instructions:` in the `.agent` file for the presence of the language-response instruction. The canonical marker to search for is: + +``` +Always respond in the same language the user writes in +``` + +Use a case-insensitive substring match. Two states: + +| State | Condition | Action | +|---|---|---| +| **Present** | The marker string is found anywhere in `system.instructions` | Nothing to do — proceed to Phase 2 silently | +| **Missing** | Marker string not found | Patch the file automatically (no user confirmation needed) | + +### Patch + +When the instruction is missing, append it as the last bullet inside `system: instructions:`, immediately before the blank line that follows the instruction block. Match the indentation of the surrounding bullet points (4 spaces). + +**Instruction to insert (exact text):** + +``` + - Always respond in the same language the user writes in. If the user writes in Japanese, respond entirely in Japanese. If French, respond entirely in French. Never mix languages in a single response. Use {!@variables.EndUserLanguage} as a locale hint when available. +``` + +**How to locate the insertion point:** + +1. Find the last `- ` bullet line inside `system: instructions:` (before `messages:` or any other top-level key) +2. Insert the new bullet immediately after that line + +**After patching**, tell the user: + +> "Added language-response instruction to `system.instructions` — agent will now respond in the user's language. Continuing." + +### Why this matters + +`additional_locales` only declares platform support. Without an explicit instruction in `system.instructions`, the LLM defaults to English regardless of the utterance language or `EndUserLanguage` session variable. + +--- + +## Phase 2 — Derive English utterances + +For each topic, generate **2–3 realistic English utterances** a real user would send. Draw from the topic and action descriptions — the utterances should be natural questions or instructions that would route to that topic. + +Aim for variety: one direct command ("Book a meeting for tomorrow"), one question ("Can you help me schedule something?"), one edge-case phrasing. + +Present these to the user: +> "Here are the derived test utterances I'll use. Want to add, remove, or adjust any before I proceed?" + +--- + +## Phase 3 — Translate utterances + +For each English utterance, produce a translation in each target locale. Rules (matching `EvalLocaleTestUtil` and `UtteranceTranslationUtil` conventions): + +- Keep **proper nouns in English**: Salesforce object names, field names, company names, person names, API identifiers (e.g., "Opportunity", "Chatter", "Einstein Copilot") +- Preserve the **intent and tone** exactly (approval → approval, command → command) +- Do not add quotes, extra punctuation, or explanations +- For `es` vs `es_MX`: es_MX is Mexican Spanish — use natural regional phrasing where it differs + +Use Claude's own translation capability for this step (no external LLM call required in adlc mode). + +--- + +## Phase 4 — Run tests + +Read the appropriate reference based on execution mode: + +- **Preview / smoke testing (Mode A):** Read `references/adlc-mode.md` → section "Preview mode" +- **Batch / regression testing (Mode B):** Read `references/adlc-mode.md` → section "Batch mode" +- **einstein-copilot-fit-tests (Maven):** Read `references/fit-tests-mode.md` + +The mode is inferred from context: +- User says "preview", "smoke", "quick test", or you're iterating during development → Mode A +- User says "batch", "regression", "CI", or "test suite" → Mode B +- Current working directory is `einstein-copilot-fit-tests` or user mentions Maven/JUnit → fit-tests mode + +### Mode B — companion CSV (always generate alongside YAML) + +**Every time you write a `testSpec.yaml` for Mode B, you must also write a companion `.csv` file at the same path with `-input.csv` replacing `-testSpec.yaml`.** + +The CSV is a Testing Center UI upload template — it lets the user import the same test cases manually via the browser without running CLI commands. + +**CSV format rules:** +- Header row (exactly): `utterance,expectedTopic,expectedActions,expectedOutcome` +- One row per test case, same order as the YAML +- `utterance` — the translated utterance (same value as YAML `utterance:`) +- `expectedTopic` — leave **blank** for locale test cases (runtime topic names hash-drift; outcome assertion is sufficient). Only fill if you have a confirmed stable runtime name. +- `expectedActions` — leave **blank** unless the test case explicitly asserts a specific action invocation +- `expectedOutcome` — same value as YAML `expectedOutcome:` (natural-language LLM-as-judge description) +- Wrap values containing commas or apostrophes in double quotes +- No trailing spaces or BOM + +**Naming convention:** +``` +tests/-locale--testSpec.yaml ← YAML (sf agent test create) +tests/-locale--input.csv ← CSV (Testing Center UI upload) +``` + +**testSpec YAML — always include `locale:` per test case** so the validator can assign the correct locale without inference and preserve spec order in the report: + +```yaml +testCases: + - utterance: "WeloはどのようなITソリューションを提供していますか?" + locale: "ja" + expectedOutcome: "Agent responds in Japanese. ..." + + - utterance: "Quelles solutions IT Welo propose-t-elle ?" + locale: "fr" + expectedOutcome: "Agent responds in French. ..." +``` + +**Validator command — always pass `--spec`** alongside `--results` so locale assignment and row order match the spec exactly: + +```bash +python3 skills/locale-validation/scripts/validate_locale_responses.py \ + --results /tmp/locale-test-results.json \ + --spec tests/-locale--testSpec.yaml \ + --locales ja fr \ + --agent-name \ + --llm-validate \ + --output /tmp/locale-validation-report.md +``` + +**Example CSV output** for a ja+fr locale suite: + +```csv +utterance,expectedTopic,expectedActions,expectedOutcome +WeloはどのようなITソリューションを提供していますか?,,,Agent responds in Japanese. Response describes Welo's IT solutions or data center services. Response does not contain English text. +データセンターサービスについて教えてください,,,Agent responds in Japanese. Response provides information about data center services. Response does not contain English text. +Quelles solutions IT Welo propose-t-elle ?,,,Agent responds in French. Response describes Welo's IT solutions or data center services. Response does not contain English text. +Parlez-moi de vos services pour centres de données,,,Agent responds in French. Response provides information about data center services. Response does not contain English text. +``` + +After writing both files, tell the user: + +> "I've written two files: +> - `tests/-testSpec.yaml` — deploy with `sf agent test create --spec ...` +> - `tests/-input.csv` — upload manually via Testing Center UI → Import from CSV" + +--- + +## Phase 5 — Validate and report + +After collecting responses, validate each one using the language validation logic from `EvalLocaleTestUtil`. + +### Option A — In-context validation (Claude) + +Evaluate responses directly without the Python script. Apply these rules: + +**Critical check (run first):** If the target locale is NOT `en_US` and the response is entirely in English, that is a **CRITICAL FAILURE** — report it immediately with `overall_evaluation: POOR`. + +**For each response, check:** +1. Language correctness — is it actually in the target language? +2. Cultural appropriateness and business tone +3. No untranslated fragments (mixed-language responses) +4. Grammar and formatting quality (lenient — flag only obvious issues) + +### Option B — Python validator script (post-batch) + +When the user has a batch results JSON file, run the script for fast automated analysis: + +```bash +# Heuristic only (fast, no API cost) +python3 skills/locale-validation/scripts/validate_locale_responses.py \ + --results /tmp/locale-test-results.json \ + --locales ja fr it de es es_MX pt_BR \ + --agent-name \ + --output /tmp/locale-validation-report.md + +# With LLM-as-judge via Claude (default — uses ANTHROPIC_API_KEY) +python3 skills/locale-validation/scripts/validate_locale_responses.py \ + --results /tmp/locale-test-results.json \ + --locales ja fr it de es es_MX pt_BR \ + --agent-name \ + --llm-validate \ + --llm-api-key "$ANTHROPIC_API_KEY" \ + --output /tmp/locale-validation-report.md + +# With LLM-as-judge via OpenAI (opt-in) +python3 skills/locale-validation/scripts/validate_locale_responses.py \ + --results /tmp/locale-test-results.json \ + --locales ja fr it de es es_MX pt_BR \ + --agent-name \ + --llm-validate \ + --llm-provider openai \ + --llm-api-key "$OPENAI_API_KEY" \ + --llm-model gpt-4o \ + --output /tmp/locale-validation-report.md +``` + +**Provider defaults:** +- `--llm-provider anthropic` (default): calls `api.anthropic.com/v1/messages`, model `claude-haiku-4-5`, key from `ANTHROPIC_API_KEY` +- `--llm-provider openai`: calls `https://api.openai.com/v1/chat/completions`, model `gpt-4o`, key from `OPENAI_API_KEY`. Pass `--llm-endpoint` to use Azure OpenAI or another compatible endpoint. + +**Input format:** The `--results` file must be a JSON object shaped as `{"result": {"testCases": [...]}}` where each test case has fields `locale`, `botResponse` (or `response`), and optionally `testCaseName`, `utterance`, `status`, `topic`. This is a custom intermediate format — not the raw `sf agent test results --result-format json` output. + +**Exit codes:** `0` = all passed; `1` = one or more critical failures (English response in non-English locale) — suitable as a CI gate. + +**LLM API key resolution order:** `--llm-api-key` flag → `ANTHROPIC_API_KEY` (or `OPENAI_API_KEY`) env var → interactive prompt at runtime. + +**If no API key is available:** Run the script with `--llm-validate` directly. The script will prompt for the key interactively: + +``` +ANTHROPIC_API_KEY is not set. Enter your Anthropic API key to continue: +``` + +The user types the key at the prompt — it is passed directly to the script and never written to disk or shell history. Do not pre-check for the key or ask the user to `export` it first; just run the script and let it prompt. + +Do not silently skip LLM validation or fall back to heuristic-only when the user explicitly requested LLM-as-judge. + +### Report format (both options) + +``` +## Locale Validation Report — +Date: +Locales tested: ja, fr, it, de, es, es_MX, pt_BR + +| Topic | Locale | Utterance (EN) | Result | Issues | +|-------|--------|----------------|--------|--------| +| OrderTopic | ja | "Check my order" | ✅ PASS | — | +| OrderTopic | fr | "Check my order" | ❌ FAIL | Response in English | + +### Summary +- Total: N tests across M topics and K locales +- Passed: X Failed: Y +- Critical failures (English response): Z + +### Failures detail + +``` + +--- + +## Quick-start examples + +**Run preview locale test on a named agent:** +``` +User: "Run locale validation on MySDRAgent in preview mode" +→ Find MySDRAgent.agent, derive utterances, translate, run sf agent preview per locale +``` + +**Generate batch test spec with locale variants:** +``` +User: "Create a batch locale test suite for MyAgent" +→ Introspect agent, derive utterances, translate all locales, write test-spec-locales.yaml +``` + +**fit-tests integration:** +``` +User: "Generate multilingual eval test data for EngagementAgent" +→ Read references/fit-tests-mode.md for utterances.json + test-case.json generation +``` diff --git a/skills/locale-validation/references/adlc-mode.md b/skills/locale-validation/references/adlc-mode.md new file mode 100644 index 0000000..5df3e88 --- /dev/null +++ b/skills/locale-validation/references/adlc-mode.md @@ -0,0 +1,207 @@ +# Locale Validation — ADLC Execution Reference + +## Preview mode (Mode A) + +Use `sf agent preview` to run each locale test interactively. Each locale requires its own session because `$Context.EndUserLanguage` is set at session creation time. + +### Start a session with a locale + +```bash +# Start session for a specific locale +sf agent preview start \ + --json \ + --authoring-bundle \ + -o \ + > /tmp/locale_session_.json + +SESSION_ID=$(cat /tmp/locale_session_.json | python3 -c "import json,sys; print(json.load(sys.stdin)['result']['sessionId'])") +``` + +> **Note:** `sf agent preview` does not natively accept `$Context.EndUserLanguage` as a CLI flag. To test locale routing, either: +> 1. Deploy a test variant of the agent with `default_locale` set to the target locale, or +> 2. Include the locale in the utterance context (e.g., "Respond only in Japanese: check my order") for quick smoke-testing, or +> 3. Use the org's user language setting if your org supports it. +> +> For rigorous locale testing, prefer Mode B (batch) which allows full session context injection. + +### Send a translated utterance + +```bash +sf agent preview send \ + --json \ + --session-id "$SESSION_ID" \ + --utterance "" \ + --authoring-bundle \ + -o \ + > /tmp/locale_response__.json +``` + +### End the session + +```bash +sf agent preview end \ + --json \ + --session-id "$SESSION_ID" \ + --authoring-bundle \ + -o +``` + +### Extract the agent's response + +```bash +# Get the last bot response from the trace +cat /tmp/locale_response__.json | \ + python3 -c "import json,sys; d=json.load(sys.stdin); print(d['result'].get('response', ''))" +``` + +### Validate the response language + +For each response, check: +1. Is the response in the target language? (Not English when target ≠ en_US) +2. Does it correctly route to the expected topic? + +Read the topic routing from the trace: +```bash +TRACE_DIR=".sfdx/agents//sessions/$SESSION_ID/traces/" +ls "$TRACE_DIR" +python3 -c " +import json, glob, sys +for f in glob.glob('$TRACE_DIR/*.json'): + d = json.load(open(f)) + topic = d.get('topic', {}).get('name', 'unknown') + print(f'Topic: {topic}') +" +``` + +### Loop pattern for all locales + +```bash +AGENT=MyAgent +ORG=my-org +LOCALES="ja fr it de es es_MX pt_BR" +UTTERANCES_FILE=/tmp/locale_utterances.json # {locale: {topic: utterance}} + +for LOCALE in $LOCALES; do + echo "=== Testing locale: $LOCALE ===" + SESSION=$(sf agent preview start --json --authoring-bundle $AGENT -o $ORG | python3 -c "import json,sys; print(json.load(sys.stdin)['result']['sessionId'])") + + # Read utterances for this locale from your utterances file + UTTERANCE=$(python3 -c "import json; u=json.load(open('$UTTERANCES_FILE')); print(u.get('$LOCALE', {}).get('topic1', ''))") + + sf agent preview send --json --session-id "$SESSION" --utterance "$UTTERANCE" \ + --authoring-bundle $AGENT -o $ORG > /tmp/resp_${LOCALE}.json + + sf agent preview end --json --session-id "$SESSION" --authoring-bundle $AGENT -o $ORG +done +``` + +--- + +## Batch mode (Mode B) + +Mode B lets you inject full session context including `$Context.EndUserLanguage`, making it the most accurate way to test locale behavior. + +### Build a locale-aware test spec + +Generate a test spec YAML with utterances in each locale. Each test case entry should include a locale prefix in its ID for clarity: + +```yaml +# test-spec-locales.yaml +name: " Locale Validation" +subjectType: AGENT +subjectName: + +testCases: + # Japanese + - utterance: "" + expectedTopic: "" + expectedOutcome: "Agent responds in Japanese. Response is culturally appropriate, polite, and does not contain English text." + + - utterance: "" + expectedTopic: "" + expectedOutcome: "Agent responds in Japanese. Response correctly addresses the user's request." + + # French + - utterance: "" + expectedTopic: "" + expectedOutcome: "Agent responds in French. Response is culturally appropriate for French-speaking users." + + # Italian + - utterance: "" + expectedTopic: "" + expectedOutcome: "Agent responds in Italian. Response is grammatically correct and professional." + + # German + - utterance: "" + expectedTopic: "" + expectedOutcome: "Agent responds in German. Response uses appropriate formal register (Sie form)." + + # Spanish + - utterance: "" + expectedTopic: "" + expectedOutcome: "Agent responds in Spanish. Response is appropriate for Latin American or European Spanish context." + + # Spanish Mexico + - utterance: "" + expectedTopic: "" + expectedOutcome: "Agent responds in Mexican Spanish. Response reflects appropriate regional phrasing." + + # Portuguese Brazil + - utterance: "" + expectedTopic: "" + expectedOutcome: "Agent responds in Brazilian Portuguese. Response is grammatically correct and professional." + + # Regression: English should still work + - utterance: "" + expectedTopic: "" + expectedOutcome: "Agent responds in English." +``` + +> **On session context injection:** The `AiEvaluationDefinition` format (used by `sf agent test create`) does not currently support per-test `$Context.EndUserLanguage` injection in the public YAML spec. If your agent relies on this variable for locale routing, set up dedicated test orgs or profiles with the user language set to the target locale, or use the fit-tests approach (see `fit-tests-mode.md`) which supports full session context via JSON. + +### Deploy and run the test suite + +```bash +# Create the test suite from spec +sf agent test create \ + --json \ + --spec test-spec-locales.yaml \ + --api-name LocaleValidation \ + -o + +# Run it +sf agent test run \ + --json \ + --api-name LocaleValidation \ + --wait 30 \ + --result-format json \ + -o \ + > /tmp/locale-test-results.json +``` + +### Parse results + +```bash +python3 -c " +import json +results = json.load(open('/tmp/locale-test-results.json')) +for tc in results.get('result', {}).get('testCases', []): + name = tc.get('testCaseName', 'unknown') + status = tc.get('status', 'unknown') + verdict = tc.get('verdict', '') + print(f'{name}: {status} — {verdict}') +" +``` + +### Evaluate responses with the language validator + +For richer validation, run the locale validator after getting results: + +```bash +python3 skills/locale-validation/scripts/validate_locale_responses.py \ + --results /tmp/locale-test-results.json \ + --locales ja fr it de es es_MX pt_BR \ + --output /tmp/locale-validation-report.md +``` + +See `scripts/validate_locale_responses.py` for the implementation. diff --git a/skills/locale-validation/references/fit-tests-mode.md b/skills/locale-validation/references/fit-tests-mode.md new file mode 100644 index 0000000..304a635 --- /dev/null +++ b/skills/locale-validation/references/fit-tests-mode.md @@ -0,0 +1,268 @@ +# Locale Validation — einstein-copilot-fit-tests Reference + +This reference covers how to generate and run multilingual locale validation tests using the +`einstein-copilot-fit-tests` Maven/JUnit framework. It is also applicable when the skill is +invoked from within that repository. + +## Framework overview + +The framework is built on `EvalParameterizedTest` (a JUnit 5 abstract base class). The key mechanism is: + +- `testCaseFilePath` — JSON template with placeholders like `{{utterance}}`, `{{languageKey}}`, `{{language}}` +- `utteranceFilePath` — JSON with utterances nested by test type and locale +- `utteranceSourceMultiLang()` / `utteranceSourceMultiLangSanity()` / `utteranceSourceMultiLangSmoke()` — method sources that expand the matrix of (utterance × locale) into parameterized test arguments + +`EvalLocaleTestUtil` provides the LLM-as-judge validation prompt. `UtteranceTranslationUtil` provides LLM-backed utterance translation at runtime. + +--- + +## Step 1 — Generate utterances.json + +The utterances file must follow the nested structure: `{ "TestType": { "localeCode": { "testKey": "utterance" } } }`. + +**Template:** + +```json +{ + "Sanity": { + "en_US": { + "topic1_utterance1": "Your English utterance here" + }, + "ja": { + "topic1_utterance1": "日本語の発話(EvalLocaleTestUtil翻訳または手動)" + }, + "fr": { + "topic1_utterance1": "Votre énoncé en français" + }, + "it": { + "topic1_utterance1": "Il tuo enunciato in italiano" + }, + "de": { + "topic1_utterance1": "Ihre Äußerung auf Deutsch" + }, + "es": { + "topic1_utterance1": "Su enunciado en español" + }, + "es_MX": { + "topic1_utterance1": "Su enunciado en español mexicano" + }, + "pt_BR": { + "topic1_utterance1": "Seu enunciado em português brasileiro" + } + }, + "Smoke": { + "en_US": { + "topic1_utterance1": "Smoke-level English utterance" + }, + "ja": { + "topic1_utterance1": "スモークレベルの日本語発話" + } + } +} +``` + +When using `UtteranceTranslationUtil` at runtime (preferred), you only need to supply `en_US` +utterances — the framework translates them dynamically. Pre-translated utterances are better for +regression stability (same input every run). + +Canonical file location: +``` +src/test/resources/testdata/evals////utterances.json +``` + +--- + +## Step 2 — Generate test-case.json + +The test case file is a JSON template with `{{placeholder}}` tokens. The framework replaces: + +| Placeholder | Value | +|-------------|-------| +| `{{utterance}}` | The translated utterance for this locale | +| `{{languageKey}}` | The locale code, e.g. `ja` | +| `{{language}}` | The human-readable language name, e.g. `Japanese` | +| `{{plannerId}}` | The agent planner ID (set in `placeHolderContext`) | +| `{{testType}}` | `Sanity`, `Smoke`, or `Regression` | + +**Minimal test-case.json template:** + +```json +{ + "tests": [{ + "id": "locale_validation_{{testType}}", + "description": "Locale validation test for {{language}}", + "steps": [ + { + "type": "agent.create_session", + "planner_id": "{{plannerId}}", + "setup_session_context": { + "variables": [ + { + "name": "$Context.EndUserLanguage", + "value": "{{languageKey}}" + } + ] + } + }, + { + "type": "agent.send_message", + "utterance": "{{utterance}}" + }, + { + "type": "evaluator.llm_assertion", + "prompt": "{{languageCheckPrompt}}", + "operator": "contains", + "expected": "\"overall_evaluation\": \"Good\"" + } + ] + }] +} +``` + +The `{{languageCheckPrompt}}` placeholder is populated at runtime by calling: +```java +placeHolderContext.put("languageCheckPrompt", + EvalLocaleTestUtil.buildLanguageValidationPrompt(languageKey, agentResponse)); +``` + +Canonical file location: +``` +src/test/resources/testdata/evals////test-case.json +``` + +--- + +## Step 3 — Generate the Java test class + +Create a class that extends `EvalParameterizedTest` (or the appropriate cloud-specific subclass). + +**Template:** + +```java +package com.salesforce.einstein_copilot.test.evals...; + +import com.salesforce.einstein_copilot.test.evals.util.EvalLocaleTestUtil; +import com.salesforce.einstein_copilot.test.evals.util.LocalizationTranslatorHolder; +import com.salesforce.einstein_copilot.test.evals.util.UtteranceTranslationUtil; +import com.salesforce.einstein_copilot.test.evalsv2.EvalParameterizedTest; +import com.salesforce.atf.context.TestRunContext; +import org.junit.jupiter.api.BeforeAll; +import org.junit.jupiter.api.BeforeEach; +import org.junit.jupiter.params.ParameterizedTest; +import org.junit.jupiter.params.provider.MethodSource; + +// Import your @EvalTestParameters annotation and eval utilities + +public class LocaleEvalTest extends EvalParameterizedTest { + + private static final String AGENT_NAME = ""; + private static final String CREDENTIAL_NAME = ""; + + @BeforeAll + public static void initInputFiles() { + testCaseFilePath.set("/testdata/evals////test-case.json"); + utteranceFilePath.set("/testdata/evals////utterances.json"); + } + + @BeforeEach + public void setupTranslator(TestRunContext context) { + // Set up LLM-based translator if available (falls back to pre-translated utterances) + var translator = UtteranceTranslationUtil.createTranslatorFromContext(context); + if (translator != null) { + LocalizationTranslatorHolder.setTranslator(translator); + } + } + + @ParameterizedTest(name = "LocaleValidation-Sanity-{0}") + @MethodSource("utteranceSourceMultiLangSanity") + public void testLocaleValidation_Sanity(String testName, org.json.JSONObject request) throws Exception { + // Extract language key from the request for validation + String languageKey = EvalLocaleTestUtil.extractLanguageKeyFromRequest(request.toString()); + + // Set the planner ID and any other context your agent needs + placeHolderContext.put("plannerId", /* resolve planner id */); + + // Run the eval + var evalResponse = evalUtil.evaluate(request.toString(), placeHolderContext); + + // Set language validation prompt for the eval assertion step + String agentResponse = /* extract response from evalResponse */; + placeHolderContext.put("languageCheckPrompt", + EvalLocaleTestUtil.buildLanguageValidationPrompt(languageKey, agentResponse)); + + runEvalValidations(evalResponse, request.toString(), placeHolderContext); + } + + @ParameterizedTest(name = "LocaleValidation-Smoke-{0}") + @MethodSource("utteranceSourceMultiLangSmoke") + public void testLocaleValidation_Smoke(String testName, org.json.JSONObject request) throws Exception { + testLocaleValidation_Sanity(testName, request); + } +} +``` + +**Placement:** `src/test/java/com/salesforce/einstein_copilot/test/evals////` + +--- + +## Step 4 — Run the tests + +```bash +# Run sanity locale tests +mvn clean test \ + -Dtest="LocaleEvalTest#testLocaleValidation_Sanity" \ + -Ptestlocal \ + -Dsut_config=src/test/resources/stc-config-sdb15.json + +# Run a specific locale only (JUnit filter) +mvn clean test \ + -Dtest="LocaleEvalTest#testLocaleValidation_Sanity[LocaleValidation-Sanity-topic1_utterance1_ja]" \ + -Ptestlocal \ + -Dsut_config=src/test/resources/stc-config-sdb15.json + +# Run all locales +mvn clean test \ + -Dtest="LocaleEvalTest" \ + -Ptestlocal \ + -Dsut_config=src/test/resources/stc-config-sdb15.json +``` + +--- + +## EvalLocaleTestUtil — key methods + +| Method | Use | +|--------|-----| +| `extractLanguageKeyFromRequest(String requestJson)` | Pulls `$Context.EndUserLanguage` from test request JSON; defaults to `en_US` | +| `buildLanguageValidationPrompt(String languageKey, String responseString)` | Builds the LLM-as-judge prompt for validating response language/quality | +| `buildLanguageValidationPromptFromRequest(String requestJson, String responseString)` | Convenience: extracts locale from request then builds prompt | +| `translateText(String text, String targetLanguageCode)` | Translates via `LocalizationTranslatorHolder`; returns original if no translator | + +## Language validation prompt behavior + +The prompt (in `EvalLocaleTestUtil.languagePrompt`) instructs the LLM judge to: + +- **Critical failure:** Return `overall_evaluation: POOR` immediately if the target is not English but the response is in English +- **Lenient pass:** Allow minor regional/dialect differences, minor formality deviations +- **Issue types:** `Untranslated Text`, `Grammar/Spelling`, `Tone/Structure`, `Formatting`, `Culture & Business Alignment`, `Sensitive Content` +- **Output:** JSON with `validation_results` list and `overall_evaluation` (`Good` or `Bad`) + +The `overall_evaluation` key in the JSON output is what the `evaluator.llm_assertion` step checks via `contains "\"overall_evaluation\": \"Good\""`. + +--- + +## Supported locale codes + +From `EvalParameterizedTest.LANGUAGE_MAPPING`: + +| Code | Language | Code | Language | +|------|----------|------|----------| +| `en_US` | English | `fr` | French | +| `ja` | Japanese | `it` | Italian | +| `de` | German | `es` | Spanish | +| `es_MX` | Spanish (Mexico) | `pt_BR` | Portuguese (Brazil) | +| `pt_PT` | Portuguese (European) | `fr_CA` | French (Canadian) | +| `en_GB` | English (UK) | `en_AU` | English (Australian) | +| `zh_CN` | Chinese (Simplified) | `zh_TW` | Chinese (Traditional) | +| `ar` | Arabic | `ko` | Korean | +| `nl` | Dutch | `da` | Danish | diff --git a/skills/locale-validation/scripts/validate_locale_responses.py b/skills/locale-validation/scripts/validate_locale_responses.py new file mode 100644 index 0000000..f18c4f2 --- /dev/null +++ b/skills/locale-validation/scripts/validate_locale_responses.py @@ -0,0 +1,735 @@ +#!/usr/bin/env python3 +""" +Validate agent responses for locale correctness after a batch test run. + +Usage — heuristic only (fast, no LLM): + python3 validate_locale_responses.py \ + --results /tmp/locale-test-results.json \ + --locales ja fr it de es es_MX pt_BR \ + --output /tmp/locale-validation-report.md + +Usage — heuristic + LLM-as-judge via Claude (default): + python3 validate_locale_responses.py \ + --results /tmp/locale-test-results.json \ + --locales ja fr it de es es_MX pt_BR \ + --llm-validate \ + --llm-api-key $ANTHROPIC_API_KEY \ + --output /tmp/locale-validation-report.md + +Usage — heuristic + LLM-as-judge via OpenAI: + python3 validate_locale_responses.py \ + --results /tmp/locale-test-results.json \ + --locales ja fr it de es es_MX pt_BR \ + --llm-validate \ + --llm-provider openai \ + --llm-api-key $OPENAI_API_KEY \ + --llm-model gpt-4o \ + --output /tmp/locale-validation-report.md + +The heuristic layer runs first (fast, zero cost). +The LLM layer is only invoked when --llm-validate is passed and runs a Python +reimplementation of EvalLocaleTestUtil.buildLanguageValidationPrompt() / languagePrompt +from com.salesforce.einstein_copilot.test.evals.util.EvalLocaleTestUtil. + +Provider defaults: anthropic (claude-haiku-4-5 via api.anthropic.com). +For OpenAI-compatible endpoints (Azure, etc.) use --llm-provider openai. +Azure OpenAI: https://.openai.azure.com/openai/deployments//chat/completions?api-version=2024-02-01 +""" + +import argparse +import json +import re +import sys +import urllib.request +import urllib.error +from pathlib import Path +from datetime import date +from typing import Optional + +# --------------------------------------------------------------------------- +# Language prompt — exact text of EvalLocaleTestUtil.languagePrompt +# (Java source: src/test/java/com/salesforce/einstein_copilot/test/evals/util/EvalLocaleTestUtil.java) +# Placeholders: +# {{languageKey}} — target locale code, e.g. "es", "ja" +# {ResponseString} — the agent response text to validate +# --------------------------------------------------------------------------- +LANGUAGE_VALIDATION_PROMPT = ( + 'You are a localization expert specializing in business communication for **{{languageKey}}**. ' + 'Your task is to analyze the provided response for cultural and linguistic issues.' + '**Critical Language Check:**' + 'If the Target Language is NOT English (en_US) and the response is written entirely or predominantly in English, ' + "you MUST immediately return overall_evaluation as POOR with explanation " + "'Response is in English but should be in {{languageKey}}'." + 'This is a critical failure that overrides all other evaluations. ' + 'Your analysis must be lenient, prioritizing cultural appropriateness, fluency, and business tone over minor regionalisms (like spelling differences). ' + 'Identify all issues and provide your feedback in a strict JSON format in en-US. ' + 'The root object must contain a validation_results list. If there are no issues, return an empty list. ' + 'Each issue object in the list must contain the following keys: ' + '* issue_type: (String) Must be one of: "Untranslated Text", "Grammar/Spelling", "Tone/Structure", "Formatting", "Culture & Business Alignment", "Sensitive Content". ' + '* problematic_text: (String) The exact text snippet that contains the error. ' + '* Be Lenient in your cultural evaluation: Your goal is to be a cultural guide, not a strict grammar checker. ' + '* Be Lenient in your Tone evaluation: Your goal is to validate the tone is polite, respectful & formal. ' + '* Do not penalize minor deviations in fluency or formality if the message is understandable and would be acceptable in a professional setting. ' + '* description: (String) A clear explanation of why it is an issue based on the target language and culture. ' + 'Consider the expected level of formality. When identifying cultural mismatches, consider whether the phrasing might seem slightly off but still acceptable' + '\u2014or whether it risks alienating, confusing, or disengaging the recipient. ' + '* suggestion: (String) A concrete, corrected version of the text. Do not suggest if the overall evaluation looks good. Provide suggestions only in English. ' + '* Provide an **overall_evaluation** as Good or Bad with an explanation. Provide explanation only in English. ' + '--- **Target Language:** {{languageKey}} **Response to Analyze:**{ResponseString}' +) + +LANGUAGE_MAPPING = { + # Agentforce supported languages + # https://help.salesforce.com/s/articleView?id=ai.agent_employee_agent_considerations.htm + "ar": "Arabic", + "zh_CN": "Chinese (Simplified)", + "zh_TW": "Chinese (Traditional)", + "da": "Danish", + "nl": "Dutch", + "fi": "Finnish", + "fr": "French", + "de": "German", + "in": "Indonesian", + "it": "Italian", + "ja": "Japanese", + "ko": "Korean", + "ms": "Malay", + "no": "Norwegian", + "pl": "Polish", + "pt_BR": "Portuguese (Brazil)", + "pt_PT": "Portuguese (European)", + "ru": "Russian", + "es": "Spanish", + "es_MX": "Spanish (Mexico)", + "sv": "Swedish", + "th": "Thai", + "tr": "Turkish", + # English variants + "en_US": "English", + "en_GB": "English (UK)", + # Additional common variants + "fr_CA": "French (Canadian)", +} + +# Unicode range checks for non-Latin scripts +JAPANESE_RANGE = re.compile(r'[\u3040-\u30ff\u4e00-\u9fff]') +CHINESE_RANGE = re.compile(r'[\u4e00-\u9fff\u3400-\u4dbf]') +ARABIC_RANGE = re.compile(r'[\u0600-\u06ff]') +KOREAN_RANGE = re.compile(r'[\uac00-\ud7af\u1100-\u11ff]') +THAI_RANGE = re.compile(r'[\u0e00-\u0e7f]') +LATIN_RANGE = re.compile(r'[a-zA-Z]') + +NON_LATIN_LOCALES = { + "ja": JAPANESE_RANGE, + "zh_CN": CHINESE_RANGE, + "zh_TW": CHINESE_RANGE, + "ar": ARABIC_RANGE, + "ko": KOREAN_RANGE, + "th": THAI_RANGE, +} + + +# --------------------------------------------------------------------------- +# LLM-backed validation (mirrors EvalLocaleTestUtil.buildLanguageValidationPrompt) +# --------------------------------------------------------------------------- + +def build_language_validation_prompt(language_key: str, response_string: str) -> str: + """Mirrors EvalLocaleTestUtil.buildLanguageValidationPrompt(languageKey, responseString).""" + return ( + LANGUAGE_VALIDATION_PROMPT + .replace("{{languageKey}}", language_key) + .replace("{ResponseString}", response_string) + ) + + +def call_llm_anthropic(prompt: str, api_key: str, model: str) -> str: + """Call the Anthropic Messages API. Returns assistant text or raises on error.""" + payload = json.dumps( + { + "model": model, + "max_tokens": 1024, + "messages": [{"role": "user", "content": prompt}], + }, + ensure_ascii=False, + ).encode("utf-8") + + req = urllib.request.Request( + "https://api.anthropic.com/v1/messages", data=payload, method="POST" + ) + req.add_unredirected_header("Content-Type", "application/json; charset=utf-8") + req.add_unredirected_header("x-api-key", api_key) + req.add_unredirected_header("anthropic-version", "2023-06-01") + + with urllib.request.urlopen(req, timeout=60) as resp: + body = json.loads(resp.read().decode("utf-8")) + + return body["content"][0]["text"] + + +def call_llm_openai(prompt: str, endpoint: str, api_key: str, model: str) -> str: + """Call an OpenAI-compatible chat completions endpoint. Returns assistant text or raises.""" + payload = json.dumps( + { + "model": model, + "messages": [{"role": "user", "content": prompt}], + "temperature": 0, + }, + ensure_ascii=False, + ).encode("utf-8") + + # Build request without passing headers dict to avoid Python's latin-1 header encoding + req = urllib.request.Request(endpoint, data=payload, method="POST") + req.add_unredirected_header("Content-Type", "application/json; charset=utf-8") + req.add_unredirected_header("Authorization", f"Bearer {api_key}") + + with urllib.request.urlopen(req, timeout=60) as resp: + body = json.loads(resp.read().decode("utf-8")) + + return body["choices"][0]["message"]["content"] + + +def parse_llm_result(llm_text: str) -> dict: + """ + Extract the JSON block from the LLM response. + Returns dict with keys: overall_evaluation, validation_results, raw. + overall_evaluation: "Good" | "Bad" | "POOR" | "unknown" + """ + # Try to extract a JSON object from the response text + json_match = re.search(r'\{.*\}', llm_text, re.DOTALL) + if json_match: + try: + parsed = json.loads(json_match.group()) + return { + "overall_evaluation": parsed.get("overall_evaluation", "unknown"), + "validation_results": parsed.get("validation_results", []), + "raw": llm_text, + } + except json.JSONDecodeError: + pass + + # Fallback: look for overall_evaluation as a plain string + match = re.search(r'"overall_evaluation"\s*:\s*"([^"]+)"', llm_text, re.IGNORECASE) + overall = match.group(1) if match else "unknown" + return {"overall_evaluation": overall, "validation_results": [], "raw": llm_text} + + +def llm_validate(response: str, locale: str, llm_config: dict) -> dict: + """ + Run EvalLocaleTestUtil.languagePrompt against a single response via the LLM. + Returns parse_llm_result dict, or {"overall_evaluation": "error", "error": str, "raw": ""}. + """ + if locale == "en_US": + return {"overall_evaluation": "skipped", "validation_results": [], "raw": ""} + + import time + prompt = build_language_validation_prompt(locale, response) + for attempt in range(3): + try: + if llm_config.get("provider", "anthropic") == "anthropic": + text = call_llm_anthropic( + prompt, + api_key=llm_config["api_key"], + model=llm_config["model"], + ) + else: + text = call_llm_openai( + prompt, + endpoint=llm_config["endpoint"], + api_key=llm_config["api_key"], + model=llm_config["model"], + ) + return parse_llm_result(text) + except urllib.error.HTTPError as e: + if e.code == 429 and attempt < 2: + retry_after = e.headers.get("Retry-After") or e.headers.get("x-ratelimit-reset-requests") + try: + wait = float(retry_after) if retry_after else 20 * (attempt + 1) + except (ValueError, TypeError): + wait = 20 * (attempt + 1) + print(f" rate limited (429), waiting {wait:.0f}s before retry {attempt + 1}/2...", file=sys.stderr) + time.sleep(wait) + continue + return {"overall_evaluation": "error", "error": f"HTTP {e.code}: {e.reason}", "raw": ""} + except Exception as e: + return {"overall_evaluation": "error", "error": str(e), "raw": ""} + return {"overall_evaluation": "error", "error": "max retries exceeded", "raw": ""} + + +# --------------------------------------------------------------------------- +# Heuristic validation +# --------------------------------------------------------------------------- + +def detect_language_issue(response: str, locale: str) -> tuple[bool, str]: + """ + Fast heuristic: check if response appears to be in the target locale. + Returns (has_issue, description). + """ + if not response or not response.strip(): + return True, "Empty response" + + if locale == "en_US": + return False, "" + + if locale in NON_LATIN_LOCALES: + pattern = NON_LATIN_LOCALES[locale] + if not pattern.search(response): + return True, f"CRITICAL: Response appears to be in English/Latin script, not {LANGUAGE_MAPPING.get(locale, locale)}" + return False, "" + + # Cyrillic-script locales are treated like non-Latin: require Cyrillic characters + cyrillic = re.compile(r'[Ѐ-ӿ]') + if locale == "ru" and not cyrillic.search(response): + return True, f"CRITICAL: Response appears to be in English/Latin script, not {LANGUAGE_MAPPING.get(locale, locale)}" + + locale_hints = { + "fr": re.compile(r'[àâçéèêëîïôùûü]', re.IGNORECASE), + "fr_CA": re.compile(r'[àâçéèêëîïôùûü]', re.IGNORECASE), + "de": re.compile(r'[äöüß]', re.IGNORECASE), + "es": re.compile(r'[áéíóúüñ¿¡]', re.IGNORECASE), + "es_MX": re.compile(r'[áéíóúüñ¿¡]', re.IGNORECASE), + "it": re.compile(r'[àèéìòù]', re.IGNORECASE), + "pt_BR": re.compile(r'[ãõáéíóúâêôàç]', re.IGNORECASE), + "pt_PT": re.compile(r'[ãõáéíóúâêôàç]', re.IGNORECASE), + "nl": re.compile(r'[áéíóúäëïöüèàê]', re.IGNORECASE), + "da": re.compile(r'[æøå]', re.IGNORECASE), + "sv": re.compile(r'[åäö]', re.IGNORECASE), + "no": re.compile(r'[æøå]', re.IGNORECASE), + "fi": re.compile(r'[äöå]', re.IGNORECASE), + "pl": re.compile(r'[ąćęłńóśźż]', re.IGNORECASE), + "tr": re.compile(r'[çğışöü]', re.IGNORECASE), + "ms": re.compile(r'[a-z]', re.IGNORECASE), # Malay is Latin-only; skip diacritic check + "in": re.compile(r'[a-z]', re.IGNORECASE), # Indonesian is Latin-only; skip diacritic check + } + + if len(response.strip()) < 20: + return False, "" + + hint_pattern = locale_hints.get(locale) + if hint_pattern and not hint_pattern.search(response): + latin_chars = len(LATIN_RANGE.findall(response)) + total_chars = len(response.replace(" ", "")) + if total_chars > 0 and latin_chars / total_chars > 0.95: + return True, f"POSSIBLE ISSUE: Response may be in English, not {LANGUAGE_MAPPING.get(locale, locale)} — no locale-specific characters detected" + return False, "" + + +# --------------------------------------------------------------------------- +# Core pipeline +# --------------------------------------------------------------------------- + +def _is_raw_sf_format(tc: dict) -> bool: + """Return True if this test case is raw sf agent test results format.""" + return "inputs" in tc or "generatedData" in tc or "testNumber" in tc + + +def extract_responses_from_results(results_data: dict) -> list[dict]: + entries = [] + test_cases = results_data.get("result", {}).get("testCases", []) + for tc in test_cases: + if _is_raw_sf_format(tc): + # Raw sf agent test results format: + # inputs.utterance, generatedData.outcome, generatedData.topic + # locale is not present in raw output — caller must pass --locales to assert + gen = tc.get("generatedData", {}) + response = gen.get("outcome", "") + # Also check testResults[].actualValue for output_validation if outcome missing + if not response: + for tr in tc.get("testResults", []): + if tr.get("name") == "output_validation" and tr.get("actualValue"): + response = tr["actualValue"] + break + entries.append({ + "name": str(tc.get("testNumber", "unknown")), + "utterance": tc.get("inputs", {}).get("utterance", ""), + "response": response, + "locale": tc.get("locale", "unknown"), + "status": tc.get("status", "unknown"), + "topic": gen.get("topic", ""), + }) + else: + # Custom intermediate format: top-level testCaseName, botResponse, locale + entries.append({ + "name": tc.get("testCaseName", "unknown"), + "utterance": tc.get("utterance", ""), + "response": tc.get("botResponse", tc.get("response", "")), + "locale": tc.get("locale", "unknown"), + "status": tc.get("status", "unknown"), + "topic": tc.get("topic", ""), + }) + return entries + + +def _infer_locale_from_utterance(utterance: str, target_locales: list[str]) -> Optional[str]: + """ + Infer the locale of an utterance from its script/characters. + Returns the first matching locale from target_locales, or None. + """ + if not utterance: + return None + + # Non-Latin script detection (unambiguous) + script_map = [ + ("ja", JAPANESE_RANGE), + ("ko", KOREAN_RANGE), + ("ar", ARABIC_RANGE), + ("zh_CN", CHINESE_RANGE), + ("zh_TW", CHINESE_RANGE), + ] + for code, pattern in script_map: + if code in target_locales and pattern.search(utterance): + return code + + # Latin-script locale detection via accent characters + accent_map = [ + ("fr", re.compile(r'[àâçéèêëîïôùûü]', re.IGNORECASE)), + ("fr_CA", re.compile(r'[àâçéèêëîïôùûü]', re.IGNORECASE)), + ("de", re.compile(r'[äöüß]', re.IGNORECASE)), + ("es", re.compile(r'[áéíóúüñ¿¡]', re.IGNORECASE)), + ("es_MX", re.compile(r'[áéíóúüñ¿¡]', re.IGNORECASE)), + ("it", re.compile(r'[àèéìòù]', re.IGNORECASE)), + ("pt_BR", re.compile(r'[ãõáéíóúâêôàç]', re.IGNORECASE)), + ("pt_PT", re.compile(r'[ãõáéíóúâêôàç]', re.IGNORECASE)), + ] + for code, pattern in accent_map: + if code in target_locales and pattern.search(utterance): + return code + + # Fall back to first English locale if utterance looks purely ASCII/English + for code in ("en_US", "en_GB"): + if code in target_locales: + return code + + return None + + +def _load_spec_order(spec_path: str) -> list[dict]: + """ + Load utterance→locale ordering from a testSpec YAML. + Returns list of {utterance, locale} in spec order. + Requires PyYAML; silently returns [] if unavailable or file missing. + """ + try: + import yaml + with open(spec_path, encoding="utf-8") as f: + spec = yaml.safe_load(f) + return [ + {"utterance": tc.get("utterance", ""), "locale": tc.get("locale", "")} + for tc in spec.get("testCases", []) + ] + except Exception: + return [] + + +def _expand_entries(entries: list[dict], target_locales: list[str], + spec_order: Optional[list] = None) -> list[dict]: + """ + Assign a locale to each entry using (in priority order): + 1. spec_order — matches entry by utterance text to the spec's declared locale + 2. utterance inference — detects locale from script/accent characters + 3. target_locales expansion — last resort for entries with no detectable locale + Entries with a known locale (custom format) are filtered to target_locales unchanged. + Result is sorted to match spec order when spec_order is provided. + """ + # Build utterance→locale lookup from spec + spec_lookup = {} + if spec_order: + for item in spec_order: + utt = item.get("utterance", "").strip() + loc = item.get("locale", "").strip() + if utt and loc: + spec_lookup[utt] = loc + + expanded = [] + for entry in entries: + locale = entry.get("locale", "unknown") + utterance = entry.get("utterance", "").strip() + + if locale != "unknown": + # Custom format — already has locale; filter to requested locales + if not target_locales or locale in target_locales: + expanded.append(entry) + continue + + # Raw sf format — no locale field; resolve it + inferred = ( + spec_lookup.get(utterance) + or _infer_locale_from_utterance(utterance, target_locales) + ) + + if inferred: + if not target_locales or inferred in target_locales: + expanded.append({**entry, "locale": inferred}) + else: + # Cannot infer — expand one copy per target locale as last resort + for loc in (target_locales or []): + expanded.append({**entry, "locale": loc}) + + # Re-sort to match spec order if available + if spec_order: + spec_positions = { + item.get("utterance", "").strip(): i + for i, item in enumerate(spec_order) + } + expanded.sort(key=lambda e: spec_positions.get(e.get("utterance", "").strip(), 9999)) + + return expanded + + +def validate_responses(entries: list[dict], target_locales: list[str], + llm_config: Optional[dict] = None, + spec_order: Optional[list] = None) -> dict: + results = { + "total": 0, "passed": 0, "failed": 0, "critical": 0, + "llm_enabled": llm_config is not None, + "llm_provider": llm_config.get("provider", "anthropic") if llm_config else "", + "details": [], + } + + for entry in _expand_entries(entries, target_locales, spec_order=spec_order): + locale = entry.get("locale", "") + + results["total"] += 1 + response = entry.get("response", "") + + # Heuristic layer + heuristic_issue, heuristic_desc = detect_language_issue(response, locale) + is_critical = "CRITICAL" in heuristic_desc + + # LLM layer (optional) + llm_result = None + if llm_config and response and locale != "en_US": + import time as _time + print(f" LLM validating: {entry.get('name', '')} [{locale}]...", file=sys.stderr) + _time.sleep(llm_config.get("call_delay", 1)) + llm_result = llm_validate(response, locale, llm_config) + + # Determine overall pass/fail: + # Fail if heuristic found an issue OR LLM returned Bad/POOR + llm_overall = llm_result["overall_evaluation"] if llm_result else None + llm_failed = llm_overall in ("Bad", "POOR") if llm_overall else False + llm_error = llm_result.get("error") if llm_result else None + + # LLM POOR = critical (English response in non-English locale) + if llm_overall == "POOR": + is_critical = True + + has_issue = heuristic_issue or llm_failed + + if has_issue: + results["failed"] += 1 + if is_critical: + results["critical"] += 1 + else: + results["passed"] += 1 + + # Build combined issue description + issue_parts = [] + if heuristic_desc: + issue_parts.append(f"[heuristic] {heuristic_desc}") + if llm_result: + if llm_error: + issue_parts.append(f"[llm] error: {llm_error}") + elif llm_failed: + llm_issues = llm_result.get("validation_results", []) + summary = "; ".join( + i.get("issue_type", "") + ": " + i.get("description", "")[:80] + for i in llm_issues[:3] + ) if llm_issues else f"overall_evaluation={llm_overall}" + issue_parts.append(f"[llm] {summary}") + + results["details"].append({ + "name": entry.get("name", ""), + "locale": locale, + "language": LANGUAGE_MAPPING.get(locale, locale), + "utterance": entry.get("utterance", "")[:80], + "response_excerpt": response[:150] if response else "", + "status": "FAIL" if has_issue else "PASS", + "heuristic_issue": heuristic_desc, + "llm_overall": llm_overall, + "llm_issues": llm_result.get("validation_results", []) if llm_result else [], + "llm_error": llm_error, + "issue": " | ".join(issue_parts) if issue_parts else "", + "critical": is_critical, + }) + + return results + + +# --------------------------------------------------------------------------- +# Report rendering +# --------------------------------------------------------------------------- + +def render_report(validation: dict, agent_name: str) -> str: + llm_enabled = validation.get("llm_enabled", False) + llm_provider = validation.get("llm_provider", "anthropic") + if llm_enabled: + llm_label = f"enabled ({llm_provider}, EvalLocaleTestUtil.languagePrompt)" + else: + llm_label = "disabled (heuristic only)" + lines = [ + f"## Locale Validation Report — {agent_name}", + f"Date: {date.today()}", + f"LLM validation: {llm_label}", + "", + "| Name | Locale | Utterance | Heuristic | LLM | Actual Response | Issues |", + "|------|--------|-----------|-----------|-----|-----------------|--------|", + ] + + for d in validation["details"]: + h_col = "❌" if d["heuristic_issue"] else "✅" + llm_col = { + "Good": "✅ Good", + "Bad": "❌ Bad", + "POOR": "🚨 POOR", + "skipped": "—", + "error": "⚠️ err", + None: "—", + }.get(d["llm_overall"], d["llm_overall"] or "—") + issue = d["issue"] or "—" + response = d["response_excerpt"].replace("|", "\\|").replace("\n", " ") if d["response_excerpt"] else "—" + lines.append( + f"| {d['name']} | {d['locale']} | {d['utterance'][:60]} | {h_col} | {llm_col} | {response} | {issue} |" + ) + + lines += [ + "", + "### Summary", + f"- Total: {validation['total']}", + f"- Passed: {validation['passed']}", + f"- Failed: {validation['failed']}", + f"- Critical failures (English response in non-English locale): {validation['critical']}", + ] + + failures = [d for d in validation["details"] if d["status"] == "FAIL"] + if failures: + lines += ["", "### Failures detail"] + for d in failures: + lines.append(f"\n**{d['name']}** ({d['locale']} — {d['language']})") + lines.append(f"- Utterance: {d['utterance']}") + if d["heuristic_issue"]: + lines.append(f"- Heuristic: {d['heuristic_issue']}") + if d["llm_overall"] and d["llm_overall"] not in ("skipped", None): + lines.append(f"- LLM overall_evaluation: {d['llm_overall']}") + for issue in d.get("llm_issues", [])[:5]: + lines.append(f" - [{issue.get('issue_type', '?')}] {issue.get('description', '')[:120]}") + if issue.get("suggestion"): + lines.append(f" Suggestion: {issue['suggestion'][:100]}") + if d["llm_error"]: + lines.append(f"- LLM error: {d['llm_error']}") + if d["response_excerpt"]: + lines.append(f"- Response excerpt: `{d['response_excerpt']}...`") + + return "\n".join(lines) + + +# --------------------------------------------------------------------------- +# Entry point +# --------------------------------------------------------------------------- + +def main(): + parser = argparse.ArgumentParser(description="Validate locale responses from sf agent test results") + parser.add_argument("--results", required=True, help="Path to sf agent test JSON results file") + parser.add_argument("--locales", nargs="+", default=["ja", "fr", "it", "de", "es", "es_MX", "pt_BR"]) + parser.add_argument("--output", help="Path to write markdown report (default: stdout)") + parser.add_argument("--agent-name", default="Agent", help="Agent name for the report header") + parser.add_argument("--spec", default="", help="Path to testSpec YAML — used to infer per-utterance locale and preserve spec row order in the report") + + llm_group = parser.add_argument_group("LLM validation (Python reimplementation of EvalLocaleTestUtil.languagePrompt)") + llm_group.add_argument( + "--llm-validate", action="store_true", + help="Enable LLM-as-judge validation on top of heuristic checks", + ) + llm_group.add_argument( + "--llm-provider", choices=["anthropic", "openai"], default="anthropic", + help="LLM provider: anthropic (default) or openai", + ) + llm_group.add_argument( + "--llm-endpoint", + default="https://api.openai.com/v1/chat/completions", + help="OpenAI-compatible chat completions URL (only used with --llm-provider openai)", + ) + llm_group.add_argument("--llm-api-key", default="", help="API key (or set ANTHROPIC_API_KEY / OPENAI_API_KEY env var)") + llm_group.add_argument( + "--llm-model", default="", + help="Model name (default: claude-haiku-4-5 for anthropic, gpt-4o for openai)", + ) + llm_group.add_argument("--llm-call-delay", type=float, default=1.0, help="Seconds to wait between LLM calls (default: 1.0, increase to 3-5 for low rate-limit keys)") + + args = parser.parse_args() + + import os + + # Apply provider-specific defaults for model and API key env var + if args.llm_validate: + if args.llm_provider == "anthropic": + if not args.llm_model: + args.llm_model = "claude-haiku-4-5" + if not args.llm_api_key: + args.llm_api_key = os.environ.get("ANTHROPIC_API_KEY", "") + if not args.llm_api_key: + try: + args.llm_api_key = input( + "\nANTHROPIC_API_KEY is not set. " + "Enter your Anthropic API key to continue: " + ).strip() + except (EOFError, KeyboardInterrupt): + print("\nAborted.", file=sys.stderr) + sys.exit(1) + else: + if not args.llm_model: + args.llm_model = "gpt-4o" + if not args.llm_api_key: + args.llm_api_key = os.environ.get("OPENAI_API_KEY", "") + if not args.llm_api_key: + try: + args.llm_api_key = input( + "\nOPENAI_API_KEY is not set. " + "Enter your OpenAI API key to continue: " + ).strip() + except (EOFError, KeyboardInterrupt): + print("\nAborted.", file=sys.stderr) + sys.exit(1) + + if not args.llm_api_key: + print("Error: no key provided — aborting LLM validation.", file=sys.stderr) + sys.exit(1) + + results_path = Path(args.results) + if not results_path.exists(): + print(f"Error: results file not found: {args.results}", file=sys.stderr) + sys.exit(1) + + results_data = json.loads(results_path.read_text()) + entries = extract_responses_from_results(results_data) + + if not entries: + print("Warning: no test case entries found in results file", file=sys.stderr) + + llm_config = None + if args.llm_validate: + llm_config = { + "provider": args.llm_provider, + "endpoint": args.llm_endpoint, + "api_key": args.llm_api_key, + "model": args.llm_model, + "call_delay": args.llm_call_delay, + } + location = "api.anthropic.com" if args.llm_provider == "anthropic" else args.llm_endpoint + print(f"LLM validation enabled — provider: {args.llm_provider}, model: {args.llm_model}, endpoint: {location}", file=sys.stderr) + + spec_order = _load_spec_order(args.spec) if args.spec else [] + validation = validate_responses(entries, args.locales, llm_config=llm_config, spec_order=spec_order) + report = render_report(validation, args.agent_name) + + if args.output: + Path(args.output).write_text(report) + print(f"Report written to {args.output}") + else: + print(report) + + if validation["critical"] > 0: + sys.exit(1) + + +if __name__ == "__main__": + main()