Skip to content

Commit 2a8e1d4

Browse files
committed
@W-22264356-first draft of the locale-validation skill
1 parent 4fa57e3 commit 2a8e1d4

5 files changed

Lines changed: 1811 additions & 0 deletions

File tree

skills/locale-validation/README.md

Lines changed: 347 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,347 @@
1+
# locale-validation
2+
3+
A skill for validating Agentforce agent responses across multiple locales. Given any agent script (`.agent` file) or `genAiPluginMetadata`, it derives realistic test utterances per topic, translates them into the target languages, runs them against the agent in preview or batch mode, and validates that responses are in the correct language.
4+
5+
## Target locales
6+
7+
`ja` · `fr` · `it` · `de` · `es` · `es_MX` · `pt_BR`
8+
9+
Override the set at any time by telling Claude which locales you want.
10+
11+
---
12+
13+
## Files
14+
15+
```
16+
locale-validation/
17+
├── SKILL.md # Skill definition & 5-phase workflow
18+
├── README.md # This file
19+
├── references/
20+
│ ├── adlc-mode.md # sf agent preview + sf agent test execution guide
21+
│ └── fit-tests-mode.md # Maven/JUnit utterances.json + test class guide
22+
└── scripts/
23+
└── validate_locale_responses.py # Python validator for batch test results
24+
```
25+
26+
---
27+
28+
## How to invoke
29+
30+
This skill is automatically triggered when Claude detects locale/language testing intent. You can also invoke it explicitly.
31+
32+
### Natural language triggers (automatic)
33+
34+
```
35+
"Run locale validation on MySDRAgent"
36+
"Test MyAgent in Japanese, French, and German"
37+
"Generate multilingual test cases for EngagementAgent"
38+
"Create a batch locale test suite for MyAgent"
39+
"Validate that MyAgent responds in Spanish"
40+
"Check if additional_locales are working in MyAgent"
41+
```
42+
43+
If your `.agent` file has `additional_locales` set and you ask about testing or quality, the skill also activates proactively.
44+
45+
### Explicit invocation
46+
47+
In any Claude Code session with `agentforce-adlc` loaded:
48+
49+
```
50+
/locale-validation MyAgent.agent
51+
```
52+
53+
Or with arguments:
54+
55+
```
56+
/locale-validation force-app/main/default/aiAuthoringBundles/MyAgent/MyAgent.agent --locales ja fr de --mode preview
57+
```
58+
59+
---
60+
61+
## Workflow
62+
63+
The skill runs five phases (plus an automatic locale gate after Phase 1). It pauses at Phase 1b if a patch is needed, and again after Phase 2 for utterance review.
64+
65+
| Phase | What happens |
66+
|-------|-------------|
67+
| 1. Introspect | Reads `.agent` or `genAiPluginMetadata`, extracts topics + actions |
68+
| **1b. Check `additional_locales`** | Reads the `language:` block. If `additional_locales` is missing or empty, **asks you which locales to add** and patches the `.agent` file before continuing |
69+
| **1c. Check language-response instruction** | Searches `system.instructions` for the language-response rule. If missing, **patches it automatically** (no confirmation needed) so the agent responds in the user's language |
70+
| 2. Derive utterances | Generates 2–3 English utterances per topic, **shows you for review** |
71+
| 3. Translate | Translates each utterance into all target locales |
72+
| 4. Run tests | Executes via `sf agent preview` (Mode A) or `sf agent test` (Mode B). **In Mode B, always writes both a `testSpec.yaml` and a companion `-input.csv`** for manual Testing Center UI upload |
73+
| 5. Validate & report | Checks responses for correct language, reports ✅/❌ per locale per topic |
74+
75+
### Phase 1b — `additional_locales` patch detail
76+
77+
When the skill detects a missing or empty `additional_locales`, it asks:
78+
79+
> "The agent script does not declare any `additional_locales`. Which locales should I add?
80+
> Default set: `ja, fr, it, de, es, es_MX, pt_BR`
81+
> Reply with the list you want or say **"use defaults"**."
82+
83+
It then writes the confirmed locales into the `language:` block using the required format — a **quoted comma-separated string with no spaces** and 4-space indentation:
84+
85+
```
86+
language:
87+
default_locale: "en_US"
88+
additional_locales: "ja,fr,de"
89+
```
90+
91+
The patched locales become the working locale set for the rest of the workflow (merged with any `--locales` argument you passed).
92+
93+
---
94+
95+
## Execution modes
96+
97+
### Mode A — Preview (smoke testing)
98+
99+
Best for iterative development. Runs `sf agent preview` per locale and extracts responses from local trace files.
100+
101+
```
102+
"Run locale validation on MyAgent in preview mode"
103+
"Quick locale smoke test for MyAgent"
104+
```
105+
106+
Claude reads `references/adlc-mode.md` for the exact `sf agent preview` commands.
107+
108+
### Mode B — Batch (regression testing)
109+
110+
Best for CI/CD and regression suites. Generates a `test-spec-locales.yaml` and runs it via `sf agent test`.
111+
112+
```
113+
"Create a batch locale test suite for MyAgent"
114+
"Run locale regression tests for MyAgent in batch mode"
115+
```
116+
117+
### fit-tests mode
118+
119+
When working in the `einstein-copilot-fit-tests` Maven project (see that repo's skill copy for full detail):
120+
121+
```
122+
"Generate multilingual eval data for EngagementAgent"
123+
```
124+
125+
---
126+
127+
## Validation logic
128+
129+
The validator flags two severity levels:
130+
131+
| Severity | Condition | Example |
132+
|----------|-----------|---------|
133+
| **CRITICAL** | Response is in English when target locale ≠ `en_US` | `ja` utterance → English response |
134+
| **Warning** | No locale-specific characters detected in a Latin-script response | `fr` utterance → no accented characters in a long response |
135+
136+
For Japanese, Chinese, Arabic, and Korean, the validator uses Unicode range detection (no LLM call needed for the critical check). For Latin-script languages, it looks for locale-specific accented characters.
137+
138+
For deep LLM-backed validation, the skill uses the same prompt template as `EvalLocaleTestUtil.languagePrompt` from `einstein-copilot-fit-tests`.
139+
140+
---
141+
142+
## Using the Python validator directly
143+
144+
After a batch run, validate a results JSON file without Claude.
145+
146+
### Input JSON format
147+
148+
The script auto-detects two input formats.
149+
150+
**Format 1 — Raw `sf agent test results` output** (detected automatically):
151+
152+
```bash
153+
sf agent test results --json --job-id <JOB_ID> --result-format json -o <org> \
154+
| tee /tmp/results.json
155+
```
156+
157+
The script reads `inputs.utterance`, `generatedData.outcome`, and `generatedData.topic` from each test case. Because the raw format has no `locale` field, you must pass `--locales` to specify which locale(s) to assert against — the same locale is applied to all entries.
158+
159+
**Format 2 — Custom intermediate format** (used by fit-tests and Claude-generated results):
160+
161+
```json
162+
{
163+
"result": {
164+
"testCases": [
165+
{
166+
"testCaseName": "test_ja_web_reply",
167+
"locale": "ja",
168+
"utterance": "製品について教えてください",
169+
"botResponse": "Weloは...",
170+
"status": "pass",
171+
"topic": "web_reply"
172+
}
173+
]
174+
}
175+
}
176+
```
177+
178+
Required fields: `locale`, `botResponse` (or `response` as fallback). Optional: `testCaseName`, `utterance`, `status`, `topic`.
179+
180+
The format is detected per test case — mixed files (some raw, some custom) are handled correctly.
181+
182+
### Options reference
183+
184+
| Option | Default | Description |
185+
|--------|---------|-------------|
186+
| `--results <path>` | *(required)* | Path to the JSON results file (see format above) |
187+
| `--spec <path>` | *(optional)* | Path to the testSpec YAML. When provided, per-utterance `locale:` fields are used for exact locale assignment and report rows are sorted to match spec order |
188+
| `--locales <codes...>` | `ja fr it de es es_MX pt_BR` | Space-separated locale codes to validate. Pass a subset to limit scope. |
189+
| `--agent-name <name>` | `Agent` | Agent name shown in the report header |
190+
| `--output <path>` | stdout | Write markdown report to a file instead of printing it |
191+
| `--llm-validate` | off | Enable LLM-as-judge validation on top of heuristic checks |
192+
| `--llm-endpoint <url>` | `https://api.openai.com/v1/chat/completions` | Any OpenAI-compatible chat completions URL |
193+
| `--llm-api-key <key>` | `$OPENAI_API_KEY` → interactive prompt | API key. Omit to read from `OPENAI_API_KEY`; if unset, the script prompts for it at runtime |
194+
| `--llm-model <name>` | `gpt-4o` | Model name passed to the endpoint |
195+
196+
### Heuristic-only (fast, no API cost)
197+
198+
```bash
199+
python3 skills/locale-validation/scripts/validate_locale_responses.py \
200+
--results /tmp/locale-test-results.json \
201+
--locales ja fr it de es es_MX pt_BR \
202+
--agent-name MyAgent \
203+
--output /tmp/locale-validation-report.md
204+
```
205+
206+
### With LLM-as-judge (uses `EvalLocaleTestUtil.languagePrompt`)
207+
208+
```bash
209+
# Optional — if not set the script will prompt at runtime
210+
export OPENAI_API_KEY=sk-...
211+
212+
python3 skills/locale-validation/scripts/validate_locale_responses.py \
213+
--results /tmp/locale-test-results.json \
214+
--locales ja fr de \
215+
--agent-name MyAgent \
216+
--llm-validate \
217+
--output /tmp/locale-validation-report.md
218+
```
219+
220+
Override model or endpoint:
221+
222+
```bash
223+
# Azure OpenAI
224+
python3 skills/locale-validation/scripts/validate_locale_responses.py \
225+
--results /tmp/locale-test-results.json \
226+
--locales ja fr \
227+
--llm-validate \
228+
--llm-endpoint "https://<resource>.openai.azure.com/openai/deployments/<deployment>/chat/completions?api-version=2024-02-01" \
229+
--llm-api-key "$AZURE_OPENAI_KEY" \
230+
--llm-model gpt-4o
231+
232+
# Different model
233+
python3 skills/locale-validation/scripts/validate_locale_responses.py \
234+
--results /tmp/locale-test-results.json \
235+
--locales es es_MX \
236+
--llm-validate \
237+
--llm-model gpt-4-turbo
238+
```
239+
240+
### Supported locales
241+
242+
| Code | Language | Script detection |
243+
|------|----------|-----------------|
244+
| `ja` | Japanese | Unicode range (U+3040–U+30FF, U+4E00–U+9FFF) |
245+
| `zh_CN` | Chinese (Simplified) | Unicode range (U+4E00–U+9FFF) |
246+
| `zh_TW` | Chinese (Traditional) | Unicode range (U+4E00–U+9FFF) |
247+
| `ar` | Arabic | Unicode range (U+0600–U+06FF) |
248+
| `ko` | Korean | Unicode range (U+AC00–U+D7AF) |
249+
| `fr` | French | Accent characters (à â ç é è ê ë…) |
250+
| `fr_CA` | French (Canadian) | LLM-only (no heuristic accent check) |
251+
| `de` | German | Accent characters (ä ö ü ß) |
252+
| `es` | Spanish | Accent characters (á é í ó ú ñ ¿ ¡) |
253+
| `es_MX` | Spanish (Mexico) | Accent characters (same as `es`) |
254+
| `it` | Italian | Accent characters (à è é ì ò ù) |
255+
| `pt_BR` | Portuguese (Brazil) | Accent characters (ã õ á é â ô ç…) |
256+
| `pt_PT` | Portuguese (European) | Accent characters (same as `pt_BR`) |
257+
| `en_US` | English | Skipped (no-op) |
258+
| `en_GB` | English (UK) | Skipped (no-op) |
259+
260+
**Note on `fr_CA`:** No locale-specific characters are checked heuristically (French Canadian uses the same accents as French but the heuristic doesn't cover it). Use `--llm-validate` for reliable `fr_CA` validation.
261+
262+
### Exit codes
263+
264+
| Code | Meaning |
265+
|------|---------|
266+
| `0` | All validations passed |
267+
| `1` | One or more **critical** failures (English response in non-English locale) — use as a CI gate |
268+
269+
---
270+
271+
## Validating the skill itself
272+
273+
### Quick sanity check
274+
275+
1. Open a Claude Code session in the `agentforce-adlc` directory
276+
2. Verify the skill is loaded:
277+
```
278+
What skills do you have available?
279+
```
280+
You should see `locale-validation` in the list.
281+
282+
3. Trigger it with a test prompt that should activate it:
283+
```
284+
I want to test MyAgent in Japanese and French.
285+
```
286+
Claude should respond by starting the 5-phase locale validation workflow (Phase 1: introspect).
287+
288+
4. Trigger it with a prompt that should NOT activate it:
289+
```
290+
Deploy my agent to production.
291+
```
292+
Claude should use `developing-agentforce` instead, not `locale-validation`.
293+
294+
### Functional validation with an agent file
295+
296+
Use the example agent bundled in this repo:
297+
298+
```bash
299+
# 1. Start Claude Code in agentforce-adlc
300+
cd /path/to/agentforce-adlc
301+
claude
302+
303+
# 2. Ask Claude to run locale validation on the example agent
304+
"Run locale validation on force-app/main/default/aiAuthoringBundles/MS_Agent_hp_Apr22_adlc/MS_Agent_hp_Apr22_adlc.agent — just derive and show me the utterances, don't run them yet."
305+
```
306+
307+
Expected: Claude reads the `.agent` file, lists topics, proposes 2–3 English utterances per topic, and waits for your go-ahead before translating or running anything.
308+
309+
### End-to-end batch validation
310+
311+
```bash
312+
# Requires: authenticated SF org, agent deployed
313+
"Generate a locale test spec YAML for MS_Agent_hp_Apr22_adlc and run it in batch mode against org my-dev-org."
314+
```
315+
316+
Expected output includes a `test-spec-locales.yaml` with utterances in all 7 locales, sf agent test commands, and a summary table.
317+
318+
### Validate the Python script
319+
320+
```bash
321+
# Create a minimal mock results file
322+
python3 -c "
323+
import json
324+
mock = {'result': {'testCases': [
325+
{'testCaseName': 'test_ja', 'locale': 'ja', 'utterance': 'check order', 'botResponse': 'Your order is ready.', 'status': 'pass'},
326+
{'testCaseName': 'test_fr', 'locale': 'fr', 'utterance': 'check order', 'botResponse': 'Votre commande est prête.', 'status': 'pass'},
327+
]}}
328+
print(json.dumps(mock))
329+
" > /tmp/mock-results.json
330+
331+
python3 skills/locale-validation/scripts/validate_locale_responses.py \
332+
--results /tmp/mock-results.json \
333+
--locales ja fr \
334+
--agent-name TestAgent
335+
```
336+
337+
Expected: `test_ja` flagged as CRITICAL (English response for Japanese locale), `test_fr` passes.
338+
339+
---
340+
341+
## Related skills
342+
343+
| Skill | When to use instead |
344+
|-------|-------------------|
345+
| `testing-agentforce` | General agent testing without locale focus |
346+
| `developing-agentforce` | Authoring/editing `.agent` files |
347+
| `observing-agentforce` | Analyzing production session traces for locale failures |

0 commit comments

Comments
 (0)