Skip to content

Commit 34511d9

Browse files
AAgnihotryclaude
andauthored
feat: add support for input overrides for multimodal evals (#1101)
Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
1 parent 13adc91 commit 34511d9

16 files changed

Lines changed: 2805 additions & 4 deletions

File tree

Lines changed: 206 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,206 @@
1+
---
2+
allowed-tools: Read, Write, Bash, Glob, Grep, Edit
3+
description: Create a new integration test case following the testcases/ pattern
4+
argument-hint: <test-name> <description>
5+
---
6+
7+
I'll help you create a new integration test case following the established pattern in the `testcases/` directory, similar to the `eval-input-overrides` example from PR #1101.
8+
9+
## Understanding the Test Structure
10+
11+
Based on the existing test pattern, each integration test should have:
12+
- `run.sh` - Main test execution script
13+
- `pyproject.toml` - Python dependencies
14+
- `entry-points.json` - Entry point configuration
15+
- `uipath.json` - UiPath configuration
16+
- `src/` directory containing:
17+
- Evaluation set JSON files
18+
- Input/configuration JSON files
19+
- `assert.py` - Validation script
20+
21+
## Step 1: Gather Information
22+
23+
I need to understand what you're testing. Please provide:
24+
1. **Test Name**: A descriptive name for your test (e.g., "eval-multimodal-inputs")
25+
2. **Test Purpose**: What feature or scenario are you testing?
26+
3. **Evaluation Set**: What evaluations will run?
27+
4. **Expected Behavior**: What should the test verify?
28+
29+
Let me check the existing testcases structure:
30+
31+
!ls -1 testcases/
32+
33+
## Step 2: Create Test Directory Structure
34+
35+
Based on your test name `${test-name}`, I'll create:
36+
37+
```bash
38+
testcases/${test-name}/
39+
├── run.sh
40+
├── pyproject.toml
41+
├── entry-points.json
42+
├── uipath.json
43+
├── src/
44+
│ ├── eval-set.json
45+
│ ├── config.json (if needed)
46+
│ └── assert.py
47+
```
48+
49+
Let me read the reference implementation to understand the pattern:
50+
51+
!cat testcases/eval-input-overrides/run.sh
52+
!cat testcases/eval-input-overrides/pyproject.toml
53+
!cat testcases/eval-input-overrides/src/assert.py
54+
55+
## Step 3: Create the Test Files
56+
57+
I'll create each file following the established pattern:
58+
59+
### 1. run.sh - Test Execution Script
60+
```bash
61+
#!/bin/bash
62+
set -e
63+
64+
echo "Syncing dependencies..."
65+
uv sync
66+
67+
echo ""
68+
echo "Running ${test-name} integration test..."
69+
echo ""
70+
71+
# Create output directory
72+
mkdir -p __uipath
73+
74+
# Run evaluations
75+
uv run uipath eval main src/eval-set.json \
76+
--no-report \
77+
--output-file __uipath/output.json
78+
79+
echo ""
80+
echo "Test completed! Verifying results..."
81+
echo ""
82+
83+
# Run assertion script to verify results
84+
uv run python src/assert.py
85+
86+
echo ""
87+
echo "${test-name} integration test completed successfully!"
88+
```
89+
90+
### 2. pyproject.toml - Dependencies
91+
```toml
92+
[project]
93+
name = "${test-name}"
94+
version = "0.1.0"
95+
requires-python = ">=3.11"
96+
dependencies = [
97+
"uipath>=2.4.0",
98+
]
99+
100+
[build-system]
101+
requires = ["setuptools>=61.0"]
102+
build-backend = "setuptools.build_meta"
103+
```
104+
105+
### 3. entry-points.json - Entry Points Configuration
106+
```json
107+
{
108+
"main": "src/main.json"
109+
}
110+
```
111+
112+
### 4. uipath.json - UiPath Configuration
113+
```json
114+
{
115+
"name": "${test-name}",
116+
"version": "1.0.0"
117+
}
118+
```
119+
120+
### 5. src/eval-set.json - Evaluation Set
121+
(You'll need to provide the specific evaluation configuration)
122+
123+
### 6. src/assert.py - Validation Script
124+
```python
125+
"""Assertions for ${test-name} testcase."""
126+
import json
127+
import os
128+
129+
130+
def main() -> None:
131+
"""Main assertion logic."""
132+
output_file = "__uipath/output.json"
133+
134+
assert os.path.isfile(output_file), (
135+
f"Evaluation output file '{output_file}' not found"
136+
)
137+
print(f"✓ Found evaluation output file: {output_file}")
138+
139+
with open(output_file, "r", encoding="utf-8") as f:
140+
output_data = json.load(f)
141+
142+
print("✓ Loaded evaluation output")
143+
144+
# Add your specific assertions here
145+
assert "evaluationSetResults" in output_data
146+
147+
evaluation_results = output_data["evaluationSetResults"]
148+
assert len(evaluation_results) > 0, "No evaluation results found"
149+
150+
print(f"✓ Found {len(evaluation_results)} evaluation result(s)")
151+
152+
# Add test-specific validations
153+
154+
print("\\n✅ All assertions passed!")
155+
156+
157+
if __name__ == "__main__":
158+
main()
159+
```
160+
161+
## Step 4: Make run.sh Executable
162+
163+
!chmod +x testcases/${test-name}/run.sh
164+
165+
## Step 5: Test the Integration Test
166+
167+
Let's validate the test runs correctly:
168+
169+
!cd testcases/${test-name} && ./run.sh
170+
171+
## Step 6: Add to Documentation
172+
173+
Consider documenting your test in the project README or test documentation:
174+
- What scenario it tests
175+
- How to run it manually
176+
- What it validates
177+
178+
---
179+
180+
## Summary
181+
182+
Your new integration test `${test-name}` has been created following the established pattern:
183+
184+
**Directory Structure**: Matches testcases/ pattern
185+
**Dependencies**: Configured in pyproject.toml
186+
**Test Script**: run.sh with proper error handling
187+
**Assertions**: Validation logic in assert.py
188+
**Configuration**: UiPath and entry points configured
189+
190+
## Next Steps
191+
192+
1. **Customize** the eval-set.json with your specific test data
193+
2. **Update** assert.py with test-specific validations
194+
3. **Run** the test: `cd testcases/${test-name} && ./run.sh`
195+
4. **Document** the test purpose and usage
196+
5. **Commit** the new test to version control
197+
198+
## Tips
199+
200+
- Keep tests focused on a single feature or scenario
201+
- Use descriptive evaluation names in eval-set.json
202+
- Add clear assertion messages for debugging
203+
- Follow the echo statement pattern (removed from initial header, kept for progress)
204+
- Ensure all JSON files are properly formatted
205+
206+
Need help customizing any specific part of the test? Just ask!

pyproject.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
[project]
22
name = "uipath"
3-
version = "2.4.21"
3+
version = "2.4.22"
44
description = "Python SDK and CLI for UiPath Platform, enabling programmatic interaction with automation services, process management, and deployment tools."
55
readme = { file = "README.md", content-type = "text/markdown" }
66
requires-python = ">=3.11"
Lines changed: 80 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,80 @@
1+
"""Utility functions for applying input overrides to evaluation inputs."""
2+
3+
import copy
4+
import logging
5+
from typing import Any
6+
7+
logger = logging.getLogger(__name__)
8+
9+
10+
def deep_merge(base: dict[str, Any], override: dict[str, Any]) -> dict[str, Any]:
11+
"""Recursively merge override into base dictionary.
12+
13+
Args:
14+
base: The base dictionary to merge into
15+
override: The override dictionary to merge from
16+
17+
Returns:
18+
A new dictionary with overrides recursively merged into base
19+
"""
20+
result = copy.deepcopy(base)
21+
for key, value in override.items():
22+
if key in result and isinstance(result[key], dict) and isinstance(value, dict):
23+
# Recursively merge nested dicts
24+
result[key] = deep_merge(result[key], value)
25+
else:
26+
# Direct replacement for non-dict or new keys
27+
result[key] = value
28+
return result
29+
30+
31+
def apply_input_overrides(
32+
inputs: dict[str, Any],
33+
input_overrides: dict[str, Any],
34+
eval_id: str | None = None,
35+
) -> dict[str, Any]:
36+
"""Apply input overrides to inputs using direct field override.
37+
38+
Format: Per-evaluation overrides (keys are evaluation IDs):
39+
{"eval-1": {"operator": "*"}, "eval-2": {"a": 100}}
40+
41+
Deep merge is supported for nested objects:
42+
- {"filePath": {"ID": "new-id"}} - deep merges inputs["filePath"] with {"ID": "new-id"}
43+
44+
Args:
45+
inputs: The original inputs dictionary
46+
input_overrides: Dictionary mapping evaluation IDs to their override values
47+
eval_id: The evaluation ID (required)
48+
49+
Returns:
50+
A new dictionary with overrides applied
51+
"""
52+
if not input_overrides:
53+
return inputs
54+
55+
if not eval_id:
56+
logger.warning(
57+
"eval_id not provided, cannot apply input overrides. Input overrides require eval_id."
58+
)
59+
return inputs
60+
61+
result = copy.deepcopy(inputs)
62+
63+
# Check if there are overrides for this specific eval_id
64+
if eval_id not in input_overrides:
65+
logger.debug(f"No overrides found for eval_id='{eval_id}'")
66+
return result
67+
68+
overrides_to_apply = input_overrides[eval_id]
69+
logger.debug(f"Applying overrides for eval_id='{eval_id}': {overrides_to_apply}")
70+
71+
# Apply direct field overrides with recursive deep merge
72+
for key, value in overrides_to_apply.items():
73+
if key in result and isinstance(result[key], dict) and isinstance(value, dict):
74+
# Recursive deep merge for dict values
75+
result[key] = deep_merge(result[key], value)
76+
else:
77+
# Direct replacement for non-dict or new keys
78+
result[key] = value
79+
80+
return result

src/uipath/_cli/_evals/_runtime.py

Lines changed: 14 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -67,6 +67,7 @@
6767
from .._utils._eval_set import EvalHelpers
6868
from .._utils._parallelization import execute_parallel
6969
from ._configurable_factory import ConfigurableRuntimeFactory
70+
from ._eval_util import apply_input_overrides
7071
from ._evaluator_factory import EvaluatorFactory
7172
from ._models._evaluation_set import (
7273
EvaluationItem,
@@ -262,6 +263,7 @@ class UiPathEvalContext:
262263
verbose: bool = False
263264
enable_mocker_cache: bool = False
264265
report_coverage: bool = False
266+
input_overrides: dict[str, Any] | None = None
265267
model_settings_id: str = "default"
266268

267269

@@ -524,7 +526,10 @@ async def _execute_eval(
524526
),
525527
)
526528
agent_execution_output = await self.execute_runtime(
527-
eval_item, execution_id, runtime
529+
eval_item,
530+
execution_id,
531+
runtime,
532+
input_overrides=self.context.input_overrides,
528533
)
529534
except Exception as e:
530535
if self.context.verbose:
@@ -759,6 +764,7 @@ async def execute_runtime(
759764
eval_item: EvaluationItem,
760765
execution_id: str,
761766
runtime: UiPathRuntimeProtocol,
767+
input_overrides: dict[str, Any] | None = None,
762768
) -> UiPathEvalRunExecutionOutput:
763769
log_handler = self._setup_execution_logging(execution_id)
764770
attributes = {
@@ -785,8 +791,14 @@ async def execute_runtime(
785791

786792
start_time = time()
787793
try:
794+
# Apply input overrides to inputs if configured
795+
inputs_with_overrides = apply_input_overrides(
796+
eval_item.inputs,
797+
input_overrides or {},
798+
eval_id=eval_item.id,
799+
)
788800
result = await execution_runtime.execute(
789-
input=eval_item.inputs,
801+
input=inputs_with_overrides,
790802
)
791803
except Exception as e:
792804
end_time = time()

src/uipath/_cli/cli_eval.py

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,7 @@
11
import ast
22
import asyncio
33
import os
4+
from typing import Any
45

56
import click
67
from uipath.core.tracing import UiPathTraceManager
@@ -113,6 +114,12 @@ def setup_reporting_prereq(no_report: bool) -> bool:
113114
default=20,
114115
help="Maximum concurrent LLM requests (default: 20)",
115116
)
117+
@click.option(
118+
"--input-overrides",
119+
cls=LiteralOption,
120+
default="{}",
121+
help='Input field overrides per evaluation ID: \'{"eval-1": {"operator": "*"}, "eval-2": {"a": 100}}\'. Supports deep merge for nested objects.',
122+
)
116123
def eval(
117124
entrypoint: str | None,
118125
eval_set: str | None,
@@ -126,6 +133,7 @@ def eval(
126133
model_settings_id: str,
127134
trace_file: str | None,
128135
max_llm_concurrency: int,
136+
input_overrides: dict[str, Any],
129137
) -> None:
130138
"""Run an evaluation set against the agent.
131139
@@ -141,6 +149,7 @@ def eval(
141149
model_settings_id: Model settings ID to override agent settings
142150
trace_file: File path where traces will be written in JSONL format
143151
max_llm_concurrency: Maximum concurrent LLM requests
152+
input_overrides: Input field overrides mapping (direct field override with deep merge)
144153
"""
145154
set_llm_concurrency(max_llm_concurrency)
146155

@@ -178,6 +187,7 @@ def eval(
178187
eval_context.eval_ids = eval_ids
179188
eval_context.report_coverage = report_coverage
180189
eval_context.model_settings_id = model_settings_id
190+
eval_context.input_overrides = input_overrides
181191

182192
try:
183193

0 commit comments

Comments
 (0)