Exposing Behavior test run as a tool call to LLMs#949
Conversation
|
Codeflash Bot seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account. You have signed the CLA already but the status is still pending? Let us recheck it. |
PR Reviewer Guide 🔍Here are some key observations to aid the review process:
|
PR Code Suggestions ✨Explore these optional code suggestions:
|
⚡️ Codeflash found optimizations for this PR📄 128% (1.28x) speedup for
|
|
@claude what are the flaws in this approach? |
PR Review SummaryPrek Checks✅ All prek checks passing after fixes. Fixes applied (committed and pushed):
Mypy✅ All 20 mypy errors resolved (0 remaining). Code ReviewCritical issues found:
Test CoverageAll changed files are new (did not exist on
Last updated: 2026-02-17 |
|
@claude what are the flaws in this approach? |
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
| pytest_timeout=30, | ||
| ) | ||
|
|
||
| assert result["success"] is True |
There was a problem hiding this comment.
Bug: test_run_behavioral_tests_tool_pytest is failing in CI — result["success"] is False. The test doesn't print result["error"] on failure, making it hard to diagnose. Consider adding a diagnostic assertion message:
| assert result["success"] is True | |
| assert result["success"] is True, f"Test run failed: {result.get('error')}" |
| project_root: str, | ||
| test_framework: str = "pytest", | ||
| pytest_timeout: int | None = 30, | ||
| verbose: bool = False, |
There was a problem hiding this comment.
Bug: The verbose parameter is accepted and advertised in the tool schema (line 119) but is silently ignored — it was removed from the run_behavioral_tests() call (which never accepted it). Either remove verbose from the function signature and schema, or implement its intended behavior.
| class TestFileInput(BaseModel): | ||
| """Input schema for a single test file.""" | ||
|
|
||
| test_file_path: str = Field(description="Absolute path to the test file to run") | ||
| test_type: str = Field( | ||
| default="existing_unit_test", | ||
| description="Type of test: 'existing_unit_test', 'generated_regression', 'replay_test', or 'concolic_coverage_test'", | ||
| ) | ||
|
|
||
|
|
||
| class RunBehavioralTestsInput(BaseModel): | ||
| """Input schema for the run_behavioral_tests tool.""" | ||
|
|
||
| test_files: list[TestFileInput] = Field(description="List of test files to run") | ||
| test_framework: str = Field(default="pytest", description="Test framework to use: 'pytest' or 'unittest'") | ||
| project_root: str = Field(description="Absolute path to the project root directory") | ||
| pytest_timeout: int | None = Field(default=30, description="Timeout in seconds for each pytest test") | ||
| verbose: bool = Field(default=False, description="Enable verbose output") | ||
|
|
||
|
|
||
| class TestResultOutput(BaseModel): | ||
| """Output schema for a single test result.""" | ||
|
|
||
| test_id: str = Field(description="Unique identifier for the test") | ||
| test_file: str = Field(description="Path to the test file") | ||
| test_function: str | None = Field(description="Name of the test function") | ||
| passed: bool = Field(description="Whether the test passed") | ||
| runtime_ns: int | None = Field(description="Runtime in nanoseconds, if available") | ||
| timed_out: bool = Field(description="Whether the test timed out") | ||
|
|
||
|
|
||
| class RunBehavioralTestsOutput(BaseModel): | ||
| """Output schema for the run_behavioral_tests tool.""" | ||
|
|
||
| success: bool = Field(description="Whether the test run completed successfully") | ||
| total_tests: int = Field(description="Total number of tests run") | ||
| passed_tests: int = Field(description="Number of tests that passed") | ||
| failed_tests: int = Field(description="Number of tests that failed") | ||
| results: list[TestResultOutput] = Field(description="Detailed results for each test") | ||
| stdout: str = Field(description="Standard output from the test run") | ||
| stderr: str = Field(description="Standard error from the test run") | ||
| error: str | None = Field(default=None, description="Error message if the run failed") | ||
|
|
There was a problem hiding this comment.
Dead code: TestFileInput, RunBehavioralTestsInput, TestResultOutput, and RunBehavioralTestsOutput are defined but never used anywhere. The actual tool function uses list[dict[str, Any]] for inputs and dict[str, Any] for outputs. Consider either removing these or using them for validation in run_behavioral_tests_tool.
| "error": None, | ||
| } | ||
|
|
||
| except Exception as e: |
There was a problem hiding this comment.
Concern: The bare except Exception catch means callers never see the actual error type — they get a stringified error in the response dict. For LLM-facing tools this might be intentional, but note that errors like FileNotFoundError, PermissionError, or subprocess.CalledProcessError are all treated identically. Consider logging the exception (e.g., via logging.exception) before returning the error dict so that issues are at least visible in server logs.
PR Type
Enhancement, Tests
Description
Expose behavioral tests as LLM tool
Add tool schema and execution registry
Provide lazy imports to avoid cycles
Add comprehensive tests for tool API
Diagram Walkthrough
File Walkthrough
__init__.py
Lazy export of verification LLM toolscodeflash/verification/init.py
llm_tools.py
LLM tool schema and behavior test wrappercodeflash/verification/llm_tools.py
run_behavioral_testsrun_behavioral_tests_toolwrappertest_llm_tools.py
Tests for verification LLM tools interfacetests/test_llm_tools.py