Exposing Behavior test run as a tool call to LLMs by aseembits93 · Pull Request #949 · codeflash-ai/codeflash

aseembits93 · 2025-12-03T04:12:26Z

PR Type

Enhancement, Tests

Description

Expose behavioral tests as LLM tool
Add tool schema and execution registry
Provide lazy imports to avoid cycles
Add comprehensive tests for tool API

Diagram Walkthrough

flowchart LR
  LLM["LLM"] -- "calls tool" --> Exec["execute_tool()"]
  Exec -- "dispatch" --> Tool["run_behavioral_tests_tool()"]
  Tool -- "invoke" --> Runner["run_behavioral_tests()"]
  Runner -- "produce JUnit XML" --> Parser["parse_test_xml()"]
  Parser -- "results" --> Tool
  Tool -- "structured dict" --> LLM

File Walkthrough

Relevant files

Enhancement

__init__.py `Lazy export of verification LLM tools` codeflash/verification/init.py Add lazy attribute loader for LLM tools Re-export tool APIs via __all__	+31/-0
llm_tools.py `LLM tool schema and behavior test wrapper` codeflash/verification/llm_tools.py Define JSON schema for `run_behavioral_tests` Implement `run_behavioral_tests_tool` wrapper Add tool registry and execution helpers Map string test types to enum	+321/-0

Tests

test_llm_tools.py `Tests for verification LLM tools interface` tests/test_llm_tools.py Add tests for tool schema and registry Validate execute_tool dispatch and errors Run real pytest samples through tool Test handling of failing and invalid paths	+193/-0

CLAassistant · 2025-12-03T04:12:32Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.

Codeflash Bot seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

github-actions · 2025-12-03T04:13:26Z

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 3 🔵🔵🔵⚪⚪
🧪 PR contains tests
🔒 No security concerns identified
⚡ Recommended focus areas for review Possible Issue The mapping in `_test_type_from_string` uses the key `concolic_test` instead of `concolic_coverage_test` (already present). If callers pass `concolic_test` per earlier conventions, it's fine, but if schemas or other parts expect only `concolic_coverage_test`, the extra alias may mask typos. Conversely, if `concolic_test` was not intended as an alias, this could be a bug. Validate intended enum values and align with `TestType`. def _test_type_from_string(test_type_str: str) -> TestType: """Convert a string test type to TestType enum.""" mapping = { "existing_unit_test": TestType.EXISTING_UNIT_TEST, "generated_regression": TestType.GENERATED_REGRESSION, "replay_test": TestType.REPLAY_TEST, "concolic_test": TestType.CONCOLIC_COVERAGE_TEST, "concolic_coverage_test": TestType.CONCOLIC_COVERAGE_TEST, } return mapping.get(test_type_str.lower(), TestType.EXISTING_UNIT_TEST) Robustness When constructing `PYTHONPATH`, the code appends `os.pathsep + project_root_path` without checking for duplication or normalizing. Also, if `PYTHONPATH` exists but is empty, the leading separator can occur. Consider normalizing and avoiding duplicates. # Ensure PYTHONPATH includes project root if "PYTHONPATH" not in test_env: test_env["PYTHONPATH"] = str(project_root_path) else: test_env["PYTHONPATH"] += os.pathsep + str(project_root_path) Error Handling The broad `except Exception` swallows all errors and returns `success: False` with minimal context. Consider logging or including more structured error info (e.g., traceback) to aid debugging, and ensure sensitive paths are handled appropriately. except Exception as e: return { "success": False, "total_tests": 0, "passed_tests": 0, "failed_tests": 0, "results": [], "stdout": "", "stderr": "", "error": str(e), }

github-actions · 2025-12-03T04:14:02Z

PR Code Suggestions ✨

Explore these optional code suggestions:

Category	Suggestion	Impact
Possible issue	Align schema enum with mapper The accepted enum in the tool schema does not include 'concolic_test', yet the mapping supports it. This mismatch can cause user inputs to be rejected before reaching your mapper. Align the schema enum with supported inputs to prevent validation errors. codeflash/verification/llm_tools.py [123-131] mapping = { "existing_unit_test": TestType.EXISTING_UNIT_TEST, "generated_regression": TestType.GENERATED_REGRESSION, "replay_test": TestType.REPLAY_TEST, "concolic_test": TestType.CONCOLIC_COVERAGE_TEST, "concolic_coverage_test": TestType.CONCOLIC_COVERAGE_TEST, } +... +RUN_BEHAVIORAL_TESTS_TOOL_SCHEMA = { + "type": "function", + "function": { + "name": "run_behavioral_tests", + "description": ( + "Run behavioral tests to verify code correctness. " + "This executes test files using pytest or unittest and returns detailed results " + "including pass/fail status, runtime information, and any errors encountered." + ), + "parameters": { + "type": "object", + "properties": { + "test_files": { + "type": "array", + "description": "List of test files to run", + "items": { + "type": "object", + "properties": { + "test_file_path": { + "type": "string", + "description": "Absolute path to the test file to run", + }, + "test_type": { + "type": "string", + "enum": [ + "existing_unit_test", + "generated_regression", + "replay_test", + "concolic_test", + "concolic_coverage_test", + ], + "default": "existing_unit_test", + "description": "Type of test being run", + }, + }, + "required": ["test_file_path"], + }, + }, + ... + }, + "required": ["test_files", "project_root"], + }, + }, +} Suggestion importance[1-10]: 7 __ Why: Correctly identifies a mismatch: `_test_type_from_string` accepts `concolic_test` but the tool schema enum omits it, causing premature validation failures. The proposed change is accurate and improves usability, though not critical to core functionality.	Medium
General	Prevent PYTHONPATH duplication, ensure precedence Appending to an existing PYTHONPATH without checking for duplicates can grow unbounded across calls and may break module resolution order. Prepend the project root if not already present to ensure it takes precedence and avoid duplication. codeflash/verification/llm_tools.py [191-196] -# Ensure PYTHONPATH includes project root -if "PYTHONPATH" not in test_env: - test_env["PYTHONPATH"] = str(project_root_path) -else: - test_env["PYTHONPATH"] += os.pathsep + str(project_root_path) +# Ensure PYTHONPATH includes project root once, and with precedence +current_pp = test_env.get("PYTHONPATH", "") +pp_parts = current_pp.split(os.pathsep) if current_pp else [] +project_root_str = str(project_root_path) +if project_root_str not in pp_parts: + test_env["PYTHONPATH"] = os.pathsep.join([project_root_str] + pp_parts) if pp_parts else project_root_str Suggestion importance[1-10]: 6 __ Why: Sensible enhancement to avoid unbounded PYTHONPATH growth and ensure project root precedence. It’s a maintainability/robustness improvement; impact is moderate and the code change aligns with the existing snippet.	Low
General	Normalize subprocess output to text Accessing 'process.stdout' and 'process.stderr' assumes they are strings; they may be bytes or None depending on how the subprocess was run. Normalize to string to avoid type issues for JSON serialization and downstream consumers. codeflash/verification/llm_tools.py [248-258] +def _to_text(s: Any) -> str: + if s is None: + return "" + return s.decode("utf-8", errors="replace") if isinstance(s, (bytes, bytearray)) else str(s) + return { "success": True, "total_tests": len(test_results), "passed_tests": passed_count, "failed_tests": failed_count, "results": results_list, - "stdout": process.stdout if process.stdout else "", - "stderr": process.stderr if process.stderr else "", + "stdout": _to_text(process.stdout), + "stderr": _to_text(process.stderr), "error": None, } Suggestion importance[1-10]: 6 __ Why: Normalizing `process.stdout`/`stderr` to strings improves robustness for serialization and consumer consistency. While likely already strings, handling bytes/None is a reasonable defensive improvement without altering behavior.	Low

codeflash-ai · 2025-12-03T04:19:12Z

⚡️ Codeflash found optimizations for this PR

📄 128% (1.28x) speedup for `_test_type_from_string` in `codeflash/verification/llm_tools.py`

⏱️ Runtime : 3.07 milliseconds → 1.34 milliseconds (best of 128 runs)

A dependent PR with the suggested changes has been created. Please review:

⚡️ Speed up function _test_type_from_string by 128% in PR #949 (feat/behavior-test-as-tool) #950

If you approve, it will be merged into this PR (branch feat/behavior-test-as-tool).

aseembits93 · 2026-02-17T14:45:41Z

@claude what are the flaws in this approach?

claude · 2026-02-17T14:46:17Z

PR Review Summary

Prek Checks

✅ All prek checks passing after fixes.

Fixes applied (committed and pushed):

Fixed TestType import to use direct module path (codeflash.models.test_type)
Removed invalid verbose kwarg from run_behavioral_tests() call (function does not accept it)
Removed invalid test_framework kwarg from TestConfig() constructor (it is a read-only property)
Added ToolEntry TypedDict for proper AVAILABLE_TOOLS typing (fixes mypy errors on get_tool_schema, get_all_tool_schemas, execute_tool)
Moved Callable import into TYPE_CHECKING block (TC003)
Added -> object return type to __getattr__ in __init__.py
Added -> None return type annotations to all test functions
Fixed dict indexing in tests with isinstance assertions

Mypy

✅ All 20 mypy errors resolved (0 remaining).

Code Review

Critical issues found:

Test failure (tests/test_llm_tools.py:102): test_run_behavioral_tests_tool_pytest fails in CI — result["success"] is False. The test does not print the error message on failure, making diagnosis difficult.
verbose parameter is silently ignored (llm_tools.py:144): The tool schema advertises verbose and the function accepts it, but it is never passed downstream since run_behavioral_tests() does not accept a verbose parameter. This is misleading for LLM callers.
Unused Pydantic models (llm_tools.py:25-67): TestFileInput, RunBehavioralTestsInput, TestResultOutput, RunBehavioralTestsOutput are defined but never used anywhere. The actual function uses plain dicts.
Broad exception catch (llm_tools.py:260): except Exception silently catches all errors and returns them as strings. Consider logging the exception before returning so issues are visible in server logs.

Test Coverage

All changed files are new (did not exist on main):

File	Stmts	Miss	Coverage
`codeflash/verification/__init__.py`	7	5	29%
`codeflash/verification/llm_tools.py`	86	3	97%
`tests/test_llm_tools.py`	85	0	100%
Total	178	8	96%

llm_tools.py has excellent coverage (97%)
__init__.py has 29% coverage — the lazy __getattr__ paths for most names are not exercised since tests import directly from llm_tools. This is expected since the tests import directly.
Overall project coverage: 79% (unchanged from main)
⚠️ 1 test failure: test_run_behavioral_tests_tool_pytest — needs investigation

Last updated: 2026-02-17

aseembits93 · 2026-02-17T15:59:28Z

@claude what are the flaws in this approach?

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

claude · 2026-02-17T16:15:06Z

+            pytest_timeout=30,
+        )
+
+        assert result["success"] is True


Bug: test_run_behavioral_tests_tool_pytest is failing in CI — result["success"] is False. The test doesn't print result["error"] on failure, making it hard to diagnose. Consider adding a diagnostic assertion message:

Suggested change

assert result["success"] is True

assert result["success"] is True, f"Test run failed: {result.get('error')}"

claude · 2026-02-17T16:15:17Z

+    project_root: str,
+    test_framework: str = "pytest",
+    pytest_timeout: int | None = 30,
+    verbose: bool = False,


Bug: The verbose parameter is accepted and advertised in the tool schema (line 119) but is silently ignored — it was removed from the run_behavioral_tests() call (which never accepted it). Either remove verbose from the function signature and schema, or implement its intended behavior.

claude · 2026-02-17T16:15:23Z

+class TestFileInput(BaseModel):
+    """Input schema for a single test file."""
+
+    test_file_path: str = Field(description="Absolute path to the test file to run")
+    test_type: str = Field(
+        default="existing_unit_test",
+        description="Type of test: 'existing_unit_test', 'generated_regression', 'replay_test', or 'concolic_coverage_test'",
+    )
+
+
+class RunBehavioralTestsInput(BaseModel):
+    """Input schema for the run_behavioral_tests tool."""
+
+    test_files: list[TestFileInput] = Field(description="List of test files to run")
+    test_framework: str = Field(default="pytest", description="Test framework to use: 'pytest' or 'unittest'")
+    project_root: str = Field(description="Absolute path to the project root directory")
+    pytest_timeout: int | None = Field(default=30, description="Timeout in seconds for each pytest test")
+    verbose: bool = Field(default=False, description="Enable verbose output")
+
+
+class TestResultOutput(BaseModel):
+    """Output schema for a single test result."""
+
+    test_id: str = Field(description="Unique identifier for the test")
+    test_file: str = Field(description="Path to the test file")
+    test_function: str | None = Field(description="Name of the test function")
+    passed: bool = Field(description="Whether the test passed")
+    runtime_ns: int | None = Field(description="Runtime in nanoseconds, if available")
+    timed_out: bool = Field(description="Whether the test timed out")
+
+
+class RunBehavioralTestsOutput(BaseModel):
+    """Output schema for the run_behavioral_tests tool."""
+
+    success: bool = Field(description="Whether the test run completed successfully")
+    total_tests: int = Field(description="Total number of tests run")
+    passed_tests: int = Field(description="Number of tests that passed")
+    failed_tests: int = Field(description="Number of tests that failed")
+    results: list[TestResultOutput] = Field(description="Detailed results for each test")
+    stdout: str = Field(description="Standard output from the test run")
+    stderr: str = Field(description="Standard error from the test run")
+    error: str | None = Field(default=None, description="Error message if the run failed")
+


Dead code: TestFileInput, RunBehavioralTestsInput, TestResultOutput, and RunBehavioralTestsOutput are defined but never used anywhere. The actual tool function uses list[dict[str, Any]] for inputs and dict[str, Any] for outputs. Consider either removing these or using them for validation in run_behavioral_tests_tool.

claude · 2026-02-17T16:15:30Z

+            "error": None,
+        }
+
+    except Exception as e:


Concern: The bare except Exception catch means callers never see the actual error type — they get a stringified error in the response dict. For LLM-facing tools this might be intentional, but note that errors like FileNotFoundError, PermissionError, or subprocess.CalledProcessError are all treated identically. Consider logging the exception (e.g., via logging.exception) before returning the error dict so that issues are at least visible in server logs.

first commit

6be5697

aseembits93 marked this pull request as draft December 3, 2025 04:12

github-actions Bot added the Review effort 3/5 label Dec 3, 2025

codeflash-ai Bot mentioned this pull request Dec 3, 2025

⚡️ Speed up function _test_type_from_string by 128% in PR #949 (feat/behavior-test-as-tool) #950

Closed

Merge branch 'main' into feat/behavior-test-as-tool

6730fdc

Merge branch 'main' into feat/behavior-test-as-tool

6a9b162

style: fix linting and mypy type errors in llm_tools

bbd57d7

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

claude Bot reviewed Feb 17, 2026

View reviewed changes

aseembits93 closed this Feb 23, 2026

aseembits93 deleted the feat/behavior-test-as-tool branch March 17, 2026 00:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Exposing Behavior test run as a tool call to LLMs#949

Exposing Behavior test run as a tool call to LLMs#949
aseembits93 wants to merge 4 commits into
mainfrom
feat/behavior-test-as-tool

aseembits93 commented Dec 3, 2025 •

edited by github-actions Bot

Loading

Uh oh!

CLAassistant commented Dec 3, 2025

Uh oh!

github-actions Bot commented Dec 3, 2025

Uh oh!

github-actions Bot commented Dec 3, 2025

Uh oh!

codeflash-ai Bot commented Dec 3, 2025

⚡️ Speed up function `_test_type_from_string` by 128% in PR #949 (`feat/behavior-test-as-tool`) #950

Uh oh!

aseembits93 commented Feb 17, 2026

Uh oh!

claude Bot commented Feb 17, 2026 •

edited

Loading

Uh oh!

aseembits93 commented Feb 17, 2026

Uh oh!

claude Bot Feb 17, 2026

Uh oh!

claude Bot Feb 17, 2026

Uh oh!

claude Bot Feb 17, 2026

Uh oh!

claude Bot Feb 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	assert result["success"] is True
	assert result["success"] is True, f"Test run failed: {result.get('error')}"

Conversation

aseembits93 commented Dec 3, 2025 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Type

Description

Diagram Walkthrough

File Walkthrough

Uh oh!

CLAassistant commented Dec 3, 2025

Uh oh!

github-actions Bot commented Dec 3, 2025

PR Reviewer Guide 🔍

Uh oh!

github-actions Bot commented Dec 3, 2025

PR Code Suggestions ✨

Uh oh!

codeflash-ai Bot commented Dec 3, 2025

⚡️ Codeflash found optimizations for this PR

📄 128% (1.28x) speedup for _test_type_from_string in codeflash/verification/llm_tools.py

A dependent PR with the suggested changes has been created. Please review:

⚡️ Speed up function _test_type_from_string by 128% in PR #949 (feat/behavior-test-as-tool) #950

Uh oh!

aseembits93 commented Feb 17, 2026

Uh oh!

claude Bot commented Feb 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review Summary

Prek Checks

Mypy

Code Review

Test Coverage

Uh oh!

aseembits93 commented Feb 17, 2026

Uh oh!

claude Bot Feb 17, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot Feb 17, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot Feb 17, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot Feb 17, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

aseembits93 commented Dec 3, 2025 •

edited by github-actions Bot

Loading

📄 128% (1.28x) speedup for `_test_type_from_string` in `codeflash/verification/llm_tools.py`

⚡️ Speed up function `_test_type_from_string` by 128% in PR #949 (`feat/behavior-test-as-tool`) #950

claude Bot commented Feb 17, 2026 •

edited

Loading