Skip to content

Exposing Behavior test run as a tool call to LLMs#949

Closed
aseembits93 wants to merge 4 commits into
mainfrom
feat/behavior-test-as-tool
Closed

Exposing Behavior test run as a tool call to LLMs#949
aseembits93 wants to merge 4 commits into
mainfrom
feat/behavior-test-as-tool

Conversation

@aseembits93
Copy link
Copy Markdown
Contributor

@aseembits93 aseembits93 commented Dec 3, 2025

PR Type

Enhancement, Tests


Description

  • Expose behavioral tests as LLM tool

  • Add tool schema and execution registry

  • Provide lazy imports to avoid cycles

  • Add comprehensive tests for tool API


Diagram Walkthrough

flowchart LR
  LLM["LLM"] -- "calls tool" --> Exec["execute_tool()"]
  Exec -- "dispatch" --> Tool["run_behavioral_tests_tool()"]
  Tool -- "invoke" --> Runner["run_behavioral_tests()"]
  Runner -- "produce JUnit XML" --> Parser["parse_test_xml()"]
  Parser -- "results" --> Tool
  Tool -- "structured dict" --> LLM
Loading

File Walkthrough

Relevant files
Enhancement
__init__.py
Lazy export of verification LLM tools                                       

codeflash/verification/init.py

  • Add lazy attribute loader for LLM tools
  • Re-export tool APIs via __all__
+31/-0   
llm_tools.py
LLM tool schema and behavior test wrapper                               

codeflash/verification/llm_tools.py

  • Define JSON schema for run_behavioral_tests
  • Implement run_behavioral_tests_tool wrapper
  • Add tool registry and execution helpers
  • Map string test types to enum
+321/-0 
Tests
test_llm_tools.py
Tests for verification LLM tools interface                             

tests/test_llm_tools.py

  • Add tests for tool schema and registry
  • Validate execute_tool dispatch and errors
  • Run real pytest samples through tool
  • Test handling of failing and invalid paths
+193/-0 

@aseembits93 aseembits93 marked this pull request as draft December 3, 2025 04:12
@CLAassistant
Copy link
Copy Markdown

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.


Codeflash Bot seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
You have signed the CLA already but the status is still pending? Let us recheck it.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Dec 3, 2025

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 3 🔵🔵🔵⚪⚪
🧪 PR contains tests
🔒 No security concerns identified
⚡ Recommended focus areas for review

Possible Issue

The mapping in _test_type_from_string uses the key concolic_test instead of concolic_coverage_test (already present). If callers pass concolic_test per earlier conventions, it's fine, but if schemas or other parts expect only concolic_coverage_test, the extra alias may mask typos. Conversely, if concolic_test was not intended as an alias, this could be a bug. Validate intended enum values and align with TestType.

def _test_type_from_string(test_type_str: str) -> TestType:
    """Convert a string test type to TestType enum."""
    mapping = {
        "existing_unit_test": TestType.EXISTING_UNIT_TEST,
        "generated_regression": TestType.GENERATED_REGRESSION,
        "replay_test": TestType.REPLAY_TEST,
        "concolic_test": TestType.CONCOLIC_COVERAGE_TEST,
        "concolic_coverage_test": TestType.CONCOLIC_COVERAGE_TEST,
    }
    return mapping.get(test_type_str.lower(), TestType.EXISTING_UNIT_TEST)
Robustness

When constructing PYTHONPATH, the code appends os.pathsep + project_root_path without checking for duplication or normalizing. Also, if PYTHONPATH exists but is empty, the leading separator can occur. Consider normalizing and avoiding duplicates.

# Ensure PYTHONPATH includes project root
if "PYTHONPATH" not in test_env:
    test_env["PYTHONPATH"] = str(project_root_path)
else:
    test_env["PYTHONPATH"] += os.pathsep + str(project_root_path)
Error Handling

The broad except Exception swallows all errors and returns success: False with minimal context. Consider logging or including more structured error info (e.g., traceback) to aid debugging, and ensure sensitive paths are handled appropriately.

except Exception as e:
    return {
        "success": False,
        "total_tests": 0,
        "passed_tests": 0,
        "failed_tests": 0,
        "results": [],
        "stdout": "",
        "stderr": "",
        "error": str(e),
    }

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Dec 3, 2025

PR Code Suggestions ✨

Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
Possible issue
Align schema enum with mapper

The accepted enum in the tool schema does not include 'concolic_test', yet the
mapping supports it. This mismatch can cause user inputs to be rejected before
reaching your mapper. Align the schema enum with supported inputs to prevent
validation errors.

codeflash/verification/llm_tools.py [123-131]

 mapping = {
     "existing_unit_test": TestType.EXISTING_UNIT_TEST,
     "generated_regression": TestType.GENERATED_REGRESSION,
     "replay_test": TestType.REPLAY_TEST,
     "concolic_test": TestType.CONCOLIC_COVERAGE_TEST,
     "concolic_coverage_test": TestType.CONCOLIC_COVERAGE_TEST,
 }
+...
+RUN_BEHAVIORAL_TESTS_TOOL_SCHEMA = {
+    "type": "function",
+    "function": {
+        "name": "run_behavioral_tests",
+        "description": (
+            "Run behavioral tests to verify code correctness. "
+            "This executes test files using pytest or unittest and returns detailed results "
+            "including pass/fail status, runtime information, and any errors encountered."
+        ),
+        "parameters": {
+            "type": "object",
+            "properties": {
+                "test_files": {
+                    "type": "array",
+                    "description": "List of test files to run",
+                    "items": {
+                        "type": "object",
+                        "properties": {
+                            "test_file_path": {
+                                "type": "string",
+                                "description": "Absolute path to the test file to run",
+                            },
+                            "test_type": {
+                                "type": "string",
+                                "enum": [
+                                    "existing_unit_test",
+                                    "generated_regression",
+                                    "replay_test",
+                                    "concolic_test",
+                                    "concolic_coverage_test",
+                                ],
+                                "default": "existing_unit_test",
+                                "description": "Type of test being run",
+                            },
+                        },
+                        "required": ["test_file_path"],
+                    },
+                },
+                ...
+            },
+            "required": ["test_files", "project_root"],
+        },
+    },
+}
Suggestion importance[1-10]: 7

__

Why: Correctly identifies a mismatch: _test_type_from_string accepts concolic_test but the tool schema enum omits it, causing premature validation failures. The proposed change is accurate and improves usability, though not critical to core functionality.

Medium
General
Prevent PYTHONPATH duplication, ensure precedence

Appending to an existing PYTHONPATH without checking for duplicates can grow
unbounded across calls and may break module resolution order. Prepend the project
root if not already present to ensure it takes precedence and avoid duplication.

codeflash/verification/llm_tools.py [191-196]

-# Ensure PYTHONPATH includes project root
-if "PYTHONPATH" not in test_env:
-    test_env["PYTHONPATH"] = str(project_root_path)
-else:
-    test_env["PYTHONPATH"] += os.pathsep + str(project_root_path)
+# Ensure PYTHONPATH includes project root once, and with precedence
+current_pp = test_env.get("PYTHONPATH", "")
+pp_parts = current_pp.split(os.pathsep) if current_pp else []
+project_root_str = str(project_root_path)
+if project_root_str not in pp_parts:
+    test_env["PYTHONPATH"] = os.pathsep.join([project_root_str] + pp_parts) if pp_parts else project_root_str
Suggestion importance[1-10]: 6

__

Why: Sensible enhancement to avoid unbounded PYTHONPATH growth and ensure project root precedence. It’s a maintainability/robustness improvement; impact is moderate and the code change aligns with the existing snippet.

Low
Normalize subprocess output to text

Accessing 'process.stdout' and 'process.stderr' assumes they are strings; they may
be bytes or None depending on how the subprocess was run. Normalize to string to
avoid type issues for JSON serialization and downstream consumers.

codeflash/verification/llm_tools.py [248-258]

+def _to_text(s: Any) -> str:
+    if s is None:
+        return ""
+    return s.decode("utf-8", errors="replace") if isinstance(s, (bytes, bytearray)) else str(s)
+
 return {
     "success": True,
     "total_tests": len(test_results),
     "passed_tests": passed_count,
     "failed_tests": failed_count,
     "results": results_list,
-    "stdout": process.stdout if process.stdout else "",
-    "stderr": process.stderr if process.stderr else "",
+    "stdout": _to_text(process.stdout),
+    "stderr": _to_text(process.stderr),
     "error": None,
 }
Suggestion importance[1-10]: 6

__

Why: Normalizing process.stdout/stderr to strings improves robustness for serialization and consumer consistency. While likely already strings, handling bytes/None is a reasonable defensive improvement without altering behavior.

Low

@codeflash-ai
Copy link
Copy Markdown
Contributor

codeflash-ai Bot commented Dec 3, 2025

⚡️ Codeflash found optimizations for this PR

📄 128% (1.28x) speedup for _test_type_from_string in codeflash/verification/llm_tools.py

⏱️ Runtime : 3.07 milliseconds 1.34 milliseconds (best of 128 runs)

A dependent PR with the suggested changes has been created. Please review:

If you approve, it will be merged into this PR (branch feat/behavior-test-as-tool).

Static Badge

@aseembits93
Copy link
Copy Markdown
Contributor Author

@claude what are the flaws in this approach?

@claude
Copy link
Copy Markdown
Contributor

claude Bot commented Feb 17, 2026

PR Review Summary

Prek Checks

✅ All prek checks passing after fixes.

Fixes applied (committed and pushed):

  • Fixed TestType import to use direct module path (codeflash.models.test_type)
  • Removed invalid verbose kwarg from run_behavioral_tests() call (function does not accept it)
  • Removed invalid test_framework kwarg from TestConfig() constructor (it is a read-only property)
  • Added ToolEntry TypedDict for proper AVAILABLE_TOOLS typing (fixes mypy errors on get_tool_schema, get_all_tool_schemas, execute_tool)
  • Moved Callable import into TYPE_CHECKING block (TC003)
  • Added -> object return type to __getattr__ in __init__.py
  • Added -> None return type annotations to all test functions
  • Fixed dict indexing in tests with isinstance assertions

Mypy

✅ All 20 mypy errors resolved (0 remaining).

Code Review

Critical issues found:

  1. Test failure (tests/test_llm_tools.py:102): test_run_behavioral_tests_tool_pytest fails in CI — result["success"] is False. The test does not print the error message on failure, making diagnosis difficult.

  2. verbose parameter is silently ignored (llm_tools.py:144): The tool schema advertises verbose and the function accepts it, but it is never passed downstream since run_behavioral_tests() does not accept a verbose parameter. This is misleading for LLM callers.

  3. Unused Pydantic models (llm_tools.py:25-67): TestFileInput, RunBehavioralTestsInput, TestResultOutput, RunBehavioralTestsOutput are defined but never used anywhere. The actual function uses plain dicts.

  4. Broad exception catch (llm_tools.py:260): except Exception silently catches all errors and returns them as strings. Consider logging the exception before returning so issues are visible in server logs.

Test Coverage

All changed files are new (did not exist on main):

File Stmts Miss Coverage
codeflash/verification/__init__.py 7 5 29%
codeflash/verification/llm_tools.py 86 3 97%
tests/test_llm_tools.py 85 0 100%
Total 178 8 96%
  • llm_tools.py has excellent coverage (97%)
  • __init__.py has 29% coverage — the lazy __getattr__ paths for most names are not exercised since tests import directly from llm_tools. This is expected since the tests import directly.
  • Overall project coverage: 79% (unchanged from main)
  • ⚠️ 1 test failure: test_run_behavioral_tests_tool_pytest — needs investigation

Last updated: 2026-02-17

@aseembits93
Copy link
Copy Markdown
Contributor Author

@claude what are the flaws in this approach?

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Comment thread tests/test_llm_tools.py
pytest_timeout=30,
)

assert result["success"] is True
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: test_run_behavioral_tests_tool_pytest is failing in CI — result["success"] is False. The test doesn't print result["error"] on failure, making it hard to diagnose. Consider adding a diagnostic assertion message:

Suggested change
assert result["success"] is True
assert result["success"] is True, f"Test run failed: {result.get('error')}"

project_root: str,
test_framework: str = "pytest",
pytest_timeout: int | None = 30,
verbose: bool = False,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: The verbose parameter is accepted and advertised in the tool schema (line 119) but is silently ignored — it was removed from the run_behavioral_tests() call (which never accepted it). Either remove verbose from the function signature and schema, or implement its intended behavior.

Comment on lines +25 to +67
class TestFileInput(BaseModel):
"""Input schema for a single test file."""

test_file_path: str = Field(description="Absolute path to the test file to run")
test_type: str = Field(
default="existing_unit_test",
description="Type of test: 'existing_unit_test', 'generated_regression', 'replay_test', or 'concolic_coverage_test'",
)


class RunBehavioralTestsInput(BaseModel):
"""Input schema for the run_behavioral_tests tool."""

test_files: list[TestFileInput] = Field(description="List of test files to run")
test_framework: str = Field(default="pytest", description="Test framework to use: 'pytest' or 'unittest'")
project_root: str = Field(description="Absolute path to the project root directory")
pytest_timeout: int | None = Field(default=30, description="Timeout in seconds for each pytest test")
verbose: bool = Field(default=False, description="Enable verbose output")


class TestResultOutput(BaseModel):
"""Output schema for a single test result."""

test_id: str = Field(description="Unique identifier for the test")
test_file: str = Field(description="Path to the test file")
test_function: str | None = Field(description="Name of the test function")
passed: bool = Field(description="Whether the test passed")
runtime_ns: int | None = Field(description="Runtime in nanoseconds, if available")
timed_out: bool = Field(description="Whether the test timed out")


class RunBehavioralTestsOutput(BaseModel):
"""Output schema for the run_behavioral_tests tool."""

success: bool = Field(description="Whether the test run completed successfully")
total_tests: int = Field(description="Total number of tests run")
passed_tests: int = Field(description="Number of tests that passed")
failed_tests: int = Field(description="Number of tests that failed")
results: list[TestResultOutput] = Field(description="Detailed results for each test")
stdout: str = Field(description="Standard output from the test run")
stderr: str = Field(description="Standard error from the test run")
error: str | None = Field(default=None, description="Error message if the run failed")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dead code: TestFileInput, RunBehavioralTestsInput, TestResultOutput, and RunBehavioralTestsOutput are defined but never used anywhere. The actual tool function uses list[dict[str, Any]] for inputs and dict[str, Any] for outputs. Consider either removing these or using them for validation in run_behavioral_tests_tool.

"error": None,
}

except Exception as e:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Concern: The bare except Exception catch means callers never see the actual error type — they get a stringified error in the response dict. For LLM-facing tools this might be intentional, but note that errors like FileNotFoundError, PermissionError, or subprocess.CalledProcessError are all treated identically. Consider logging the exception (e.g., via logging.exception) before returning the error dict so that issues are at least visible in server logs.

@aseembits93 aseembits93 deleted the feat/behavior-test-as-tool branch March 17, 2026 00:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants