Skip to content

LCORE-1270: E2e tests responses tools#1414

Merged
tisnik merged 3 commits into
lightspeed-core:mainfrom
asimurka:e2e_tests_responses_tools
Mar 31, 2026
Merged

LCORE-1270: E2e tests responses tools#1414
tisnik merged 3 commits into
lightspeed-core:mainfrom
asimurka:e2e_tests_responses_tools

Conversation

@asimurka

@asimurka asimurka commented Mar 27, 2026

Copy link
Copy Markdown
Contributor

Description

This PR adds E2E tests for tool choices in responses endpoint

Type of change

  • Refactor
  • New feature
  • Bug fix
  • CVE fix
  • Optimization
  • Documentation Update
  • Configuration Update
  • Bump-up service version
  • Bump-up dependent library
  • Bump-up library or tool used for development (does not change the final image)
  • CI configuration change
  • Konflux configuration change
  • Unit tests improvement
  • Integration tests improvement
  • End to end tests improvement
  • Benchmarks improvement

Tools used to create PR

Identify any AI code assistants used in this PR (for transparency and review context)

  • Assisted-by: (e.g., Claude, CodeRabbit, Ollama, etc., N/A if not used)
  • Generated by: (e.g., tool name and version; N/A if not used)

Related Tickets & Documents

Checklist before requesting a review

  • I have performed a self-review of my code.
  • PR has passed all pre-merge test jobs.
  • If it is a core feature, I have added thorough tests.

Testing

  • Please provide detailed steps to perform tests related to this code change.
  • How were the fix/results from this change verified? Please provide relevant screenshots or results.

Summary by CodeRabbit

  • Tests
    • Added end-to-end coverage for Responses API behavior around tool selection, verifying when tool invocations occur or not, that response text contains expected fragments, and that token metrics change across scenarios.

@asimurka asimurka marked this pull request as draft March 27, 2026 15:53
@coderabbitai

coderabbitai Bot commented Mar 27, 2026

Copy link
Copy Markdown
Contributor

Caution

Review failed

Pull request was closed or merged during review

Walkthrough

Adds end-to-end Gherkin scenarios and new Behave step helpers to test responses endpoint tool_choice behaviors, asserting HTTP 200, presence/absence of tool-invocation output items, expected output text fragments, and token metric changes across multiple tool-choice configurations.

Changes

Cohort / File(s) Summary
Responses endpoint tool-choice E2E scenarios
tests/e2e/features/responses.feature
Added ~169 lines of Gherkin scenarios covering tool_choice variations ("none", "auto", "required", explicit {"type":"file_search"}) and allowed_tools filters; assertions include HTTP 200, existence/non-existence of tool invocation output item types (e.g., file_search_call), expected output text fragments when file search is used, and token metric increases.
Test validation helpers and steps
tests/e2e/features/steps/llm_query_response.py
Added _RESPONSE_TOOL_OUTPUT_ITEM_TYPES constant and _collect_output_item_types(response_body: dict[str, Any]) -> list[str]; added Behave @then steps to assert: no tool invocation item types present, presence/absence of a specific output[*].type, presence of one of multiple types, and substring fragment checks against response_json["output_text"]. Additional guards ensure context.response is present and parse Responses JSON structure.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Linked Issues check ⚠️ Warning The PR implements most stated objectives: tool_choice modes (none, auto, required) are tested, tool invocation presence/absence is asserted, output_text fragments for file_search are verified, and token metrics are captured. However, reviewers note a missing scenario for allowed_tools name-based filtering of specific MCP tool names. Add test scenario(s) verifying that allowed_tools can filter MCP invocations by specific MCP tool names in auto and required modes, as recommended in the review guidance.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'LCORE-1270: E2e tests responses tools' directly references the issue number and clearly indicates the PR adds end-to-end tests for tool choices in the responses endpoint.
Out of Scope Changes check ✅ Passed All changes are scoped to end-to-end tests for the responses endpoint's tool choice behavior. The additions to the feature file and test step helpers directly support the PR objectives without unrelated modifications.
Docstring Coverage ✅ Passed Docstring coverage is 97.92% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
✨ Simplify code
  • Create PR with simplified code

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@asimurka asimurka changed the title E2e tests responses tools LCORE-1270: E2e tests responses tools Mar 27, 2026
@asimurka

Copy link
Copy Markdown
Contributor Author

Do not close, needs rebase at the top of #1412

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/responses.md`:
- Around line 283-284: Update the docs to state that the `allowed_tools` filter
is applied to both request-supplied `tools` and the implicit tool set produced
by `prepare_tools()` (including the LCORE tool set), i.e., `allowed_tools`
filters the prepared/implicit tools when the request omits `tools` as well as
the explicit `tools` array; clarify that `mode` remains `"auto"` or `"required"`
and `tools` is a list of key-valued filters that apply to all candidate tools,
including LCORE.

In `@src/utils/responses.py`:
- Around line 493-505: The helper mcp_strip_name_from_allowlist_entries
currently removes "name" for any entry with type "mcp", which can turn
{"type":"mcp","name":"foo"} without a server_label into a permissive
{"type":"mcp"}; change the logic so you only remove the "name" when the entry is
an MCP and also contains a "server_label" (i.e., check new_entry.get("type") ==
"mcp" and new_entry.get("server_label") is not None) so malformed/narrow MCP
filters without server_label are left intact (and thus handled/ignored by
group_mcp_tools_by_server/filter_tools_by_allowed_entries) instead of widening
the allowlist.
- Around line 1515-1517: The early return when tool_choice ==
ToolChoiceMode.none is currently returning None for all three outputs and clears
explicit vector_store_ids; change the behavior so disabling tools only disables
tool-related outputs (e.g., tool config and tool inputs) but preserves and
returns the original vector_store_ids so build_rag_context still receives
file_search.vector_store_ids; locate the check against ToolChoiceMode.none
(variable tool_choice, enum ToolChoiceMode) and adjust the returned tuple to
keep the third value (vector_store_ids) instead of None.

In `@tests/e2e/features/responses.feature`:
- Around line 555-575: Update the scenario's request body so the model is forced
to attempt a file search: inside the JSON passed to the step "When I use
\"responses\" to ask question with authorization header" (the request that
contains "tool_choice" and "instructions"), change the "instructions" value to
explicitly require the file_search tool (e.g., include a sentence like "You MUST
use the file_search tool to answer this question.") so the negative check on
"responses output should not include an item with type \"file_search_call\""
actually verifies that the allowlist filtering (the "tool_choice" -> "tools":
[{"type":"mcp"}]) prevented the model from using file_search.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: f40a15ef-64e0-4d60-8dbd-d81eebc4eb50

📥 Commits

Reviewing files that changed from the base of the PR and between e184541 and c7b90c3.

📒 Files selected for processing (5)
  • docs/responses.md
  • src/utils/responses.py
  • tests/e2e/features/responses.feature
  • tests/e2e/features/steps/llm_query_response.py
  • tests/unit/utils/test_responses.py

Comment thread docs/responses.md
Comment thread src/utils/responses.py
Comment thread src/utils/responses.py Outdated
Comment thread tests/e2e/features/responses.feature Outdated
@asimurka asimurka force-pushed the e2e_tests_responses_tools branch from c7b90c3 to 0b1f3ba Compare March 27, 2026 16:38
@tisnik

tisnik commented Mar 28, 2026

Copy link
Copy Markdown
Contributor

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

🎫 Ticket compliance analysis 🔶

1406 - Partially compliant

Compliant requirements:

  • Add end-to-end tests for tool configuration and restrictions in the responses endpoint
  • Cover tool_choice modes none, auto, required
  • Assert tool invocation items presence/absence
  • Verify output_text fragments for file_search scenarios
  • Capture token metrics in tests

Non-compliant requirements:

  • End-to-end tests for allowed_tools filtering of MCP tools by specific name are missing

Requires further human verification:

(none)

⏱️ Estimated effort to review: 4 🔵🔵🔵🔵⚪
🧪 PR contains tests
🔒 No security concerns identified
⚡ Recommended focus areas for review

Missing Test

There is no end-to-end scenario verifying that allowed_tools with specific MCP tool name values restricts MCP invocations to those names. Without it, the MCP name-filtering behavior added in code isn’t validated in e2e tests.

Scenario: Check if responses endpoint with allowed tools in automatic mode answers knowledge question using file search
  Given The system is in default state
    And I set the Authorization header to Bearer eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJzdWIiOiIxMjM0NTY3ODkwIiwibmFtZSI6Ikpva
    And I capture the current token metrics
  When I use "responses" to ask question with authorization header
  """
  {
    "input": "What is the title of the article from Paul?",
    "model": "{PROVIDER}/{MODEL}",
    "stream": false,
    "instructions": "You are an assistant. You MUST use the file_search tool to answer. Answer in lowercase.",
    "tool_choice": {
      "type": "allowed_tools",
      "mode": "auto",
      "tools": [{"type": "file_search"}]
    }
  }
  """
  Then The status code of the response is 200
    And The responses output should include an item with type "file_search_call"
    And The responses output_text should contain following fragments
      | Fragments in LLM response |
      | great work                |
    And The token metrics should have increased

Scenario: Check if responses endpoint with allowed tools in required mode invokes file search for a basic question
  Given The system is in default state
    And I set the Authorization header to Bearer eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJzdWIiOiIxMjM0NTY3ODkwIiwibmFtZSI6Ikpva
    And I capture the current token metrics
  When I use "responses" to ask question with authorization header
  """
  {
    "input": "Hello world!",
    "model": "{PROVIDER}/{MODEL}",
    "stream": false,
    "tool_choice": {
      "type": "allowed_tools",
      "mode": "required",
      "tools": [{"type": "file_search"}]
    }
  }
  """
  Then The status code of the response is 200
    And The responses output should include an item with type "file_search_call"
    And The token metrics should have increased

@asimurka asimurka force-pushed the e2e_tests_responses_tools branch 2 times, most recently from ae2d522 to 7265e3a Compare March 31, 2026 14:20
@asimurka asimurka force-pushed the e2e_tests_responses_tools branch from 7265e3a to ba99db0 Compare March 31, 2026 14:20
@asimurka asimurka marked this pull request as ready for review March 31, 2026 14:21
@asimurka asimurka requested a review from radofuchs March 31, 2026 14:21

@radofuchs radofuchs left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/e2e/features/responses.feature`:
- Around line 555-598: Tests currently only filter MCP by type and miss
scenarios that whitelist a specific MCP tool name; add two new cases (auto and
required) under the existing responses.feature scenarios that use tool_choice
with type "allowed_tools" and tools set to
[{"type":"mcp","name":"<concrete_tool_name>"}] (replace <concrete_tool_name>
with a real MCP tool identifier used in the suite) and assert that only
invocations for that named MCP occur (for auto: responses output should include
only the specified MCP invocation and no other MCP/file_search_call items; for
required: responses output should contain no other tool invocation item types if
the name is invalid). Make sure to reference the same request structure keys
("input", "model", "tool_choice") and reuse the existing assertions about status
200 and token metric increases.

In `@tests/e2e/features/steps/llm_query_response.py`:
- Around line 234-253: The test check_fragments_in_responses_output_text
currently looks for a non-existent "output_text" field; update it to use the
Responses aggregated text field "n": ensure you check that "n" exists in
response_json (replace the "output_text" existence assert), set output_text =
response_json["n"], and then run the same fragment membership assertions against
that value in the loop inside check_fragments_in_responses_output_text.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 7a436334-319c-43a7-804b-ca5e9ac3c8f3

📥 Commits

Reviewing files that changed from the base of the PR and between c7b90c3 and ba99db0.

📒 Files selected for processing (2)
  • tests/e2e/features/responses.feature
  • tests/e2e/features/steps/llm_query_response.py
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Konflux kflux-prd-rh02 / lightspeed-stack-on-pull-request
🧰 Additional context used
📓 Path-based instructions (4)
tests/e2e/features/**/*.feature

📄 CodeRabbit inference engine (AGENTS.md)

Use behave (BDD) framework with Gherkin feature files for end-to-end tests

Files:

  • tests/e2e/features/responses.feature
**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Use absolute imports for internal modules from the Lightspeed Core Stack (e.g., from authentication import get_auth_dependency)

Files:

  • tests/e2e/features/steps/llm_query_response.py
tests/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

tests/**/*.py: Use pytest for all unit and integration tests; do not use unittest
Maintain test coverage of at least 60% for unit tests and 10% for integration tests

Files:

  • tests/e2e/features/steps/llm_query_response.py
tests/e2e/features/steps/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Define E2E test step implementations in tests/e2e/features/steps/ directory

Files:

  • tests/e2e/features/steps/llm_query_response.py
🧠 Learnings (2)
📚 Learning: 2026-02-25T07:46:39.608Z
Learnt from: asimurka
Repo: lightspeed-core/lightspeed-stack PR: 1211
File: src/models/responses.py:8-16
Timestamp: 2026-02-25T07:46:39.608Z
Learning: In the lightspeed-stack codebase, src/models/requests.py uses OpenAIResponseInputTool as Tool while src/models/responses.py uses OpenAIResponseTool as Tool. This type difference is intentional - input tools and output/response tools have different schemas in llama-stack-api.

Applied to files:

  • tests/e2e/features/responses.feature
📚 Learning: 2025-08-25T09:05:18.350Z
Learnt from: radofuchs
Repo: lightspeed-core/lightspeed-stack PR: 445
File: tests/e2e/features/health.feature:50-53
Timestamp: 2025-08-25T09:05:18.350Z
Learning: The step definition for ignoring specific fields in response body comparisons is implemented in tests/e2e/features/steps/common_http.py at line 175 as `then('The body of the response, ignoring the "{field}" field, is the following')`.

Applied to files:

  • tests/e2e/features/steps/llm_query_response.py
🔇 Additional comments (2)
tests/e2e/features/responses.feature (1)

432-554: Good tool_choice mode coverage expansion.

These scenarios add solid end-to-end checks for none/auto/required, explicit file-search selection, output item-type assertions, response fragment checks, and token-metric movement.

tests/e2e/features/steps/llm_query_response.py (1)

16-84: Nice addition of reusable Responses output-type assertions.

The shared collector + dedicated @then steps improve readability and scenario intent for tool invocation checks.

Comment thread tests/e2e/features/responses.feature
Comment thread tests/e2e/features/steps/llm_query_response.py

@tisnik tisnik left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@tisnik tisnik merged commit 18b9a10 into lightspeed-core:main Mar 31, 2026
19 of 23 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants