Skip to content

feat(subtasks): segment trajectories into subtasks before tip generation#198

Merged
visahak merged 19 commits into
AgentToolkit:mainfrom
jayaramkr:trajectory-segmentation
Apr 28, 2026
Merged

feat(subtasks): segment trajectories into subtasks before tip generation#198
visahak merged 19 commits into
AgentToolkit:mainfrom
jayaramkr:trajectory-segmentation

Conversation

@jayaramkr

@jayaramkr jayaramkr commented Apr 18, 2026

Copy link
Copy Markdown
Collaborator

In Claude/Bob plugin contexts the first user message is often conversational or vague, making the task_description stored in tip metadata low-quality and clustering unreliable. Trajectory segmentation fixes this by running an LLM pass over the full trajectory to infer logical subtasks with generalized, reusable descriptions (e.g. 'Authenticate with GitHub API'). Tips are then generated per subtask using only the steps in that subtask's slice, and each tip entity is stored with the subtask's generalized description.

  • Add SubtaskSegment and SegmentationResponse Pydantic models to schema/tips.py
  • Add EVOLVE_SEGMENTATION_ENABLED config flag (default: True) to evolve.py
  • Add segment_trajectory.jinja2 prompt encouraging contiguous subtask ranges
  • Add altk_evolve/llm/tips/segmentation.py with segment_trajectory() (3-retry, graceful empty-list fallback on failure)
  • Refactor generate_tips() to return list[TipGenerationResult] (one per subtask), expose steps_list from parse_openai_agents_trajectory() for step-range slicing, and extract _generate_tips_for_segment() helper
  • Update phoenix_sync._process_trajectory() and mcp_server.save_trajectory() to iterate over the results list with nested comprehensions
  • Add tests/unit/test_segmentation.py with 4 unit tests
  • Update test_phoenix_sync.py and test_mcp_server.py mocks for list return type

Summary by CodeRabbit

  • New Features

    • Trajectory segmentation: agent runs can be split into logical subtasks.
    • Tips generation now produces per-segment tip sets and aggregates them for storage.
    • New configuration toggle to enable/disable segmentation (enabled by default).
  • Bug Fixes / Improvements

    • More robust parsing with retries and validation of segment ranges.
    • Safer handling and persistence of multi-result tip outputs; avoids empty updates.
  • Tests

    • Added and updated unit and end-to-end tests covering segmentation, retries, parsing, and validation.

In Claude/Bob plugin contexts the first user message is often conversational
or vague, making the task_description stored in tip metadata low-quality and
clustering unreliable. Trajectory segmentation fixes this by running an LLM
pass over the full trajectory to infer logical subtasks with generalized,
reusable descriptions (e.g. 'Authenticate with GitHub API'). Tips are then
generated per subtask using only the steps in that subtask's slice, and each
tip entity is stored with the subtask's generalized description.

- Add SubtaskSegment and SegmentationResponse Pydantic models to schema/tips.py
- Add EVOLVE_SEGMENTATION_ENABLED config flag (default: True) to evolve.py
- Add segment_trajectory.jinja2 prompt encouraging contiguous subtask ranges
- Add altk_evolve/llm/tips/segmentation.py with segment_trajectory() (3-retry,
  graceful empty-list fallback on failure)
- Refactor generate_tips() to return list[TipGenerationResult] (one per subtask),
  expose steps_list from parse_openai_agents_trajectory() for step-range slicing,
  and extract _generate_tips_for_segment() helper
- Update phoenix_sync._process_trajectory() and mcp_server.save_trajectory()
  to iterate over the results list with nested comprehensions
- Add tests/unit/test_segmentation.py with 4 unit tests
- Update test_phoenix_sync.py and test_mcp_server.py mocks for list return type
@coderabbitai

coderabbitai Bot commented Apr 18, 2026

Copy link
Copy Markdown

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds trajectory segmentation and per-subtask tip generation: new segmentation schema and prompt, segmentation logic calling the LLM with retries and validation, tip-generation refactor to produce multiple results, a config toggle, and downstream updates to flatten and persist multi-result tips.

Changes

Cohort / File(s) Summary
Configuration
altk_evolve/config/evolve.py
Added segmentation_enabled: bool = True to EvolveConfig.
Schema
altk_evolve/schema/tips.py
Added SubtaskSegment and SegmentationResponse Pydantic models; validates end_step >= start_step.
Segmentation prompt
altk_evolve/llm/tips/prompts/segment_trajectory.jinja2
New Jinja2 template instructing LLM to segment trajectories and providing a JSON-only fallback format.
Segmentation logic
altk_evolve/llm/tips/segmentation.py
New segment_trajectory(messages) function: renders prompt, calls LLM (with optional constrained decoding), retries on parse/validation errors, validates response, returns list[SubtaskSegment].
Tips generation
altk_evolve/llm/tips/tips.py
Cached generate-tips template; parse_openai_agents_trajectory returns steps_list; added _generate_tips_for_segment; changed generate_tips to return list[TipGenerationResult] and to optionally segment trajectories when enabled.
Frontend / MCP
altk_evolve/frontend/mcp/mcp_server.py
save_trajectory updated to consume list results from generate_tips, flatten tip entities, and conditionally call update_entities.
Sync / Phoenix
altk_evolve/sync/phoenix_sync.py
_process_trajectory updated to iterate multiple tip results, flatten tips, use per-result metadata, enable conflict resolution, and return total tip count.
Tests & fixtures
tests/unit/test_segmentation.py, tests/unit/test_mcp_server.py, tests/unit/test_phoenix_sync.py, tests/e2e/test_segmentation.py, tests/fixtures/appworld_venmo_task_trajectory.json
Added unit/e2e segmentation tests; updated mocks to return lists of TipGenerationResult; added trajectory fixture.
Other
.secrets.baseline
Updated baseline metadata and findings to reference the new fixture.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant Tips as "Tips Module\n(altk_evolve/llm/tips)"
    participant Seg as "Segmentation Module\n(altk_evolve/llm/tips/segmentation)"
    participant LLM as "LLM Service\n(litellm)"
    participant Schema as "Schema Validation\n(Pydantic)"
    participant Consumer as "Consumer\n(MCP / Phoenix)"

    Client->>Tips: generate_tips(messages)
    alt segmentation_enabled
        Tips->>Seg: segment_trajectory(messages)
        Seg->>LLM: completion(segment_prompt)
        LLM-->>Seg: raw JSON/text
        Seg->>Schema: validate SegmentationResponse
        Schema-->>Seg: SubtaskSegment[]
        Seg-->>Tips: subtask_list
        loop per subtask
            Tips->>LLM: completion(tips_prompt_for_segment)
            LLM-->>Tips: tips JSON/text
            Tips->>Schema: validate TipGenerationResult
            Schema-->>Tips: TipGenerationResult
        end
    else
        Tips->>LLM: completion(tips_prompt_full)
        LLM-->>Tips: tips JSON/text
        Tips->>Schema: validate TipGenerationResult
        Schema-->>Tips: TipGenerationResult
    end
    Tips-->>Client: list[TipGenerationResult]
    Client->>Consumer: update_entities(flattened_tips)
    Consumer->>Consumer: store tips with metadata
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Possibly related PRs

Suggested reviewers

  • visahak
  • vinodmut

Poem

🐰 I nibble steps and split the trail,
Into subtasks where small tips hail.
Prompts whisper, retries hum along,
Each tip hops out to sing its song.
Hooray — the pipeline's split and strong!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 55.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main feature added: trajectory segmentation before tip generation, which is the core change across multiple files.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
altk_evolve/llm/tips/tips.py (1)

87-108: ⚠️ Potential issue | 🟠 Major

num_steps and steps_list can disagree for long trajectories.

steps_list is built from agent_steps[:50] (at most 50 entries), while num_steps on line 107 counts every action/reasoning entry in agent_steps without the [:50] cap. For any trajectory with more than 50 action/reasoning steps:

  • segment_trajectory (segmentation.py lines 42-43) passes the full num_steps into the prompt together with a trajectory_summary that only shows steps 1..50. The LLM is told "there are 80 steps" but only sees 50, so it will happily return end_step=80, and generate_tips at line 202 clamps to n_steps = len(steps_list) = 50 — silently mis-attributing the trailing subtask to the visible prefix.
  • _generate_tips_for_segment receives num_steps=trajectory_data["num_steps"] on line 219 but a trajectory_slice containing only the visible steps, same discrepancy.

Either cap num_steps to len(steps_list) (and truncate consistently), or stop truncating to 50 here and do the truncation only when rendering.

Separately, the elif step_type == "observation": branch on lines 99-100 is dead code — agent_steps is only ever populated with "reasoning" (line 50) and "action" (line 74) entries — and can be removed or the producer extended to emit observations.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@altk_evolve/llm/tips/tips.py` around lines 87 - 108, The returned num_steps
can disagree with steps_list because steps_list is built from agent_steps[:50]
while num_steps counts all action/reasoning entries; update the function that
builds steps_list/trajectory_summary so num_steps is capped to len(steps_list)
(or compute num_steps from the same sliced source) and ensure
trajectory_summary/steps_list and the value passed into
_generate_tips_for_segment/segment_trajectory are consistent (use the same
agent_steps slice or avoid slicing here and move truncation to rendering); also
remove the dead elif for "observation" or add producing code to emit
observations so the branch is reachable—references: steps_list, agent_steps,
num_steps, trajectory_summary, generate_tips, _generate_tips_for_segment,
segment_trajectory, trajectory_slice.
🧹 Nitpick comments (9)
altk_evolve/llm/tips/prompts/segment_trajectory.jinja2 (1)

17-20: Clarify step indexing base in the prompt.

The prompt says start_step/end_step are "the step numbers shown in the trajectory above (inclusive)" but doesn't state the base (1-indexed vs 0-indexed). The example uses 19, implying 1-based, and downstream slicing via steps_list must match. Making this explicit (e.g., "1-indexed, inclusive") reduces the chance of off-by-one errors when trajectory_summary rendering changes.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@altk_evolve/llm/tips/prompts/segment_trajectory.jinja2` around lines 17 - 20,
The prompt is ambiguous about whether start_step/end_step are 0- or 1-indexed
which can cause off-by-one bugs when matching trajectory_summary or slicing via
steps_list; update the text around `start_step` and `end_step` in
segment_trajectory.jinja2 to explicitly state that these are 1-indexed and
inclusive (e.g., "1-indexed, inclusive") and ensure any example and any use of
`steps_list` or references to the trajectory in the template and downstream code
assume this 1-based indexing consistently.
tests/unit/test_phoenix_sync.py (1)

511-635: LGTM on the mock-shape updates.

The three updated mocks correctly adopt the new list[TipGenerationResult] contract and test_sync_processes_valid_spans continues to assert provenance metadata (task_description, source_task_id, source_span_id, creation_mode) on the flattened tip entities, which exercises the nested comprehension in _process_trajectory.

One coverage gap worth considering: there's no test asserting that update_entities is not called for tips when generate_tips returns [] or a list of results with empty tips — the new if tip_entities: guard in phoenix_sync.py (line 492) is currently unexercised.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unit/test_phoenix_sync.py` around lines 511 - 635, Add a unit test that
exercises the new guard in phoenix_sync._process_trajectory by verifying
phoenix_sync.client.update_entities is NOT called when generate_tips returns
either an empty list or a list of TipGenerationResult objects whose tips lists
are empty; specifically, patch "altk_evolve.sync.phoenix_sync.generate_tips" to
return [] and another case returning [TipGenerationResult(tips=[],
task_description="...")] and call phoenix_sync.sync(...) then assert
phoenix_sync.client.update_entities.assert_not_called() to cover the `if
tip_entities:` branch.
tests/unit/test_segmentation.py (1)

28-86: Tests only cover the non-constrained-decoding path.

All four tests stub supports_response_schema=False and get_supported_openai_params=[], so the constrained_decoding_supported=True branch in segment_trajectory (which sets litellm.enable_json_schema_validation = True and passes response_format=SegmentationResponse) is never exercised. Given this branch is the preferred path when the model supports it, a fifth test that flips both mocks to True/["response_format"] and asserts completion was called with response_format=SegmentationResponse would prevent silent regressions.

Also consider one test with a response that parses as JSON but fails SegmentationResponse validation (e.g., missing subtasks key or wrong field types) to confirm ValidationError triggers retry rather than propagating — the current "bad" fixture only exercises the JSONDecodeError path.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unit/test_segmentation.py` around lines 28 - 86, Add tests exercising
the constrained-decoding branch by mocking supports_response_schema to True and
get_supported_openai_params to return ["response_format"], call
segment_trajectory and assert the mocked completion was invoked with
response_format=SegmentationResponse and that
litellm.enable_json_schema_validation is effectively used; additionally add a
test where completion returns syntactically valid JSON that fails
SegmentationResponse validation (e.g., missing "subtasks" or wrong types) and
assert segment_trajectory treats this like a parse/validation failure and
retries (incrementing completion.call_count) rather than propagating a
ValidationError.
altk_evolve/frontend/mcp/mcp_server.py (1)

205-229: Correct, but duplicated with phoenix_sync._process_trajectory.

The nested comprehension + if tip_entities: guard is byte-for-byte parallel to altk_evolve/sync/phoenix_sync.py lines 472-497 aside from the creation_mode and span id metadata. Consider extracting a small helper like build_tip_entities(results, *, source_task_id, source_span_id=None, creation_mode) in altk_evolve/llm/tips/tips.py (or a neighboring utility) so both call sites stay in lockstep as fields evolve.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@altk_evolve/frontend/mcp/mcp_server.py` around lines 205 - 229, The code
builds tip_entities via a nested comprehension after calling generate_tips and
then calls get_client().update_entities, which duplicates logic present in
phoenix_sync._process_trajectory; extract the entity-construction logic into a
shared helper (e.g. build_tip_entities(results, *, source_task_id,
source_span_id=None, creation_mode)) placed in altk_evolve/llm/tips/tips.py (or
a nearby util) and replace the comprehension in both mcp_server.py (after
generate_tips) and phoenix_sync._process_trajectory to call
build_tip_entities(...) so fields like category, rationale, trigger,
implementation_steps, task_description, source_task_id, source_span_id and
creation_mode stay consistent as they evolve.
tests/unit/test_mcp_server.py (1)

22-30: Consider asserting task_description propagation in tip metadata.

The mock now returns a list and save_trajectory builds task_description from result.task_description ("Add feature"). The existing assertions don't cover this new field, so a regression in the per-subtask description propagation wouldn't be caught here. Adding assert tip_entity.metadata["task_description"] == "Add feature" would close that gap.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unit/test_mcp_server.py` around lines 22 - 30, The test currently mocks
mock_generate_tips to return mock_result with result.task_description="Add
feature", but does not assert that save_trajectory propagated that into each tip
entity; add an assertion after the saved tip is retrieved to verify
tip_entity.metadata["task_description"] == "Add feature" (locate the check near
existing assertions for tip_entity/metadata in tests/unit/test_mcp_server.py,
referencing mock_result, mock_tip, save_trajectory and tip_entity to ensure the
per-subtask description is covered).
altk_evolve/sync/phoenix_sync.py (1)

471-499: Behavior looks correct; consider counter semantics when all segments fail.

The nested comprehension and sum(len(r.tips) for r in results) correctly handle multi-segment results, and the if tip_entities: guard avoids a pointless update_entities call when segmentation yields no tips.

Minor semantic note: when segmentation is enabled and every per-segment tip generation fails (or when segment_trajectory returns [] and the fallback produces no tips), _process_trajectory returns 0 and sync() still counts the trajectory as processed += 1 at line 562. That matches the pre-refactor behavior for a failed single call, but given segmentation multiplies LLM calls, you may want either (a) a log line distinguishing "processed with 0 tips" or (b) surfacing partial-failure counts in SyncResult. Not a blocker — just worth a thought for observability.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@altk_evolve/sync/phoenix_sync.py` around lines 471 - 499, The code currently
returns 0 tips from _process_trajectory even when segmentation ran and all
per-segment tip generations failed; add observability by logging when tip
generation completed but produced zero tips and/or surface partial-failure
counts in SyncResult. Specifically, in _process_trajectory (where
generate_tips(...) and tip_entities are computed) compute total_tips =
sum(len(r.tips) for r in results) and if total_tips == 0 and results: emit a
warning via self.logger (or process logger) like "trajectory processed with 0
tips (segmented failures)" including trajectory["trace_id"]/span_id;
alternatively, update the method signature to return (total_tips,
partial_failure_count) and propagate that into sync() and SyncResult so the
caller can differentiate processed-but-empty from unprocessed. Use the symbols
generate_tips, tip_entities, _process_trajectory, sync, and SyncResult to locate
changes.
altk_evolve/llm/tips/segmentation.py (2)

17-17: Specify encoding="utf-8" when reading the prompt.

Path.read_text() without an explicit encoding uses the platform-default, which can cause UnicodeDecodeError at import time on non-UTF-8 systems (e.g., Windows cp1252). Ruff's PLW1514 flags this.

♻️ Proposed change
-_SEGMENT_TEMPLATE = Template((Path(__file__).parent / "prompts/segment_trajectory.jinja2").read_text())
+_SEGMENT_TEMPLATE = Template((Path(__file__).parent / "prompts/segment_trajectory.jinja2").read_text(encoding="utf-8"))

The same applies to _GENERATE_TIPS_TEMPLATE in altk_evolve/llm/tips/tips.py at line 19.

As per coding guidelines: **/*.py: Use Ruff for linting and formatting (configured in pyproject.toml).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@altk_evolve/llm/tips/segmentation.py` at line 17, The template loading uses
Path.read_text() without an explicit encoding which can raise UnicodeDecodeError
on non-UTF-8 platforms; update the Template constructors for _SEGMENT_TEMPLATE
and _GENERATE_TIPS_TEMPLATE to call read_text(encoding="utf-8") so the prompt
files are always read as UTF-8, preserving current behavior and satisfying
Ruff's PLW1514 rule.

30-73: Consider extracting a shared helper for constrained-decoding detection and completion.

The capability detection at lines 30-39 and the completion(...) branching at lines 50-73 are near-duplicates of altk_evolve/llm/tips/tips.py lines 176-185 and 125-148 respectively. A shared helper (e.g., _completion_with_schema(prompt, response_model) returning the cleaned string) in altk_evolve/llm/ would eliminate drift risk between the two sites, since both already use the same llm_settings and clean_llm_response.

Not blocking; can be deferred.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@altk_evolve/llm/tips/segmentation.py` around lines 30 - 73, Extract the
duplicated capability-detection and completion logic into a shared helper (e.g.,
_completion_with_schema(prompt: str, response_model: Type | None) -> str) in
altk_evolve/llm/, which should encapsulate the calls to
get_supported_openai_params and supports_response_schema for the given
llm_settings, toggle litellm.enable_json_schema_validation, call completion(...)
with appropriate response_format or without it, and return the final cleaned
string using clean_llm_response when necessary; then replace the duplicated
blocks in segmentation.py (the use of get_supported_openai_params,
supports_response_format, supports_response_schema,
constrained_decoding_supported, and the completion(...) branches including
SegmentationResponse and clean_llm_response) with a call to that new helper,
passing the prompt and SegmentationResponse where required.
altk_evolve/llm/tips/tips.py (1)

198-211: Defensively guard against malformed step ranges from the segmenter.

If the segmenter returns start_step > end_step, start_step > n_steps, or overlapping/out-of-order ranges, slice_steps can be empty, producing a TipGenerationResult with num_steps=0 and a trajectory slice of "". _generate_tips_for_segment will then prompt the LLM with an empty trajectory, almost certainly yielding empty/garbage tips but consuming an LLM call per bad segment.

Consider dropping segments where len(slice_steps) == 0 (or end <= start) with a logger.warning, so downstream clustering isn't polluted with empty results and token spend is capped.

         for subtask in subtasks:
             start = max(0, subtask.start_step - 1)
             end = min(n_steps, subtask.end_step)
             slice_steps = steps_list[start:end]
+            if not slice_steps:
+                logger.warning(
+                    f"Skipping empty subtask segment "
+                    f"[{subtask.start_step}, {subtask.end_step}] (n_steps={n_steps}): "
+                    f"{subtask.generalized_description!r}"
+                )
+                continue
             result = _generate_tips_for_segment(
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@altk_evolve/llm/tips/tips.py` around lines 198 - 211, The segment loop in the
tips generation currently calls _generate_tips_for_segment even when slice_steps
is empty due to malformed ranges (e.g., subtask.start_step > subtask.end_step,
start_step > n_steps, or overlapping/out-of-order ranges); update the loop that
iterates over subtasks to compute start = max(0, subtask.start_step - 1) and end
= min(n_steps, subtask.end_step) then skip any segment where end <= start or
len(slice_steps) == 0 by emitting a logger.warning (include
subtask.generalized_description, start, end, and n_steps) and not appending a
TipGenerationResult for that segment so _generate_tips_for_segment is only
called with non-empty trajectory_slice, preventing empty results and wasted LLM
calls.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@altk_evolve/llm/tips/segmentation.py`:
- Around line 48-83: Narrow the retry except to only JSONDecodeError and
ValidationError (remove the broad Exception) so unexpected errors from
completion(...) or attribute/key errors propagate; before the loop compute and
set litellm.enable_json_schema_validation once (instead of mutating it each
attempt) or use a context manager to avoid race conditions with concurrent
_generate_tips_for_segment calls; after each failed attempt log the exception
with context (use logger.debug/info) including attempt index, prompt, and
last_error, and if clean_response is empty/None (from completion(...) or
clean_llm_response) treat it as a non-retriable error (log and break/raise)
rather than retrying; continue to validate using
SegmentationResponse.model_validate(json.loads(clean_response)) and only retry
on the two JSON/Validation exceptions, re-raising other exceptions.

In `@altk_evolve/schema/tips.py`:
- Around line 20-28: Add Field descriptions to SubtaskSegment's fields
(generalized_description, start_step, end_step, purpose) using
pydantic.Field(description="...") so the JSON schema passed as
response_format=SegmentationResponse includes grounding text; also enforce
step-range validation by constraining start_step and end_step (e.g., use
pydantic.conint or Field(ge=1) for start_step/end_step and add a validator
method on SubtaskSegment that raises ValueError if end_step < start_step) so
invalid ranges fail validation and trigger retries in the segmentation consumer.

---

Outside diff comments:
In `@altk_evolve/llm/tips/tips.py`:
- Around line 87-108: The returned num_steps can disagree with steps_list
because steps_list is built from agent_steps[:50] while num_steps counts all
action/reasoning entries; update the function that builds
steps_list/trajectory_summary so num_steps is capped to len(steps_list) (or
compute num_steps from the same sliced source) and ensure
trajectory_summary/steps_list and the value passed into
_generate_tips_for_segment/segment_trajectory are consistent (use the same
agent_steps slice or avoid slicing here and move truncation to rendering); also
remove the dead elif for "observation" or add producing code to emit
observations so the branch is reachable—references: steps_list, agent_steps,
num_steps, trajectory_summary, generate_tips, _generate_tips_for_segment,
segment_trajectory, trajectory_slice.

---

Nitpick comments:
In `@altk_evolve/frontend/mcp/mcp_server.py`:
- Around line 205-229: The code builds tip_entities via a nested comprehension
after calling generate_tips and then calls get_client().update_entities, which
duplicates logic present in phoenix_sync._process_trajectory; extract the
entity-construction logic into a shared helper (e.g. build_tip_entities(results,
*, source_task_id, source_span_id=None, creation_mode)) placed in
altk_evolve/llm/tips/tips.py (or a nearby util) and replace the comprehension in
both mcp_server.py (after generate_tips) and phoenix_sync._process_trajectory to
call build_tip_entities(...) so fields like category, rationale, trigger,
implementation_steps, task_description, source_task_id, source_span_id and
creation_mode stay consistent as they evolve.

In `@altk_evolve/llm/tips/prompts/segment_trajectory.jinja2`:
- Around line 17-20: The prompt is ambiguous about whether start_step/end_step
are 0- or 1-indexed which can cause off-by-one bugs when matching
trajectory_summary or slicing via steps_list; update the text around
`start_step` and `end_step` in segment_trajectory.jinja2 to explicitly state
that these are 1-indexed and inclusive (e.g., "1-indexed, inclusive") and ensure
any example and any use of `steps_list` or references to the trajectory in the
template and downstream code assume this 1-based indexing consistently.

In `@altk_evolve/llm/tips/segmentation.py`:
- Line 17: The template loading uses Path.read_text() without an explicit
encoding which can raise UnicodeDecodeError on non-UTF-8 platforms; update the
Template constructors for _SEGMENT_TEMPLATE and _GENERATE_TIPS_TEMPLATE to call
read_text(encoding="utf-8") so the prompt files are always read as UTF-8,
preserving current behavior and satisfying Ruff's PLW1514 rule.
- Around line 30-73: Extract the duplicated capability-detection and completion
logic into a shared helper (e.g., _completion_with_schema(prompt: str,
response_model: Type | None) -> str) in altk_evolve/llm/, which should
encapsulate the calls to get_supported_openai_params and
supports_response_schema for the given llm_settings, toggle
litellm.enable_json_schema_validation, call completion(...) with appropriate
response_format or without it, and return the final cleaned string using
clean_llm_response when necessary; then replace the duplicated blocks in
segmentation.py (the use of get_supported_openai_params,
supports_response_format, supports_response_schema,
constrained_decoding_supported, and the completion(...) branches including
SegmentationResponse and clean_llm_response) with a call to that new helper,
passing the prompt and SegmentationResponse where required.

In `@altk_evolve/llm/tips/tips.py`:
- Around line 198-211: The segment loop in the tips generation currently calls
_generate_tips_for_segment even when slice_steps is empty due to malformed
ranges (e.g., subtask.start_step > subtask.end_step, start_step > n_steps, or
overlapping/out-of-order ranges); update the loop that iterates over subtasks to
compute start = max(0, subtask.start_step - 1) and end = min(n_steps,
subtask.end_step) then skip any segment where end <= start or len(slice_steps)
== 0 by emitting a logger.warning (include subtask.generalized_description,
start, end, and n_steps) and not appending a TipGenerationResult for that
segment so _generate_tips_for_segment is only called with non-empty
trajectory_slice, preventing empty results and wasted LLM calls.

In `@altk_evolve/sync/phoenix_sync.py`:
- Around line 471-499: The code currently returns 0 tips from
_process_trajectory even when segmentation ran and all per-segment tip
generations failed; add observability by logging when tip generation completed
but produced zero tips and/or surface partial-failure counts in SyncResult.
Specifically, in _process_trajectory (where generate_tips(...) and tip_entities
are computed) compute total_tips = sum(len(r.tips) for r in results) and if
total_tips == 0 and results: emit a warning via self.logger (or process logger)
like "trajectory processed with 0 tips (segmented failures)" including
trajectory["trace_id"]/span_id; alternatively, update the method signature to
return (total_tips, partial_failure_count) and propagate that into sync() and
SyncResult so the caller can differentiate processed-but-empty from unprocessed.
Use the symbols generate_tips, tip_entities, _process_trajectory, sync, and
SyncResult to locate changes.

In `@tests/unit/test_mcp_server.py`:
- Around line 22-30: The test currently mocks mock_generate_tips to return
mock_result with result.task_description="Add feature", but does not assert that
save_trajectory propagated that into each tip entity; add an assertion after the
saved tip is retrieved to verify tip_entity.metadata["task_description"] == "Add
feature" (locate the check near existing assertions for tip_entity/metadata in
tests/unit/test_mcp_server.py, referencing mock_result, mock_tip,
save_trajectory and tip_entity to ensure the per-subtask description is
covered).

In `@tests/unit/test_phoenix_sync.py`:
- Around line 511-635: Add a unit test that exercises the new guard in
phoenix_sync._process_trajectory by verifying
phoenix_sync.client.update_entities is NOT called when generate_tips returns
either an empty list or a list of TipGenerationResult objects whose tips lists
are empty; specifically, patch "altk_evolve.sync.phoenix_sync.generate_tips" to
return [] and another case returning [TipGenerationResult(tips=[],
task_description="...")] and call phoenix_sync.sync(...) then assert
phoenix_sync.client.update_entities.assert_not_called() to cover the `if
tip_entities:` branch.

In `@tests/unit/test_segmentation.py`:
- Around line 28-86: Add tests exercising the constrained-decoding branch by
mocking supports_response_schema to True and get_supported_openai_params to
return ["response_format"], call segment_trajectory and assert the mocked
completion was invoked with response_format=SegmentationResponse and that
litellm.enable_json_schema_validation is effectively used; additionally add a
test where completion returns syntactically valid JSON that fails
SegmentationResponse validation (e.g., missing "subtasks" or wrong types) and
assert segment_trajectory treats this like a parse/validation failure and
retries (incrementing completion.call_count) rather than propagating a
ValidationError.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: d209fb78-be14-4c74-8288-ce5d7c44b802

📥 Commits

Reviewing files that changed from the base of the PR and between 034de13 and ff58475.

📒 Files selected for processing (10)
  • altk_evolve/config/evolve.py
  • altk_evolve/frontend/mcp/mcp_server.py
  • altk_evolve/llm/tips/prompts/segment_trajectory.jinja2
  • altk_evolve/llm/tips/segmentation.py
  • altk_evolve/llm/tips/tips.py
  • altk_evolve/schema/tips.py
  • altk_evolve/sync/phoenix_sync.py
  • tests/unit/test_mcp_server.py
  • tests/unit/test_phoenix_sync.py
  • tests/unit/test_segmentation.py

Comment thread altk_evolve/llm/guidelines/segmentation.py
Comment thread altk_evolve/schema/tips.py Outdated
jayaramkr and others added 2 commits April 18, 2026 14:28
- tips.py: cap num_steps to len(steps_list) so segmentation LLM step
  count matches the visible trajectory; remove dead observation branch
- schema/tips.py: add Field descriptions to SubtaskSegment for JSON
  schema grounding; add model_validator rejecting end_step < start_step
- segmentation.py: narrow except to (JSONDecodeError, ValidationError),
  add explicit empty-response guard, log each failed attempt at debug;
  move litellm.enable_json_schema_validation outside the retry loop
- tips.py: catch broad Exception from segment_trajectory() and fall back
  to full-trajectory mode with a warning log
- test_segmentation.py: add tests for inverted range rejection and
  non-parse error propagation from segment_trajectory
@jayaramkr jayaramkr requested review from vinodmut and visahak April 20, 2026 01:30

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@altk_evolve/llm/tips/tips.py`:
- Around line 200-213: The loop over subtasks can pass an empty trajectory slice
into _generate_tips_for_segment when subtask.start_step/end_step lie outside the
visible range; clamp and validate the slice bounds before calling the LLM:
compute start = min(max(0, subtask.start_step - 1), n_steps) and end =
min(max(0, subtask.end_step), n_steps), then if start >= end (or slice_steps
would be empty) skip this subtask (or log/warn) instead of calling
_generate_tips_for_segment with num_steps==0; update the loop that builds
slice_steps/num_steps accordingly to use these validated bounds.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 969c8342-9af4-40e9-a383-a5f2b9528f03

📥 Commits

Reviewing files that changed from the base of the PR and between ff58475 and eee4ea6.

📒 Files selected for processing (7)
  • altk_evolve/config/evolve.py
  • altk_evolve/frontend/mcp/mcp_server.py
  • altk_evolve/llm/tips/segmentation.py
  • altk_evolve/llm/tips/tips.py
  • altk_evolve/schema/tips.py
  • tests/unit/test_mcp_server.py
  • tests/unit/test_segmentation.py
✅ Files skipped from review due to trivial changes (2)
  • tests/unit/test_mcp_server.py
  • altk_evolve/config/evolve.py
🚧 Files skipped from review as they are similar to previous changes (3)
  • altk_evolve/schema/tips.py
  • altk_evolve/llm/tips/segmentation.py
  • tests/unit/test_segmentation.py

Comment thread altk_evolve/llm/tips/tips.py Outdated
jayaramkr and others added 4 commits April 21, 2026 12:43
Clamp both bounds to [0, n_steps] before slicing, then skip the subtask
with a warning if start >= end. Prevents passing num_steps=0 and an empty
trajectory slice into _generate_tips_for_segment when the LLM returns
step indices outside the visible 50-step window.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@altk_evolve/llm/tips/tips.py`:
- Around line 200-218: The code currently returns whenever any subtask produced
a result, which violates the requirement to only use per-subtask tips when at
least two valid subtasks exist; change the post-filter check so that after
iterating subtasks and building results you only return results if len(results)
>= 2 (i.e., at least two valid subtask slices were generated); otherwise fall
through to the full-trajectory fallback. Update the logic around the loop that
builds results (referencing variables/subroutines: subtasks, results, start/end
calculation, steps_list, and _generate_tips_for_segment) to perform this
stricter count check before returning.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 9b88539e-fad3-4cfd-936f-65944f80719f

📥 Commits

Reviewing files that changed from the base of the PR and between 972c113 and d0a885b.

📒 Files selected for processing (1)
  • altk_evolve/llm/tips/tips.py

Comment thread altk_evolve/llm/tips/tips.py Outdated

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
altk_evolve/llm/tips/tips.py (1)

87-105: ⚠️ Potential issue | 🟡 Minor

Add logging and extract the 50-step trajectory cap as a constant for discoverability.

The hardcoded agent_steps[:50] at lines 88 and 105 silently truncates trajectories longer than 50 steps with no logging or warning. When segmentation is enabled, parse_openai_agents_trajectory is called twice (lines 185 and again at segmentation.py:28), both seeing the same capped trajectory. This means long trajectories silently lose their tail—a user has no visibility that steps beyond 50 are being dropped.

Additionally, the type filter at line 105 is redundant: agent_steps only ever contains entries with type in {"reasoning", "action"} (appended only at lines 50 and 74).

Suggested action:

  • Extract 50 as a module-level constant (e.g., _MAX_TRAJECTORY_STEPS = 50) for clarity and tunability.
  • Log at debug or info level when truncation occurs, so it's observable.
  • Replace the redundant filter at line 105 with len(visible_steps).
🛠️ Suggested refinement
+_MAX_TRAJECTORY_STEPS = 50
+
 def parse_openai_agents_trajectory(messages: list[dict]) -> dict:
@@
-    steps_list = []
-    for i, step in enumerate(agent_steps[:50], 1):
+    visible_steps = agent_steps[:_MAX_TRAJECTORY_STEPS]
+    if len(agent_steps) > _MAX_TRAJECTORY_STEPS:
+        logger.debug(
+            f"Trajectory truncated from {len(agent_steps)} to {_MAX_TRAJECTORY_STEPS} steps"
+        )
+    steps_list = []
+    for i, step in enumerate(visible_steps, 1):
         step_type = step["type"]
@@
     return {
         "task_instruction": task_instruction or DEFAULT_TASK_DESCRIPTION,
         "trajectory_summary": "\n\n".join(steps_list),
         "steps_list": steps_list,
         "function_calls": function_calls,
-        "num_steps": len([s for s in agent_steps[:50] if s["type"] in ["action", "reasoning"]]),
+        "num_steps": len(visible_steps),
     }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@altk_evolve/llm/tips/tips.py` around lines 87 - 105, The code silently
truncates trajectories using agent_steps[:50] and redundantly re-filters types
when computing num_steps; extract the magic number into a module-level constant
(e.g., _MAX_TRAJECTORY_STEPS = 50), compute visible_steps =
agent_steps[:_MAX_TRAJECTORY_STEPS], log at debug/info when len(agent_steps) >
_MAX_TRAJECTORY_STEPS (including original and truncated lengths) before building
steps_list, use visible_steps when iterating to build steps_list and replace the
redundant len([s for s in agent_steps[:50] if ...]) with len(visible_steps), and
ensure parse_openai_agents_trajectory callers now see the same visible_steps so
truncation is explicit and discoverable.
🧹 Nitpick comments (1)
altk_evolve/llm/tips/tips.py (1)

185-198: Trajectory is parsed twice when segmentation is enabled.

parse_openai_agents_trajectory(messages) runs here at line 185, and segment_trajectory() in altk_evolve/llm/tips/segmentation.py calls it again internally on the same messages. For large trajectories with many function-call arguments to json.loads/json.dumps, this doubles the parsing cost on every tip-generation call.

Consider either threading the parsed trajectory_data (or just trajectory_summary/num_steps) into segment_trajectory, or moving the parse inside generate_tips and letting segment_trajectory accept already-parsed data. This also keeps the "what does the segmenter see?" invariant explicit rather than relying on both sides independently capping at 50.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@altk_evolve/llm/tips/tips.py` around lines 185 - 198, The code currently
calls parse_openai_agents_trajectory(messages) in generate_tips and then calls
segment_trajectory(messages), which re-parses the same messages; to fix, change
the call/signature so segment_trajectory accepts the already-parsed
trajectory_data (or the derived trajectory_summary/steps_list/num_steps) instead
of raw messages: update the call site in generate_tips to pass trajectory_data
(or trajectory_summary/steps_list) when evolve_config.segmentation_enabled and
modify altk_evolve.llm.tips.segmentation.segment_trajectory to accept and use
that parsed object (or add an overload) so parsing only happens once and the
50-step capping remains explicit via trajectory_data/steps_list.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/fixtures/appworld_venmo_task_trajectory.json`:
- Around line 231-244: The fixture omits Nancy from the taxi split causing an
incorrect net settlement; update the code that builds the three venmo requests
(symbols taxi_share, taxi_note, venmo_access_token and calls
apis.venmo.create_payment_request) to include a third request for Nancy (e.g.,
create result_nancy =
apis.venmo.create_payment_request(user_email=\"nancy@example.com\",
amount=taxi_share, access_token=venmo_access_token, description=taxi_note)), and
update the printed/output line to include the simulator’s actual returned
payment_request_id for that third request instead of the current two-request
output (adjust result_jordan/result_kathryn print to include result_nancy).

---

Outside diff comments:
In `@altk_evolve/llm/tips/tips.py`:
- Around line 87-105: The code silently truncates trajectories using
agent_steps[:50] and redundantly re-filters types when computing num_steps;
extract the magic number into a module-level constant (e.g.,
_MAX_TRAJECTORY_STEPS = 50), compute visible_steps =
agent_steps[:_MAX_TRAJECTORY_STEPS], log at debug/info when len(agent_steps) >
_MAX_TRAJECTORY_STEPS (including original and truncated lengths) before building
steps_list, use visible_steps when iterating to build steps_list and replace the
redundant len([s for s in agent_steps[:50] if ...]) with len(visible_steps), and
ensure parse_openai_agents_trajectory callers now see the same visible_steps so
truncation is explicit and discoverable.

---

Nitpick comments:
In `@altk_evolve/llm/tips/tips.py`:
- Around line 185-198: The code currently calls
parse_openai_agents_trajectory(messages) in generate_tips and then calls
segment_trajectory(messages), which re-parses the same messages; to fix, change
the call/signature so segment_trajectory accepts the already-parsed
trajectory_data (or the derived trajectory_summary/steps_list/num_steps) instead
of raw messages: update the call site in generate_tips to pass trajectory_data
(or trajectory_summary/steps_list) when evolve_config.segmentation_enabled and
modify altk_evolve.llm.tips.segmentation.segment_trajectory to accept and use
that parsed object (or add an overload) so parsing only happens once and the
50-step capping remains explicit via trajectory_data/steps_list.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 86859d49-c6b7-4cc0-8046-3fe674c8a7cd

📥 Commits

Reviewing files that changed from the base of the PR and between d0a885b and b3b9efb.

📒 Files selected for processing (5)
  • .secrets.baseline
  • altk_evolve/config/evolve.py
  • altk_evolve/llm/tips/tips.py
  • tests/e2e/test_segmentation.py
  • tests/fixtures/appworld_venmo_task_trajectory.json
🚧 Files skipped from review as they are similar to previous changes (1)
  • altk_evolve/config/evolve.py

Comment thread tests/fixtures/appworld_venmo_task_trajectory.json
@visahak

visahak commented Apr 24, 2026

Copy link
Copy Markdown
Collaborator

Check if this is true.

Bug: Step numbering misalignment in trajectory segmentation

In altk_evolve/llm/tips/tips.py, there's a potential step numbering bug when trajectory segmentation is enabled.

The Issue

The step numbering in steps_list (used for slicing) is based on enumeration of agent_steps, which only contains reasoning and action steps. However, segment_trajectory() analyzes the raw messages, which include observation steps (tool responses) between actions. This creates a mismatch:

  • steps_list counts: reasoning=1, action=2, reasoning=3
  • Raw trajectory counts: reasoning=1, observation=2, action=3, observation=4, reasoning=5

Concrete Example

Given this agent trajectory:

Message 0: User → "Do X"
Message 1: Assistant → "I'll analyze this..." (REASONING)
Message 2: Tool → {"data": "..."} (OBSERVATION)
Message 3: Assistant → search(...) (ACTION)
Message 4: Tool → {"result": "..."} (OBSERVATION)
Message 5: Assistant → "Based on results..." (REASONING)

What gets built:

  • agent_steps = [reasoning, action, reasoning]
  • steps_list = ["Step 1 - Reasoning: ...", "Step 2 - Action: ...", "Step 3 - Reasoning: ..."]

When segment_trajectory() analyzes the raw messages and returns start_step=2, end_step=4 (wanting the observation→action→observation range):

start = max(0, 2 - 1) = 1
end = min(max(0, 4), 3) = 3  # n_steps=3
slice_steps = steps_list[1:3]  # ["Step 2 - Action: search(...)", "Step 3 - Reasoning: ..."]

Result: You requested steps 2-4 from the actual trajectory, but got steps 2-3 from steps_listcompletely misaligned content with wrong step labels.

Solution

Filter agent_steps for reasoning/action types before enumeration to ensure step numbers match list indices:

reasoning_action_steps = [s for s in agent_steps[:50] if s["type"] in ["reasoning", "action"]]
steps_list = []
for i, step in enumerate(reasoning_action_steps, 1):
    # ... rest of logic

This way, step numbering stays consistent between segmentation analysis and list slicing.

@visahak

visahak commented Apr 24, 2026

Copy link
Copy Markdown
Collaborator

Bug: Missing clean_llm_response in constrained decoding branch of segmentation

In altk_evolve/llm/tips/segmentation.py, when constrained_decoding_supported is True, the raw LLM string is passed directly into json.loads() without sanitization.

The Issue

LLMs frequently wrap valid JSON in markdown fences (e.g. ```json ... ```) even when schema validation is enabled, causing an immediate JSONDecodeError. The False branch correctly guards against this with clean_llm_response(raw), but the True branch does not:

# True branch — raw string goes straight to json.loads() on line 79
clean_response = (
    completion(...).choices[0].message.content  # no sanitization
)

# False branch — correctly sanitized
clean_response = clean_llm_response(raw)  # ✓

Both branches then hit json.loads(clean_response) on line 79. When the True branch receives a markdown-fenced response, it raises JSONDecodeError, retries 3 times, then returns [] — silently disabling segmentation entirely.

Note: tips.py has the same asymmetry but it's less impactful there since the JSONDecodeError is caught, logged, and returns an empty result rather than disabling a feature.

Solution

Apply clean_llm_response to the True branch, matching the False branch:

if constrained_decoding_supported:
    clean_response = clean_llm_response(
        completion(
            model=llm_settings.tips_model,
            messages=[{"role": "user", "content": prompt}],
            response_format=SegmentationResponse,
            custom_llm_provider=llm_settings.custom_llm_provider,
        )
        .choices[0]
        .message.content
    )

@visahak

visahak commented Apr 24, 2026

Copy link
Copy Markdown
Collaborator

Wasted LLM API calls when subtask bounds are invalid

In altk_evolve/llm/tips/tips.py, generate_tips makes expensive LLM calls inside the subtask loop before knowing whether enough valid slices exist to use the results.

The Issue

If some subtasks have out-of-range steps and are skipped, and fewer than 2 valid results remain, the code falls through to a full-trajectory fallback — making yet another LLM call. Any calls made in the loop are completely discarded.

Concrete Example

Given 3 subtasks: [valid, out-of-range, out-of-range]

Loop iteration 1: bounds check passes → LLM call made ✓
Loop iteration 2: bounds check fails → skipped
Loop iteration 3: bounds check fails → skipped

len(results) == 1 → fall through to full-trajectory fallback → another LLM call

The first LLM call is wasted and the result is thrown away.

Solution

Validate all bounds first, then only make LLM calls if >= 2 valid slices exist:

valid_slices = []
for subtask in subtasks:
    start = min(max(0, subtask.start_step - 1), n_steps)
    end = min(max(0, subtask.end_step), n_steps)
    if start < end:
        valid_slices.append((subtask, steps_list[start:end]))

if len(valid_slices) >= 2:
    results = [
        _generate_tips_for_segment(
            task_description=subtask.generalized_description,
            trajectory_slice="\n\n".join(slice_steps),
            num_steps=len(slice_steps),
            constrained_decoding_supported=constrained_decoding_supported,
        )
        for subtask, slice_steps in valid_slices
    ]
    return results

@visahak visahak left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created issues in the PR for you to check

@visahak

visahak commented Apr 24, 2026

Copy link
Copy Markdown
Collaborator

The 13 e2e tests pass
EVOLVE_BACKEND=filesystem ./.venv/bin/python -m pytest tests/e2e --run-e2e -v

JAYARAM RADHAKRISHNAN added 2 commits April 28, 2026 07:06
…tion

# Conflicts:
#	altk_evolve/frontend/mcp/mcp_server.py
#	altk_evolve/llm/guidelines/guidelines.py
#	altk_evolve/llm/guidelines/prompts/segment_trajectory.jinja2
#	altk_evolve/llm/guidelines/segmentation.py
#	altk_evolve/schema/tips.py
#	altk_evolve/sync/phoenix_sync.py
#	tests/unit/test_mcp_server.py
#	tests/unit/test_phoenix_sync.py
@jayaramkr

Copy link
Copy Markdown
Collaborator Author

Bug: Step numbering misalignment in trajectory segmentation

fixed.

@jayaramkr

Copy link
Copy Markdown
Collaborator Author

Wasted LLM API calls when subtask bounds are invalid

In altk_evolve/llm/tips/tips.py, generate_tips makes expensive LLM calls inside the subtask loop before knowing whether enough valid slices exist to use the results.

The Issue

If some subtasks have out-of-range steps and are skipped, and fewer than 2 valid results remain, the code falls through to a full-trajectory fallback — making yet another LLM call. Any calls made in the loop are completely discarded.

Concrete Example

Given 3 subtasks: [valid, out-of-range, out-of-range]

Loop iteration 1: bounds check passes → LLM call made ✓
Loop iteration 2: bounds check fails → skipped
Loop iteration 3: bounds check fails → skipped

len(results) == 1 → fall through to full-trajectory fallback → another LLM call

The first LLM call is wasted and the result is thrown away.

Solution

Validate all bounds first, then only make LLM calls if >= 2 valid slices exist:

valid_slices = []
for subtask in subtasks:
    start = min(max(0, subtask.start_step - 1), n_steps)
    end = min(max(0, subtask.end_step), n_steps)
    if start < end:
        valid_slices.append((subtask, steps_list[start:end]))

if len(valid_slices) >= 2:
    results = [
        _generate_tips_for_segment(
            task_description=subtask.generalized_description,
            trajectory_slice="\n\n".join(slice_steps),
            num_steps=len(slice_steps),
            constrained_decoding_supported=constrained_decoding_supported,
        )
        for subtask, slice_steps in valid_slices
    ]
    return results

FIXED

@jayaramkr

Copy link
Copy Markdown
Collaborator Author

Check if this is true.

Bug: Step numbering misalignment in trajectory segmentation

In altk_evolve/llm/tips/tips.py, there's a potential step numbering bug when trajectory segmentation is enabled.

The Issue

The step numbering in steps_list (used for slicing) is based on enumeration of agent_steps, which only contains reasoning and action steps. However, segment_trajectory() analyzes the raw messages, which include observation steps (tool responses) between actions. This creates a mismatch:

  • steps_list counts: reasoning=1, action=2, reasoning=3
  • Raw trajectory counts: reasoning=1, observation=2, action=3, observation=4, reasoning=5

Concrete Example

Given this agent trajectory:

Message 0: User → "Do X"
Message 1: Assistant → "I'll analyze this..." (REASONING)
Message 2: Tool → {"data": "..."} (OBSERVATION)
Message 3: Assistant → search(...) (ACTION)
Message 4: Tool → {"result": "..."} (OBSERVATION)
Message 5: Assistant → "Based on results..." (REASONING)

What gets built:

  • agent_steps = [reasoning, action, reasoning]
  • steps_list = ["Step 1 - Reasoning: ...", "Step 2 - Action: ...", "Step 3 - Reasoning: ..."]

When segment_trajectory() analyzes the raw messages and returns start_step=2, end_step=4 (wanting the observation→action→observation range):

start = max(0, 2 - 1) = 1
end = min(max(0, 4), 3) = 3  # n_steps=3
slice_steps = steps_list[1:3]  # ["Step 2 - Action: search(...)", "Step 3 - Reasoning: ..."]

Result: You requested steps 2-4 from the actual trajectory, but got steps 2-3 from steps_listcompletely misaligned content with wrong step labels.

Solution

Filter agent_steps for reasoning/action types before enumeration to ensure step numbers match list indices:

reasoning_action_steps = [s for s in agent_steps[:50] if s["type"] in ["reasoning", "action"]]
steps_list = []
for i, step in enumerate(reasoning_action_steps, 1):
    # ... rest of logic

This way, step numbering stays consistent between segmentation analysis and list slicing.

altk_evolve/llm/guidelines/segmentation.py:28-45 calls parse_openai_agents_trajectory(messages) and passes only trajectory_data["trajectory_summary"] to the LLM — the same numbered string built from the filtered steps_list that generate_guidelines() uses for slicing.

The LLM never sees observation messages. It sees:

Step 1 - Reasoning: ...
Step 2 - Action: ...
Step 3 - Reasoning: ...

So when it returns start_step=2, end_step=3, those are indices into that filtered numbered list.

The current caller (generate_guidelines) does it right because it also calls parse_openai_agents_trajectory and slices steps_list. But that's an implicit contract — nothing enforces it, and a future caller could misuse it. Doing a docstring fix for now.

@visahak visahak self-requested a review April 28, 2026 14:07
@visahak visahak merged commit 7364300 into AgentToolkit:main Apr 28, 2026
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants