feat(subtasks): segment trajectories into subtasks before tip generation#198
Conversation
In Claude/Bob plugin contexts the first user message is often conversational or vague, making the task_description stored in tip metadata low-quality and clustering unreliable. Trajectory segmentation fixes this by running an LLM pass over the full trajectory to infer logical subtasks with generalized, reusable descriptions (e.g. 'Authenticate with GitHub API'). Tips are then generated per subtask using only the steps in that subtask's slice, and each tip entity is stored with the subtask's generalized description. - Add SubtaskSegment and SegmentationResponse Pydantic models to schema/tips.py - Add EVOLVE_SEGMENTATION_ENABLED config flag (default: True) to evolve.py - Add segment_trajectory.jinja2 prompt encouraging contiguous subtask ranges - Add altk_evolve/llm/tips/segmentation.py with segment_trajectory() (3-retry, graceful empty-list fallback on failure) - Refactor generate_tips() to return list[TipGenerationResult] (one per subtask), expose steps_list from parse_openai_agents_trajectory() for step-range slicing, and extract _generate_tips_for_segment() helper - Update phoenix_sync._process_trajectory() and mcp_server.save_trajectory() to iterate over the results list with nested comprehensions - Add tests/unit/test_segmentation.py with 4 unit tests - Update test_phoenix_sync.py and test_mcp_server.py mocks for list return type
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds trajectory segmentation and per-subtask tip generation: new segmentation schema and prompt, segmentation logic calling the LLM with retries and validation, tip-generation refactor to produce multiple results, a config toggle, and downstream updates to flatten and persist multi-result tips. Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant Tips as "Tips Module\n(altk_evolve/llm/tips)"
participant Seg as "Segmentation Module\n(altk_evolve/llm/tips/segmentation)"
participant LLM as "LLM Service\n(litellm)"
participant Schema as "Schema Validation\n(Pydantic)"
participant Consumer as "Consumer\n(MCP / Phoenix)"
Client->>Tips: generate_tips(messages)
alt segmentation_enabled
Tips->>Seg: segment_trajectory(messages)
Seg->>LLM: completion(segment_prompt)
LLM-->>Seg: raw JSON/text
Seg->>Schema: validate SegmentationResponse
Schema-->>Seg: SubtaskSegment[]
Seg-->>Tips: subtask_list
loop per subtask
Tips->>LLM: completion(tips_prompt_for_segment)
LLM-->>Tips: tips JSON/text
Tips->>Schema: validate TipGenerationResult
Schema-->>Tips: TipGenerationResult
end
else
Tips->>LLM: completion(tips_prompt_full)
LLM-->>Tips: tips JSON/text
Tips->>Schema: validate TipGenerationResult
Schema-->>Tips: TipGenerationResult
end
Tips-->>Client: list[TipGenerationResult]
Client->>Consumer: update_entities(flattened_tips)
Consumer->>Consumer: store tips with metadata
Estimated code review effort🎯 4 (Complex) | ⏱️ ~75 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
altk_evolve/llm/tips/tips.py (1)
87-108:⚠️ Potential issue | 🟠 Major
num_stepsandsteps_listcan disagree for long trajectories.
steps_listis built fromagent_steps[:50](at most 50 entries), whilenum_stepson line 107 counts every action/reasoning entry inagent_stepswithout the[:50]cap. For any trajectory with more than 50 action/reasoning steps:
segment_trajectory(segmentation.py lines 42-43) passes the fullnum_stepsinto the prompt together with atrajectory_summarythat only shows steps 1..50. The LLM is told "there are 80 steps" but only sees 50, so it will happily returnend_step=80, andgenerate_tipsat line 202 clamps ton_steps = len(steps_list) = 50— silently mis-attributing the trailing subtask to the visible prefix._generate_tips_for_segmentreceivesnum_steps=trajectory_data["num_steps"]on line 219 but atrajectory_slicecontaining only the visible steps, same discrepancy.Either cap
num_stepstolen(steps_list)(and truncate consistently), or stop truncating to 50 here and do the truncation only when rendering.Separately, the
elif step_type == "observation":branch on lines 99-100 is dead code —agent_stepsis only ever populated with"reasoning"(line 50) and"action"(line 74) entries — and can be removed or the producer extended to emit observations.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@altk_evolve/llm/tips/tips.py` around lines 87 - 108, The returned num_steps can disagree with steps_list because steps_list is built from agent_steps[:50] while num_steps counts all action/reasoning entries; update the function that builds steps_list/trajectory_summary so num_steps is capped to len(steps_list) (or compute num_steps from the same sliced source) and ensure trajectory_summary/steps_list and the value passed into _generate_tips_for_segment/segment_trajectory are consistent (use the same agent_steps slice or avoid slicing here and move truncation to rendering); also remove the dead elif for "observation" or add producing code to emit observations so the branch is reachable—references: steps_list, agent_steps, num_steps, trajectory_summary, generate_tips, _generate_tips_for_segment, segment_trajectory, trajectory_slice.
🧹 Nitpick comments (9)
altk_evolve/llm/tips/prompts/segment_trajectory.jinja2 (1)
17-20: Clarify step indexing base in the prompt.The prompt says
start_step/end_stepare "the step numbers shown in the trajectory above (inclusive)" but doesn't state the base (1-indexed vs 0-indexed). The example uses1…9, implying 1-based, and downstream slicing viasteps_listmust match. Making this explicit (e.g., "1-indexed, inclusive") reduces the chance of off-by-one errors whentrajectory_summaryrendering changes.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@altk_evolve/llm/tips/prompts/segment_trajectory.jinja2` around lines 17 - 20, The prompt is ambiguous about whether start_step/end_step are 0- or 1-indexed which can cause off-by-one bugs when matching trajectory_summary or slicing via steps_list; update the text around `start_step` and `end_step` in segment_trajectory.jinja2 to explicitly state that these are 1-indexed and inclusive (e.g., "1-indexed, inclusive") and ensure any example and any use of `steps_list` or references to the trajectory in the template and downstream code assume this 1-based indexing consistently.tests/unit/test_phoenix_sync.py (1)
511-635: LGTM on the mock-shape updates.The three updated mocks correctly adopt the new
list[TipGenerationResult]contract andtest_sync_processes_valid_spanscontinues to assert provenance metadata (task_description,source_task_id,source_span_id,creation_mode) on the flattened tip entities, which exercises the nested comprehension in_process_trajectory.One coverage gap worth considering: there's no test asserting that
update_entitiesis not called for tips whengenerate_tipsreturns[]or a list of results with emptytips— the newif tip_entities:guard inphoenix_sync.py(line 492) is currently unexercised.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/unit/test_phoenix_sync.py` around lines 511 - 635, Add a unit test that exercises the new guard in phoenix_sync._process_trajectory by verifying phoenix_sync.client.update_entities is NOT called when generate_tips returns either an empty list or a list of TipGenerationResult objects whose tips lists are empty; specifically, patch "altk_evolve.sync.phoenix_sync.generate_tips" to return [] and another case returning [TipGenerationResult(tips=[], task_description="...")] and call phoenix_sync.sync(...) then assert phoenix_sync.client.update_entities.assert_not_called() to cover the `if tip_entities:` branch.tests/unit/test_segmentation.py (1)
28-86: Tests only cover the non-constrained-decoding path.All four tests stub
supports_response_schema=Falseandget_supported_openai_params=[], so theconstrained_decoding_supported=Truebranch insegment_trajectory(which setslitellm.enable_json_schema_validation = Trueand passesresponse_format=SegmentationResponse) is never exercised. Given this branch is the preferred path when the model supports it, a fifth test that flips both mocks toTrue/["response_format"]and assertscompletionwas called withresponse_format=SegmentationResponsewould prevent silent regressions.Also consider one test with a response that parses as JSON but fails
SegmentationResponsevalidation (e.g., missingsubtaskskey or wrong field types) to confirmValidationErrortriggers retry rather than propagating — the current "bad" fixture only exercises theJSONDecodeErrorpath.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/unit/test_segmentation.py` around lines 28 - 86, Add tests exercising the constrained-decoding branch by mocking supports_response_schema to True and get_supported_openai_params to return ["response_format"], call segment_trajectory and assert the mocked completion was invoked with response_format=SegmentationResponse and that litellm.enable_json_schema_validation is effectively used; additionally add a test where completion returns syntactically valid JSON that fails SegmentationResponse validation (e.g., missing "subtasks" or wrong types) and assert segment_trajectory treats this like a parse/validation failure and retries (incrementing completion.call_count) rather than propagating a ValidationError.altk_evolve/frontend/mcp/mcp_server.py (1)
205-229: Correct, but duplicated withphoenix_sync._process_trajectory.The nested comprehension +
if tip_entities:guard is byte-for-byte parallel toaltk_evolve/sync/phoenix_sync.pylines 472-497 aside from thecreation_modeand span id metadata. Consider extracting a small helper likebuild_tip_entities(results, *, source_task_id, source_span_id=None, creation_mode)inaltk_evolve/llm/tips/tips.py(or a neighboring utility) so both call sites stay in lockstep as fields evolve.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@altk_evolve/frontend/mcp/mcp_server.py` around lines 205 - 229, The code builds tip_entities via a nested comprehension after calling generate_tips and then calls get_client().update_entities, which duplicates logic present in phoenix_sync._process_trajectory; extract the entity-construction logic into a shared helper (e.g. build_tip_entities(results, *, source_task_id, source_span_id=None, creation_mode)) placed in altk_evolve/llm/tips/tips.py (or a nearby util) and replace the comprehension in both mcp_server.py (after generate_tips) and phoenix_sync._process_trajectory to call build_tip_entities(...) so fields like category, rationale, trigger, implementation_steps, task_description, source_task_id, source_span_id and creation_mode stay consistent as they evolve.tests/unit/test_mcp_server.py (1)
22-30: Consider assertingtask_descriptionpropagation in tip metadata.The mock now returns a list and
save_trajectorybuildstask_descriptionfromresult.task_description("Add feature"). The existing assertions don't cover this new field, so a regression in the per-subtask description propagation wouldn't be caught here. Addingassert tip_entity.metadata["task_description"] == "Add feature"would close that gap.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/unit/test_mcp_server.py` around lines 22 - 30, The test currently mocks mock_generate_tips to return mock_result with result.task_description="Add feature", but does not assert that save_trajectory propagated that into each tip entity; add an assertion after the saved tip is retrieved to verify tip_entity.metadata["task_description"] == "Add feature" (locate the check near existing assertions for tip_entity/metadata in tests/unit/test_mcp_server.py, referencing mock_result, mock_tip, save_trajectory and tip_entity to ensure the per-subtask description is covered).altk_evolve/sync/phoenix_sync.py (1)
471-499: Behavior looks correct; consider counter semantics when all segments fail.The nested comprehension and
sum(len(r.tips) for r in results)correctly handle multi-segment results, and theif tip_entities:guard avoids a pointlessupdate_entitiescall when segmentation yields no tips.Minor semantic note: when segmentation is enabled and every per-segment tip generation fails (or when
segment_trajectoryreturns[]and the fallback produces no tips),_process_trajectoryreturns 0 andsync()still counts the trajectory asprocessed += 1at line 562. That matches the pre-refactor behavior for a failed single call, but given segmentation multiplies LLM calls, you may want either (a) a log line distinguishing "processed with 0 tips" or (b) surfacing partial-failure counts inSyncResult. Not a blocker — just worth a thought for observability.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@altk_evolve/sync/phoenix_sync.py` around lines 471 - 499, The code currently returns 0 tips from _process_trajectory even when segmentation ran and all per-segment tip generations failed; add observability by logging when tip generation completed but produced zero tips and/or surface partial-failure counts in SyncResult. Specifically, in _process_trajectory (where generate_tips(...) and tip_entities are computed) compute total_tips = sum(len(r.tips) for r in results) and if total_tips == 0 and results: emit a warning via self.logger (or process logger) like "trajectory processed with 0 tips (segmented failures)" including trajectory["trace_id"]/span_id; alternatively, update the method signature to return (total_tips, partial_failure_count) and propagate that into sync() and SyncResult so the caller can differentiate processed-but-empty from unprocessed. Use the symbols generate_tips, tip_entities, _process_trajectory, sync, and SyncResult to locate changes.altk_evolve/llm/tips/segmentation.py (2)
17-17: Specifyencoding="utf-8"when reading the prompt.
Path.read_text()without an explicit encoding uses the platform-default, which can causeUnicodeDecodeErrorat import time on non-UTF-8 systems (e.g., Windows cp1252). Ruff'sPLW1514flags this.♻️ Proposed change
-_SEGMENT_TEMPLATE = Template((Path(__file__).parent / "prompts/segment_trajectory.jinja2").read_text()) +_SEGMENT_TEMPLATE = Template((Path(__file__).parent / "prompts/segment_trajectory.jinja2").read_text(encoding="utf-8"))The same applies to
_GENERATE_TIPS_TEMPLATEinaltk_evolve/llm/tips/tips.pyat line 19.As per coding guidelines:
**/*.py: Use Ruff for linting and formatting (configured in pyproject.toml).🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@altk_evolve/llm/tips/segmentation.py` at line 17, The template loading uses Path.read_text() without an explicit encoding which can raise UnicodeDecodeError on non-UTF-8 platforms; update the Template constructors for _SEGMENT_TEMPLATE and _GENERATE_TIPS_TEMPLATE to call read_text(encoding="utf-8") so the prompt files are always read as UTF-8, preserving current behavior and satisfying Ruff's PLW1514 rule.
30-73: Consider extracting a shared helper for constrained-decoding detection and completion.The capability detection at lines 30-39 and the
completion(...)branching at lines 50-73 are near-duplicates ofaltk_evolve/llm/tips/tips.pylines 176-185 and 125-148 respectively. A shared helper (e.g.,_completion_with_schema(prompt, response_model)returning the cleaned string) inaltk_evolve/llm/would eliminate drift risk between the two sites, since both already use the samellm_settingsandclean_llm_response.Not blocking; can be deferred.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@altk_evolve/llm/tips/segmentation.py` around lines 30 - 73, Extract the duplicated capability-detection and completion logic into a shared helper (e.g., _completion_with_schema(prompt: str, response_model: Type | None) -> str) in altk_evolve/llm/, which should encapsulate the calls to get_supported_openai_params and supports_response_schema for the given llm_settings, toggle litellm.enable_json_schema_validation, call completion(...) with appropriate response_format or without it, and return the final cleaned string using clean_llm_response when necessary; then replace the duplicated blocks in segmentation.py (the use of get_supported_openai_params, supports_response_format, supports_response_schema, constrained_decoding_supported, and the completion(...) branches including SegmentationResponse and clean_llm_response) with a call to that new helper, passing the prompt and SegmentationResponse where required.altk_evolve/llm/tips/tips.py (1)
198-211: Defensively guard against malformed step ranges from the segmenter.If the segmenter returns
start_step > end_step,start_step > n_steps, or overlapping/out-of-order ranges,slice_stepscan be empty, producing aTipGenerationResultwithnum_steps=0and a trajectory slice of""._generate_tips_for_segmentwill then prompt the LLM with an empty trajectory, almost certainly yielding empty/garbage tips but consuming an LLM call per bad segment.Consider dropping segments where
len(slice_steps) == 0(orend <= start) with alogger.warning, so downstream clustering isn't polluted with empty results and token spend is capped.for subtask in subtasks: start = max(0, subtask.start_step - 1) end = min(n_steps, subtask.end_step) slice_steps = steps_list[start:end] + if not slice_steps: + logger.warning( + f"Skipping empty subtask segment " + f"[{subtask.start_step}, {subtask.end_step}] (n_steps={n_steps}): " + f"{subtask.generalized_description!r}" + ) + continue result = _generate_tips_for_segment(🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@altk_evolve/llm/tips/tips.py` around lines 198 - 211, The segment loop in the tips generation currently calls _generate_tips_for_segment even when slice_steps is empty due to malformed ranges (e.g., subtask.start_step > subtask.end_step, start_step > n_steps, or overlapping/out-of-order ranges); update the loop that iterates over subtasks to compute start = max(0, subtask.start_step - 1) and end = min(n_steps, subtask.end_step) then skip any segment where end <= start or len(slice_steps) == 0 by emitting a logger.warning (include subtask.generalized_description, start, end, and n_steps) and not appending a TipGenerationResult for that segment so _generate_tips_for_segment is only called with non-empty trajectory_slice, preventing empty results and wasted LLM calls.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@altk_evolve/llm/tips/segmentation.py`:
- Around line 48-83: Narrow the retry except to only JSONDecodeError and
ValidationError (remove the broad Exception) so unexpected errors from
completion(...) or attribute/key errors propagate; before the loop compute and
set litellm.enable_json_schema_validation once (instead of mutating it each
attempt) or use a context manager to avoid race conditions with concurrent
_generate_tips_for_segment calls; after each failed attempt log the exception
with context (use logger.debug/info) including attempt index, prompt, and
last_error, and if clean_response is empty/None (from completion(...) or
clean_llm_response) treat it as a non-retriable error (log and break/raise)
rather than retrying; continue to validate using
SegmentationResponse.model_validate(json.loads(clean_response)) and only retry
on the two JSON/Validation exceptions, re-raising other exceptions.
In `@altk_evolve/schema/tips.py`:
- Around line 20-28: Add Field descriptions to SubtaskSegment's fields
(generalized_description, start_step, end_step, purpose) using
pydantic.Field(description="...") so the JSON schema passed as
response_format=SegmentationResponse includes grounding text; also enforce
step-range validation by constraining start_step and end_step (e.g., use
pydantic.conint or Field(ge=1) for start_step/end_step and add a validator
method on SubtaskSegment that raises ValueError if end_step < start_step) so
invalid ranges fail validation and trigger retries in the segmentation consumer.
---
Outside diff comments:
In `@altk_evolve/llm/tips/tips.py`:
- Around line 87-108: The returned num_steps can disagree with steps_list
because steps_list is built from agent_steps[:50] while num_steps counts all
action/reasoning entries; update the function that builds
steps_list/trajectory_summary so num_steps is capped to len(steps_list) (or
compute num_steps from the same sliced source) and ensure
trajectory_summary/steps_list and the value passed into
_generate_tips_for_segment/segment_trajectory are consistent (use the same
agent_steps slice or avoid slicing here and move truncation to rendering); also
remove the dead elif for "observation" or add producing code to emit
observations so the branch is reachable—references: steps_list, agent_steps,
num_steps, trajectory_summary, generate_tips, _generate_tips_for_segment,
segment_trajectory, trajectory_slice.
---
Nitpick comments:
In `@altk_evolve/frontend/mcp/mcp_server.py`:
- Around line 205-229: The code builds tip_entities via a nested comprehension
after calling generate_tips and then calls get_client().update_entities, which
duplicates logic present in phoenix_sync._process_trajectory; extract the
entity-construction logic into a shared helper (e.g. build_tip_entities(results,
*, source_task_id, source_span_id=None, creation_mode)) placed in
altk_evolve/llm/tips/tips.py (or a nearby util) and replace the comprehension in
both mcp_server.py (after generate_tips) and phoenix_sync._process_trajectory to
call build_tip_entities(...) so fields like category, rationale, trigger,
implementation_steps, task_description, source_task_id, source_span_id and
creation_mode stay consistent as they evolve.
In `@altk_evolve/llm/tips/prompts/segment_trajectory.jinja2`:
- Around line 17-20: The prompt is ambiguous about whether start_step/end_step
are 0- or 1-indexed which can cause off-by-one bugs when matching
trajectory_summary or slicing via steps_list; update the text around
`start_step` and `end_step` in segment_trajectory.jinja2 to explicitly state
that these are 1-indexed and inclusive (e.g., "1-indexed, inclusive") and ensure
any example and any use of `steps_list` or references to the trajectory in the
template and downstream code assume this 1-based indexing consistently.
In `@altk_evolve/llm/tips/segmentation.py`:
- Line 17: The template loading uses Path.read_text() without an explicit
encoding which can raise UnicodeDecodeError on non-UTF-8 platforms; update the
Template constructors for _SEGMENT_TEMPLATE and _GENERATE_TIPS_TEMPLATE to call
read_text(encoding="utf-8") so the prompt files are always read as UTF-8,
preserving current behavior and satisfying Ruff's PLW1514 rule.
- Around line 30-73: Extract the duplicated capability-detection and completion
logic into a shared helper (e.g., _completion_with_schema(prompt: str,
response_model: Type | None) -> str) in altk_evolve/llm/, which should
encapsulate the calls to get_supported_openai_params and
supports_response_schema for the given llm_settings, toggle
litellm.enable_json_schema_validation, call completion(...) with appropriate
response_format or without it, and return the final cleaned string using
clean_llm_response when necessary; then replace the duplicated blocks in
segmentation.py (the use of get_supported_openai_params,
supports_response_format, supports_response_schema,
constrained_decoding_supported, and the completion(...) branches including
SegmentationResponse and clean_llm_response) with a call to that new helper,
passing the prompt and SegmentationResponse where required.
In `@altk_evolve/llm/tips/tips.py`:
- Around line 198-211: The segment loop in the tips generation currently calls
_generate_tips_for_segment even when slice_steps is empty due to malformed
ranges (e.g., subtask.start_step > subtask.end_step, start_step > n_steps, or
overlapping/out-of-order ranges); update the loop that iterates over subtasks to
compute start = max(0, subtask.start_step - 1) and end = min(n_steps,
subtask.end_step) then skip any segment where end <= start or len(slice_steps)
== 0 by emitting a logger.warning (include subtask.generalized_description,
start, end, and n_steps) and not appending a TipGenerationResult for that
segment so _generate_tips_for_segment is only called with non-empty
trajectory_slice, preventing empty results and wasted LLM calls.
In `@altk_evolve/sync/phoenix_sync.py`:
- Around line 471-499: The code currently returns 0 tips from
_process_trajectory even when segmentation ran and all per-segment tip
generations failed; add observability by logging when tip generation completed
but produced zero tips and/or surface partial-failure counts in SyncResult.
Specifically, in _process_trajectory (where generate_tips(...) and tip_entities
are computed) compute total_tips = sum(len(r.tips) for r in results) and if
total_tips == 0 and results: emit a warning via self.logger (or process logger)
like "trajectory processed with 0 tips (segmented failures)" including
trajectory["trace_id"]/span_id; alternatively, update the method signature to
return (total_tips, partial_failure_count) and propagate that into sync() and
SyncResult so the caller can differentiate processed-but-empty from unprocessed.
Use the symbols generate_tips, tip_entities, _process_trajectory, sync, and
SyncResult to locate changes.
In `@tests/unit/test_mcp_server.py`:
- Around line 22-30: The test currently mocks mock_generate_tips to return
mock_result with result.task_description="Add feature", but does not assert that
save_trajectory propagated that into each tip entity; add an assertion after the
saved tip is retrieved to verify tip_entity.metadata["task_description"] == "Add
feature" (locate the check near existing assertions for tip_entity/metadata in
tests/unit/test_mcp_server.py, referencing mock_result, mock_tip,
save_trajectory and tip_entity to ensure the per-subtask description is
covered).
In `@tests/unit/test_phoenix_sync.py`:
- Around line 511-635: Add a unit test that exercises the new guard in
phoenix_sync._process_trajectory by verifying
phoenix_sync.client.update_entities is NOT called when generate_tips returns
either an empty list or a list of TipGenerationResult objects whose tips lists
are empty; specifically, patch "altk_evolve.sync.phoenix_sync.generate_tips" to
return [] and another case returning [TipGenerationResult(tips=[],
task_description="...")] and call phoenix_sync.sync(...) then assert
phoenix_sync.client.update_entities.assert_not_called() to cover the `if
tip_entities:` branch.
In `@tests/unit/test_segmentation.py`:
- Around line 28-86: Add tests exercising the constrained-decoding branch by
mocking supports_response_schema to True and get_supported_openai_params to
return ["response_format"], call segment_trajectory and assert the mocked
completion was invoked with response_format=SegmentationResponse and that
litellm.enable_json_schema_validation is effectively used; additionally add a
test where completion returns syntactically valid JSON that fails
SegmentationResponse validation (e.g., missing "subtasks" or wrong types) and
assert segment_trajectory treats this like a parse/validation failure and
retries (incrementing completion.call_count) rather than propagating a
ValidationError.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: d209fb78-be14-4c74-8288-ce5d7c44b802
📒 Files selected for processing (10)
altk_evolve/config/evolve.pyaltk_evolve/frontend/mcp/mcp_server.pyaltk_evolve/llm/tips/prompts/segment_trajectory.jinja2altk_evolve/llm/tips/segmentation.pyaltk_evolve/llm/tips/tips.pyaltk_evolve/schema/tips.pyaltk_evolve/sync/phoenix_sync.pytests/unit/test_mcp_server.pytests/unit/test_phoenix_sync.pytests/unit/test_segmentation.py
- tips.py: cap num_steps to len(steps_list) so segmentation LLM step count matches the visible trajectory; remove dead observation branch - schema/tips.py: add Field descriptions to SubtaskSegment for JSON schema grounding; add model_validator rejecting end_step < start_step - segmentation.py: narrow except to (JSONDecodeError, ValidationError), add explicit empty-response guard, log each failed attempt at debug; move litellm.enable_json_schema_validation outside the retry loop - tips.py: catch broad Exception from segment_trajectory() and fall back to full-trajectory mode with a warning log - test_segmentation.py: add tests for inverted range rejection and non-parse error propagation from segment_trajectory
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@altk_evolve/llm/tips/tips.py`:
- Around line 200-213: The loop over subtasks can pass an empty trajectory slice
into _generate_tips_for_segment when subtask.start_step/end_step lie outside the
visible range; clamp and validate the slice bounds before calling the LLM:
compute start = min(max(0, subtask.start_step - 1), n_steps) and end =
min(max(0, subtask.end_step), n_steps), then if start >= end (or slice_steps
would be empty) skip this subtask (or log/warn) instead of calling
_generate_tips_for_segment with num_steps==0; update the loop that builds
slice_steps/num_steps accordingly to use these validated bounds.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 969c8342-9af4-40e9-a383-a5f2b9528f03
📒 Files selected for processing (7)
altk_evolve/config/evolve.pyaltk_evolve/frontend/mcp/mcp_server.pyaltk_evolve/llm/tips/segmentation.pyaltk_evolve/llm/tips/tips.pyaltk_evolve/schema/tips.pytests/unit/test_mcp_server.pytests/unit/test_segmentation.py
✅ Files skipped from review due to trivial changes (2)
- tests/unit/test_mcp_server.py
- altk_evolve/config/evolve.py
🚧 Files skipped from review as they are similar to previous changes (3)
- altk_evolve/schema/tips.py
- altk_evolve/llm/tips/segmentation.py
- tests/unit/test_segmentation.py
Clamp both bounds to [0, n_steps] before slicing, then skip the subtask with a warning if start >= end. Prevents passing num_steps=0 and an empty trajectory slice into _generate_tips_for_segment when the LLM returns step indices outside the visible 50-step window.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@altk_evolve/llm/tips/tips.py`:
- Around line 200-218: The code currently returns whenever any subtask produced
a result, which violates the requirement to only use per-subtask tips when at
least two valid subtasks exist; change the post-filter check so that after
iterating subtasks and building results you only return results if len(results)
>= 2 (i.e., at least two valid subtask slices were generated); otherwise fall
through to the full-trajectory fallback. Update the logic around the loop that
builds results (referencing variables/subroutines: subtasks, results, start/end
calculation, steps_list, and _generate_tips_for_segment) to perform this
stricter count check before returning.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 9b88539e-fad3-4cfd-936f-65944f80719f
📒 Files selected for processing (1)
altk_evolve/llm/tips/tips.py
…simulator tokens)
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
altk_evolve/llm/tips/tips.py (1)
87-105:⚠️ Potential issue | 🟡 MinorAdd logging and extract the 50-step trajectory cap as a constant for discoverability.
The hardcoded
agent_steps[:50]at lines 88 and 105 silently truncates trajectories longer than 50 steps with no logging or warning. When segmentation is enabled,parse_openai_agents_trajectoryis called twice (lines 185 and again atsegmentation.py:28), both seeing the same capped trajectory. This means long trajectories silently lose their tail—a user has no visibility that steps beyond 50 are being dropped.Additionally, the type filter at line 105 is redundant:
agent_stepsonly ever contains entries withtypein{"reasoning", "action"}(appended only at lines 50 and 74).Suggested action:
- Extract
50as a module-level constant (e.g.,_MAX_TRAJECTORY_STEPS = 50) for clarity and tunability.- Log at
debugorinfolevel when truncation occurs, so it's observable.- Replace the redundant filter at line 105 with
len(visible_steps).🛠️ Suggested refinement
+_MAX_TRAJECTORY_STEPS = 50 + def parse_openai_agents_trajectory(messages: list[dict]) -> dict: @@ - steps_list = [] - for i, step in enumerate(agent_steps[:50], 1): + visible_steps = agent_steps[:_MAX_TRAJECTORY_STEPS] + if len(agent_steps) > _MAX_TRAJECTORY_STEPS: + logger.debug( + f"Trajectory truncated from {len(agent_steps)} to {_MAX_TRAJECTORY_STEPS} steps" + ) + steps_list = [] + for i, step in enumerate(visible_steps, 1): step_type = step["type"] @@ return { "task_instruction": task_instruction or DEFAULT_TASK_DESCRIPTION, "trajectory_summary": "\n\n".join(steps_list), "steps_list": steps_list, "function_calls": function_calls, - "num_steps": len([s for s in agent_steps[:50] if s["type"] in ["action", "reasoning"]]), + "num_steps": len(visible_steps), }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@altk_evolve/llm/tips/tips.py` around lines 87 - 105, The code silently truncates trajectories using agent_steps[:50] and redundantly re-filters types when computing num_steps; extract the magic number into a module-level constant (e.g., _MAX_TRAJECTORY_STEPS = 50), compute visible_steps = agent_steps[:_MAX_TRAJECTORY_STEPS], log at debug/info when len(agent_steps) > _MAX_TRAJECTORY_STEPS (including original and truncated lengths) before building steps_list, use visible_steps when iterating to build steps_list and replace the redundant len([s for s in agent_steps[:50] if ...]) with len(visible_steps), and ensure parse_openai_agents_trajectory callers now see the same visible_steps so truncation is explicit and discoverable.
🧹 Nitpick comments (1)
altk_evolve/llm/tips/tips.py (1)
185-198: Trajectory is parsed twice when segmentation is enabled.
parse_openai_agents_trajectory(messages)runs here at line 185, andsegment_trajectory()inaltk_evolve/llm/tips/segmentation.pycalls it again internally on the samemessages. For large trajectories with many function-call arguments tojson.loads/json.dumps, this doubles the parsing cost on every tip-generation call.Consider either threading the parsed
trajectory_data(or justtrajectory_summary/num_steps) intosegment_trajectory, or moving the parse insidegenerate_tipsand lettingsegment_trajectoryaccept already-parsed data. This also keeps the "what does the segmenter see?" invariant explicit rather than relying on both sides independently capping at 50.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@altk_evolve/llm/tips/tips.py` around lines 185 - 198, The code currently calls parse_openai_agents_trajectory(messages) in generate_tips and then calls segment_trajectory(messages), which re-parses the same messages; to fix, change the call/signature so segment_trajectory accepts the already-parsed trajectory_data (or the derived trajectory_summary/steps_list/num_steps) instead of raw messages: update the call site in generate_tips to pass trajectory_data (or trajectory_summary/steps_list) when evolve_config.segmentation_enabled and modify altk_evolve.llm.tips.segmentation.segment_trajectory to accept and use that parsed object (or add an overload) so parsing only happens once and the 50-step capping remains explicit via trajectory_data/steps_list.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tests/fixtures/appworld_venmo_task_trajectory.json`:
- Around line 231-244: The fixture omits Nancy from the taxi split causing an
incorrect net settlement; update the code that builds the three venmo requests
(symbols taxi_share, taxi_note, venmo_access_token and calls
apis.venmo.create_payment_request) to include a third request for Nancy (e.g.,
create result_nancy =
apis.venmo.create_payment_request(user_email=\"nancy@example.com\",
amount=taxi_share, access_token=venmo_access_token, description=taxi_note)), and
update the printed/output line to include the simulator’s actual returned
payment_request_id for that third request instead of the current two-request
output (adjust result_jordan/result_kathryn print to include result_nancy).
---
Outside diff comments:
In `@altk_evolve/llm/tips/tips.py`:
- Around line 87-105: The code silently truncates trajectories using
agent_steps[:50] and redundantly re-filters types when computing num_steps;
extract the magic number into a module-level constant (e.g.,
_MAX_TRAJECTORY_STEPS = 50), compute visible_steps =
agent_steps[:_MAX_TRAJECTORY_STEPS], log at debug/info when len(agent_steps) >
_MAX_TRAJECTORY_STEPS (including original and truncated lengths) before building
steps_list, use visible_steps when iterating to build steps_list and replace the
redundant len([s for s in agent_steps[:50] if ...]) with len(visible_steps), and
ensure parse_openai_agents_trajectory callers now see the same visible_steps so
truncation is explicit and discoverable.
---
Nitpick comments:
In `@altk_evolve/llm/tips/tips.py`:
- Around line 185-198: The code currently calls
parse_openai_agents_trajectory(messages) in generate_tips and then calls
segment_trajectory(messages), which re-parses the same messages; to fix, change
the call/signature so segment_trajectory accepts the already-parsed
trajectory_data (or the derived trajectory_summary/steps_list/num_steps) instead
of raw messages: update the call site in generate_tips to pass trajectory_data
(or trajectory_summary/steps_list) when evolve_config.segmentation_enabled and
modify altk_evolve.llm.tips.segmentation.segment_trajectory to accept and use
that parsed object (or add an overload) so parsing only happens once and the
50-step capping remains explicit via trajectory_data/steps_list.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 86859d49-c6b7-4cc0-8046-3fe674c8a7cd
📒 Files selected for processing (5)
.secrets.baselinealtk_evolve/config/evolve.pyaltk_evolve/llm/tips/tips.pytests/e2e/test_segmentation.pytests/fixtures/appworld_venmo_task_trajectory.json
🚧 Files skipped from review as they are similar to previous changes (1)
- altk_evolve/config/evolve.py
|
Check if this is true. Bug: Step numbering misalignment in trajectory segmentationIn The IssueThe step numbering in
Concrete ExampleGiven this agent trajectory: What gets built:
When start = max(0, 2 - 1) = 1
end = min(max(0, 4), 3) = 3 # n_steps=3
slice_steps = steps_list[1:3] # ["Step 2 - Action: search(...)", "Step 3 - Reasoning: ..."]Result: You requested steps 2-4 from the actual trajectory, but got steps 2-3 from SolutionFilter reasoning_action_steps = [s for s in agent_steps[:50] if s["type"] in ["reasoning", "action"]]
steps_list = []
for i, step in enumerate(reasoning_action_steps, 1):
# ... rest of logicThis way, step numbering stays consistent between segmentation analysis and list slicing. |
Bug: Missing
|
|
Wasted LLM API calls when subtask bounds are invalid In The IssueIf some subtasks have out-of-range steps and are skipped, and fewer than 2 valid results remain, the code falls through to a full-trajectory fallback — making yet another LLM call. Any calls made in the loop are completely discarded. Concrete ExampleGiven 3 subtasks: The first LLM call is wasted and the result is thrown away. SolutionValidate all bounds first, then only make LLM calls if valid_slices = []
for subtask in subtasks:
start = min(max(0, subtask.start_step - 1), n_steps)
end = min(max(0, subtask.end_step), n_steps)
if start < end:
valid_slices.append((subtask, steps_list[start:end]))
if len(valid_slices) >= 2:
results = [
_generate_tips_for_segment(
task_description=subtask.generalized_description,
trajectory_slice="\n\n".join(slice_steps),
num_steps=len(slice_steps),
constrained_decoding_supported=constrained_decoding_supported,
)
for subtask, slice_steps in valid_slices
]
return results |
visahak
left a comment
There was a problem hiding this comment.
Created issues in the PR for you to check
|
The 13 e2e tests pass |
…tion # Conflicts: # altk_evolve/frontend/mcp/mcp_server.py # altk_evolve/llm/guidelines/guidelines.py # altk_evolve/llm/guidelines/prompts/segment_trajectory.jinja2 # altk_evolve/llm/guidelines/segmentation.py # altk_evolve/schema/tips.py # altk_evolve/sync/phoenix_sync.py # tests/unit/test_mcp_server.py # tests/unit/test_phoenix_sync.py
…kip wasted LLM calls on invalid bounds
fixed. |
FIXED |
…ered steps_list, not raw messages
altk_evolve/llm/guidelines/segmentation.py:28-45 calls parse_openai_agents_trajectory(messages) and passes only trajectory_data["trajectory_summary"] to the LLM — the same numbered string built from the filtered steps_list that generate_guidelines() uses for slicing. The LLM never sees observation messages. It sees: Step 1 - Reasoning: ... So when it returns start_step=2, end_step=3, those are indices into that filtered numbered list. The current caller (generate_guidelines) does it right because it also calls parse_openai_agents_trajectory and slices steps_list. But that's an implicit contract — nothing enforces it, and a future caller could misuse it. Doing a docstring fix for now. |
In Claude/Bob plugin contexts the first user message is often conversational or vague, making the task_description stored in tip metadata low-quality and clustering unreliable. Trajectory segmentation fixes this by running an LLM pass over the full trajectory to infer logical subtasks with generalized, reusable descriptions (e.g. 'Authenticate with GitHub API'). Tips are then generated per subtask using only the steps in that subtask's slice, and each tip entity is stored with the subtask's generalized description.
Summary by CodeRabbit
New Features
Bug Fixes / Improvements
Tests