feat: agent can regex through Skill reference files#1304
feat: agent can regex through Skill reference files#1304
Conversation
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request introduces a new SkillSearchTool that allows agents to perform regex searches across markdown reference files within a skill. It also enhances the existing SkillTool to support directory listings with line counts. Key changes include updates to the prompt builder to provide instructions for the new search capability, new data model methods for iterating through references and counting lines, and corresponding updates to the adapter logic for tool resolution and caching. Feedback focuses on performance and memory efficiency, specifically regarding the simultaneous loading of multiple files into memory and the non-streaming approach to line counting, as well as a potential ReDoS vulnerability from unconstrained regex patterns.
| prefix_rel = None | ||
| scope_label = None | ||
| try: | ||
| corpus = list(skill.iter_markdown_references(prefix=prefix_rel)) |
There was a problem hiding this comment.
Converting the corpus generator to a list forces all markdown file contents into memory simultaneously. For skills with many or large reference files, this can lead to excessive memory consumption or OOM errors. It is safer to iterate over the generator directly to process files one by one. Note that if you remove list(), you should also ensure that any potential ValueError from the generator initialization is handled correctly.
| corpus = list(skill.iter_markdown_references(prefix=prefix_rel)) | |
| corpus = skill.iter_markdown_references(prefix=prefix_rel) |
| elif entry.suffix == ".md": | ||
| rel = entry.relative_to(base_dir).as_posix() | ||
| try: | ||
| line_count = skill.count_file_lines(root, rel) |
There was a problem hiding this comment.
Calling count_file_lines for every markdown file during a directory listing is a major performance bottleneck. Since the current implementation of count_file_lines reads the full content of each file, listing a directory with many files results in O(Total Size of All Files) I/O operations. This makes the tool very slow and memory-intensive for large reference sets.
| raise ValueError( | ||
| f"Unknown resource root: {resource_root!r}. Expected 'references' or 'assets'." | ||
| ) | ||
| text = self._read_resource(base_dir, relative_path) |
There was a problem hiding this comment.
| return ToolCallResult(output="Error: 'pattern' parameter is required.") | ||
|
|
||
| try: | ||
| regex = re.compile(pattern, re.IGNORECASE) |
There was a problem hiding this comment.
Compiling and executing a regex pattern provided by the agent without constraints or timeouts poses a risk of Regular Expression Denial of Service (ReDoS). A maliciously crafted or accidentally complex pattern can cause catastrophic backtracking, hanging the execution thread. Consider using a regex engine with built-in protections or implementing a timeout for the search operation.
📊 Coverage ReportOverall Coverage: 92% Diff: origin/main...HEAD
Summary
Line-by-lineView line-by-line diff coveragelibs/core/kiln_ai/adapters/fine_tune/dataset_formatter.pyLines 484-508 484 self._tool_cache[cache_key] = skill_def
485 tool_definitions.append(skill_def)
486
487 if skill_search_tool_ids:
! 488 cache_key = "search::" + "::".join(sorted(skill_search_tool_ids))
! 489 if cache_key in self._tool_cache:
! 490 tool_definitions.append(self._tool_cache[cache_key])
491 else:
! 492 from kiln_ai.adapters.adapter_registry import (
493 load_skills_from_tool_ids,
494 )
! 495 from kiln_ai.tools.skill_search_tool import SkillSearchTool
496
! 497 skills_dict = load_skills_from_tool_ids(task, skill_search_tool_ids)
! 498 if skills_dict:
! 499 search_tool = SkillSearchTool(
500 f"{SKILL_SEARCH_TOOL_ID_PREFIX}_combined",
501 list(skills_dict.values()),
502 )
! 503 search_def = await search_tool.toolcall_definition()
! 504 self._tool_cache[cache_key] = search_def
! 505 tool_definitions.append(search_def)
506
507 return tool_definitions if tool_definitions else Nonelibs/core/kiln_ai/adapters/model_adapters/base_adapter.pyLines 834-849 834 tools.append(SkillTool(f"{SKILL_TOOL_ID_PREFIX}_combined", skills))
835
836 search_skills = self._resolve_skill_search_tool_skills()
837 if search_skills:
! 838 seen_names = set()
! 839 for skill in search_skills:
! 840 if skill.name in seen_names:
! 841 raise ValueError(
842 f"Duplicate skill name '{skill.name}'. Each skill must have a unique name."
843 )
! 844 seen_names.add(skill.name)
! 845 tools.append(
846 SkillSearchTool(
847 f"{SKILL_SEARCH_TOOL_ID_PREFIX}_combined", search_skills
848 )
849 )libs/core/kiln_ai/datamodel/skill.pyLines 129-137 129 return
130
131 for md_path in sorted(target.rglob("*.md")):
132 if not md_path.is_file():
! 133 continue
134 try:
135 content = md_path.read_text(encoding="utf-8")
136 except UnicodeDecodeError:
137 continuelibs/core/kiln_ai/tools/skill_search_tool.pyLines 230-238 230 output="Error: Only markdown files (.md) are searchable"
231 )
232 rel = resource[len(REFERENCES_PREFIX) :]
233 if not rel:
! 234 return ToolCallResult(
235 output="Error: 'resource' must be a file path. Use 'path_prefix' to search a directory."
236 )
237 base_dir = skill.references_dir()
238 try:Lines 248-259 248 output="Error: 'resource' must be a file path. Use 'path_prefix' to search a directory."
249 )
250 try:
251 content = skill.read_reference(rel)
! 252 except FileNotFoundError:
! 253 return ToolCallResult(output=f"Error: Resource not found: {resource}")
! 254 except ValueError as e:
! 255 return ToolCallResult(output=f"Error: {e}")
256 corpus = [(rel, content)]
257 scope_label = resource
258 else:
259 if path_prefix is not None:Lines 378-387 378 if body.startswith("---\n"):
379 try:
380 end_idx = body.index("\n---\n", 3)
381 body = body[end_idx + 5 :]
! 382 except ValueError:
! 383 pass
384 body = body.lstrip("\n")
385 for i, line in enumerate(body.splitlines()):
386 if i >= _H1_SEARCH_LINES:
387 breaklibs/core/kiln_ai/tools/skill_tool.pyLines 187-196 187 f"{prefix}{connector}{entry.name} ({line_count} lines)"
188 )
189 totals["md_files"] += 1
190 totals["md_lines"] += line_count
! 191 except (ValueError, OSError, UnicodeDecodeError):
! 192 tree_lines.append(f"{prefix}{connector}{entry.name}")
193 else:
194 tree_lines.append(f"{prefix}{connector}{entry.name}")
195
196 walk(dir_path, "")
|
What does this PR do?
WIP prototype for letting agent regex through Skill reference files.
Checklists