Skip to content

feat: agent can regex through Skill reference files#1304

Draft
leonardmq wants to merge 1 commit intomainfrom
leonard/skill-search
Draft

feat: agent can regex through Skill reference files#1304
leonardmq wants to merge 1 commit intomainfrom
leonard/skill-search

Conversation

@leonardmq
Copy link
Copy Markdown
Collaborator

What does this PR do?

WIP prototype for letting agent regex through Skill reference files.

Checklists

  • Tests have been run locally and passed
  • New tests have been added to any work in /lib

@leonardmq leonardmq marked this pull request as draft April 21, 2026 09:03
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 21, 2026

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 6ffd18fa-69d1-421a-bec1-17c73ed8b9bb

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch leonard/skill-search

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new SkillSearchTool that allows agents to perform regex searches across markdown reference files within a skill. It also enhances the existing SkillTool to support directory listings with line counts. Key changes include updates to the prompt builder to provide instructions for the new search capability, new data model methods for iterating through references and counting lines, and corresponding updates to the adapter logic for tool resolution and caching. Feedback focuses on performance and memory efficiency, specifically regarding the simultaneous loading of multiple files into memory and the non-streaming approach to line counting, as well as a potential ReDoS vulnerability from unconstrained regex patterns.

prefix_rel = None
scope_label = None
try:
corpus = list(skill.iter_markdown_references(prefix=prefix_rel))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Converting the corpus generator to a list forces all markdown file contents into memory simultaneously. For skills with many or large reference files, this can lead to excessive memory consumption or OOM errors. It is safer to iterate over the generator directly to process files one by one. Note that if you remove list(), you should also ensure that any potential ValueError from the generator initialization is handled correctly.

Suggested change
corpus = list(skill.iter_markdown_references(prefix=prefix_rel))
corpus = skill.iter_markdown_references(prefix=prefix_rel)

elif entry.suffix == ".md":
rel = entry.relative_to(base_dir).as_posix()
try:
line_count = skill.count_file_lines(root, rel)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Calling count_file_lines for every markdown file during a directory listing is a major performance bottleneck. Since the current implementation of count_file_lines reads the full content of each file, listing a directory with many files results in O(Total Size of All Files) I/O operations. This makes the tool very slow and memory-intensive for large reference sets.

raise ValueError(
f"Unknown resource root: {resource_root!r}. Expected 'references' or 'assets'."
)
text = self._read_resource(base_dir, relative_path)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The count_file_lines method reads the entire file content into memory just to count newlines. This is inefficient for large reference or asset files. A streaming approach (e.g., reading the file in chunks or line-by-line) would be significantly more memory-efficient.

return ToolCallResult(output="Error: 'pattern' parameter is required.")

try:
regex = re.compile(pattern, re.IGNORECASE)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-medium medium

Compiling and executing a regex pattern provided by the agent without constraints or timeouts poses a risk of Regular Expression Denial of Service (ReDoS). A maliciously crafted or accidentally complex pattern can cause catastrophic backtracking, hanging the execution thread. Consider using a regex engine with built-in protections or implementing a timeout for the search operation.

@github-actions
Copy link
Copy Markdown

📊 Coverage Report

Overall Coverage: 92%

Diff: origin/main...HEAD

  • libs/core/kiln_ai/adapters/fine_tune/dataset_formatter.py (21.4%): Missing lines 488-490,492,495,497-499,503-505
  • libs/core/kiln_ai/adapters/model_adapters/base_adapter.py (88.7%): Missing lines 838-841,844-845
  • libs/core/kiln_ai/adapters/prompt_builders.py (100%)
  • libs/core/kiln_ai/datamodel/skill.py (97.1%): Missing lines 133
  • libs/core/kiln_ai/datamodel/tool_id.py (100%)
  • libs/core/kiln_ai/tools/skill_search_tool.py (96.9%): Missing lines 234,252-255,382-383
  • libs/core/kiln_ai/tools/skill_tool.py (96.4%): Missing lines 191-192

Summary

  • Total: 409 lines
  • Missing: 27 lines
  • Coverage: 93%

Line-by-line

View line-by-line diff coverage

libs/core/kiln_ai/adapters/fine_tune/dataset_formatter.py

Lines 484-508

  484                     self._tool_cache[cache_key] = skill_def
  485                     tool_definitions.append(skill_def)
  486 
  487         if skill_search_tool_ids:
! 488             cache_key = "search::" + "::".join(sorted(skill_search_tool_ids))
! 489             if cache_key in self._tool_cache:
! 490                 tool_definitions.append(self._tool_cache[cache_key])
  491             else:
! 492                 from kiln_ai.adapters.adapter_registry import (
  493                     load_skills_from_tool_ids,
  494                 )
! 495                 from kiln_ai.tools.skill_search_tool import SkillSearchTool
  496 
! 497                 skills_dict = load_skills_from_tool_ids(task, skill_search_tool_ids)
! 498                 if skills_dict:
! 499                     search_tool = SkillSearchTool(
  500                         f"{SKILL_SEARCH_TOOL_ID_PREFIX}_combined",
  501                         list(skills_dict.values()),
  502                     )
! 503                     search_def = await search_tool.toolcall_definition()
! 504                     self._tool_cache[cache_key] = search_def
! 505                     tool_definitions.append(search_def)
  506 
  507         return tool_definitions if tool_definitions else None

libs/core/kiln_ai/adapters/model_adapters/base_adapter.py

Lines 834-849

  834             tools.append(SkillTool(f"{SKILL_TOOL_ID_PREFIX}_combined", skills))
  835 
  836         search_skills = self._resolve_skill_search_tool_skills()
  837         if search_skills:
! 838             seen_names = set()
! 839             for skill in search_skills:
! 840                 if skill.name in seen_names:
! 841                     raise ValueError(
  842                         f"Duplicate skill name '{skill.name}'. Each skill must have a unique name."
  843                     )
! 844                 seen_names.add(skill.name)
! 845             tools.append(
  846                 SkillSearchTool(
  847                     f"{SKILL_SEARCH_TOOL_ID_PREFIX}_combined", search_skills
  848                 )
  849             )

libs/core/kiln_ai/datamodel/skill.py

Lines 129-137

  129             return
  130 
  131         for md_path in sorted(target.rglob("*.md")):
  132             if not md_path.is_file():
! 133                 continue
  134             try:
  135                 content = md_path.read_text(encoding="utf-8")
  136             except UnicodeDecodeError:
  137                 continue

libs/core/kiln_ai/tools/skill_search_tool.py

Lines 230-238

  230                     output="Error: Only markdown files (.md) are searchable"
  231                 )
  232             rel = resource[len(REFERENCES_PREFIX) :]
  233             if not rel:
! 234                 return ToolCallResult(
  235                     output="Error: 'resource' must be a file path. Use 'path_prefix' to search a directory."
  236                 )
  237             base_dir = skill.references_dir()
  238             try:

Lines 248-259

  248                     output="Error: 'resource' must be a file path. Use 'path_prefix' to search a directory."
  249                 )
  250             try:
  251                 content = skill.read_reference(rel)
! 252             except FileNotFoundError:
! 253                 return ToolCallResult(output=f"Error: Resource not found: {resource}")
! 254             except ValueError as e:
! 255                 return ToolCallResult(output=f"Error: {e}")
  256             corpus = [(rel, content)]
  257             scope_label = resource
  258         else:
  259             if path_prefix is not None:

Lines 378-387

  378         if body.startswith("---\n"):
  379             try:
  380                 end_idx = body.index("\n---\n", 3)
  381                 body = body[end_idx + 5 :]
! 382             except ValueError:
! 383                 pass
  384         body = body.lstrip("\n")
  385         for i, line in enumerate(body.splitlines()):
  386             if i >= _H1_SEARCH_LINES:
  387                 break

libs/core/kiln_ai/tools/skill_tool.py

Lines 187-196

  187                             f"{prefix}{connector}{entry.name} ({line_count} lines)"
  188                         )
  189                         totals["md_files"] += 1
  190                         totals["md_lines"] += line_count
! 191                     except (ValueError, OSError, UnicodeDecodeError):
! 192                         tree_lines.append(f"{prefix}{connector}{entry.name}")
  193                 else:
  194                     tree_lines.append(f"{prefix}{connector}{entry.name}")
  195 
  196         walk(dir_path, "")


Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant