Skip to content

Commit 987f8f5

Browse files
committed
Improve semantic code scanning for Skill generation
1 parent 49c9231 commit 987f8f5

21 files changed

Lines changed: 660 additions & 77 deletions

CHANGELOG.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,10 @@ Detailed notes for each tagged release live under [`docs/releases/`](./docs/rele
55

66
## Unreleased
77

8+
- Added semantic Python extraction for call targets, dynamic imports, type references, raised exceptions, class attributes, and lightweight data-flow edges.
9+
- Improved internal import resolution for detailed `from ... import ...` records and dynamic imports.
10+
- Fed call/type/data-flow evidence into file prioritization, planning prompts, generated skeletons, and project summaries.
11+
- Added algorithm notes documenting the paper-backed ideas behind the scanner improvements.
812
- Added `doctor` readiness checks for generated bundles, Skill plans, state snapshots, and adapted target files.
913
- Added repository-specific `adoption-guide.md` output and updated README/docs around first adoption, CI refresh, and multi-tool publishing workflows.
1014
- Changed merge-style adapters to preserve hand-written content through a managed code2skill block.

README.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@ Use it when a Python project needs coding assistants to follow the current modul
1414

1515
## What This Repository Can Do
1616

17-
- Analyze a Python repository with AST parsing, import graph checks, config extraction, and file-role inference.
17+
- Analyze a Python repository with AST semantic extraction, import graph checks, call/type/data-flow evidence, config extraction, and file-role inference.
1818
- Write a `.code2skill/` bundle with a project summary, references, a Skill plan, generated Skills, a report, and incremental state.
1919
- Estimate model cost and affected Skills before generation.
2020
- Generate Skill Markdown from repository evidence using OpenAI Responses API, OpenAI-compatible Responses endpoints, Claude, or Qwen.
@@ -240,6 +240,7 @@ For lower-level automation, use `create_scan_config(...)` with `scan_repository(
240240
- [CI Guide](https://github.com/oceanusXXD/code2skill/blob/main/docs/ci.md)
241241
- [Python API](https://github.com/oceanusXXD/code2skill/blob/main/docs/python-api.md)
242242
- [Output Layout](https://github.com/oceanusXXD/code2skill/blob/main/docs/output-layout.md)
243+
- [Algorithm Notes](https://github.com/oceanusXXD/code2skill/blob/main/docs/algorithm-notes.md)
243244
- [Release Guide](https://github.com/oceanusXXD/code2skill/blob/main/docs/release.md)
244245
- [Changelog](https://github.com/oceanusXXD/code2skill/blob/main/CHANGELOG.md)
245246

README.zh-CN.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@
1414

1515
## 这个仓库可以做什么
1616

17-
- 用 AST、import graph、配置抽取和文件角色推断分析 Python 仓库。
17+
- 用 AST 语义抽取、import graph、调用/类型/data-flow 证据、配置抽取和文件角色推断分析 Python 仓库。
1818
- 写出 `.code2skill/` bundle,包括项目概要、参考文档、Skill plan、生成的 Skills、执行报告和增量 state。
1919
- 在生成前估算模型成本和受影响 Skills。
2020
- 使用 OpenAI Responses API、OpenAI-compatible Responses endpoint、Claude 或 Qwen,从仓库证据生成 Skill Markdown。
@@ -240,6 +240,7 @@ print(readiness.ready, readiness.score)
240240
- [CI Guide](https://github.com/oceanusXXD/code2skill/blob/main/docs/ci.md)
241241
- [Python API](https://github.com/oceanusXXD/code2skill/blob/main/docs/python-api.md)
242242
- [Output Layout](https://github.com/oceanusXXD/code2skill/blob/main/docs/output-layout.md)
243+
- [Algorithm Notes](https://github.com/oceanusXXD/code2skill/blob/main/docs/algorithm-notes.md)
243244
- [Release Guide](https://github.com/oceanusXXD/code2skill/blob/main/docs/release.md)
244245
- [Changelog](https://github.com/oceanusXXD/code2skill/blob/main/CHANGELOG.md)
245246

docs/algorithm-notes.md

Lines changed: 50 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,50 @@
1+
# Algorithm Notes
2+
3+
`code2skill` does not try to train a model over the repository. It builds a
4+
small structural graph and a compact AST skeleton, then gives the LLM grounded
5+
evidence for planning and Skill generation.
6+
7+
## What Was Borrowed
8+
9+
- AST path evidence: inspired by code2vec, which showed that paths through code
10+
structure are stronger signals than plain token lists.
11+
- Program graph evidence: inspired by graph-based program representation work
12+
and Code Property Graphs, which combine syntax and semantic edges instead of
13+
treating code as isolated files.
14+
- Data-flow evidence: inspired by GraphCodeBERT, which uses data-flow structure
15+
to connect variables and operations beyond lexical proximity.
16+
17+
## Current Implementation
18+
19+
- Python AST extraction records imports, exports, functions, classes, methods,
20+
route decorators, model/schema signals, call targets, type references,
21+
raised exceptions, dynamic imports, class attributes, and simple data-flow
22+
edges such as `scope:target<-source`.
23+
- Import graph construction uses detailed `ImportInfo`, including `from ...
24+
import ...` names and dynamic imports, so package-level imports resolve to
25+
concrete internal files when possible.
26+
- File priority combines path heuristics with content evidence. Route, service,
27+
model, main-guard, call-target, type-reference, and data-flow signals can
28+
raise selection priority.
29+
- Planner prompts receive dependency, call, type, and flow evidence for core
30+
modules. Generation prompts use the same skeleton lines when large files are
31+
summarized instead of inlined.
32+
33+
## Boundaries
34+
35+
The extractor is deliberately conservative. It records shallow data-flow edges
36+
from assignments, loops, and context managers, but it does not attempt full
37+
interprocedural static analysis, control-flow reconstruction, type inference, or
38+
runtime import evaluation. Missing or ambiguous evidence should still be marked
39+
as uncertain by generated Skills.
40+
41+
## References
42+
43+
- Alon et al., code2vec: Learning Distributed Representations of Code:
44+
https://arxiv.org/abs/1803.09473
45+
- Allamanis et al., Learning to Represent Programs with Graphs:
46+
https://arxiv.org/abs/1711.00740
47+
- Yamaguchi et al., Modeling and Discovering Vulnerabilities with Code Property
48+
Graphs: https://ieeexplore.ieee.org/document/6956581
49+
- Guo et al., GraphCodeBERT: Pre-training Code Representations with Data Flow:
50+
https://arxiv.org/abs/2009.08366

src/code2skill/analyzers/skill_blueprint_builder.py

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -201,7 +201,8 @@ def _build_important_apis(
201201
)
202202
)
203203
if summary.inferred_role == "service":
204-
for function_name in summary.functions[:3]:
204+
service_entries = [*summary.functions, *summary.methods]
205+
for function_name in service_entries[:4]:
205206
apis.append(
206207
ApiSummary(
207208
kind=summary.inferred_role,
@@ -305,6 +306,9 @@ def _core_module_sort_key(summary: SourceFileSummary) -> tuple[int, float, int,
305306
+ len(summary.methods)
306307
+ len(summary.routes) * 2
307308
+ len(summary.internal_dependencies)
309+
+ min(len(summary.call_targets), 8)
310+
+ min(len(summary.type_references), 4)
311+
+ min(len(summary.data_flow_edges), 4)
308312
)
309313
return (
310314
role_priority,

0 commit comments

Comments
 (0)