|
| 1 | +# Algorithm Notes |
| 2 | + |
| 3 | +`code2skill` does not try to train a model over the repository. It builds a |
| 4 | +small structural graph and a compact AST skeleton, then gives the LLM grounded |
| 5 | +evidence for planning and Skill generation. |
| 6 | + |
| 7 | +## What Was Borrowed |
| 8 | + |
| 9 | +- AST path evidence: inspired by code2vec, which showed that paths through code |
| 10 | + structure are stronger signals than plain token lists. |
| 11 | +- Program graph evidence: inspired by graph-based program representation work |
| 12 | + and Code Property Graphs, which combine syntax and semantic edges instead of |
| 13 | + treating code as isolated files. |
| 14 | +- Data-flow evidence: inspired by GraphCodeBERT, which uses data-flow structure |
| 15 | + to connect variables and operations beyond lexical proximity. |
| 16 | + |
| 17 | +## Current Implementation |
| 18 | + |
| 19 | +- Python AST extraction records imports, exports, functions, classes, methods, |
| 20 | + route decorators, model/schema signals, call targets, type references, |
| 21 | + raised exceptions, dynamic imports, class attributes, and simple data-flow |
| 22 | + edges such as `scope:target<-source`. |
| 23 | +- Import graph construction uses detailed `ImportInfo`, including `from ... |
| 24 | + import ...` names and dynamic imports, so package-level imports resolve to |
| 25 | + concrete internal files when possible. |
| 26 | +- File priority combines path heuristics with content evidence. Route, service, |
| 27 | + model, main-guard, call-target, type-reference, and data-flow signals can |
| 28 | + raise selection priority. |
| 29 | +- Planner prompts receive dependency, call, type, and flow evidence for core |
| 30 | + modules. Generation prompts use the same skeleton lines when large files are |
| 31 | + summarized instead of inlined. |
| 32 | + |
| 33 | +## Boundaries |
| 34 | + |
| 35 | +The extractor is deliberately conservative. It records shallow data-flow edges |
| 36 | +from assignments, loops, and context managers, but it does not attempt full |
| 37 | +interprocedural static analysis, control-flow reconstruction, type inference, or |
| 38 | +runtime import evaluation. Missing or ambiguous evidence should still be marked |
| 39 | +as uncertain by generated Skills. |
| 40 | + |
| 41 | +## References |
| 42 | + |
| 43 | +- Alon et al., code2vec: Learning Distributed Representations of Code: |
| 44 | + https://arxiv.org/abs/1803.09473 |
| 45 | +- Allamanis et al., Learning to Represent Programs with Graphs: |
| 46 | + https://arxiv.org/abs/1711.00740 |
| 47 | +- Yamaguchi et al., Modeling and Discovering Vulnerabilities with Code Property |
| 48 | + Graphs: https://ieeexplore.ieee.org/document/6956581 |
| 49 | +- Guo et al., GraphCodeBERT: Pre-training Code Representations with Data Flow: |
| 50 | + https://arxiv.org/abs/2009.08366 |
0 commit comments