Skip to content

Commit 834ea3a

Browse files
committed
feat: Add comprehensive codebase evaluation framework
This commit adds a complete framework for evaluating whether AI agents improve or deteriorate codebases using Terraphim's knowledge graph capabilities. Components Added: 1. Design Document (CODEBASE_EVALUATION_DESIGN.md): - Complete evaluation methodology - Step-by-step procedures - Metrics reference and verdict logic - CI/CD integration examples - Firecracker VM integration patterns 2. Evaluation Scripts (examples/codebase-evaluation/scripts/): - baseline-evaluation.sh: Capture initial metrics - post-evaluation.sh: Capture post-change metrics - compare-evaluations.sh: Generate verdict with delta analysis - evaluate-ai-agent.sh: Master workflow script 3. Knowledge Graph Templates (kg-templates/): - code-quality.md: Code smells and technical debt patterns - bug-patterns.md: Common errors and anti-patterns - performance.md: Bottleneck detection patterns - security.md: Vulnerability patterns 4. Documentation: - README.md: Quick start and usage guide - Integration examples for GitHub Actions, GitLab CI - Updated TERRAPHIM_CLAUDE_INTEGRATION.md with evaluation section Key Features: - Deterministic evaluation using Aho-Corasick automata - Role-based perspectives (Code Reviewer, Security Auditor, etc.) - Quantifiable metrics: clippy warnings, anti-patterns, TODOs - Automated verdict generation: Improvement, Deterioration, or Neutral - CI/CD ready with exit code 1 on deterioration - Privacy-first: All evaluation runs locally Use Cases: 1. Evaluate PR from AI agent before merge 2. Continuous quality monitoring in CI/CD 3. Historical trend analysis across evaluations 4. Multi-dimensional evaluation (security + performance + quality) This framework enables objective, repeatable assessment of AI-generated code changes using Terraphim's existing infrastructure.
1 parent aef178e commit 834ea3a

11 files changed

Lines changed: 1939 additions & 0 deletions

examples/CODEBASE_EVALUATION_DESIGN.md

Lines changed: 914 additions & 0 deletions
Large diffs are not rendered by default.

examples/TERRAPHIM_CLAUDE_INTEGRATION.md

Lines changed: 203 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,7 @@ This guide explains how to integrate Terraphim's knowledge graph capabilities wi
88
- [Approach Comparison](#approach-comparison)
99
- [Claude Code Hooks](#claude-code-hooks)
1010
- [Claude Skills](#claude-skills)
11+
- [Codebase Evaluation](#codebase-evaluation)
1112
- [Which Approach to Use](#which-approach-to-use)
1213
- [Getting Started](#getting-started)
1314
- [Advanced Integration](#advanced-integration)
@@ -267,6 +268,208 @@ Skills activate automatically when Claude detects relevant context:
267268

268269
**Documentation**: See `examples/claude-skills/terraphim-package-manager/README.md`
269270

271+
## Codebase Evaluation
272+
273+
Beyond text replacement, Terraphim AI provides a powerful framework for **evaluating whether AI agents improve or deteriorate codebases**. This deterministic, knowledge graph-based evaluation system measures code quality before and after AI changes.
274+
275+
### Overview
276+
277+
The evaluation system uses Terraphim's core capabilities to:
278+
- **Index codebases** as searchable haystacks
279+
- **Build knowledge graphs** for quality, security, and performance patterns
280+
- **Run standardized queries** to detect issues
281+
- **Compare metrics** before and after AI changes
282+
- **Generate verdicts**: Improvement, Deterioration, or Neutral
283+
284+
### Key Features
285+
286+
- **Deterministic**: Aho-Corasick automata provide consistent, repeatable scoring
287+
- **Local & Private**: No external API dependencies for evaluation
288+
- **Role-Based**: Evaluate from multiple perspectives (security, performance, quality)
289+
- **Quantifiable**: Numeric scores for objective comparison
290+
- **CI/CD Ready**: Integrate with GitHub Actions, GitLab CI, etc.
291+
292+
### Quick Start
293+
294+
```bash
295+
# Run complete evaluation
296+
cd examples/codebase-evaluation
297+
./scripts/evaluate-ai-agent.sh /path/to/your/codebase
298+
299+
# The script will:
300+
# 1. Create baseline evaluation
301+
# 2. Prompt you to apply AI changes
302+
# 3. Re-evaluate after changes
303+
# 4. Generate verdict report
304+
```
305+
306+
### Evaluation Metrics
307+
308+
**Knowledge Graph Metrics**:
309+
- Semantic matches for quality issues
310+
- Pattern detection using Aho-Corasick
311+
- Concept relationship density
312+
313+
**Code Quality Metrics** (Rust example):
314+
- Clippy warnings count
315+
- Test pass/fail rates
316+
- Anti-pattern occurrences (unwrap, panic, etc.)
317+
- TODO/FIXME counts
318+
319+
**Verdict Logic**:
320+
-**IMPROVEMENT**: More metrics improved than deteriorated
321+
-**DETERIORATION**: More metrics deteriorated than improved
322+
-**NEUTRAL**: Mixed or minimal changes
323+
324+
### Example Use Cases
325+
326+
**1. Evaluate Pull Request from AI Agent**
327+
328+
```bash
329+
# Checkout baseline (main branch)
330+
git checkout main
331+
./scripts/baseline-evaluation.sh . "Code Reviewer"
332+
333+
# Checkout AI-generated PR
334+
git checkout ai-agent-pr-123
335+
./scripts/post-evaluation.sh . "Code Reviewer"
336+
337+
# Generate verdict
338+
./scripts/compare-evaluations.sh
339+
```
340+
341+
**2. Continuous Evaluation in CI/CD**
342+
343+
```yaml
344+
# GitHub Actions example
345+
- name: Baseline evaluation
346+
run: ./scripts/baseline-evaluation.sh ${{ github.workspace }}
347+
348+
- name: Apply AI changes
349+
run: # Your AI agent step
350+
351+
- name: Post-change evaluation
352+
run: ./scripts/post-evaluation.sh ${{ github.workspace }}
353+
354+
- name: Generate verdict (fails on deterioration)
355+
run: ./scripts/compare-evaluations.sh
356+
```
357+
358+
**3. Multi-Role Evaluation**
359+
360+
Evaluate from different perspectives:
361+
362+
```bash
363+
# Code quality focus
364+
./scripts/evaluate-ai-agent.sh ./codebase claude-code "Code Reviewer"
365+
366+
# Security focus
367+
./scripts/evaluate-ai-agent.sh ./codebase claude-code "Security Auditor"
368+
369+
# Performance focus
370+
./scripts/evaluate-ai-agent.sh ./codebase claude-code "Performance Analyst"
371+
```
372+
373+
### Evaluation Roles
374+
375+
Define custom evaluation perspectives using knowledge graphs:
376+
377+
**Code Reviewer Role** (`code-quality.md`):
378+
```markdown
379+
# Code Quality
380+
381+
synonyms:: code smell, technical debt, refactoring opportunity, bad practice
382+
```
383+
384+
**Security Auditor Role** (`security.md`):
385+
```markdown
386+
# Security Vulnerability
387+
388+
synonyms:: SQL injection, XSS, CSRF, authentication flaw, command injection
389+
```
390+
391+
**Performance Analyst Role** (`performance.md`):
392+
```markdown
393+
# Performance Bottleneck
394+
395+
synonyms:: slow code, inefficient algorithm, O(n^2) complexity, blocking operation
396+
```
397+
398+
### Sample Verdict Report
399+
400+
```markdown
401+
# Codebase Evaluation Verdict
402+
403+
## Summary
404+
405+
### Clippy Warnings
406+
| Metric | Baseline | After | Delta |
407+
|----------|----------|-------|-------|
408+
| Warnings | 15 | 8 | -7 |
409+
410+
**Improvement**: Reduced warnings by 7
411+
412+
### Anti-Patterns
413+
| Metric | Baseline | After | Delta |
414+
|--------|----------|-------|-------|
415+
| Count | 23 | 18 | -5 |
416+
417+
**Improvement**: Removed 5 anti-patterns
418+
419+
## Overall Verdict
420+
421+
**IMPROVEMENT**: The AI agent improved the codebase quality.
422+
423+
- ✅ Improved metrics: **3**
424+
- ❌ Deteriorated metrics: **0**
425+
- ➖ Neutral metrics: **1**
426+
427+
## Recommendations
428+
429+
- ✅ No critical issues found
430+
- 📝 Review remaining 8 clippy warnings for completion
431+
```
432+
433+
### Integration with Claude
434+
435+
Combine codebase evaluation with hooks or skills:
436+
437+
**Hook Integration**: Automatically evaluate changes before commits
438+
```bash
439+
# pre-commit hook
440+
./scripts/baseline-evaluation.sh .
441+
# ... make changes with Claude ...
442+
./scripts/post-evaluation.sh .
443+
./scripts/compare-evaluations.sh || exit 1
444+
```
445+
446+
**Skill Integration**: Ask Claude to evaluate changes
447+
```markdown
448+
---
449+
name: terraphim-codebase-eval
450+
description: Evaluate code quality using Terraphim's knowledge graph system
451+
---
452+
453+
When the user asks to evaluate code quality or AI changes, run:
454+
./scripts/evaluate-ai-agent.sh <codebase>
455+
```
456+
457+
### Documentation
458+
459+
Complete documentation and scripts available:
460+
- **Design Document**: `examples/codebase-evaluation/CODEBASE_EVALUATION_DESIGN.md`
461+
- **Quick Start Guide**: `examples/codebase-evaluation/README.md`
462+
- **Evaluation Scripts**: `examples/codebase-evaluation/scripts/`
463+
- **KG Templates**: `examples/codebase-evaluation/kg-templates/`
464+
465+
### Benefits
466+
467+
- **Objective Assessment**: Quantifiable metrics over subjective opinions
468+
- **Early Detection**: Catch quality issues before they reach production
469+
- **CI/CD Integration**: Automated quality gates in pipelines
470+
- **Historical Tracking**: Monitor quality trends over time
471+
- **Multi-Dimensional**: Evaluate security, performance, and quality simultaneously
472+
270473
## Which Approach to Use
271474

272475
### Decision Matrix

0 commit comments

Comments
 (0)