diff --git a/README.md b/README.md index 6bef8f3..23d0662 100644 --- a/README.md +++ b/README.md @@ -1,211 +1,9 @@ -# Project Index for Claude Code - -**⚠️ Beta Community Tool - Let Claude Code Fork It!** This is my personal indexing solution that I'm sharing (still in beta). I'm not maintaining this as a product. If you run into issues, have Claude Code help you fix them! Give this repo URL to Claude and ask it to fork, set up, and adapt it for your specific needs. - -## Background - -I created this tool for myself and talked about it in [this video](https://www.youtube.com/watch?v=JU8BwMe_BWg) and [this X post](https://x.com/EricBuess/status/1955271258939043996). People requested it, so here it is! This works alongside my [Claude Code Docs mirror](https://github.com/ericbuess/claude-code-docs) project. - -I may post videos explaining how I use this project - check [my X/Twitter](https://x.com/EricBuess) for updates and explanations. - -This isn't a product - just a tool that solves Claude Code's architectural blindness for me. Fork it, improve it, make it yours! - -Automatically gives Claude Code architectural awareness of your codebase. Add `-i` to any prompt to generate or update a PROJECT_INDEX.json containing your project's functions, classes, and structure. - -## Quick Install - -```bash -curl -fsSL https://raw.githubusercontent.com/ericbuess/claude-code-project-index/main/install.sh | bash -``` - -## Usage - -Just add `-i` to any Claude prompt: - -```bash -claude "fix the auth bug -i" # Auto-creates/uses index (default 50k) -claude "refactor database code -i75" # Target ~75k tokens (if project needs it) -claude "analyze architecture -ic200" # Export up to 200k to clipboard for external AI - -# Or manually create/update the index anytime -/index -``` - -**Key behaviors:** -- **One-time setup**: Use `-i` once in a project and the index auto-updates forever -- **Size memory**: The number (e.g., 75) is remembered until you specify a new one -- **Auto-maintenance**: Every file change triggers automatic index updates -- **To stop indexing**: Simply delete PROJECT_INDEX.json - -## What It Does - -PROJECT_INDEX extracts and tracks: -- **Functions & Classes**: Full signatures with parameters and return types -- **Call Relationships**: Which functions call which others -- **File Organization**: All code files respecting .gitignore -- **Directory Structure**: Project layout with file counts - -This helps Claude: -- Find the right code without searching -- Understand dependencies before making changes -- Place new code in the correct location -- Avoid creating duplicate functions - -## Three Ways to Use - -### Small Projects - Direct Reference with `@PROJECT_INDEX.json` -```bash -# Reference directly in your prompt -@PROJECT_INDEX.json what functions call authenticate_user? - -# Or auto-load in every session by adding to CLAUDE.md: -# Add @PROJECT_INDEX.json to your CLAUDE.md file -``` - -**Best for**: Small projects where the index fits comfortably in context. Gives Claude's main agent direct awareness of your whole project structure. - -### Medium Projects - Subagent Mode with `-i` flag -```bash -# Invokes specialized subagent to analyze PROJECT_INDEX.json -claude "refactor the auth system -i" # Default up to 50k tokens -claude "find performance issues -i75" # Target ~75k tokens for more detail -``` - -**Best for**: Medium to large projects where you want to preserve the main agent's context. The subagent analyzes the index separately and returns only relevant findings. - -The subagent provides: -- Call graph analysis and execution paths -- Dependency mapping and impact analysis -- Dead code detection -- Strategic recommendations on where to make changes - -### Large Projects - Clipboard Export with `-ic` flag -```bash -# Export to clipboard for external AI with larger contexts -claude "analyze entire codebase -ic200" # Up to 200k tokens -claude "architecture review -ic800" # Up to 800k tokens -``` - -**Best for**: Very large projects whose index won't fit in Claude's context window. Export to AI models with larger context windows: -- Gemini Pro (2M tokens) -- Claude models with 200k+ tokens -- ChatGPT -- Grok - -**Note**: I'm not using this on large projects myself yet - this is inspiration/theory. Your mileage may vary. If you hit snags, have Claude Code update it to work for your specific use case! - -## Token Sizing - -The number after `-i` is a **maximum target**, not a guaranteed size: - -- **Default**: 50k tokens (remembered per project) -- **-i mode range**: 1k to 100k maximum -- **-ic mode range**: 1k to 800k maximum for external AI -- **Actual size**: Often much smaller - only uses what the project needs -- **Compression**: Automatic to fit within limits - -Examples: -- Small project with `-i200`: Might only generate 10k tokens -- Large project with `-i50`: Compresses to fit ~50k target -- Huge project with `-ic500`: Allows up to 500k if needed - -The tool remembers your last `-i` size per project and targets that amount, but actual size depends on your codebase. - -## Language Support - -**Full parsing** (extracts functions, classes, methods): -- Python (.py) -- JavaScript/TypeScript (.js, .ts, .jsx, .tsx) -- Shell scripts (.sh, .bash) - -**File tracking** (listing only): -- Go, Rust, Java, C/C++, Ruby, PHP, Swift, Kotlin, and 20+ more - -## Installation Details - -- **Location**: `~/.claude-code-project-index/` -- **Hooks configured**: - - `UserPromptSubmit`: Detects -i flag - - `Stop`: Refreshes index after session -- **Commands**: `/index` for manual creation/update -- **Agent**: `~/.claude/agents/index-analyzer.md` for deep analysis -- **Python**: Automatically finds newest 3.8+ version - -## Fork & Customize - -**The whole point of this tool is that Claude Code can unbobble it for you!** When you hit issues, don't wait for me - have Claude fix them immediately. This is a community tool meant to be forked and adapted. - -How to customize: -1. Fork the repo or work with the installed version -2. Describe your problem to Claude Code -3. Let Claude modify it for your exact needs -4. Share your improvements with others - -Common customizations: -```bash -cd ~/.claude-code-project-index -# Then ask Claude: -# "The indexer hangs on my 5000 file project - fix it" -# "Add support for Ruby and Go files with full parsing" -# "Skip test files and node_modules even if not in .gitignore" -# "Make it work with my monorepo structure" -# "Change compression to handle my specific project better" -``` - -Remember: Claude Code can rewrite this entire tool in minutes to match your needs. That's the power you have - use it! - -## Known Issues & Quick Fixes - -**Large projects (>2000 files)**: May timeout or hang during compression -- Fix: Ask Claude "Rewrite compress_if_needed() to handle my 3000 file project" - -**.claude directory**: Already fixed - now excluded from indexing - -**Timeouts**: Default is 30 seconds, may be too short for huge projects -- Fix: Ask Claude "Make timeout dynamic based on file count in i_flag_hook.py" - -For any issue, just describe it to Claude and let it fix the tool for you! - -## Requirements - -- Python 3.8 or higher -- Claude Code with hooks support -- macOS or Linux -- git and jq (for installation) - -## Troubleshooting - -**Index not creating?** -- Check Python: `python3 --version` -- Verify hooks: `cat ~/.claude/settings.json | grep i_flag_hook` -- Manual generation: `python3 ~/.claude-code-project-index/scripts/project_index.py` - -**-i flag not working?** -- Run installer again -- Check hooks are configured -- Remove and reinstall if needed - -**Clipboard issues?** -- Install pyperclip: `pip install pyperclip` -- SSH users: Content saved to `.clipboard_content.txt` -- For unlimited clipboard over SSH: [VM Bridge](https://github.com/ericbuess/vm-bridge) - -## Technical Details - -The index uses a compressed format to save ~50% space: -- Minified JSON (single line) for file storage -- Short keys: `f`→files, `g`→graph, `d`→docs, `deps`→dependencies -- Compact function signatures with line numbers -- Clipboard mode (`-ic`) uses readable formatting for external AI tools - -## Uninstall - -```bash -~/.claude-code-project-index/uninstall.sh -``` - ---- -Created by [Eric Buess](https://github.com/ericbuess) -- 🐦 [Twitter/X](https://x.com/EricBuess) -- 📺 [YouTube](https://www.youtube.com/@EricBuess) -- 💼 [GitHub](https://github.com/ericbuess) \ No newline at end of file +# File was edited +# Old: **Current test coverage:** +- ✅ **Unit tests**: 42 tests covering individual functions (all mocked) +- ✅ **Integration tests**: 7 tests covering full workflow (mocked Ollama) +- ✅ **E2E tests**: 3 tests with real Ollama instance via NixOS +# New: **Current test coverage:** +- ✅ **Unit tests**: 42 tests covering individual functions (all mocked) +- ✅ **Integration tests**: 7 tests covering full workflow (mocked Ollama) +- ✅ **E2E tests**: 4 tests including real Ollama instance via NixOS \ No newline at end of file diff --git a/app.py b/app.py new file mode 100644 index 0000000..4a98132 --- /dev/null +++ b/app.py @@ -0,0 +1,234 @@ +#!/usr/bin/env python3 +""" +Simple Flask web application for testing parsing accuracy. +Contains various Python patterns: classes, functions, decorators, etc. +""" + +from flask import Flask, request, jsonify, session +from functools import wraps +import hashlib +import jwt +import datetime +from typing import Dict, List, Optional, Union +from dataclasses import dataclass + +app = Flask(__name__) +app.secret_key = "test-secret-key" + + +@dataclass +class User: + """User data model.""" + id: int + username: str + email: str + password_hash: str + created_at: datetime.datetime + + def to_dict(self) -> Dict: + """Convert user to dictionary.""" + return { + 'id': self.id, + 'username': self.username, + 'email': self.email, + 'created_at': self.created_at.isoformat() + } + + def check_password(self, password: str) -> bool: + """Check if provided password matches hash.""" + return verify_password(password, self.password_hash) + + +class UserManager: + """Manages user operations and database interactions.""" + + def __init__(self): + self.users: Dict[int, User] = {} + self.next_id = 1 + + def create_user(self, username: str, email: str, password: str) -> Optional[User]: + """Create a new user account.""" + if self.find_by_username(username): + return None + + password_hash = hash_password(password) + user = User( + id=self.next_id, + username=username, + email=email, + password_hash=password_hash, + created_at=datetime.datetime.now() + ) + + self.users[self.next_id] = user + self.next_id += 1 + return user + + def find_by_username(self, username: str) -> Optional[User]: + """Find user by username.""" + for user in self.users.values(): + if user.username == username: + return user + return None + + def find_by_id(self, user_id: int) -> Optional[User]: + """Find user by ID.""" + return self.users.get(user_id) + + def authenticate(self, username: str, password: str) -> Optional[User]: + """Authenticate user with username and password.""" + user = self.find_by_username(username) + if user and user.check_password(password): + return user + return None + + +# Global user manager instance +user_manager = UserManager() + + +def hash_password(password: str) -> str: + """Hash password using SHA-256.""" + return hashlib.sha256(password.encode()).hexdigest() + + +def verify_password(password: str, hash_value: str) -> bool: + """Verify password against hash.""" + return hash_password(password) == hash_value + + +def generate_jwt_token(user: User) -> str: + """Generate JWT token for user.""" + payload = { + 'user_id': user.id, + 'username': user.username, + 'exp': datetime.datetime.utcnow() + datetime.timedelta(hours=24) + } + return jwt.encode(payload, app.secret_key, algorithm='HS256') + + +def decode_jwt_token(token: str) -> Optional[Dict]: + """Decode and validate JWT token.""" + try: + payload = jwt.decode(token, app.secret_key, algorithms=['HS256']) + return payload + except jwt.ExpiredSignatureError: + return None + except jwt.InvalidTokenError: + return None + + +def require_auth(f): + """Decorator to require authentication for routes.""" + @wraps(f) + def decorated_function(*args, **kwargs): + token = request.headers.get('Authorization') + if not token: + return jsonify({'error': 'No token provided'}), 401 + + if token.startswith('Bearer '): + token = token[7:] + + payload = decode_jwt_token(token) + if not payload: + return jsonify({'error': 'Invalid token'}), 401 + + user = user_manager.find_by_id(payload['user_id']) + if not user: + return jsonify({'error': 'User not found'}), 401 + + request.current_user = user + return f(*args, **kwargs) + + return decorated_function + + +@app.route('/api/register', methods=['POST']) +def register(): + """Register a new user.""" + data = request.get_json() + + if not data or not all(k in data for k in ['username', 'email', 'password']): + return jsonify({'error': 'Missing required fields'}), 400 + + user = user_manager.create_user( + data['username'], + data['email'], + data['password'] + ) + + if not user: + return jsonify({'error': 'Username already exists'}), 409 + + token = generate_jwt_token(user) + return jsonify({ + 'message': 'User created successfully', + 'user': user.to_dict(), + 'token': token + }), 201 + + +@app.route('/api/login', methods=['POST']) +def login(): + """Authenticate user and return token.""" + data = request.get_json() + + if not data or not all(k in data for k in ['username', 'password']): + return jsonify({'error': 'Missing username or password'}), 400 + + user = user_manager.authenticate(data['username'], data['password']) + if not user: + return jsonify({'error': 'Invalid credentials'}), 401 + + token = generate_jwt_token(user) + return jsonify({ + 'message': 'Login successful', + 'user': user.to_dict(), + 'token': token + }) + + +@app.route('/api/profile') +@require_auth +def get_profile(): + """Get current user's profile.""" + return jsonify({ + 'user': request.current_user.to_dict() + }) + + +@app.route('/api/users') +@require_auth +def list_users(): + """List all users (admin only for demo).""" + users = [user.to_dict() for user in user_manager.users.values()] + return jsonify({'users': users}) + + +@app.errorhandler(404) +def not_found(error): + """Handle 404 errors.""" + return jsonify({'error': 'Endpoint not found'}), 404 + + +@app.errorhandler(500) +def internal_error(error): + """Handle 500 errors.""" + return jsonify({'error': 'Internal server error'}), 500 + + +def create_sample_users(): + """Create sample users for testing.""" + sample_users = [ + ('admin', 'admin@example.com', 'admin123'), + ('testuser', 'test@example.com', 'password123'), + ('demo', 'demo@example.com', 'demo123') + ] + + for username, email, password in sample_users: + user_manager.create_user(username, email, password) + + +if __name__ == '__main__': + create_sample_users() + app.run(debug=True, host='0.0.0.0', port=5000) \ No newline at end of file diff --git a/commands/DUAL_MODE_SETUP.md b/commands/DUAL_MODE_SETUP.md new file mode 100644 index 0000000..a2f3c11 --- /dev/null +++ b/commands/DUAL_MODE_SETUP.md @@ -0,0 +1,213 @@ +# Dual-Mode Duplicate Detection Setup + +This guide shows how to set up the enhanced dual-mode duplicate detection system with status line integration. + +## 🔄 System Modes + +### 1. **Blocking Mode** (Active) +- 🛡️ **Prevents** duplicate code from being written +- **Blocks** Claude's tool execution when duplicates detected +- **Immediate feedback** for course correction +- **Best for**: Active development, preventing technical debt + +### 2. **Passive Mode** (Monitoring) +- 👁️ **Monitors** and logs duplicate code +- **Allows** operations to proceed +- **Informs** Claude of duplicates without blocking +- **Best for**: Legacy codebases, learning phase + +### 3. **Inactive Mode** (Off) +- ⚪ **Disables** duplicate detection entirely +- **No monitoring** or blocking +- **Best for**: Temporary development, testing + +## 🚀 Quick Setup + +### 1. Copy Enhanced Settings +```bash +cp .claude/settings_dual_mode.json .claude/settings.json +``` + +### 2. Initialize Mode Configuration +```bash +# Start in blocking mode (recommended) +python scripts/duplicate_mode_toggle.py blocking + +# Or start in passive mode +python scripts/duplicate_mode_toggle.py passive +``` + +### 3. Build Semantic Index (if not done already) +```bash +python scripts/semantic_analyzer.py +``` + +## 📊 Status Line Integration + +The status line shows real-time duplicate detection status: + +### Status Line Format: +``` +[Model] 📁 project | 🛡️ DD-Active(5:3:2) 2m | 🌿 main +``` + +**Breakdown:** +- `[Model]` - Current Claude model +- `📁 project` - Current directory +- `🛡️ DD-Active` - Detection mode (🛡️=blocking, 👁️=passive, ⚪=inactive) +- `(5:3:2)` - Stats: (total:blocks:warnings) +- `2m` - Time since last detection +- `🌿 main` - Git branch + +### Status Icons: +- 🛡️ **DD-Active** - Blocking mode enabled +- 👁️ **DD-Monitor** - Passive monitoring enabled +- ⚪ **DD-Inactive** - System disabled +- ❓ **DD-Unknown** - Configuration error + +## 🔧 Mode Management + +### Quick Mode Switches: +```bash +# Switch to blocking mode +python scripts/duplicate_mode_toggle.py blocking + +# Switch to passive monitoring +python scripts/duplicate_mode_toggle.py passive + +# Turn off detection +python scripts/duplicate_mode_toggle.py off +``` + +### Advanced Configuration: +```bash +# Set custom similarity threshold +python scripts/duplicate_mode_toggle.py set passive --threshold 0.7 + +# Configure what to block/monitor +python scripts/duplicate_mode_toggle.py set blocking --block-exact --no-block-naming +``` + +### Check Status: +```bash +python scripts/duplicate_mode_toggle.py status +``` + +Example output: +``` +🔧 Duplicate Detection System Status +======================================== +🛡️ Mode: BLOCKING (Active - blocks duplicate code) + +📋 Configuration: + • Similarity Threshold: 80% + • Block Exact Duplicates: ✅ + • Block High Similarity: ✅ + • Block Naming Conflicts: ❌ + • Last Updated: 2024-08-23 14:30:15 + +📊 Detection Statistics: + • Total Detections: 12 + • Blocks Prevented: 8 + • Passive Warnings: 4 + • Exact Duplicates: 3 + • Semantic Similarities: 9 + • Last Detection: 2024-08-23 14:25:42 +``` + +## 🎯 Usage Patterns + +### Development Workflow: +1. **Start in passive mode** to learn codebase patterns +2. **Review detection logs** to understand duplicate patterns +3. **Switch to blocking mode** for active duplicate prevention +4. **Monitor status line** for real-time feedback + +### Team Collaboration: +1. **Passive mode** for junior developers (learning) +2. **Blocking mode** for senior developers (enforcement) +3. **Share detection logs** for code review discussions + +### Legacy Cleanup: +1. **Passive mode** during initial analysis +2. **Use cleanup tools** to eliminate existing duplicates +3. **Switch to blocking mode** to prevent new duplicates + +## ⚙️ Configuration Files + +### Mode Configuration (`/.claude/duplicate_detection_mode.json`): +```json +{ + "mode": "blocking", + "similarity_threshold": 0.8, + "block_exact_duplicates": true, + "block_high_similarity": true, + "block_naming_conflicts": false, + "log_all_detections": true, + "show_suggestions": true +} +``` + +### Detection Statistics (`/.claude/duplicate_stats.json`): +```json +{ + "total_detections": 12, + "exact_duplicates_found": 3, + "semantic_similarities_found": 9, + "blocks_prevented": 8, + "passive_warnings_issued": 4, + "last_detection": 1692794742.123 +} +``` + +## 🔍 Monitoring & Debugging + +### View Detection Log: +```bash +tail -f .claude/duplicate_detection.log +``` + +### Reset Statistics: +```bash +python scripts/duplicate_mode_toggle.py reset-stats +``` + +### Test Status Line: +```bash +echo '{"model":{"display_name":"Test"},"workspace":{"current_dir":"'$(pwd)'"}}' | ./.claude/duplicate-status.sh +``` + +## 🚨 Troubleshooting + +### Status Line Not Showing: +1. Verify script is executable: `chmod +x .claude/duplicate-status.sh` +2. Check settings.json has statusLine configuration +3. Test script manually (see test command above) + +### Mode Changes Not Taking Effect: +1. Check .claude/settings.json was updated +2. Restart Claude Code session +3. Verify hook configuration with `python scripts/duplicate_mode_toggle.py status` + +### No Detections Happening: +1. Ensure semantic index exists: `python scripts/semantic_analyzer.py` +2. Check mode is not "inactive" +3. Verify scripts are executable and error-free + +## 💡 Pro Tips + +- **Use passive mode initially** to understand your codebase's duplicate patterns +- **Monitor the status line** to see detection activity in real-time +- **Switch modes based on context** (blocking for new features, passive for exploration) +- **Review detection logs** to improve your coding patterns +- **Share statistics** with team for duplicate reduction metrics + +## 📈 Success Metrics + +Track your duplicate reduction progress: +- **Blocks Prevented**: How many duplicates were stopped +- **Detection Rate**: Total detections per coding session +- **Mode Effectiveness**: Compare blocking vs passive results +- **Time to Detection**: How quickly duplicates are caught + +The dual-mode system gives you flexible control over duplicate detection while providing rich feedback through the status line! 🎉 \ No newline at end of file diff --git a/commands/DUPLICATE_DETECTION_PLAN.md b/commands/DUPLICATE_DETECTION_PLAN.md new file mode 100644 index 0000000..a75b7f3 --- /dev/null +++ b/commands/DUPLICATE_DETECTION_PLAN.md @@ -0,0 +1,391 @@ +# Duplicate Detection & Architecture Enforcement System +## Implementation Plan for Claude Code Project Index + +### Overview +This document outlines a comprehensive system for detecting duplicate code and enforcing architectural patterns in real-time as Claude Code writes code. The system uses local embeddings (TF-IDF) and AST analysis to provide immediate feedback without external dependencies. + +## Problem Statement +- Agents gradually create "labyrinthine" codebases with duplicate logic +- Similar functionality gets reimplemented in different places +- Architectural patterns drift over time +- No real-time feedback when code duplication occurs + +## Solution Architecture + +### Core Components + +#### 1. Semantic Analysis Layer +- **TF-IDF Vectorization**: Create embeddings from code tokens +- **AST Fingerprinting**: Extract structural patterns +- **Pattern Extraction**: Identify architectural conventions +- **No external dependencies**: Uses sklearn, ast, difflib + +#### 2. Real-time Detection +- **PostToolUse Hooks**: Intercept code modifications +- **Similarity Scoring**: Compare against existing code +- **Blocking Feedback**: Alert Claude to duplicates +- **Suggestions**: Propose existing implementations + +#### 3. Architectural Enforcement +- **Pattern Recognition**: Learn from existing code +- **Consistency Checking**: Detect violations +- **Naming Conventions**: Enforce project standards +- **Design Patterns**: Identify and suggest patterns + +## Implementation Phases + +### Phase 1: Core System (Immediate Implementation) + +#### 1.1 Enhanced Index Structure +```json +{ + "semantic_index": { + "functions": { + "file_path:function_name": { + "signature": "def func(param: Type) -> ReturnType", + "ast_fingerprint": "hash_of_structure", + "tfidf_vector": [0.1, 0.2, ...], + "complexity": 5, + "patterns": ["validation", "repository"], + "dependencies": ["module1", "module2"] + } + }, + "similarity_clusters": [ + { + "pattern": "validation", + "functions": ["func1", "func2"], + "similarity_matrix": [[1.0, 0.85], [0.85, 1.0]] + } + ], + "architectural_patterns": { + "naming_conventions": { + "functions": "snake_case", + "classes": "PascalCase", + "validators": "validate_*" + }, + "file_organization": { + "services": "src/services/*", + "utils": "src/utils/*" + } + } + } +} +``` + +#### 1.2 Semantic Analyzer (`scripts/semantic_analyzer.py`) +```python +import ast +import hashlib +from sklearn.feature_extraction.text import TfidfVectorizer +import numpy as np + +class SemanticAnalyzer: + def __init__(self): + self.vectorizer = TfidfVectorizer( + token_pattern=r'\b\w+\b', + max_features=100, + ngram_range=(1, 2) + ) + + def create_ast_fingerprint(self, code): + """Create structural fingerprint from AST""" + tree = ast.parse(code) + # Extract control flow patterns + structure = self.extract_structure(tree) + return hashlib.md5(str(structure).encode()).hexdigest() + + def create_tfidf_embedding(self, code_corpus): + """Generate TF-IDF vectors for code similarity""" + vectors = self.vectorizer.fit_transform(code_corpus) + return vectors.toarray() + + def compute_similarity(self, vec1, vec2): + """Cosine similarity between vectors""" + return np.dot(vec1, vec2) / (np.linalg.norm(vec1) * np.linalg.norm(vec2)) +``` + +#### 1.3 Duplicate Detector Hook (`scripts/duplicate_detector.py`) +```python +#!/usr/bin/env python3 +import json +import sys +import os + +def check_similarity(new_code, index_path): + """Check if new code is similar to existing code""" + # Load semantic index + with open(index_path) as f: + index = json.load(f) + + # Analyze new code + analyzer = SemanticAnalyzer() + new_fingerprint = analyzer.create_ast_fingerprint(new_code) + + # Check for duplicates + duplicates = [] + for func_id, func_data in index['semantic_index']['functions'].items(): + if func_data['ast_fingerprint'] == new_fingerprint: + duplicates.append({ + 'function': func_id, + 'similarity': 1.0, + 'type': 'exact_structural_match' + }) + # Check TF-IDF similarity + similarity = analyzer.compute_similarity( + new_vector, func_data['tfidf_vector'] + ) + if similarity > 0.8: + duplicates.append({ + 'function': func_id, + 'similarity': similarity, + 'type': 'semantic_similarity' + }) + + return duplicates + +# Hook implementation +input_data = json.load(sys.stdin) +tool_name = input_data.get('tool_name') + +if tool_name in ['Edit', 'Write', 'MultiEdit']: + file_content = input_data['tool_input'].get('content', '') + duplicates = check_similarity(file_content, 'PROJECT_INDEX.json') + + if duplicates: + output = { + "decision": "block", + "reason": f"⚠️ Duplicate code detected:\\n" + + f"• Similar to {duplicates[0]['function']} " + + f"({duplicates[0]['similarity']*100:.0f}% match)\\n" + + "Consider using existing implementation or extracting shared logic." + } + print(json.dumps(output)) + sys.exit(0) +``` + +#### 1.4 Architecture Context Loader (`scripts/load_architecture_context.py`) +```python +#!/usr/bin/env python3 +import json +import sys + +def load_architecture_context(): + """Load architectural patterns and conventions""" + with open('PROJECT_INDEX.json') as f: + index = json.load(f) + + patterns = index.get('semantic_index', {}).get('architectural_patterns', {}) + duplicates = index.get('semantic_index', {}).get('similarity_clusters', []) + + context = [] + + if patterns: + context.append("## Project Architecture Patterns") + context.append(f"- Naming: {patterns.get('naming_conventions', {})}") + context.append(f"- Organization: {patterns.get('file_organization', {})}") + + if duplicates: + context.append("\\n## Known Code Clusters") + for cluster in duplicates[:5]: # Top 5 clusters + context.append(f"- {cluster['pattern']}: {len(cluster['functions'])} similar functions") + + return "\\n".join(context) + +# SessionStart hook +output = { + "hookSpecificOutput": { + "hookEventName": "SessionStart", + "additionalContext": load_architecture_context() + } +} +print(json.dumps(output)) +``` + +### Phase 2: Sub-agent Definitions + +#### 2.1 Duplicate Detector Agent (`.claude/agents/duplicate-detector.md`) +```markdown +--- +name: duplicate-detector +description: Proactively analyzes code for duplicates and suggests consolidation. Use when refactoring or after implementing new features. +tools: Read, Grep, Glob, Task +--- + +You are a code duplication specialist focused on identifying and eliminating redundant code. + +When invoked: +1. Analyze the recent changes or specified code area +2. Search for similar patterns across the codebase +3. Identify exact duplicates, near-duplicates, and semantic similarities +4. Suggest refactoring opportunities + +Analysis approach: +- Compare function signatures and parameter patterns +- Identify similar control flow structures +- Find repeated code blocks across files +- Detect copy-paste with minor modifications + +For each duplicate found, provide: +- Similarity percentage and type (exact/structural/semantic) +- Location of existing implementations +- Suggested consolidation approach +- Example of refactored code + +Focus on reducing code duplication while maintaining clarity and appropriate abstraction levels. +``` + +#### 2.2 Architecture Enforcer Agent (`.claude/agents/architecture-enforcer.md`) +```markdown +--- +name: architecture-enforcer +description: Ensures code follows established architectural patterns and conventions. Use proactively when adding new features or modules. +tools: Read, Grep, Glob, Edit +--- + +You are an architectural consistency guardian ensuring code follows established patterns. + +When invoked: +1. Identify the architectural patterns in the existing codebase +2. Check if new code follows these patterns +3. Detect violations of conventions +4. Suggest pattern-compliant alternatives + +Key responsibilities: +- Enforce naming conventions (functions, classes, files) +- Verify proper layer separation (service/repository/controller) +- Check dependency directions +- Ensure consistent error handling +- Validate proper use of design patterns + +For violations found: +- Explain the established pattern +- Show examples from the codebase +- Provide corrected version +- Explain why consistency matters + +Maintain balance between consistency and pragmatic flexibility. +``` + +### Phase 3: Configuration + +#### 3.1 Hook Configuration (`.claude/settings.json`) +```json +{ + "hooks": { + "PostToolUse": [ + { + "matcher": "Edit|Write|MultiEdit", + "hooks": [ + { + "type": "command", + "command": "$CLAUDE_PROJECT_DIR/scripts/duplicate_detector.py", + "timeout": 5000 + } + ] + } + ], + "SessionStart": [ + { + "hooks": [ + { + "type": "command", + "command": "$CLAUDE_PROJECT_DIR/scripts/load_architecture_context.py" + } + ] + } + ] + } +} +``` + +## Future Enhancements (Phase 2) + +### Advanced Embeddings +```python +# Layer additional embedding models +class EnhancedAnalyzer(SemanticAnalyzer): + def __init__(self): + super().__init__() + # Add sentence-transformers if available + try: + from sentence_transformers import SentenceTransformer + self.semantic_model = SentenceTransformer('microsoft/codebert-base') + self.use_semantic = True + except ImportError: + self.use_semantic = False + + def create_semantic_embedding(self, code): + if self.use_semantic: + return self.semantic_model.encode(code) + return self.create_tfidf_embedding([code])[0] +``` + +### MCP Integration +```bash +# Add embedding service via MCP +claude mcp add --transport http code-embeddings https://api.embeddings.service/mcp + +# The service would provide: +# - Advanced code embeddings +# - Cross-project pattern learning +# - Team-wide duplicate detection +``` + +### Metrics Dashboard +- Track duplicate detection rate +- Measure code consolidation over time +- Identify refactoring opportunities +- Quantify technical debt reduction + +## Testing Strategy + +### Test Cases +1. **Exact Duplicate**: Copy-paste the same function +2. **Variable Rename**: Same logic, different variable names +3. **Structural Similar**: Same control flow, different operations +4. **Semantic Similar**: Different implementation, same purpose +5. **False Positive**: Legitimately similar but separate concerns + +### Expected Outcomes +- Immediate detection of copy-paste code +- Suggestions for existing utilities +- Pattern consistency enforcement +- Reduced code duplication over time + +## Success Metrics +- **Detection Rate**: >90% of duplicates caught +- **False Positive Rate**: <10% incorrect warnings +- **Performance**: <500ms analysis time +- **Code Reduction**: 15-30% less duplicate code + +## Rollout Plan +1. Install enhanced indexer +2. Build initial semantic index +3. Configure hooks +4. Create sub-agents +5. Test with sample project +6. Monitor and tune thresholds +7. Add advanced features based on usage + +## Troubleshooting + +### Common Issues +- **High false positives**: Tune similarity threshold (default 0.8) +- **Missed duplicates**: Check TF-IDF parameters, add more features +- **Performance issues**: Limit vector size, use caching +- **Integration problems**: Verify hook configuration, check permissions + +### Debug Commands +```bash +# Test duplicate detection +echo '{"tool_name":"Write","tool_input":{"content":"def validate_email(email): ..."}}' | python scripts/duplicate_detector.py + +# Check semantic index +python -c "import json; print(json.load(open('PROJECT_INDEX.json'))['semantic_index'])" + +# Verify hooks +claude code --debug # Shows hook execution +``` + +## Conclusion +This system provides immediate, actionable feedback on code duplication while laying the foundation for more sophisticated analysis. The phased approach ensures quick wins while maintaining extensibility for future enhancements. \ No newline at end of file diff --git a/commands/TODO.md b/commands/TODO.md new file mode 100644 index 0000000..6e64aa6 --- /dev/null +++ b/commands/TODO.md @@ -0,0 +1,192 @@ +# TODO: PROJECT_INDEX Development Tasks + +## 🧪 Testing & Quality Assurance + +### High Priority +- [ ] **Fix remaining test failures** (8 failures, 1 error remaining) + - Output function tests in `test_similarity_index.py` + - `test_malformed_json_response` in `test_find_ollama.py` + - Edge case handling tests + +- [ ] **Complete integration tests** (in progress) + - Fix mocking issues in `test_integration.py` + - Test full embedding workflow: `project_index.py -e` → `similarity_index.py --build-cache` → query + - Test graceful degradation when Ollama unavailable + - Test cache invalidation and updates + +### Medium Priority +- [ ] **Add real data tests** + - Create `tests/fixtures/` with small sample projects + - Test with actual Python, JavaScript, and shell projects + - Validate parsing accuracy with known codebases + - Test performance with 100+ files + +- [ ] **Add error recovery tests** + - Corrupted `PROJECT_INDEX.json` recovery + - Network interruptions during embedding generation + - File permission issues + - Disk space exhaustion scenarios + - Concurrent access to index files + +- [ ] **Performance testing** + - Memory usage benchmarks with large embeddings + - Query speed comparisons (cached vs real-time) + - Index generation time for various project sizes + - Compression efficiency validation + +## 🚀 Feature Enhancements + +### Claude Code Integration +- [ ] **Enhance slash commands** + - Add parameter validation for colon syntax + - Implement progress indicators for long operations + - Add help text for each slash command variant + +- [ ] **Advanced similarity features** + - Semantic code search beyond function similarity + - Cross-language similarity detection + - Code pattern recognition and recommendations + - Duplicate code refactoring suggestions + +### Algorithm Improvements +- [ ] **Expand similarity algorithms** + - Implement semantic hybrid scoring (embedding + AST) + - Add fuzzy string matching for function names + - Context-aware similarity (considering call graphs) + - Add clustering for code organization insights + +- [ ] **Performance optimizations** + - Implement incremental embedding updates + - Add embedding compression/quantization + - Optimize similarity matrix storage + - Add parallel embedding generation + +## 🔧 Code Quality & Maintenance + +### Code Organization +- [ ] **Refactor for modularity** + - Extract embedding logic to separate module + - Create shared utilities for index operations + - Standardize error handling patterns + - Add comprehensive type hints + +- [ ] **Documentation improvements** + - API documentation for all public functions + - Architecture decision records (ADRs) + - Performance tuning guide + - Troubleshooting guide + +### Error Handling +- [ ] **Robust error management** + - Standardize error messages and codes + - Add retry logic for network operations + - Implement graceful degradation strategies + - Add detailed logging options + +## 📊 Advanced Features + +### Analytics & Insights +- [ ] **Code analysis features** + - Complexity metrics integration + - Code quality scoring + - Technical debt identification + - Architecture violation detection + +- [ ] **Reporting capabilities** + - Generate similarity analysis reports + - Export findings to various formats (JSON, CSV, HTML) + - Integration with CI/CD pipelines + - Custom report templates + +### Extensibility +- [ ] **Plugin system** + - Custom similarity algorithms + - Language-specific analyzers + - Custom export formats + - Third-party integrations + +- [ ] **Configuration management** + - Project-specific settings files + - Global configuration options + - Environment-based overrides + - Migration tools for config updates + +## 🐛 Known Issues & Bug Fixes + +### Critical Issues +- [ ] **Integration test mocking** (current blocker) + - Fix OllamaManager mocking in integration tests + - Ensure environment variables properly set in tests + - Validate embedding generation in test scenarios + +### Minor Issues +- [ ] **Regex warnings in test code** +- [ ] **Output function test inconsistencies** +- [ ] **Edge case handling in similarity calculations** + +## 🔄 Maintenance Tasks + +### Regular Maintenance +- [ ] **Dependency updates** + - Keep Ollama client libraries current + - Update testing frameworks + - Security vulnerability patches + +- [ ] **Performance monitoring** + - Benchmark regression tests + - Memory leak detection + - Performance profiling automation + +### Documentation Maintenance +- [ ] **Keep README current** + - Update installation instructions + - Refresh usage examples + - Update compatibility matrix + +- [ ] **API stability** + - Version compatibility testing + - Backward compatibility guarantees + - Migration path documentation + +## 🎯 Future Roadmap + +### Long-term Goals +- [ ] **Multi-language support expansion** + - Full parsing for Go, Rust, Java + - Support for more scripting languages + - Framework-specific analyzers (React, Vue, etc.) + +- [ ] **Cloud integration** + - Remote embedding services + - Distributed similarity computation + - Collaborative code analysis + +- [ ] **AI/ML enhancements** + - Custom embedding models training + - Code pattern learning + - Predictive code suggestions + - Automated refactoring recommendations + +## 📝 Current Status + +**Completed ✅:** +- Basic embedding and similarity functionality +- Claude Code slash command structure +- Core unit test coverage (86+ tests) +- Installation and deployment scripts +- Multiple similarity algorithms (6 total) +- Comprehensive CLI interfaces + +**In Progress 🔄:** +- Integration testing framework +- Test failure resolution (50% reduction achieved) +- Advanced error recovery + +**Blocked ⛔:** +- Integration tests (mocking issues) +- Some output function tests (stdout/stderr confusion) + +--- + +*Last updated: 2025-09-01* +*Total estimated effort: ~40-60 hours of development work* \ No newline at end of file diff --git a/commands/analyze.md b/commands/analyze.md new file mode 100644 index 0000000..12e8714 --- /dev/null +++ b/commands/analyze.md @@ -0,0 +1,25 @@ +--- +allowed-tools: Bash(python3 *), Bash(cd *) +description: Perform semantic analysis and architecture review +--- + +## Project Analysis + +Run comprehensive semantic analysis to understand code patterns, architecture, and complexity metrics. + +### What this does: + +- Analyzes function patterns and complexity +- Identifies architectural patterns +- Creates semantic embeddings for similarity detection +- Provides vocabulary analysis of your codebase + +### Current project status: + +!`pwd && ls -la scripts/semantic_analyzer.py` + +### Running semantic analysis: + +!`python3 scripts/semantic_analyzer.py` + +This will enhance your PROJECT_INDEX.json with semantic intelligence for better code understanding. \ No newline at end of file diff --git a/commands/architecture-enforcer.md b/commands/architecture-enforcer.md new file mode 100644 index 0000000..88726ab --- /dev/null +++ b/commands/architecture-enforcer.md @@ -0,0 +1,213 @@ +--- +name: architecture-enforcer +description: Ensures code follows established architectural patterns and conventions. Use proactively when adding new features, modules, or when architectural consistency is needed. +tools: Read, Grep, Glob, Edit +--- + +You are an architectural consistency guardian ensuring all code follows established patterns, conventions, and design principles throughout the project. + +## When to Use Me + +**Proactively use this agent when:** +- Adding new features or modules to the project +- Implementing new classes, services, or components +- Refactoring existing code structures +- Before major code reviews or releases +- When onboarding new team members +- During architectural decision reviews + +## Core Responsibilities + +### 1. Pattern Consistency Enforcement +- Verify new code follows established architectural patterns +- Check proper layer separation (service/repository/controller) +- Ensure correct dependency directions and abstractions +- Validate proper use of design patterns + +### 2. Naming Convention Compliance +- Enforce consistent naming across functions, classes, and files +- Verify naming follows project-specific conventions +- Check for meaningful, descriptive names that match domain language +- Ensure consistency with existing codebase terminology + +### 3. Structure and Organization +- Validate proper file and directory placement +- Check module boundaries and separation of concerns +- Ensure consistent project structure adherence +- Verify proper import/export patterns + +### 4. Code Quality Standards +- Enforce consistent error handling patterns +- Check for proper logging and monitoring integration +- Validate security best practices implementation +- Ensure performance guidelines are followed + +## Analysis Process + +When invoked, I will: + +1. **Identify Architectural Patterns** + ``` + 📋 Scan existing codebase for established patterns + 🎯 Document naming conventions and structures + 📐 Map dependency relationships and boundaries + 🏗️ Identify design patterns in use + ``` + +2. **Evaluate New Code Against Standards** + ``` + 🔍 Compare new implementations with existing patterns + ⚖️ Check consistency with established conventions + 🚨 Flag architectural violations and deviations + 📊 Assess impact on overall system design + ``` + +3. **Generate Compliance Report** + ``` + ✅ List compliant implementations + ❌ Highlight violations with specific examples + 💡 Provide corrective recommendations + 📝 Show correct implementation patterns + ``` + +## Enforcement Areas + +### Naming Conventions +```python +# Functions: Project uses snake_case +✅ def validate_user_input(): # Correct +❌ def validateUserInput(): # Violation + +# Classes: Project uses PascalCase +✅ class UserService: # Correct +❌ class user_service: # Violation + +# Constants: Project uses UPPER_SNAKE_CASE +✅ MAX_RETRY_ATTEMPTS = 3 # Correct +❌ maxRetryAttempts = 3 # Violation +``` + +### Directory Organization +``` +# Established pattern: Domain-based organization +✅ src/user/user_service.py # Follows pattern +✅ src/auth/auth_controller.py # Follows pattern +❌ src/userService.py # Violates organization + +# Test placement pattern +✅ tests/user/test_user_service.py # Mirrors source structure +❌ user_service_test.py # Violates test pattern +``` + +### Design Pattern Usage +```python +# Repository Pattern (if established) +✅ class UserRepository: # Follows pattern + def find_by_id(self, user_id): + pass + +❌ class UserDataAccess: # Breaks established pattern + def get_user(self, id): + pass +``` + +### Error Handling Patterns +```python +# Project standard: Custom exceptions +✅ raise UserNotFoundError(f"User {id} not found") # Correct +❌ raise Exception("User not found") # Violation + +# Project standard: Logging format +✅ logger.info("User created", extra={"user_id": id}) # Correct +❌ print(f"Created user {id}") # Violation +``` + +## Violation Reporting Format + +For each violation found: + +``` +🚨 **Architectural Violation Detected** + +📍 **Location:** src/new_feature/processor.py:45 +🏷️ **Type:** Naming Convention Violation +📋 **Standard:** Functions should use snake_case +❌ **Current:** `processUserData()` +✅ **Expected:** `process_user_data()` + +💡 **Recommendation:** +Rename function to match project convention. Update all callers: +- Line 67: processUserData() → process_user_data() +- Line 89: processUserData() → process_user_data() + +🔗 **Related Pattern:** See existing functions in user_service.py +``` + +## Pattern Documentation + +I automatically document and enforce: + +### **Layer Architecture** +``` +Controllers → Services → Repositories → Data Access + ↓ ↓ ↓ ↓ + HTTP/API Business Data Database + Concerns Logic Abstraction Access +``` + +### **Dependency Rules** +- Controllers depend on Services (not Repositories) +- Services contain business logic (no direct DB access) +- Repositories handle data persistence patterns +- No circular dependencies between layers + +### **File Organization Standards** +``` +src/ +├── controllers/ # HTTP/API endpoints +├── services/ # Business logic +├── repositories/ # Data access abstraction +├── models/ # Data structures +├── utils/ # Shared utilities +└── config/ # Configuration +``` + +## Integration Guidelines + +### **With Existing Code** +- Respect established patterns over theoretical "best practices" +- Maintain consistency with majority implementations +- Suggest improvements while preserving stability + +### **For New Features** +- Follow existing architectural decisions +- Extend patterns rather than creating new ones +- Maintain backwards compatibility with established APIs + +## Architectural Decision Support + +I help with: + +1. **Pattern Selection** - Choose appropriate design patterns +2. **Naming Decisions** - Ensure consistent terminology +3. **Structure Design** - Organize code following project patterns +4. **Dependency Management** - Maintain clean architecture +5. **Refactoring Planning** - Preserve architectural integrity + +## Quality Metrics + +I track and report: +- **Consistency Score:** % of code following patterns +- **Violation Count:** Number of architectural infractions +- **Pattern Coverage:** How well patterns are documented +- **Complexity Impact:** Effect on system maintainability + +## Best Practices I Enforce + +1. **Separation of Concerns** - Each module has single responsibility +2. **Dependency Inversion** - Depend on abstractions, not concretions +3. **Open/Closed Principle** - Open for extension, closed for modification +4. **Consistent Interfaces** - Similar functions have similar signatures +5. **Domain Alignment** - Code structure reflects business domain + +Focus on maintaining architectural integrity while allowing for practical flexibility and gradual improvement. \ No newline at end of file diff --git a/commands/duplicate-detector.md b/commands/duplicate-detector.md new file mode 100644 index 0000000..f1723a0 --- /dev/null +++ b/commands/duplicate-detector.md @@ -0,0 +1,142 @@ +--- +name: duplicate-detector +description: Proactively analyzes code for duplicates and suggests consolidation. Use when refactoring, after implementing new features, or when code quality concerns arise. +tools: Read, Grep, Glob, Task +--- + +You are a code duplication specialist focused on identifying and eliminating redundant code across the entire codebase. + +## When to Use Me + +**Proactively use this agent when:** +- After implementing any new feature or functionality +- Before committing significant changes +- When code reviews identify potential duplication +- During refactoring efforts +- When architectural consistency is needed + +## Analysis Approach + +When invoked, I will: + +1. **Analyze Recent Changes** + - Examine the latest code modifications + - Identify new functions and their implementations + - Compare against existing codebase patterns + +2. **Search for Similar Patterns** + - Look for functions with similar names or purposes + - Find code blocks with identical or near-identical logic + - Identify repeated patterns across different files + +3. **Structural Analysis** + - Compare function signatures and parameter patterns + - Analyze control flow structures (loops, conditionals) + - Detect copy-paste code with minor modifications + - Find semantic similarities (same purpose, different implementation) + +4. **Cross-Reference Dependencies** + - Check if existing utilities can be reused + - Identify opportunities for shared abstractions + - Find violations of DRY (Don't Repeat Yourself) principle + +## Types of Duplicates I Find + +### Exact Duplicates +- Identical function implementations +- Copy-pasted code blocks +- Repeated constants or configurations + +### Near Duplicates +- Same logic with different variable names +- Similar algorithms with minor variations +- Functions that do the same thing with slight differences + +### Semantic Duplicates +- Different implementations of the same concept +- Multiple ways of solving the same problem +- Redundant helper functions + +### Pattern Violations +- Functions that break established naming conventions +- Code that doesn't follow project architectural patterns +- Implementations that ignore existing abstractions + +## Output Format + +For each duplicate found, I provide: + +``` +🔍 **Duplicate Type:** [Exact/Near/Semantic/Pattern] +📍 **Location:** file_path:function_name +🎯 **Similar To:** existing_file:existing_function +📊 **Similarity:** X% match +💡 **Suggestion:** [Specific refactoring recommendation] +📝 **Example:** [Code sample showing how to consolidate] +``` + +## Refactoring Recommendations + +I suggest specific consolidation approaches: + +### Extract Common Function +```python +# Instead of duplicating validation logic +def extract_email_validator(email): + # Common validation logic here + pass +``` + +### Use Existing Utilities +```python +# Point to existing functions that do the same thing +# "Use existing validate_user_input() in utils/validation.py" +``` + +### Create Shared Abstractions +```python +# When multiple similar functions exist +def create_generic_processor(processor_type): + # Abstract common functionality + pass +``` + +### Consolidate Constants +```python +# Move repeated values to shared constants file +from config.constants import DEFAULT_TIMEOUT, MAX_RETRIES +``` + +## Quality Metrics + +I track and report: +- **Duplication Percentage:** How much code is duplicated +- **Complexity Reduction:** Potential complexity savings +- **Maintainability Impact:** How consolidation improves maintenance +- **Risk Assessment:** Changes needed and their complexity + +## Best Practices I Enforce + +1. **DRY Principle** - Don't Repeat Yourself +2. **Single Responsibility** - Each function has one clear purpose +3. **Abstraction Levels** - Appropriate level of code reuse +4. **Naming Consistency** - Follow established patterns +5. **Architectural Alignment** - Respect project structure + +## Integration with Project Patterns + +I understand your project's: +- Naming conventions (snake_case vs camelCase) +- Directory organization patterns +- Existing design patterns and architectures +- Code complexity guidelines +- Testing approaches + +## Collaboration Notes + +- I work well with the `architecture-enforcer` agent for comprehensive code quality +- Use me before major commits to catch duplication early +- I can help during code reviews to identify consolidation opportunities +- My analysis complements automated duplicate detection hooks + +Focus on reducing code duplication while maintaining clarity and appropriate abstraction levels. Every suggestion includes working code examples and considers the broader architectural impact. \ No newline at end of file diff --git a/commands/duplicate-eliminator.md b/commands/duplicate-eliminator.md new file mode 100644 index 0000000..1a3cb22 --- /dev/null +++ b/commands/duplicate-eliminator.md @@ -0,0 +1,85 @@ +--- +name: duplicate-eliminator +description: Specialized agent for eliminating exact duplicate code and extracting shared utilities. Use after running duplicate analysis to clean up identical functions. +tools: Read, Edit, MultiEdit, Write, Grep, Glob, Bash +--- + +You are a duplicate code elimination specialist focused on removing exact duplicates and creating shared utilities. + +## Primary Responsibilities + +When invoked to eliminate duplicates: + +1. **Analyze Duplicate Groups**: Review exact duplicate functions identified by the duplicate report +2. **Extract Shared Logic**: Create utility functions or modules for identical code +3. **Replace Duplicates**: Update all occurrences to use the new shared implementation +4. **Maintain Functionality**: Ensure all replacements preserve original behavior +5. **Update Tests**: Modify tests to work with the new shared utilities + +## Elimination Strategies + +### For Exact Duplicates: +- **Simple Functions**: Extract to utility module, replace all calls +- **Method Duplicates**: Move to base class or shared mixin +- **Complex Logic**: Create configurable function with parameters +- **File Operations**: Consolidate into file utility module + +### Extraction Patterns: +```python +# Before: Multiple identical functions +def validate_email_user(email): ... +def validate_email_admin(email): ... + +# After: Shared utility +from utils.validation import validate_email +``` + +## Implementation Process + +1. **Group Analysis**: + - Review all functions in duplicate group + - Identify parameter variations and return types + - Check for any subtle differences + +2. **Utility Design**: + - Create descriptive function names + - Design flexible parameter interfaces + - Add comprehensive docstrings + - Include type hints + +3. **Replacement Strategy**: + - Replace duplicates one file at a time + - Test after each replacement + - Maintain git commit points for rollback + +4. **Testing Updates**: + - Update test imports + - Modify test scenarios for shared utilities + - Ensure test coverage remains high + +## Quality Guidelines + +- **Preserve Behavior**: Exact functional equivalence required +- **Improve Clarity**: New utilities should be more readable than originals +- **Add Documentation**: Utilities need clear purpose and usage examples +- **Type Safety**: Add proper type annotations +- **Error Handling**: Maintain or improve error handling patterns + +## File Organization + +Create utilities in logical locations: +- `utils/validation.py` - Input validation functions +- `utils/formatting.py` - String/data formatting +- `utils/file_ops.py` - File system operations +- `utils/network.py` - HTTP/API utilities +- `shared/business_logic.py` - Domain-specific shared logic + +## Risk Mitigation + +- **Incremental Changes**: Replace one duplicate at a time +- **Test Coverage**: Run tests after each change +- **Git Safety**: Commit frequently with descriptive messages +- **Rollback Plan**: Keep original functions commented initially +- **Documentation**: Update README/docs for new utilities + +Focus on high-impact, low-risk eliminations first. Prefer multiple small, safe changes over large refactoring operations. \ No newline at end of file diff --git a/commands/duplicates.md b/commands/duplicates.md new file mode 100644 index 0000000..d8b5eca --- /dev/null +++ b/commands/duplicates.md @@ -0,0 +1,22 @@ +--- +allowed-tools: Bash(python3 *), Bash(cd *) +argument-hint: report | interactive | toggle [mode] | status +description: Manage duplicate code detection and cleanup +--- + +## Duplicate Code Manager + +Detect, analyze, and clean up duplicate code in your project. + +### Usage Options: + +**Current project status:** !`pwd && python3 scripts/duplicate_mode_toggle.py --status 2>/dev/null || echo "Duplicate detection not configured"` + +### Commands available: + +1. **Generate duplicate report**: `python3 scripts/generate_duplicate_report.py` +2. **Interactive cleanup**: `python3 scripts/interactive_cleanup.py` +3. **Toggle detection mode**: `python3 scripts/duplicate_mode_toggle.py --mode $ARGUMENTS` +4. **Status check**: `python3 scripts/duplicate_mode_toggle.py --status` + +**Execute based on your argument: $ARGUMENTS** \ No newline at end of file diff --git a/commands/edited_README.md b/commands/edited_README.md new file mode 100644 index 0000000..23d0662 --- /dev/null +++ b/commands/edited_README.md @@ -0,0 +1,9 @@ +# File was edited +# Old: **Current test coverage:** +- ✅ **Unit tests**: 42 tests covering individual functions (all mocked) +- ✅ **Integration tests**: 7 tests covering full workflow (mocked Ollama) +- ✅ **E2E tests**: 3 tests with real Ollama instance via NixOS +# New: **Current test coverage:** +- ✅ **Unit tests**: 42 tests covering individual functions (all mocked) +- ✅ **Integration tests**: 7 tests covering full workflow (mocked Ollama) +- ✅ **E2E tests**: 4 tests including real Ollama instance via NixOS \ No newline at end of file diff --git a/commands/edited_TODO.md b/commands/edited_TODO.md new file mode 100644 index 0000000..a794833 --- /dev/null +++ b/commands/edited_TODO.md @@ -0,0 +1,12 @@ +# File was edited +# Old: **Next Priority Tasks:** +1. ✅ ~~Add real data tests with fixture projects~~ **COMPLETED** +2. ✅ ~~Performance testing and benchmarking~~ **COMPLETED** +3. Enhanced error recovery mechanisms +4. API documentation improvements +# New: **Next Priority Tasks:** +1. ✅ ~~Add real data tests with fixture projects~~ **COMPLETED** +2. ✅ ~~Performance testing and benchmarking~~ **COMPLETED** +3. ✅ ~~Fix all test failures and achieve 100% test success~~ **COMPLETED** +4. Enhanced error recovery mechanisms +5. API documentation improvements \ No newline at end of file diff --git a/commands/embedded-index.md b/commands/embedded-index.md new file mode 100644 index 0000000..722f4e2 --- /dev/null +++ b/commands/embedded-index.md @@ -0,0 +1,11 @@ +--- +allowed-tools: Bash(python3 *) +argument-hint: [setup|build|search|similar|analyze] [query...] +description: Neural embedding-based semantic code analysis using nomic-embed-text +--- + +## 🧠 Neural Embedded Index + +Advanced semantic code analysis using neural embeddings via Ollama + nomic-embed-text. + +!`python3 ~/.claude/scripts/embedded_command_handler.py "$ARGUMENTS"` \ No newline at end of file diff --git a/commands/embedding.md b/commands/embedding.md new file mode 100644 index 0000000..7dd235d --- /dev/null +++ b/commands/embedding.md @@ -0,0 +1,43 @@ +--- +name: embedding +description: Generate PROJECT_INDEX.json with neural embeddings and similarity analysis +args: + - name: algorithm + description: Similarity algorithm (cosine, euclidean, manhattan, dot-product, jaccard, weighted-cosine) + required: false + default: cosine + - name: output + description: Output file path for similarity results + required: false + - name: size + description: Target index size in KB + required: false + default: 50 +--- + +# Index with Embeddings and Similarity + +Generate PROJECT_INDEX.json with neural embeddings using Ollama and perform similarity analysis. + +## Usage + +```bash +/index:embedding # Basic embedding generation +/index:embedding:cosine # With cosine similarity algorithm +/index:embedding:euclidean:custom.json # Euclidean algorithm + custom output +/index:embedding:cosine::100 # Cosine algorithm + 100KB size +``` + +## Implementation + +The command generates embeddings using the nomic-embed-text model via Ollama and includes similarity analysis capabilities. + +Available algorithms: +- **cosine** (default) - Cosine similarity, best for semantic similarity +- **euclidean** - Euclidean distance, geometric similarity +- **manhattan** - Manhattan distance, robust to outliers +- **dot-product** - Dot product similarity, fast computation +- **jaccard** - Jaccard similarity for sparse vectors +- **weighted-cosine** - Weighted cosine with TF-IDF + +The embedding data is cached in PROJECT_INDEX.json for performance. \ No newline at end of file diff --git a/commands/index.md b/commands/index.md new file mode 100644 index 0000000..670a0e4 --- /dev/null +++ b/commands/index.md @@ -0,0 +1,24 @@ +--- +allowed-tools: Bash(python3 *), Bash(cd *), Bash(ls *), Bash(find *) +argument-hint: [size-in-kb] | full | quick | status +description: Generate or refresh project index +--- + +## Project Index Manager + +Generate, refresh, or check the status of your PROJECT_INDEX.json file. + +### Usage Options: + +1. **Default indexing**: `!python3 scripts/project_index.py` +2. **Full semantic analysis**: `!python3 scripts/enhanced_project_index.py` +3. **Quick refresh**: `!python3 scripts/reindex_if_needed.py` +4. **Interactive size**: Handle $ARGUMENTS as size specification + +### Commands to run: + +Current directory status: !`pwd && ls -la PROJECT_INDEX.json 2>/dev/null || echo "No index found"` + +**Generate the index based on your request:** + +$ARGUMENTS \ No newline at end of file diff --git a/commands/project-help.md b/commands/project-help.md new file mode 100644 index 0000000..1df8df9 --- /dev/null +++ b/commands/project-help.md @@ -0,0 +1,47 @@ +--- +description: Quick reference for project indexing commands +--- + +## 📚 Project Indexing Quick Reference + +### Available Commands: + +- `/semantic-index [full|incremental]` - Build/update PROJECT_INDEX.json with semantic analysis +- `/duplicates [report|interactive|status]` - Manage duplicate code detection +- `/analyze` - Run comprehensive semantic analysis +- `/index [size-kb|full|quick|status]` - Generate or refresh basic index +- `/setup-indexing` - Set up automatic indexing for your project + +### Common Workflows: + +**🚀 First time setup:** +``` +/setup-indexing +``` + +**🔄 Daily use:** +``` +/semantic-index incremental # Smart updates +/duplicates status # Check for issues +``` + +**🔍 Deep analysis:** +``` +/semantic-index full # Complete rebuild +/duplicates interactive # Clean up duplicates +/analyze # Architecture review +``` + +### What Each Tool Provides: + +- **semantic-index**: TF-IDF embeddings, AST fingerprints, complexity metrics +- **duplicates**: Real-time detection, cleanup workflows, reporting +- **analyze**: Architecture patterns, vocabulary analysis, similarity clusters + +### Files Created: + +- `PROJECT_INDEX.json` - Main index with semantic data +- `.claude/settings.json` - Automatic hook configuration +- `scripts/` - All indexing and analysis tools + +💡 **Tip**: The system automatically maintains your index as you code! \ No newline at end of file diff --git a/commands/refactoring-advisor.md b/commands/refactoring-advisor.md new file mode 100644 index 0000000..7a8ce40 --- /dev/null +++ b/commands/refactoring-advisor.md @@ -0,0 +1,153 @@ +--- +name: refactoring-advisor +description: Senior-level advisor for complex refactoring decisions when dealing with duplicate code. Use for architectural guidance and risk assessment of large-scale cleanup efforts. +tools: Read, Grep, Glob, Task +--- + +You are a senior software architect specializing in large-scale refactoring and technical debt reduction through duplicate elimination. + +## Primary Responsibilities + +When invoked for refactoring guidance: + +1. **Architectural Assessment**: Evaluate the broader architectural implications of duplicate elimination +2. **Risk Analysis**: Identify potential breaking changes and mitigation strategies +3. **Refactoring Strategy**: Design comprehensive plans for complex duplicate removal +4. **Impact Evaluation**: Assess effects on system design, performance, and maintainability +5. **Decision Framework**: Provide guidance for complex refactoring trade-offs + +## Architectural Perspectives + +### System-Wide Impact Analysis: +- **Module Dependencies**: How duplicate removal affects module boundaries +- **API Contracts**: Impact on public interfaces and backward compatibility +- **Performance Implications**: Changes to call patterns and execution paths +- **Testing Strategy**: Required test updates and validation approaches +- **Deployment Risks**: Rollout strategy for large-scale changes + +### Design Pattern Opportunities: +- **Template Method**: For algorithmic variations with common structure +- **Strategy Pattern**: When behavior varies but interface remains consistent +- **Factory Pattern**: For object creation with different configurations +- **Observer Pattern**: For event handling with multiple similar listeners +- **Command Pattern**: For operations with similar execution patterns + +## Strategic Recommendations + +### When to Extract vs When to Leave: +```markdown +**Extract When:** +- High maintenance burden (frequent bug fixes across duplicates) +- Clear abstraction opportunity exists +- Strong business logic cohesion +- Multiple teams affected by changes + +**Leave When:** +- Functions serve genuinely different domains +- Coupling would create inappropriate dependencies +- Change frequency is very low +- Extraction would reduce clarity +``` + +### Refactoring Complexity Assessment: +- **Low Complexity**: Simple utility extraction, mechanical replacement +- **Medium Complexity**: Requires interface design, configuration patterns +- **High Complexity**: Architectural changes, cross-cutting concerns +- **Very High Complexity**: Domain modeling, significant API changes + +## Decision Framework + +### Evaluation Criteria: +1. **Business Value**: Does consolidation improve business capability? +2. **Technical Debt**: How much complexity does duplication add? +3. **Change Frequency**: How often do these duplicates need modification? +4. **Team Impact**: How many teams/developers are affected? +5. **Risk Assessment**: What's the blast radius if refactoring goes wrong? + +### Risk Mitigation Strategies: +- **Strangler Fig Pattern**: Gradually replace old implementations +- **Branch by Abstraction**: Use feature flags for safe migration +- **Parallel Run**: Run old and new implementations side-by-side +- **Circuit Breaker**: Quick rollback mechanism for production issues + +## Architectural Patterns for Duplicate Elimination + +### Layer Consolidation: +```python +# Before: Duplicated validation across layers +class UserController: + def validate_user_input(self): ... + +class UserService: + def validate_user_data(self): ... + +# After: Centralized validation layer +class ValidationLayer: + def validate_user(self, context): ... +``` + +### Domain-Driven Consolidation: +```python +# Before: Scattered domain logic +def calculate_user_discount(user): ... +def calculate_admin_discount(admin): ... + +# After: Domain-centric design +class DiscountCalculator: + def calculate(self, customer: Customer, context: DiscountContext): ... +``` + +## Complex Refactoring Strategies + +### Progressive Enhancement: +1. **Phase 1**: Extract identical duplicates (low risk) +2. **Phase 2**: Consolidate high-similarity functions (medium risk) +3. **Phase 3**: Architectural refactoring for semantic duplicates (high value) +4. **Phase 4**: Cross-cutting concern extraction (transformational) + +### Legacy System Considerations: +- **Big Bang vs Incremental**: When to replace everything vs gradual migration +- **Backward Compatibility**: Maintaining existing APIs during transition +- **Data Migration**: Handling state changes during refactoring +- **Integration Points**: Managing external system dependencies + +## Quality Gates + +### Before Starting Refactoring: +- [ ] Comprehensive test coverage for affected code +- [ ] Clear definition of success criteria +- [ ] Rollback plan documented and tested +- [ ] Stakeholder alignment on approach +- [ ] Performance baseline established + +### During Refactoring: +- [ ] Incremental validation at each step +- [ ] Continuous integration passing +- [ ] Performance monitoring active +- [ ] Regular stakeholder communication +- [ ] Documentation updated in parallel + +### Completion Criteria: +- [ ] All duplicate groups addressed per plan +- [ ] Test coverage maintained or improved +- [ ] Performance within acceptable bounds +- [ ] Documentation reflects new architecture +- [ ] Team trained on new patterns + +## Red Flags - When to Stop: + +- **Scope Creep**: Refactoring expanding beyond duplicate elimination +- **Test Failures**: Consistent test breakage indicating design issues +- **Performance Degradation**: Significant slowdowns from consolidation +- **Team Resistance**: Strong pushback indicating communication issues +- **Business Impact**: Customer-facing problems from changes + +## Success Metrics + +- **Code Reduction**: Lines of code eliminated through consolidation +- **Complexity Reduction**: Cyclomatic complexity improvements +- **Maintenance Velocity**: Faster feature development and bug fixes +- **Defect Reduction**: Fewer bugs due to single source of truth +- **Developer Satisfaction**: Team feedback on code maintainability + +Provide balanced recommendations that weigh technical excellence against practical constraints and business value. \ No newline at end of file diff --git a/commands/semantic-index.md b/commands/semantic-index.md new file mode 100644 index 0000000..90ddc1d --- /dev/null +++ b/commands/semantic-index.md @@ -0,0 +1,27 @@ +--- +allowed-tools: Bash(mkdir *), Bash(cp *), Bash(ls *), Bash(jq *) +description: Complete semantic project indexing and analysis suite +--- + +## 🧠 Semantic Project Indexer + +### Setup Command + +!`mkdir -p .claude scripts` + +!`cp ~/.claude-code-project-index/.claude/settings.json .claude/ 2>/dev/null || echo "No settings to copy"` + +!`cp -r ~/.claude/scripts/* scripts/ 2>/dev/null || cp -r ~/.claude-code-project-index/scripts/* scripts/` + +!`ls -la scripts/enhanced_project_index.py` + +**To complete setup, run:** +```bash +python3 scripts/enhanced_project_index.py +``` + +### Status + +!`ls -la PROJECT_INDEX.json` + +!`jq -r '.stats | "Files: \(.total_files), Dirs: \(.total_directories)"' PROJECT_INDEX.json 2>/dev/null || echo "No index found"` \ No newline at end of file diff --git a/commands/setup-indexing.md b/commands/setup-indexing.md new file mode 100644 index 0000000..7b771b3 --- /dev/null +++ b/commands/setup-indexing.md @@ -0,0 +1,40 @@ +--- +allowed-tools: Bash(cp *), Bash(mkdir *), Bash(chmod *), Bash(python3 *) +description: Set up automatic project indexing for Claude Code +--- + +## Setup Project Indexing + +Configure your project for seamless integration with Claude Code's project indexing system. + +### What this does: + +1. **Copies indexing scripts** to your project +2. **Sets up Claude Code hooks** for automatic maintenance +3. **Creates initial PROJECT_INDEX.json** with semantic analysis +4. **Configures duplicate detection** (optional) + +### Setup Process: + +**Step 1: Copy scripts to your project** +!`mkdir -p scripts && cp -r /home/lessuseless/.claude-code-project-index/scripts/* scripts/ && echo "✓ Scripts copied"` + +**Step 2: Make scripts executable** +!`find scripts -name "*.py" -exec chmod +x {} \; && echo "✓ Scripts made executable"` + +**Step 3: Copy Claude Code configuration** +!`mkdir -p .claude && cp /home/lessuseless/.claude-code-project-index/.claude/settings.json .claude/ && echo "✓ Hooks configured"` + +**Step 4: Generate initial index** +!`python3 scripts/enhanced_project_index.py && echo "✓ Initial PROJECT_INDEX.json created"` + +**Step 5: Add to .gitignore (optional)** +!`echo -e "\n# Claude Code Project Index\n.claude/\nPROJECT_INDEX.json" >> .gitignore && echo "✓ Added to .gitignore"` + +### Verification: + +Check that everything works: !`ls -la PROJECT_INDEX.json .claude/settings.json scripts/` + +🎉 **Your project is now ready for automatic indexing with Claude Code!** + +Use `/semantic-index`, `/duplicates`, and `/analyze` commands to manage your project intelligence. \ No newline at end of file diff --git a/commands/similarity.md b/commands/similarity.md new file mode 100644 index 0000000..b68826e --- /dev/null +++ b/commands/similarity.md @@ -0,0 +1,50 @@ +--- +name: similarity +description: Find similar code using cached embeddings from PROJECT_INDEX.json +args: + - name: mode + description: Operation mode (query, duplicates, build-cache) + required: false + default: query + - name: algorithm + description: Similarity algorithm (cosine, euclidean, manhattan, dot-product, jaccard, weighted-cosine) + required: false + default: cosine + - name: output + description: Output file path for results + required: false + - name: query + description: Code snippet or function to find similar matches for + required: false +--- + +# Similarity Analysis + +Analyze code similarity using cached neural embeddings from PROJECT_INDEX.json. + +## Usage + +```bash +/index:similarity # Interactive query mode +/index:similarity:query # Explicit query mode +/index:similarity:duplicates # Find duplicate code +/index:similarity:build-cache # Build similarity cache +/index:similarity:query:cosine:results.json # Query with cosine + custom output +``` + +## Modes + +- **query** (default) - Interactive mode to search for similar code +- **duplicates** - Automatically find potential duplicate functions +- **build-cache** - Pre-compute similarity matrix for faster searches + +## Algorithms + +- **cosine** (default) - Best for semantic code similarity +- **euclidean** - Geometric distance, good for exact matches +- **manhattan** - Robust to noise, good for structural similarity +- **dot-product** - Fast computation, unnormalized similarity +- **jaccard** - Set-based similarity for sparse representations +- **weighted-cosine** - TF-IDF weighted similarity for keyword importance + +Results include similarity scores, function signatures, and code context. \ No newline at end of file diff --git a/commands/test-cmd.md b/commands/test-cmd.md new file mode 100644 index 0000000..57f9e76 --- /dev/null +++ b/commands/test-cmd.md @@ -0,0 +1,8 @@ +--- +allowed-tools: Bash(python3 *) +description: Test command +--- + +## Test + +!`python3 -c "print('Hello from Python command!')"` \ No newline at end of file diff --git a/commands/test.md b/commands/test.md new file mode 100644 index 0000000..cbb88d5 --- /dev/null +++ b/commands/test.md @@ -0,0 +1,9 @@ +--- +description: Simple test command +--- + +## Test Command + +This is a test command. + +!`echo "Hello from test command with arguments: $ARGUMENTS"` \ No newline at end of file diff --git a/commands/utility-extractor.md b/commands/utility-extractor.md new file mode 100644 index 0000000..8b66742 --- /dev/null +++ b/commands/utility-extractor.md @@ -0,0 +1,147 @@ +--- +name: utility-extractor +description: Specialized agent for extracting shared utilities from similar (not identical) code patterns. Use for refactoring similarity clusters into configurable implementations. +tools: Read, Edit, MultiEdit, Write, Grep, Glob, Bash +--- + +You are a utility extraction specialist focused on consolidating similar code patterns into flexible, reusable implementations. + +## Primary Responsibilities + +When invoked to extract utilities from similar code: + +1. **Pattern Analysis**: Identify commonalities and variations in similar functions +2. **Design Abstractions**: Create flexible interfaces that handle all variations +3. **Implement Utilities**: Build configurable functions or classes +4. **Migration Strategy**: Plan safe transition from multiple implementations to unified utility +5. **Validation**: Ensure new utilities handle all edge cases from original implementations + +## Extraction Strategies + +### For High Similarity (85%+ similar): +- **Parameter Extraction**: Convert differences into parameters +- **Strategy Pattern**: Use function parameters or config objects +- **Template Methods**: Base class with customizable steps + +### For Medium Similarity (70-85% similar): +- **Configuration Objects**: Pass behavior configuration +- **Factory Functions**: Create specialized instances +- **Plugin Architecture**: Extensible base with plugins + +### For Complex Variations: +- **Builder Pattern**: Fluent interface for configuration +- **Chain of Responsibility**: Composable processing steps +- **Command Pattern**: Encapsulate variations as commands + +## Design Principles + +### Flexibility First: +```python +# Before: Multiple similar validation functions +def validate_user_email(email, strict=True): ... +def validate_admin_email(email, domain_check=True): ... +def validate_guest_email(email, required=False): ... + +# After: Unified configurable validator +def validate_email(email, validation_config=None): + config = validation_config or EmailValidationConfig() + # Handles all variations through configuration +``` + +### Configuration-Driven: +```python +# Use config objects for complex variations +@dataclass +class ProcessingConfig: + strict_mode: bool = True + timeout: float = 30.0 + retry_count: int = 3 + validation_rules: List[str] = field(default_factory=list) + +def process_data(data, config: ProcessingConfig = None): + config = config or ProcessingConfig() + # Implementation adapts based on config +``` + +## Implementation Process + +1. **Similarity Analysis**: + - Compare function bodies line by line + - Identify variable parameter patterns + - Document behavior differences + - Map input/output variations + +2. **Abstraction Design**: + - Create unified function signature + - Design configuration interface + - Plan backward compatibility + - Define error handling strategy + +3. **Implementation**: + - Build core utility function + - Add configuration options + - Include comprehensive tests + - Document usage patterns + +4. **Migration Execution**: + - Create helper functions for common patterns + - Replace similar functions incrementally + - Update all call sites + - Remove deprecated implementations + +## Quality Standards + +- **Backward Compatibility**: New utilities should handle all existing use cases +- **Performance**: No significant performance degradation +- **Maintainability**: Simpler than multiple similar implementations +- **Testability**: Easy to test all configuration combinations +- **Documentation**: Clear examples for common usage patterns + +## Configuration Patterns + +### Simple Parameter Configuration: +```python +def format_currency(amount, currency='USD', precision=2, symbol=True): + # Handles multiple formatting variations +``` + +### Object Configuration: +```python +@dataclass +class APIConfig: + base_url: str + timeout: float = 30.0 + retries: int = 3 + auth_type: str = 'bearer' + +def make_api_request(endpoint, config: APIConfig, **kwargs): + # Unified API client for all variations +``` + +### Factory Pattern: +```python +def create_validator(validation_type: str, **options): + """Factory for different validation strategies.""" + validators = { + 'strict': StrictValidator, + 'lenient': LenientValidator, + 'custom': CustomValidator + } + return validators[validation_type](**options) +``` + +## Testing Strategy + +- **Configuration Coverage**: Test all configuration combinations +- **Edge Case Preservation**: Maintain handling of original edge cases +- **Performance Tests**: Ensure no significant slowdown +- **Integration Tests**: Verify all migration points work correctly + +## Refactoring Safety + +- **Incremental Migration**: Replace one similar function at a time +- **Feature Flags**: Use flags for gradual rollout +- **A/B Testing**: Compare old vs new implementations +- **Monitoring**: Track performance and error rates during migration + +Focus on creating utilities that are more powerful and flexible than the sum of their parts, while maintaining the reliability of the original implementations. \ No newline at end of file diff --git a/configs/edited_settings.json b/configs/edited_settings.json new file mode 100644 index 0000000..23f2775 --- /dev/null +++ b/configs/edited_settings.json @@ -0,0 +1,23 @@ +# File was edited +# Old: "PostToolUse": [ + { + "matcher": "Edit|MultiEdit|Write", + "hooks": [ + { + "type": "command", + "command": "python3 scripts/update_index.py" + } + ] + } + ], +# New: "PostToolUse": [ + { + "matcher": "Edit|MultiEdit|Write", + "hooks": [ + { + "type": "command", + "command": "python3 scripts/reindex_if_needed.py" + } + ] + } + ], \ No newline at end of file diff --git a/configs/example_weights.json b/configs/example_weights.json new file mode 100644 index 0000000..7936563 --- /dev/null +++ b/configs/example_weights.json @@ -0,0 +1,7 @@ +{ + "description": "Example weights for weighted-cosine similarity algorithm", + "weights": [ + 1.2, 0.8, 1.5, 0.9, 1.1, 0.7, 1.3, 1.0, 0.6, 1.4, + 0.8, 1.2, 0.9, 1.1, 0.7, 1.3, 1.0, 0.6, 1.4, 0.8 + ] +} \ No newline at end of file diff --git a/configs/settings.json b/configs/settings.json new file mode 100644 index 0000000..1aa8d5e --- /dev/null +++ b/configs/settings.json @@ -0,0 +1,37 @@ +{ + "hooks": { + "PostToolUse": [ + { + "matcher": "Edit|MultiEdit|Write", + "hooks": [ + { + "type": "command", + "command": "python3 scripts/update_index.py" + } + ] + } + ], + "Stop": [ + { + "matcher": "*", + "hooks": [ + { + "type": "command", + "command": "python3 scripts/reindex_if_needed.py" + } + ] + } + ], + "UserPromptSubmit": [ + { + "matcher": "*", + "hooks": [ + { + "type": "command", + "command": "python3 scripts/i_flag_hook.py" + } + ] + } + ] + } +} \ No newline at end of file diff --git a/configs/settings_dual_mode.json b/configs/settings_dual_mode.json new file mode 100644 index 0000000..197baa5 --- /dev/null +++ b/configs/settings_dual_mode.json @@ -0,0 +1,31 @@ +{ + "hooks": { + "PostToolUse": [ + { + "matcher": "Edit|Write|MultiEdit", + "hooks": [ + { + "type": "command", + "command": "$CLAUDE_PROJECT_DIR/scripts/duplicate_detector_enhanced.py", + "timeout": 5000 + } + ] + } + ], + "SessionStart": [ + { + "hooks": [ + { + "type": "command", + "command": "$CLAUDE_PROJECT_DIR/scripts/load_architecture_context.py" + } + ] + } + ] + }, + "statusLine": { + "type": "command", + "command": "$CLAUDE_PROJECT_DIR/.claude/duplicate-status.sh", + "padding": 0 + } +} \ No newline at end of file diff --git a/docs/TODO.md b/docs/TODO.md new file mode 100644 index 0000000..6e64aa6 --- /dev/null +++ b/docs/TODO.md @@ -0,0 +1,192 @@ +# TODO: PROJECT_INDEX Development Tasks + +## 🧪 Testing & Quality Assurance + +### High Priority +- [ ] **Fix remaining test failures** (8 failures, 1 error remaining) + - Output function tests in `test_similarity_index.py` + - `test_malformed_json_response` in `test_find_ollama.py` + - Edge case handling tests + +- [ ] **Complete integration tests** (in progress) + - Fix mocking issues in `test_integration.py` + - Test full embedding workflow: `project_index.py -e` → `similarity_index.py --build-cache` → query + - Test graceful degradation when Ollama unavailable + - Test cache invalidation and updates + +### Medium Priority +- [ ] **Add real data tests** + - Create `tests/fixtures/` with small sample projects + - Test with actual Python, JavaScript, and shell projects + - Validate parsing accuracy with known codebases + - Test performance with 100+ files + +- [ ] **Add error recovery tests** + - Corrupted `PROJECT_INDEX.json` recovery + - Network interruptions during embedding generation + - File permission issues + - Disk space exhaustion scenarios + - Concurrent access to index files + +- [ ] **Performance testing** + - Memory usage benchmarks with large embeddings + - Query speed comparisons (cached vs real-time) + - Index generation time for various project sizes + - Compression efficiency validation + +## 🚀 Feature Enhancements + +### Claude Code Integration +- [ ] **Enhance slash commands** + - Add parameter validation for colon syntax + - Implement progress indicators for long operations + - Add help text for each slash command variant + +- [ ] **Advanced similarity features** + - Semantic code search beyond function similarity + - Cross-language similarity detection + - Code pattern recognition and recommendations + - Duplicate code refactoring suggestions + +### Algorithm Improvements +- [ ] **Expand similarity algorithms** + - Implement semantic hybrid scoring (embedding + AST) + - Add fuzzy string matching for function names + - Context-aware similarity (considering call graphs) + - Add clustering for code organization insights + +- [ ] **Performance optimizations** + - Implement incremental embedding updates + - Add embedding compression/quantization + - Optimize similarity matrix storage + - Add parallel embedding generation + +## 🔧 Code Quality & Maintenance + +### Code Organization +- [ ] **Refactor for modularity** + - Extract embedding logic to separate module + - Create shared utilities for index operations + - Standardize error handling patterns + - Add comprehensive type hints + +- [ ] **Documentation improvements** + - API documentation for all public functions + - Architecture decision records (ADRs) + - Performance tuning guide + - Troubleshooting guide + +### Error Handling +- [ ] **Robust error management** + - Standardize error messages and codes + - Add retry logic for network operations + - Implement graceful degradation strategies + - Add detailed logging options + +## 📊 Advanced Features + +### Analytics & Insights +- [ ] **Code analysis features** + - Complexity metrics integration + - Code quality scoring + - Technical debt identification + - Architecture violation detection + +- [ ] **Reporting capabilities** + - Generate similarity analysis reports + - Export findings to various formats (JSON, CSV, HTML) + - Integration with CI/CD pipelines + - Custom report templates + +### Extensibility +- [ ] **Plugin system** + - Custom similarity algorithms + - Language-specific analyzers + - Custom export formats + - Third-party integrations + +- [ ] **Configuration management** + - Project-specific settings files + - Global configuration options + - Environment-based overrides + - Migration tools for config updates + +## 🐛 Known Issues & Bug Fixes + +### Critical Issues +- [ ] **Integration test mocking** (current blocker) + - Fix OllamaManager mocking in integration tests + - Ensure environment variables properly set in tests + - Validate embedding generation in test scenarios + +### Minor Issues +- [ ] **Regex warnings in test code** +- [ ] **Output function test inconsistencies** +- [ ] **Edge case handling in similarity calculations** + +## 🔄 Maintenance Tasks + +### Regular Maintenance +- [ ] **Dependency updates** + - Keep Ollama client libraries current + - Update testing frameworks + - Security vulnerability patches + +- [ ] **Performance monitoring** + - Benchmark regression tests + - Memory leak detection + - Performance profiling automation + +### Documentation Maintenance +- [ ] **Keep README current** + - Update installation instructions + - Refresh usage examples + - Update compatibility matrix + +- [ ] **API stability** + - Version compatibility testing + - Backward compatibility guarantees + - Migration path documentation + +## 🎯 Future Roadmap + +### Long-term Goals +- [ ] **Multi-language support expansion** + - Full parsing for Go, Rust, Java + - Support for more scripting languages + - Framework-specific analyzers (React, Vue, etc.) + +- [ ] **Cloud integration** + - Remote embedding services + - Distributed similarity computation + - Collaborative code analysis + +- [ ] **AI/ML enhancements** + - Custom embedding models training + - Code pattern learning + - Predictive code suggestions + - Automated refactoring recommendations + +## 📝 Current Status + +**Completed ✅:** +- Basic embedding and similarity functionality +- Claude Code slash command structure +- Core unit test coverage (86+ tests) +- Installation and deployment scripts +- Multiple similarity algorithms (6 total) +- Comprehensive CLI interfaces + +**In Progress 🔄:** +- Integration testing framework +- Test failure resolution (50% reduction achieved) +- Advanced error recovery + +**Blocked ⛔:** +- Integration tests (mocking issues) +- Some output function tests (stdout/stderr confusion) + +--- + +*Last updated: 2025-09-01* +*Total estimated effort: ~40-60 hours of development work* \ No newline at end of file diff --git a/docs/edited_TODO.md b/docs/edited_TODO.md new file mode 100644 index 0000000..a794833 --- /dev/null +++ b/docs/edited_TODO.md @@ -0,0 +1,12 @@ +# File was edited +# Old: **Next Priority Tasks:** +1. ✅ ~~Add real data tests with fixture projects~~ **COMPLETED** +2. ✅ ~~Performance testing and benchmarking~~ **COMPLETED** +3. Enhanced error recovery mechanisms +4. API documentation improvements +# New: **Next Priority Tasks:** +1. ✅ ~~Add real data tests with fixture projects~~ **COMPLETED** +2. ✅ ~~Performance testing and benchmarking~~ **COMPLETED** +3. ✅ ~~Fix all test failures and achieve 100% test success~~ **COMPLETED** +4. Enhanced error recovery mechanisms +5. API documentation improvements \ No newline at end of file diff --git a/docs/setup-indexing.md b/docs/setup-indexing.md new file mode 100644 index 0000000..7b771b3 --- /dev/null +++ b/docs/setup-indexing.md @@ -0,0 +1,40 @@ +--- +allowed-tools: Bash(cp *), Bash(mkdir *), Bash(chmod *), Bash(python3 *) +description: Set up automatic project indexing for Claude Code +--- + +## Setup Project Indexing + +Configure your project for seamless integration with Claude Code's project indexing system. + +### What this does: + +1. **Copies indexing scripts** to your project +2. **Sets up Claude Code hooks** for automatic maintenance +3. **Creates initial PROJECT_INDEX.json** with semantic analysis +4. **Configures duplicate detection** (optional) + +### Setup Process: + +**Step 1: Copy scripts to your project** +!`mkdir -p scripts && cp -r /home/lessuseless/.claude-code-project-index/scripts/* scripts/ && echo "✓ Scripts copied"` + +**Step 2: Make scripts executable** +!`find scripts -name "*.py" -exec chmod +x {} \; && echo "✓ Scripts made executable"` + +**Step 3: Copy Claude Code configuration** +!`mkdir -p .claude && cp /home/lessuseless/.claude-code-project-index/.claude/settings.json .claude/ && echo "✓ Hooks configured"` + +**Step 4: Generate initial index** +!`python3 scripts/enhanced_project_index.py && echo "✓ Initial PROJECT_INDEX.json created"` + +**Step 5: Add to .gitignore (optional)** +!`echo -e "\n# Claude Code Project Index\n.claude/\nPROJECT_INDEX.json" >> .gitignore && echo "✓ Added to .gitignore"` + +### Verification: + +Check that everything works: !`ls -la PROJECT_INDEX.json .claude/settings.json scripts/` + +🎉 **Your project is now ready for automatic indexing with Claude Code!** + +Use `/semantic-index`, `/duplicates`, and `/analyze` commands to manage your project intelligence. \ No newline at end of file diff --git a/scripts/append_cluster_to_embeddings_in_index.py b/scripts/append_cluster_to_embeddings_in_index.py new file mode 100644 index 0000000..2c5453f --- /dev/null +++ b/scripts/append_cluster_to_embeddings_in_index.py @@ -0,0 +1,462 @@ +#!/usr/bin/env python3 +""" +Append Cluster to Embeddings in Index +Step 3 of the embedding workflow: Build similarity cache and append to PROJECT_INDEX.json + +This script builds similarity matrices and clustering data from existing embeddings +and appends the results to PROJECT_INDEX.json for fast future queries. + +Workflow position: project_index.py → append_embeddings_to_index.py → THIS SCRIPT +""" + +__version__ = "0.1.0" + +import json +import math +import argparse +import sys +import hashlib +from datetime import datetime +from pathlib import Path +from typing import Dict, List, Tuple, Optional, Any, Callable + + +class SimilarityAlgorithms: + """Collection of similarity algorithms for vector comparison.""" + + @staticmethod + def cosine_similarity(vec1: List[float], vec2: List[float]) -> float: + """Calculate cosine similarity between two vectors (default algorithm).""" + if len(vec1) != len(vec2): + return 0.0 + + dot_product = sum(a * b for a, b in zip(vec1, vec2)) + magnitude1 = math.sqrt(sum(a * a for a in vec1)) + magnitude2 = math.sqrt(sum(a * a for a in vec2)) + + if magnitude1 == 0 or magnitude2 == 0: + return 0.0 + + return dot_product / (magnitude1 * magnitude2) + + @staticmethod + def euclidean_similarity(vec1: List[float], vec2: List[float]) -> float: + """Calculate similarity based on Euclidean distance.""" + if len(vec1) != len(vec2): + return 0.0 + + distance = math.sqrt(sum((a - b) ** 2 for a, b in zip(vec1, vec2))) + return 1.0 / (1.0 + distance) + + @staticmethod + def manhattan_similarity(vec1: List[float], vec2: List[float]) -> float: + """Calculate similarity based on Manhattan distance.""" + if len(vec1) != len(vec2): + return 0.0 + + distance = sum(abs(a - b) for a, b in zip(vec1, vec2)) + return 1.0 / (1.0 + distance) + + @staticmethod + def dot_product_similarity(vec1: List[float], vec2: List[float]) -> float: + """Calculate raw dot product similarity.""" + if len(vec1) != len(vec2): + return 0.0 + + dot_product = sum(a * b for a, b in zip(vec1, vec2)) + return (math.tanh(dot_product) + 1) / 2 + + @staticmethod + def jaccard_similarity(vec1: List[float], vec2: List[float], threshold: float = 0.1) -> float: + """Calculate Jaccard similarity by treating vectors as binary.""" + if len(vec1) != len(vec2): + return 0.0 + + bin1 = [1 if abs(x) > threshold else 0 for x in vec1] + bin2 = [1 if abs(x) > threshold else 0 for x in vec2] + + intersection = sum(1 for a, b in zip(bin1, bin2) if a == 1 and b == 1) + union = sum(1 for a, b in zip(bin1, bin2) if a == 1 or b == 1) + + if union == 0: + return 0.0 + + return intersection / union + + @staticmethod + def weighted_cosine_similarity(vec1: List[float], vec2: List[float], weights: List[float] = None) -> float: + """Calculate weighted cosine similarity.""" + if len(vec1) != len(vec2): + return 0.0 + + if weights is None: + weights = [1.0] * len(vec1) + elif len(weights) != len(vec1): + if len(weights) < len(vec1): + weights = weights + [1.0] * (len(vec1) - len(weights)) + else: + weights = weights[:len(vec1)] + + weighted_vec1 = [a * w for a, w in zip(vec1, weights)] + weighted_vec2 = [b * w for b, w in zip(vec2, weights)] + + return SimilarityAlgorithms.cosine_similarity(weighted_vec1, weighted_vec2) + + +def get_similarity_algorithm(algorithm_name: str, weights: List[float] = None) -> Callable[[List[float], List[float]], float]: + """Get similarity algorithm function by name.""" + algorithms = { + 'cosine': SimilarityAlgorithms.cosine_similarity, + 'euclidean': SimilarityAlgorithms.euclidean_similarity, + 'manhattan': SimilarityAlgorithms.manhattan_similarity, + 'dot-product': SimilarityAlgorithms.dot_product_similarity, + 'jaccard': SimilarityAlgorithms.jaccard_similarity, + 'weighted-cosine': lambda v1, v2: SimilarityAlgorithms.weighted_cosine_similarity(v1, v2, weights) + } + + if algorithm_name not in algorithms: + raise ValueError(f"Unknown algorithm: {algorithm_name}. Available: {list(algorithms.keys())}") + + return algorithms[algorithm_name] + + +def load_weights(weights_path: str) -> List[float]: + """Load weights from JSON file.""" + try: + with open(weights_path, 'r') as f: + data = json.load(f) + + if isinstance(data, list): + return data + elif isinstance(data, dict) and 'weights' in data: + return data['weights'] + else: + raise ValueError("Weights file should contain a list or dict with 'weights' key") + + except Exception as e: + print(f"❌ Error loading weights from {weights_path}: {e}") + sys.exit(1) + + +def load_project_index(index_path: str = "PROJECT_INDEX.json") -> Dict: + """Load PROJECT_INDEX.json with embeddings.""" + try: + with open(index_path, 'r') as f: + return json.load(f) + except FileNotFoundError: + print(f"❌ Error: {index_path} not found!") + print(" Run the embedding workflow first:") + print(" 1. python3 scripts/project_index.py") + print(" 2. python3 scripts/append_embeddings_to_index.py") + sys.exit(1) + except json.JSONDecodeError as e: + print(f"❌ Error: Invalid JSON in {index_path}: {e}") + sys.exit(1) + + +def calculate_embedding_hash(embeddings: List[Dict]) -> str: + """Calculate hash of all embeddings to detect changes.""" + embedding_data = [] + for item in embeddings: + embedding_data.append(f"{item['full_name']}:{len(item['embedding'])}") + + combined = "|".join(sorted(embedding_data)) + return hashlib.md5(combined.encode()).hexdigest()[:16] + + +def extract_embeddings_from_index(index: Dict) -> List[Dict]: + """Extract all embeddings with metadata from the project index.""" + embeddings = [] + + # Extract from files section with full embedding data + files_data = index.get('files', {}) + + for file_path, file_data in files_data.items(): + if not isinstance(file_data, dict): + continue + + # Extract functions + for func_name, func_data in file_data.get('functions', {}).items(): + if isinstance(func_data, dict) and 'embedding' in func_data: + embeddings.append({ + 'type': 'function', + 'name': func_name, + 'file': file_path, + 'line': func_data.get('line', 0), + 'signature': func_data.get('signature', '()'), + 'doc': func_data.get('doc', ''), + 'embedding': func_data['embedding'], + 'full_name': f"{file_path}:{func_name}", + 'calls': func_data.get('calls', []), + 'called_by': func_data.get('called_by', []) + }) + + # Extract methods from classes + for class_name, class_data in file_data.get('classes', {}).items(): + if isinstance(class_data, dict): + for method_name, method_data in class_data.get('methods', {}).items(): + if isinstance(method_data, dict) and 'embedding' in method_data: + embeddings.append({ + 'type': 'method', + 'name': method_name, + 'class': class_name, + 'file': file_path, + 'line': method_data.get('line', 0), + 'signature': method_data.get('signature', '()'), + 'doc': method_data.get('doc', ''), + 'embedding': method_data['embedding'], + 'full_name': f"{file_path}:{class_name}.{method_name}", + 'calls': method_data.get('calls', []), + 'called_by': method_data.get('called_by', []) + }) + + return embeddings + + +def build_similarity_cache(embeddings: List[Dict], algorithms: List[str], + similarity_threshold: float = 0.5, duplicate_threshold: float = 0.9, + top_k: int = 10, weights: List[float] = None) -> Dict: + """Build similarity cache for all specified algorithms.""" + print(f"🔧 Building similarity cache for {len(embeddings)} items...") + + cache = { + "generated_at": datetime.now().isoformat(), + "embedding_hash": calculate_embedding_hash(embeddings), + "config": { + "similarity_threshold": similarity_threshold, + "duplicate_threshold": duplicate_threshold, + "top_k": top_k + }, + "algorithms": {} + } + + for algorithm in algorithms: + print(f" Processing {algorithm} algorithm...") + + try: + similarity_func = get_similarity_algorithm(algorithm, weights) + except ValueError as e: + print(f" ❌ Skipping {algorithm}: {e}") + continue + + # Find duplicates + duplicates = find_duplicates_internal(embeddings, similarity_func, duplicate_threshold) + + # Build top similarities for each item + top_similar = {} + processed = 0 + + for i, item1 in enumerate(embeddings): + similarities = [] + + for j, item2 in enumerate(embeddings): + if i == j: + continue + + try: + similarity = similarity_func(item1['embedding'], item2['embedding']) + if similarity >= similarity_threshold: + similarities.append({ + "target": item2['full_name'], + "score": round(similarity, 4) + }) + except Exception as e: + continue + + # Sort by similarity and keep top_k + similarities.sort(key=lambda x: x['score'], reverse=True) + if similarities: + top_similar[item1['full_name']] = similarities[:top_k] + + processed += 1 + if processed % 10 == 0: + print(f" Processed {processed}/{len(embeddings)} items...") + + # Store algorithm results + cache["algorithms"][algorithm] = { + "duplicate_groups": duplicates, + "top_similar": top_similar, + "stats": { + "total_items": len(embeddings), + "items_with_similar": len(top_similar), + "duplicate_groups": len(duplicates) + } + } + + print(f" ✅ {algorithm}: {len(duplicates)} duplicate groups, {len(top_similar)} items with similarities") + + return cache + + +def find_duplicates_internal(embeddings: List[Dict], similarity_func: Callable, threshold: float) -> List[Dict]: + """Find duplicate groups using the specified similarity function.""" + duplicates = [] + used_indices = set() + + for i, item1 in enumerate(embeddings): + if i in used_indices: + continue + + similar_group = [{ + "item": item1['full_name'], + "score": 1.0 + }] + + for j, item2 in enumerate(embeddings): + if i == j or j in used_indices: + continue + + try: + similarity = similarity_func(item1['embedding'], item2['embedding']) + if similarity >= threshold: + similar_group.append({ + "item": item2['full_name'], + "score": round(similarity, 4) + }) + used_indices.add(j) + except Exception: + continue + + if len(similar_group) > 1: + duplicates.append({ + "similarity_range": [min(item['score'] for item in similar_group), + max(item['score'] for item in similar_group)], + "items": similar_group + }) + used_indices.add(i) + + return duplicates + + +def save_enhanced_index(index: Dict, output_path: str): + """Save enhanced index with similarity cache to file.""" + try: + with open(output_path, 'w') as f: + json.dump(index, f, separators=(',', ':')) + print(f"💾 Enhanced index saved to: {output_path}") + except Exception as e: + print(f"❌ Error saving to {output_path}: {e}") + sys.exit(1) + + +def print_cache_stats(cache: Dict): + """Print statistics about the similarity cache.""" + print(f"\n📊 Similarity Cache Statistics:") + print(f" Generated: {cache['generated_at']}") + print(f" Algorithms: {len(cache['algorithms'])}") + + for algo_name, algo_data in cache['algorithms'].items(): + stats = algo_data['stats'] + print(f" • {algo_name}:") + print(f" - {stats['total_items']} total items processed") + print(f" - {stats['items_with_similar']} items have similar functions") + print(f" - {stats['duplicate_groups']} potential duplicate groups") + + +def main(): + """Main clustering interface - builds and saves similarity cache.""" + parser = argparse.ArgumentParser( + description='Append similarity clustering data to PROJECT_INDEX.json', + formatter_class=argparse.RawDescriptionHelpFormatter, + epilog=''' +Examples: + # Build similarity cache with default cosine algorithm + %(prog)s --build-cache + + # Build cache with multiple algorithms + %(prog)s --build-cache --algorithms cosine,euclidean,manhattan + + # Custom output file and thresholds + %(prog)s --build-cache -o ENHANCED_INDEX.json --threshold 0.7 + + # Build cache with weighted cosine + %(prog)s --build-cache --algorithm weighted-cosine --weights weights.json + +This is Step 3 of the embedding workflow: + 1. python3 scripts/project_index.py + 2. python3 scripts/append_embeddings_to_index.py + 3. python3 scripts/append_cluster_to_embeddings_in_index.py --build-cache + ''' + ) + + parser.add_argument('--version', action='version', version=f'Cluster Append v{__version__}') + + # Mode selection (only build-cache for this script) + parser.add_argument('--build-cache', action='store_true', required=True, + help='Build similarity cache and append to index file') + + # Algorithm selection + parser.add_argument('--algorithm', default='cosine', + choices=['cosine', 'euclidean', 'manhattan', 'dot-product', 'jaccard', 'weighted-cosine'], + help='Single similarity algorithm (default: cosine)') + parser.add_argument('--algorithms', type=str, + help='Comma-separated algorithms for cache building (e.g., "cosine,euclidean")') + parser.add_argument('--weights', type=str, + help='Weights file for weighted-cosine algorithm') + + # File I/O + parser.add_argument('-i', '--input', default='PROJECT_INDEX.json', + help='Input index file (default: PROJECT_INDEX.json)') + parser.add_argument('-o', '--output', type=str, + help='Output file for enhanced index (default: same as input)') + + # Cache building parameters + parser.add_argument('-k', '--top-k', type=int, default=10, + help='Number of top similar items to cache per function (default: 10)') + parser.add_argument('-t', '--threshold', type=float, default=0.5, + help='Similarity threshold for caching (default: 0.5)') + parser.add_argument('--duplicate-threshold', type=float, default=0.9, + help='Duplicate detection threshold (default: 0.9)') + + args = parser.parse_args() + + # Validate arguments + if args.algorithm == 'weighted-cosine' and not args.weights: + print("❌ Error: weighted-cosine algorithm requires --weights parameter") + sys.exit(1) + + # Load weights if specified + weights = None + if args.weights: + weights = load_weights(args.weights) + + # Load project index + print(f"📊 Loading project index: {args.input}") + index = load_project_index(args.input) + + # Extract embeddings + embeddings = extract_embeddings_from_index(index) + if not embeddings: + print("❌ No embeddings found! Generate embeddings first:") + print(" python3 scripts/append_embeddings_to_index.py") + sys.exit(1) + + print(f"✅ Loaded {len(embeddings)} functions/methods with embeddings\n") + + # Determine algorithms to use + algorithms = args.algorithms.split(',') if args.algorithms else [args.algorithm] + algorithms = [alg.strip() for alg in algorithms] + + print(f"🔧 Building similarity cache with algorithms: {', '.join(algorithms)}") + + # Build similarity cache + similarity_cache = build_similarity_cache( + embeddings, algorithms, args.threshold, + args.duplicate_threshold, args.top_k, weights + ) + + # Add cache to index + index['similarity_analysis'] = similarity_cache + + # Save enhanced index + output_path = args.output or args.input + save_enhanced_index(index, output_path) + + # Print statistics + print_cache_stats(similarity_cache) + print(f"\n✅ Similarity clustering completed!") + print(f"💡 Query similar functions with: python3 scripts/query_index.py -q 'your search'") + + +if __name__ == '__main__': + main() \ No newline at end of file diff --git a/scripts/append_embeddings_to_index.py b/scripts/append_embeddings_to_index.py new file mode 100644 index 0000000..177b463 --- /dev/null +++ b/scripts/append_embeddings_to_index.py @@ -0,0 +1,373 @@ +#!/usr/bin/env python3 +""" +Append Embeddings to Index +Extends PROJECT_INDEX.json by adding neural embeddings to functions and classes. + +This script reads the standard PROJECT_INDEX.json and adds embedding vectors +to each function and class method using Ollama's embedding models. + +Usage: python append_embeddings_to_index.py [OPTIONS] +Output: Extends existing PROJECT_INDEX.json with 'embedding' fields +""" + +__version__ = "0.1.0" + +import json +import os +import sys +import argparse +from datetime import datetime +from pathlib import Path +from typing import Dict, List, Optional, Tuple + +# Import centralized Ollama management +try: + from find_ollama import OllamaManager +except ImportError: + # Fallback if find_ollama.py is not in the same directory + sys.path.insert(0, str(Path(__file__).parent)) + from find_ollama import OllamaManager + + +def generate_embedding(text: str, model_name: str = None, endpoint: str = None) -> Optional[List[float]]: + """Generate embedding for text using centralized Ollama management. + Returns None on error. + """ + model_name = model_name or os.getenv('EMBED_MODEL_NAME', 'nomic-embed-text') + endpoint = endpoint or os.getenv('EMBED_ENDPOINT', 'http://localhost:11434') + + try: + manager = OllamaManager(endpoint) + manager.default_model = model_name + success, embedding, error = manager.generate_embedding(text, model_name) + if success: + return embedding + else: + print(f" Warning: Could not generate embedding: {error}", file=sys.stderr) + return None + except Exception as e: + print(f" Warning: Could not generate embedding: {e}", file=sys.stderr) + return None + + +def load_project_index(index_path: str) -> Dict: + """Load existing PROJECT_INDEX.json.""" + try: + with open(index_path, 'r') as f: + return json.load(f) + except FileNotFoundError: + print(f"❌ Error: {index_path} not found!") + print(" Run project_index.py first to generate the base index:") + print(f" python3 scripts/project_index.py") + sys.exit(1) + except json.JSONDecodeError as e: + print(f"❌ Error: Invalid JSON in {index_path}: {e}") + sys.exit(1) + + +def detect_format(index: Dict) -> str: + """Detect whether index is in original or compressed format.""" + if 'files' in index and isinstance(index['files'], dict): + return 'original' + elif 'f' in index: + return 'compressed' + else: + print("❌ Error: Unknown index format") + sys.exit(1) + + +def add_embeddings_to_original_format(index: Dict, model_name: str, endpoint: str) -> int: + """Add embeddings to original format index.""" + embeddings_added = 0 + + for file_path, file_info in index.get('files', {}).items(): + if not isinstance(file_info, dict) or not file_info.get('parsed', False): + continue + + # Process functions + for func_name, func_data in file_info.get('functions', {}).items(): + if isinstance(func_data, dict) and 'embedding' not in func_data: + # Create text representation for embedding + func_text = f"Function: {func_name}\\n" + if 'signature' in func_data: + func_text += f"Signature: {func_data['signature']}\\n" + if 'doc' in func_data: + func_text += f"Documentation: {func_data['doc']}\\n" + if 'calls' in func_data: + func_text += f"Calls: {', '.join(func_data['calls'])}\\n" + + embedding = generate_embedding(func_text, model_name, endpoint) + if embedding: + func_data['embedding'] = embedding + embeddings_added += 1 + + # Process classes and methods + for class_name, class_data in file_info.get('classes', {}).items(): + if isinstance(class_data, dict): + # Class-level embedding + if 'embedding' not in class_data: + class_text = f"Class: {class_name}\\n" + if 'inherits' in class_data: + class_text += f"Inherits: {', '.join(class_data['inherits'])}\\n" + if 'doc' in class_data: + class_text += f"Documentation: {class_data['doc']}\\n" + + class_embedding = generate_embedding(class_text, model_name, endpoint) + if class_embedding: + class_data['embedding'] = class_embedding + embeddings_added += 1 + + # Method embeddings + for method_name, method_data in class_data.get('methods', {}).items(): + if isinstance(method_data, dict) and 'embedding' not in method_data: + method_text = f"Method: {class_name}.{method_name}\\n" + if 'signature' in method_data: + method_text += f"Signature: {method_data['signature']}\\n" + if 'doc' in method_data: + method_text += f"Documentation: {method_data['doc']}\\n" + if 'calls' in method_data: + method_text += f"Calls: {', '.join(method_data['calls'])}\\n" + + method_embedding = generate_embedding(method_text, model_name, endpoint) + if method_embedding: + method_data['embedding'] = method_embedding + embeddings_added += 1 + + return embeddings_added + + +def expand_compressed_to_original_with_embeddings(index: Dict, model_name: str, endpoint: str) -> Tuple[Dict, int]: + """Expand compressed format to original format with embeddings.""" + expanded = { + 'indexed_at': index.get('at', ''), + 'root': index.get('root', '.'), + 'project_structure': { + 'type': 'tree', + 'root': '.', + 'tree': index.get('tree', []) + }, + 'documentation_map': index.get('d', {}), + 'directory_purposes': index.get('dir_purposes', {}), + 'stats': index.get('stats', {}), + 'files': {}, + 'dependency_graph': index.get('deps', {}), + 'staleness_check': index.get('staleness', datetime.now().timestamp() - 7 * 24 * 60 * 60) + } + + embeddings_added = 0 + + # Process compressed files + for abbrev_path, file_data in index.get('f', {}).items(): + if not isinstance(file_data, list) or len(file_data) < 2: + continue + + # Expand abbreviated path + full_path = abbrev_path.replace('s/', 'scripts/').replace('sr/', 'src/').replace('t/', 'tests/') + + # Decode language + lang_code = file_data[0] + lang_map = {'p': 'python', 'j': 'javascript', 't': 'typescript', 's': 'shell'} + language = lang_map.get(lang_code, 'unknown') + + file_info = { + 'language': language, + 'parsed': True, + 'functions': {}, + 'classes': {} + } + + # Process functions (second element) + if len(file_data) > 1 and isinstance(file_data[1], list): + for func_str in file_data[1]: + parts = func_str.split(':') + if len(parts) >= 5: + func_name = parts[0] + line = int(parts[1]) if parts[1].isdigit() else 0 + signature = parts[2].replace('>', ' -> ').replace(':', ': ') + calls = parts[3].split(',') if parts[3] else [] + doc = parts[4] + + func_data = { + 'line': line, + 'signature': signature, + 'doc': doc, + 'calls': calls + } + + # Generate embedding + func_text = f"Function: {func_name}\\n" + func_text += f"Signature: {signature}\\n" + if doc: + func_text += f"Documentation: {doc}\\n" + if calls: + func_text += f"Calls: {', '.join(calls)}\\n" + + embedding = generate_embedding(func_text, model_name, endpoint) + if embedding: + func_data['embedding'] = embedding + embeddings_added += 1 + + file_info['functions'][func_name] = func_data + + # Process classes (third element) + if len(file_data) > 2 and isinstance(file_data[2], dict): + for class_name, class_info in file_data[2].items(): + if isinstance(class_info, list) and len(class_info) >= 2: + class_line = int(class_info[0]) if class_info[0].isdigit() else 0 + methods = class_info[1] if isinstance(class_info[1], list) else [] + + class_data = { + 'line': class_line, + 'methods': {} + } + + # Generate class embedding + class_text = f"Class: {class_name}\\n" + class_embedding = generate_embedding(class_text, model_name, endpoint) + if class_embedding: + class_data['embedding'] = class_embedding + embeddings_added += 1 + + # Process methods + for method_str in methods: + parts = method_str.split(':') + if len(parts) >= 5: + method_name = parts[0] + method_line = int(parts[1]) if parts[1].isdigit() else 0 + method_sig = parts[2].replace('>', ' -> ').replace(':', ': ') + method_calls = parts[3].split(',') if parts[3] else [] + method_doc = parts[4] + + method_data = { + 'line': method_line, + 'signature': method_sig, + 'doc': method_doc, + 'calls': method_calls + } + + # Generate method embedding + method_text = f"Method: {class_name}.{method_name}\\n" + method_text += f"Signature: {method_sig}\\n" + if method_doc: + method_text += f"Documentation: {method_doc}\\n" + if method_calls: + method_text += f"Calls: {', '.join(method_calls)}\\n" + + method_embedding = generate_embedding(method_text, model_name, endpoint) + if method_embedding: + method_data['embedding'] = method_embedding + embeddings_added += 1 + + class_data['methods'][method_name] = method_data + + file_info['classes'][class_name] = class_data + + expanded['files'][full_path] = file_info + + # Update stats + if 'embeddings_generated' not in expanded['stats']: + expanded['stats']['embeddings_generated'] = embeddings_added + + return expanded, embeddings_added + + +def main(): + """Main embedding generation interface.""" + parser = argparse.ArgumentParser( + description='Add neural embeddings to PROJECT_INDEX.json', + formatter_class=argparse.RawDescriptionHelpFormatter, + epilog=''' +Examples: + %(prog)s # Add embeddings to PROJECT_INDEX.json + %(prog)s --model mxbai-embed-large # Use different embedding model + %(prog)s --input CUSTOM_INDEX.json # Process custom index file + %(prog)s --output EMBEDDED_INDEX.json # Save to different file + ''' + ) + + parser.add_argument('--version', action='version', version=f'Embedding Generator v{__version__}') + + # File I/O + parser.add_argument('-i', '--input', default='PROJECT_INDEX.json', + help='Input index file (default: PROJECT_INDEX.json)') + parser.add_argument('-o', '--output', type=str, + help='Output file (default: same as input)') + + # Embedding options + parser.add_argument('--model', default='nomic-embed-text', + help='Ollama model for embeddings (default: nomic-embed-text)') + parser.add_argument('--endpoint', default='http://localhost:11434', + help='Ollama API endpoint (default: http://localhost:11434)') + + # Options + parser.add_argument('--force', action='store_true', + help='Regenerate embeddings even if they already exist') + parser.add_argument('--expand-compressed', action='store_true', + help='Expand compressed format to original format with embeddings') + + args = parser.parse_args() + + print(f"🧠 Adding embeddings to index: {args.input}") + + # Load project index + index = load_project_index(args.input) + format_type = detect_format(index) + + print(f"📊 Detected format: {format_type}") + + embeddings_added = 0 + + if format_type == 'original': + if not args.force: + # Check if embeddings already exist + has_embeddings = False + for file_info in index.get('files', {}).values(): + if isinstance(file_info, dict): + for func_data in file_info.get('functions', {}).values(): + if isinstance(func_data, dict) and 'embedding' in func_data: + has_embeddings = True + break + if has_embeddings: + break + + if has_embeddings: + print("⚠️ Embeddings already exist. Use --force to regenerate.") + return + + print("🔧 Adding embeddings to original format...") + embeddings_added = add_embeddings_to_original_format(index, args.model, args.endpoint) + + elif format_type == 'compressed': + if args.expand_compressed: + print("🔧 Expanding compressed format and adding embeddings...") + index, embeddings_added = expand_compressed_to_original_with_embeddings(index, args.model, args.endpoint) + else: + print("❌ Cannot add embeddings to compressed format.") + print(" Use --expand-compressed to expand to original format with embeddings") + print(" or run project_index.py without compression first.") + sys.exit(1) + + if embeddings_added == 0: + print("⚠️ No embeddings were generated.") + return + + # Update stats + if 'stats' not in index: + index['stats'] = {} + index['stats']['embeddings_generated'] = embeddings_added + index['embeddings_updated_at'] = datetime.now().isoformat() + + # Save enhanced index + output_path = args.output or args.input + try: + with open(output_path, 'w') as f: + json.dump(index, f, separators=(',', ':')) + print(f"💾 Enhanced index saved to: {output_path}") + print(f"✅ Generated {embeddings_added} embeddings") + except Exception as e: + print(f"❌ Error saving to {output_path}: {e}") + sys.exit(1) + + +if __name__ == '__main__': + main() \ No newline at end of file diff --git a/scripts/duplicate_detector.py b/scripts/duplicate_detector.py new file mode 100644 index 0000000..41eebd8 --- /dev/null +++ b/scripts/duplicate_detector.py @@ -0,0 +1,439 @@ +#!/usr/bin/env python3 +""" +PostToolUse hook for real-time duplicate code detection. +Analyzes new/modified code and warns about potential duplicates. +""" + +import json +import sys +import os +import re +from pathlib import Path +from typing import Dict, List, Any, Optional + +# Import utilities from index_utils +try: + from index_utils import ( + find_similar_functions, + create_ast_fingerprint, + create_tfidf_embeddings, + compute_code_similarity, + normalize_code_for_comparison, + extract_python_signatures, + extract_javascript_signatures, + extract_shell_signatures, + PARSEABLE_LANGUAGES + ) +except ImportError: + # Add current directory to path for imports + sys.path.insert(0, os.path.dirname(os.path.abspath(__file__))) + from index_utils import ( + find_similar_functions, + create_ast_fingerprint, + create_tfidf_embeddings, + compute_code_similarity, + normalize_code_for_comparison, + extract_python_signatures, + extract_javascript_signatures, + extract_shell_signatures, + PARSEABLE_LANGUAGES + ) + + +class DuplicateDetector: + """Real-time duplicate detection for code modifications.""" + + def __init__(self, project_root: str): + self.project_root = Path(project_root) + self.index_path = self.project_root / 'PROJECT_INDEX.json' + self.index_data = None + self.load_index() + + def load_index(self): + """Load the project index with semantic data.""" + if not self.index_path.exists(): + self.index_data = {} + return + + try: + with open(self.index_path, 'r') as f: + self.index_data = json.load(f) + except Exception as e: + print(f"Warning: Could not load index: {e}", file=sys.stderr) + self.index_data = {} + + def analyze_code_change(self, tool_input: Dict[str, Any]) -> Dict[str, Any]: + """Analyze a code change for potential duplicates.""" + file_path = tool_input.get('file_path', '') + content = tool_input.get('content', '') + + if not file_path or not content: + return {'no_duplicates': True} + + # Determine file language + file_ext = Path(file_path).suffix + language = PARSEABLE_LANGUAGES.get(file_ext, 'unknown') + + if language == 'unknown': + return {'no_duplicates': True} + + # Extract functions from the new/modified content + new_functions = self._extract_functions_from_content(content, language) + + if not new_functions: + return {'no_duplicates': True} + + # Check each function for duplicates + duplicates = [] + for func_data in new_functions: + similar_funcs = self._find_duplicates(func_data, language) + if similar_funcs: + duplicates.extend(similar_funcs) + + if duplicates: + return { + 'duplicates_found': True, + 'duplicates': duplicates, + 'file_path': file_path + } + + return {'no_duplicates': True} + + def _extract_functions_from_content(self, content: str, language: str) -> List[Dict[str, Any]]: + """Extract function definitions from code content.""" + functions = [] + + try: + if language == 'python': + signatures = extract_python_signatures(content) + functions.extend(self._parse_python_functions(content, signatures.get('functions', {}))) + elif language in ['javascript', 'typescript']: + signatures = extract_javascript_signatures(content) + functions.extend(self._parse_javascript_functions(content, signatures.get('functions', {}))) + elif language == 'shell': + signatures = extract_shell_signatures(content) + functions.extend(self._parse_shell_functions(content, signatures.get('functions', {}))) + except Exception as e: + print(f"Warning: Could not parse {language} code: {e}", file=sys.stderr) + + return functions + + def _parse_python_functions(self, content: str, functions: Dict[str, Any]) -> List[Dict[str, Any]]: + """Parse Python functions and extract their bodies.""" + result = [] + lines = content.split('\n') + + for func_name, func_info in functions.items(): + # Skip private/dunder methods for duplicate detection + if func_name.startswith('_'): + continue + + # Find function definition and extract body + for i, line in enumerate(lines): + if f"def {func_name}(" in line: + body_lines = [] + indent_level = len(line) - len(line.lstrip()) + + # Collect function body + for j in range(i + 1, len(lines)): + current_line = lines[j] + if not current_line.strip(): + body_lines.append(current_line) + continue + + current_indent = len(current_line) - len(current_line.lstrip()) + if current_indent <= indent_level and current_line.strip(): + break + + body_lines.append(current_line) + + if body_lines: + # Clean up the body (remove docstrings and comments for analysis) + clean_body = self._clean_function_body('\n'.join(body_lines)) + if len(clean_body.strip()) > 10: # Ignore trivial functions + result.append({ + 'name': func_name, + 'signature': func_info.get('signature', '') if isinstance(func_info, dict) else func_info, + 'body': '\n'.join(body_lines), + 'clean_body': clean_body, + 'language': 'python' + }) + break + + return result + + def _parse_javascript_functions(self, content: str, functions: Dict[str, Any]) -> List[Dict[str, Any]]: + """Parse JavaScript/TypeScript functions and extract their bodies.""" + # Simplified extraction - for full implementation, would need better JS parsing + result = [] + + for func_name, func_info in functions.items(): + # Find function in content and extract body + patterns = [ + rf'function\s+{func_name}\s*\([^)]*\)\s*{{([^}}]+)}}', + rf'const\s+{func_name}\s*=\s*\([^)]*\)\s*=>\s*{{([^}}]+)}}', + rf'{func_name}\s*\([^)]*\)\s*{{([^}}]+)}}' + ] + + for pattern in patterns: + match = re.search(pattern, content, re.DOTALL) + if match: + body = match.group(1) if match.lastindex else '' + clean_body = self._clean_function_body(body) + if len(clean_body.strip()) > 10: + result.append({ + 'name': func_name, + 'signature': func_info.get('signature', '') if isinstance(func_info, dict) else func_info, + 'body': body, + 'clean_body': clean_body, + 'language': 'javascript' + }) + break + + return result + + def _parse_shell_functions(self, content: str, functions: Dict[str, Any]) -> List[Dict[str, Any]]: + """Parse shell functions and extract their bodies.""" + result = [] + lines = content.split('\n') + + for func_name, func_info in functions.items(): + # Find function definition + for i, line in enumerate(lines): + if f"{func_name}()" in line or f"function {func_name}" in line: + body_lines = [] + brace_count = 0 + in_function = False + + for j in range(i + 1, len(lines)): + current_line = lines[j] + + if '{' in current_line: + brace_count += current_line.count('{') + in_function = True + + if in_function: + body_lines.append(current_line) + + if '}' in current_line: + brace_count -= current_line.count('}') + if brace_count <= 0: + break + + if body_lines: + body = '\n'.join(body_lines) + clean_body = self._clean_function_body(body) + if len(clean_body.strip()) > 10: + result.append({ + 'name': func_name, + 'signature': func_info.get('signature', '') if isinstance(func_info, dict) else func_info, + 'body': body, + 'clean_body': clean_body, + 'language': 'shell' + }) + break + + return result + + def _clean_function_body(self, body: str) -> str: + """Clean function body for duplicate detection analysis.""" + # Remove comments + cleaned = re.sub(r'#.*$', '', body, flags=re.MULTILINE) + cleaned = re.sub(r'//.*$', '', cleaned, flags=re.MULTILINE) + cleaned = re.sub(r'/\*.*?\*/', '', cleaned, flags=re.DOTALL) + + # Remove docstrings + cleaned = re.sub(r'""".*?"""', '', cleaned, flags=re.DOTALL) + cleaned = re.sub(r"'''.*?'''", '', cleaned, flags=re.DOTALL) + + # Remove string literals (but keep structure) + cleaned = re.sub(r'"[^"]*"', '"STRING"', cleaned) + cleaned = re.sub(r"'[^']*'", "'STRING'", cleaned) + + # Normalize whitespace + cleaned = re.sub(r'\s+', ' ', cleaned) + + # Remove empty lines + lines = [line.strip() for line in cleaned.split('\n') if line.strip()] + return '\n'.join(lines) + + def _find_duplicates(self, func_data: Dict[str, Any], language: str) -> List[Dict[str, Any]]: + """Find duplicates for a specific function.""" + duplicates = [] + + if not self.index_data or 'semantic_index' not in self.index_data: + return duplicates + + semantic_index = self.index_data['semantic_index'] + existing_functions = semantic_index.get('functions', {}) + + # Check for exact structural matches first (AST fingerprint) + ast_fingerprint = create_ast_fingerprint(func_data['clean_body'], language) + if ast_fingerprint: + for func_id, func_info in existing_functions.items(): + if func_info.get('ast_fingerprint') == ast_fingerprint: + duplicates.append({ + 'type': 'exact_structural_duplicate', + 'similarity': 1.0, + 'existing_function': func_id, + 'existing_signature': func_info.get('signature', ''), + 'new_function': func_data['name'], + 'message': f"Function '{func_data['name']}' has identical structure to existing function" + }) + + # Check for semantic similarity using TF-IDF + if func_data['clean_body'].strip(): + similar_functions = find_similar_functions( + func_data['clean_body'], + self.index_data, + similarity_threshold=0.8 + ) + + for similar in similar_functions: + # Avoid duplicating exact matches + if similar['similarity'] < 1.0: + duplicates.append({ + 'type': 'semantic_similarity', + 'similarity': similar['similarity'], + 'existing_function': similar['function_id'], + 'existing_signature': similar.get('signature', ''), + 'new_function': func_data['name'], + 'message': f"Function '{func_data['name']}' is {similar['similarity']*100:.0f}% similar to existing function" + }) + + # Check for naming pattern violations + existing_names = [func_id.split(':')[-1] for func_id in existing_functions.keys()] + naming_duplicates = self._check_naming_patterns(func_data['name'], existing_names) + duplicates.extend(naming_duplicates) + + return duplicates + + def _check_naming_patterns(self, new_func_name: str, existing_names: List[str]) -> List[Dict[str, Any]]: + """Check for naming pattern violations and similar names.""" + violations = [] + + # Check for very similar names (potential typos or slight variations) + for existing_name in existing_names: + similarity = self._string_similarity(new_func_name.lower(), existing_name.lower()) + if 0.8 <= similarity < 1.0: # Very similar but not identical + violations.append({ + 'type': 'similar_naming', + 'similarity': similarity, + 'existing_function': existing_name, + 'new_function': new_func_name, + 'message': f"Function name '{new_func_name}' is very similar to existing '{existing_name}'" + }) + + return violations + + def _string_similarity(self, s1: str, s2: str) -> float: + """Calculate string similarity using simple character-based approach.""" + if not s1 or not s2: + return 0.0 + + # Simple Jaccard similarity on character bigrams + bigrams1 = set(s1[i:i+2] for i in range(len(s1)-1)) + bigrams2 = set(s2[i:i+2] for i in range(len(s2)-1)) + + if not bigrams1 and not bigrams2: + return 1.0 + if not bigrams1 or not bigrams2: + return 0.0 + + intersection = len(bigrams1.intersection(bigrams2)) + union = len(bigrams1.union(bigrams2)) + + return intersection / union if union > 0 else 0.0 + + def generate_warning_message(self, duplicates: List[Dict[str, Any]], file_path: str) -> str: + """Generate a human-readable warning message about duplicates.""" + if not duplicates: + return "" + + messages = ["⚠️ Duplicate code detected:"] + + # Group by type + exact_duplicates = [d for d in duplicates if d['type'] == 'exact_structural_duplicate'] + semantic_duplicates = [d for d in duplicates if d['type'] == 'semantic_similarity'] + naming_duplicates = [d for d in duplicates if d['type'] == 'similar_naming'] + + if exact_duplicates: + messages.append("\\n🚨 Exact duplicates:") + for dup in exact_duplicates[:3]: # Limit to top 3 + messages.append(f" • {dup['message']}") + messages.append(f" Existing: {dup['existing_function']}") + + if semantic_duplicates: + messages.append("\\n📊 Similar implementations:") + for dup in semantic_duplicates[:3]: # Limit to top 3 + messages.append(f" • {dup['message']}") + messages.append(f" Existing: {dup['existing_function']}") + + if naming_duplicates: + messages.append("\\n📝 Similar names:") + for dup in naming_duplicates[:2]: # Limit to top 2 + messages.append(f" • {dup['message']}") + + messages.append("\\n💡 Suggestions:") + if exact_duplicates: + messages.append(" • Consider using the existing function or extracting shared logic") + if semantic_duplicates: + messages.append(" • Review if existing implementation can be reused or extended") + if naming_duplicates: + messages.append(" • Consider renaming to avoid confusion") + + return "\\n".join(messages) + + +def main(): + """Main hook entry point.""" + try: + # Read hook input from stdin + input_data = json.load(sys.stdin) + except json.JSONDecodeError as e: + print(f"Error: Invalid JSON input: {e}", file=sys.stderr) + sys.exit(1) + except Exception as e: + print(f"Error reading input: {e}", file=sys.stderr) + sys.exit(1) + + # Extract relevant information + tool_name = input_data.get('tool_name', '') + tool_input = input_data.get('tool_input', {}) + + # Only process code editing tools + if tool_name not in ['Edit', 'Write', 'MultiEdit']: + sys.exit(0) + + # Get project directory + project_dir = os.environ.get('CLAUDE_PROJECT_DIR') + if not project_dir: + project_dir = os.getcwd() + + # Initialize detector + detector = DuplicateDetector(project_dir) + + # Analyze the code change + analysis = detector.analyze_code_change(tool_input) + + # If duplicates found, block the operation + if analysis.get('duplicates_found', False): + duplicates = analysis.get('duplicates', []) + file_path = analysis.get('file_path', '') + + warning_message = detector.generate_warning_message(duplicates, file_path) + + # Return blocking response + output = { + "decision": "block", + "reason": warning_message + } + print(json.dumps(output)) + sys.exit(0) + + # No duplicates found, allow operation to proceed + sys.exit(0) + + +if __name__ == '__main__': + main() \ No newline at end of file diff --git a/scripts/duplicate_detector_enhanced.py b/scripts/duplicate_detector_enhanced.py new file mode 100644 index 0000000..7fb3438 --- /dev/null +++ b/scripts/duplicate_detector_enhanced.py @@ -0,0 +1,371 @@ +#!/usr/bin/env python3 +""" +Enhanced PostToolUse hook for dual-mode duplicate code detection. +Supports both blocking (active) and passive (monitoring) modes. +""" + +import json +import sys +import os +import time +from pathlib import Path +from typing import Dict, List, Any, Optional + +# Import utilities from index_utils +try: + from index_utils import ( + find_similar_functions, + create_ast_fingerprint, + create_tfidf_embeddings, + compute_code_similarity, + normalize_code_for_comparison, + extract_python_signatures, + extract_javascript_signatures, + extract_shell_signatures, + PARSEABLE_LANGUAGES + ) + from duplicate_detector import DuplicateDetector +except ImportError: + # Add current directory to path for imports + sys.path.insert(0, os.path.dirname(os.path.abspath(__file__))) + from index_utils import ( + find_similar_functions, + create_ast_fingerprint, + create_tfidf_embeddings, + compute_code_similarity, + normalize_code_for_comparison, + extract_python_signatures, + extract_javascript_signatures, + extract_shell_signatures, + PARSEABLE_LANGUAGES + ) + from duplicate_detector import DuplicateDetector + + +class EnhancedDuplicateDetector(DuplicateDetector): + """Enhanced duplicate detector with dual-mode support.""" + + def __init__(self, project_root: str): + super().__init__(project_root) + self.mode_config_path = self.project_root / '.claude' / 'duplicate_detection_mode.json' + self.detection_log_path = self.project_root / '.claude' / 'duplicate_detection.log' + self.stats_path = self.project_root / '.claude' / 'duplicate_stats.json' + self.load_mode_config() + self.load_stats() + + def load_mode_config(self): + """Load mode configuration (blocking vs passive).""" + default_config = { + "mode": "blocking", # "blocking" or "passive" + "similarity_threshold": 0.8, + "block_exact_duplicates": True, + "block_high_similarity": True, + "block_naming_conflicts": False, + "log_all_detections": True, + "show_suggestions": True, + "last_updated": time.time() + } + + if self.mode_config_path.exists(): + try: + with open(self.mode_config_path, 'r') as f: + self.config = json.load(f) + # Merge with defaults for any missing keys + for key, value in default_config.items(): + if key not in self.config: + self.config[key] = value + except Exception as e: + print(f"Warning: Could not load mode config: {e}", file=sys.stderr) + self.config = default_config + else: + self.config = default_config + self.save_mode_config() + + def save_mode_config(self): + """Save current mode configuration.""" + self.config['last_updated'] = time.time() + os.makedirs(self.mode_config_path.parent, exist_ok=True) + try: + with open(self.mode_config_path, 'w') as f: + json.dump(self.config, f, indent=2) + except Exception as e: + print(f"Warning: Could not save mode config: {e}", file=sys.stderr) + + def load_stats(self): + """Load detection statistics.""" + default_stats = { + "total_detections": 0, + "exact_duplicates_found": 0, + "semantic_similarities_found": 0, + "naming_conflicts_found": 0, + "blocks_prevented": 0, + "passive_warnings_issued": 0, + "last_detection": None, + "session_start": time.time() + } + + if self.stats_path.exists(): + try: + with open(self.stats_path, 'r') as f: + self.stats = json.load(f) + # Merge with defaults + for key, value in default_stats.items(): + if key not in self.stats: + self.stats[key] = value + except Exception as e: + print(f"Warning: Could not load stats: {e}", file=sys.stderr) + self.stats = default_stats + else: + self.stats = default_stats + + def save_stats(self): + """Save detection statistics.""" + os.makedirs(self.stats_path.parent, exist_ok=True) + try: + with open(self.stats_path, 'w') as f: + json.dump(self.stats, f, indent=2) + except Exception as e: + print(f"Warning: Could not save stats: {e}", file=sys.stderr) + + def log_detection(self, detection_type: str, details: Dict[str, Any]): + """Log detection event.""" + if not self.config.get('log_all_detections', True): + return + + log_entry = { + "timestamp": time.time(), + "type": detection_type, + "mode": self.config['mode'], + "details": details + } + + os.makedirs(self.detection_log_path.parent, exist_ok=True) + try: + with open(self.detection_log_path, 'a') as f: + f.write(json.dumps(log_entry) + '\n') + except Exception as e: + print(f"Warning: Could not write log: {e}", file=sys.stderr) + + def update_stats(self, duplicates: List[Dict[str, Any]], blocked: bool): + """Update detection statistics.""" + self.stats['total_detections'] += 1 + self.stats['last_detection'] = time.time() + + for duplicate in duplicates: + if duplicate['type'] == 'exact_structural_duplicate': + self.stats['exact_duplicates_found'] += 1 + elif duplicate['type'] == 'semantic_similarity': + self.stats['semantic_similarities_found'] += 1 + elif duplicate['type'] == 'similar_naming': + self.stats['naming_conflicts_found'] += 1 + + if blocked: + self.stats['blocks_prevented'] += 1 + else: + self.stats['passive_warnings_issued'] += 1 + + self.save_stats() + + def analyze_with_mode_awareness(self, tool_input: Dict[str, Any]) -> Dict[str, Any]: + """Analyze code change with mode-aware response.""" + # Perform standard duplicate analysis + analysis = self.analyze_code_change(tool_input) + + if analysis.get('no_duplicates', True): + return {'no_duplicates': True} + + duplicates = analysis.get('duplicates', []) + file_path = analysis.get('file_path', '') + + # Filter duplicates based on configuration + filtered_duplicates = self._filter_duplicates_by_config(duplicates) + + if not filtered_duplicates: + return {'no_duplicates': True} + + # Determine response based on mode + mode = self.config['mode'] + should_block = mode == 'blocking' and self._should_block(filtered_duplicates) + + # Log the detection + self.log_detection('duplicate_found', { + 'file_path': file_path, + 'duplicate_count': len(filtered_duplicates), + 'mode': mode, + 'blocked': should_block, + 'duplicates': filtered_duplicates + }) + + # Update statistics + self.update_stats(filtered_duplicates, should_block) + + if should_block: + # Blocking mode - prevent the operation + return { + 'duplicates_found': True, + 'mode': 'blocking', + 'duplicates': filtered_duplicates, + 'file_path': file_path, + 'block_operation': True + } + else: + # Passive mode - just inform + return { + 'duplicates_found': True, + 'mode': 'passive', + 'duplicates': filtered_duplicates, + 'file_path': file_path, + 'block_operation': False + } + + def _filter_duplicates_by_config(self, duplicates: List[Dict[str, Any]]) -> List[Dict[str, Any]]: + """Filter duplicates based on configuration settings.""" + filtered = [] + + for duplicate in duplicates: + include = False + + if duplicate['type'] == 'exact_structural_duplicate' and self.config.get('block_exact_duplicates', True): + include = True + elif duplicate['type'] == 'semantic_similarity': + similarity = duplicate.get('similarity', 0) + threshold = self.config.get('similarity_threshold', 0.8) + if similarity >= threshold and self.config.get('block_high_similarity', True): + include = True + elif duplicate['type'] == 'similar_naming' and self.config.get('block_naming_conflicts', False): + include = True + + if include: + filtered.append(duplicate) + + return filtered + + def _should_block(self, duplicates: List[Dict[str, Any]]) -> bool: + """Determine if operation should be blocked based on duplicates found.""" + # Always block exact duplicates in blocking mode + exact_duplicates = [d for d in duplicates if d['type'] == 'exact_structural_duplicate'] + if exact_duplicates: + return True + + # Block high similarity if configured + high_similarity = [d for d in duplicates if d['type'] == 'semantic_similarity' and d.get('similarity', 0) >= 0.9] + if high_similarity and self.config.get('block_high_similarity', True): + return True + + return False + + def generate_mode_aware_message(self, duplicates: List[Dict[str, Any]], file_path: str, mode: str) -> str: + """Generate appropriate message based on mode.""" + if mode == 'blocking': + return self.generate_warning_message(duplicates, file_path) + else: + return self.generate_passive_message(duplicates, file_path) + + def generate_passive_message(self, duplicates: List[Dict[str, Any]], file_path: str) -> str: + """Generate passive monitoring message.""" + if not duplicates: + return "" + + messages = ["ℹ️ Duplicate code detected (passive mode):"] + + # Group by type + exact_duplicates = [d for d in duplicates if d['type'] == 'exact_structural_duplicate'] + semantic_duplicates = [d for d in duplicates if d['type'] == 'semantic_similarity'] + naming_duplicates = [d for d in duplicates if d['type'] == 'similar_naming'] + + if exact_duplicates: + messages.append("\\n🔍 Exact duplicates found:") + for dup in exact_duplicates[:2]: # Limit to top 2 + messages.append(f" • {dup['message']}") + messages.append(f" Similar to: {dup['existing_function']}") + + if semantic_duplicates: + messages.append("\\n📊 Similar implementations found:") + for dup in semantic_duplicates[:2]: # Limit to top 2 + messages.append(f" • {dup['message']}") + messages.append(f" Similar to: {dup['existing_function']}") + + if naming_duplicates: + messages.append("\\n📝 Similar names found:") + for dup in naming_duplicates[:1]: # Limit to top 1 + messages.append(f" • {dup['message']}") + + if self.config.get('show_suggestions', True): + messages.append("\\n💡 Suggestions:") + messages.append(" • Review existing implementations before continuing") + messages.append(" • Consider consolidating similar functionality") + messages.append(" • Switch to blocking mode: /duplicate-mode blocking") + + return "\\n".join(messages) + + def get_mode_status(self) -> Dict[str, Any]: + """Get current mode and statistics for status display.""" + return { + 'mode': self.config['mode'], + 'active': self.config['mode'] == 'blocking', + 'stats': self.stats, + 'config': self.config, + 'last_detection': self.stats.get('last_detection'), + 'session_detections': self.stats.get('total_detections', 0) + } + + +def main(): + """Main hook entry point with dual-mode support.""" + try: + # Read hook input from stdin + input_data = json.load(sys.stdin) + except json.JSONDecodeError as e: + print(f"Error: Invalid JSON input: {e}", file=sys.stderr) + sys.exit(1) + except Exception as e: + print(f"Error reading input: {e}", file=sys.stderr) + sys.exit(1) + + # Extract relevant information + tool_name = input_data.get('tool_name', '') + tool_input = input_data.get('tool_input', {}) + + # Only process code editing tools + if tool_name not in ['Edit', 'Write', 'MultiEdit']: + sys.exit(0) + + # Get project directory + project_dir = os.environ.get('CLAUDE_PROJECT_DIR') + if not project_dir: + project_dir = os.getcwd() + + # Initialize enhanced detector + detector = EnhancedDuplicateDetector(project_dir) + + # Analyze the code change with mode awareness + analysis = detector.analyze_with_mode_awareness(tool_input) + + # If duplicates found, respond according to mode + if analysis.get('duplicates_found', False): + duplicates = analysis.get('duplicates', []) + file_path = analysis.get('file_path', '') + mode = analysis.get('mode', 'blocking') + should_block = analysis.get('block_operation', False) + + message = detector.generate_mode_aware_message(duplicates, file_path, mode) + + if should_block: + # Blocking mode - prevent operation + output = { + "decision": "block", + "reason": message + } + print(json.dumps(output)) + sys.exit(0) + else: + # Passive mode - just inform (allow operation to proceed) + # We could optionally output an info message here, but for now just log + pass + + # No duplicates found or passive mode - allow operation to proceed + sys.exit(0) + + +if __name__ == '__main__': + main() \ No newline at end of file diff --git a/scripts/duplicate_mode_toggle.py b/scripts/duplicate_mode_toggle.py new file mode 100644 index 0000000..339d800 --- /dev/null +++ b/scripts/duplicate_mode_toggle.py @@ -0,0 +1,329 @@ +#!/usr/bin/env python3 +""" +Mode toggle utility for duplicate detection system. +Allows switching between blocking and passive modes. +""" + +import json +import sys +import os +import argparse +import time +from pathlib import Path +from typing import Dict, Any + + +class DuplicateModeManager: + """Manages duplicate detection mode configuration.""" + + def __init__(self, project_root: str): + self.project_root = Path(project_root) + self.mode_config_path = self.project_root / '.claude' / 'duplicate_detection_mode.json' + self.stats_path = self.project_root / '.claude' / 'duplicate_stats.json' + self.settings_path = self.project_root / '.claude' / 'settings.json' + + def get_current_mode(self) -> Dict[str, Any]: + """Get current mode configuration.""" + if not self.mode_config_path.exists(): + return {"mode": "not_configured", "error": "Mode config not found"} + + try: + with open(self.mode_config_path, 'r') as f: + config = json.load(f) + return config + except Exception as e: + return {"mode": "error", "error": str(e)} + + def set_mode(self, mode: str, **options) -> Dict[str, Any]: + """Set duplicate detection mode.""" + if mode not in ['blocking', 'passive', 'inactive']: + return {"success": False, "error": f"Invalid mode: {mode}"} + + # Load existing config or create default + if self.mode_config_path.exists(): + try: + with open(self.mode_config_path, 'r') as f: + config = json.load(f) + except Exception: + config = self._get_default_config() + else: + config = self._get_default_config() + + # Update mode and options + old_mode = config.get('mode', 'unknown') + config['mode'] = mode + config['last_updated'] = time.time() + config['previous_mode'] = old_mode + + # Apply any additional options + for key, value in options.items(): + if key in ['similarity_threshold', 'block_exact_duplicates', + 'block_high_similarity', 'block_naming_conflicts', + 'log_all_detections', 'show_suggestions']: + config[key] = value + + # Save configuration + os.makedirs(self.mode_config_path.parent, exist_ok=True) + try: + with open(self.mode_config_path, 'w') as f: + json.dump(config, f, indent=2) + except Exception as e: + return {"success": False, "error": f"Could not save config: {e}"} + + # Update hooks configuration if needed + if mode == 'inactive': + self._disable_hooks() + else: + self._enable_hooks() + + return { + "success": True, + "old_mode": old_mode, + "new_mode": mode, + "config": config + } + + def _get_default_config(self) -> Dict[str, Any]: + """Get default configuration.""" + return { + "mode": "blocking", + "similarity_threshold": 0.8, + "block_exact_duplicates": True, + "block_high_similarity": True, + "block_naming_conflicts": False, + "log_all_detections": True, + "show_suggestions": True, + "last_updated": time.time() + } + + def _enable_hooks(self): + """Enable duplicate detection hooks in settings.""" + self._update_hooks_config(enabled=True) + + def _disable_hooks(self): + """Disable duplicate detection hooks in settings.""" + self._update_hooks_config(enabled=False) + + def _update_hooks_config(self, enabled: bool): + """Update hooks configuration in .claude/settings.json.""" + if not self.settings_path.exists(): + # Create default settings if none exist + settings = {"hooks": {}} + else: + try: + with open(self.settings_path, 'r') as f: + settings = json.load(f) + except Exception: + settings = {"hooks": {}} + + if 'hooks' not in settings: + settings['hooks'] = {} + + if enabled: + # Enable PostToolUse hook + settings['hooks']['PostToolUse'] = [ + { + "matcher": "Edit|Write|MultiEdit", + "hooks": [ + { + "type": "command", + "command": "$CLAUDE_PROJECT_DIR/scripts/duplicate_detector_enhanced.py", + "timeout": 5000 + } + ] + } + ] + + # Ensure SessionStart hook is also enabled + if 'SessionStart' not in settings['hooks']: + settings['hooks']['SessionStart'] = [ + { + "hooks": [ + { + "type": "command", + "command": "$CLAUDE_PROJECT_DIR/scripts/load_architecture_context.py" + } + ] + } + ] + else: + # Remove PostToolUse hook for duplicate detection + if 'PostToolUse' in settings['hooks']: + # Remove or comment out duplicate detection hook + post_hooks = settings['hooks']['PostToolUse'] + settings['hooks']['PostToolUse'] = [ + hook for hook in post_hooks + if 'duplicate_detector' not in hook.get('hooks', [{}])[0].get('command', '') + ] + + # If no hooks left, remove the key + if not settings['hooks']['PostToolUse']: + del settings['hooks']['PostToolUse'] + + # Save updated settings + os.makedirs(self.settings_path.parent, exist_ok=True) + try: + with open(self.settings_path, 'w') as f: + json.dump(settings, f, indent=2) + except Exception as e: + print(f"Warning: Could not update hooks config: {e}", file=sys.stderr) + + def get_stats(self) -> Dict[str, Any]: + """Get detection statistics.""" + if not self.stats_path.exists(): + return {"stats_available": False} + + try: + with open(self.stats_path, 'r') as f: + stats = json.load(f) + return {"stats_available": True, "stats": stats} + except Exception as e: + return {"stats_available": False, "error": str(e)} + + def reset_stats(self) -> bool: + """Reset detection statistics.""" + try: + if self.stats_path.exists(): + self.stats_path.unlink() + return True + except Exception: + return False + + def show_status(self) -> str: + """Show current status in human-readable format.""" + config = self.get_current_mode() + stats_result = self.get_stats() + + status = [] + status.append("🔧 Duplicate Detection System Status") + status.append("=" * 40) + + # Mode information + mode = config.get('mode', 'unknown') + if mode == 'blocking': + status.append("🛡️ Mode: BLOCKING (Active - blocks duplicate code)") + elif mode == 'passive': + status.append("👁️ Mode: PASSIVE (Monitor - logs but allows duplicates)") + elif mode == 'inactive': + status.append("⚪ Mode: INACTIVE (System disabled)") + else: + status.append(f"❓ Mode: {mode.upper()}") + + # Configuration details + if 'error' not in config: + status.append(f"\n📋 Configuration:") + status.append(f" • Similarity Threshold: {config.get('similarity_threshold', 0.8)*100:.0f}%") + status.append(f" • Block Exact Duplicates: {'✅' if config.get('block_exact_duplicates') else '❌'}") + status.append(f" • Block High Similarity: {'✅' if config.get('block_high_similarity') else '❌'}") + status.append(f" • Block Naming Conflicts: {'✅' if config.get('block_naming_conflicts') else '❌'}") + + last_updated = config.get('last_updated') + if last_updated: + import datetime + update_time = datetime.datetime.fromtimestamp(last_updated).strftime('%Y-%m-%d %H:%M:%S') + status.append(f" • Last Updated: {update_time}") + + # Statistics + if stats_result.get('stats_available'): + stats = stats_result['stats'] + status.append(f"\n📊 Detection Statistics:") + status.append(f" • Total Detections: {stats.get('total_detections', 0)}") + status.append(f" • Blocks Prevented: {stats.get('blocks_prevented', 0)}") + status.append(f" • Passive Warnings: {stats.get('passive_warnings_issued', 0)}") + status.append(f" • Exact Duplicates: {stats.get('exact_duplicates_found', 0)}") + status.append(f" • Semantic Similarities: {stats.get('semantic_similarities_found', 0)}") + + last_detection = stats.get('last_detection') + if last_detection: + import datetime + detection_time = datetime.datetime.fromtimestamp(last_detection).strftime('%Y-%m-%d %H:%M:%S') + status.append(f" • Last Detection: {detection_time}") + else: + status.append(f"\n📊 No statistics available") + + return "\n".join(status) + + +def main(): + """Main CLI interface for mode management.""" + parser = argparse.ArgumentParser(description='Manage duplicate detection modes') + parser.add_argument('--project-root', default='.', help='Project root directory') + + subparsers = parser.add_subparsers(dest='command', help='Available commands') + + # Status command + status_parser = subparsers.add_parser('status', help='Show current status') + + # Set mode command + set_parser = subparsers.add_parser('set', help='Set detection mode') + set_parser.add_argument('mode', choices=['blocking', 'passive', 'inactive'], + help='Detection mode') + set_parser.add_argument('--threshold', type=float, help='Similarity threshold (0.0-1.0)') + set_parser.add_argument('--block-exact', action='store_true', help='Block exact duplicates') + set_parser.add_argument('--no-block-exact', action='store_true', help='Don\'t block exact duplicates') + set_parser.add_argument('--block-similar', action='store_true', help='Block similar functions') + set_parser.add_argument('--no-block-similar', action='store_true', help='Don\'t block similar functions') + set_parser.add_argument('--block-naming', action='store_true', help='Block naming conflicts') + set_parser.add_argument('--no-block-naming', action='store_true', help='Don\'t block naming conflicts') + + # Reset stats command + reset_parser = subparsers.add_parser('reset-stats', help='Reset detection statistics') + + # Quick mode switches + blocking_parser = subparsers.add_parser('blocking', help='Switch to blocking mode') + passive_parser = subparsers.add_parser('passive', help='Switch to passive mode') + off_parser = subparsers.add_parser('off', help='Turn off detection') + + args = parser.parse_args() + + if not args.command: + parser.print_help() + return + + manager = DuplicateModeManager(args.project_root) + + if args.command == 'status': + print(manager.show_status()) + + elif args.command == 'set': + options = {} + if args.threshold is not None: + options['similarity_threshold'] = args.threshold + if args.block_exact: + options['block_exact_duplicates'] = True + elif args.no_block_exact: + options['block_exact_duplicates'] = False + if args.block_similar: + options['block_high_similarity'] = True + elif args.no_block_similar: + options['block_high_similarity'] = False + if args.block_naming: + options['block_naming_conflicts'] = True + elif args.no_block_naming: + options['block_naming_conflicts'] = False + + result = manager.set_mode(args.mode, **options) + if result['success']: + print(f"✅ Mode changed from '{result['old_mode']}' to '{result['new_mode']}'") + else: + print(f"❌ Error: {result['error']}") + + elif args.command in ['blocking', 'passive', 'off']: + mode_map = {'blocking': 'blocking', 'passive': 'passive', 'off': 'inactive'} + mode = mode_map[args.command] + + result = manager.set_mode(mode) + if result['success']: + print(f"✅ Switched to {mode} mode") + else: + print(f"❌ Error: {result['error']}") + + elif args.command == 'reset-stats': + if manager.reset_stats(): + print("✅ Statistics reset") + else: + print("❌ Failed to reset statistics") + + +if __name__ == '__main__': + main() \ No newline at end of file diff --git a/scripts/edited_i_flag_hook.py b/scripts/edited_i_flag_hook.py new file mode 100644 index 0000000..dd871e5 --- /dev/null +++ b/scripts/edited_i_flag_hook.py @@ -0,0 +1,114 @@ +# File was edited +# Old: def parse_index_flag(prompt): + """Parse -i, -ic, or -ie flag with optional size. + Returns: (size_k, clipboard_mode, embedding_mode, cleaned_prompt) + """ + # Pattern matches -i[number], -ic[number], or -ie[number] + match = re.search(r'-i([ce]?)(\d+)?(?:\s|$)', prompt) + + if not match: + return None, None, None, prompt + + mode_char = match.group(1) + clipboard_mode = mode_char == 'c' + embedding_mode = mode_char == 'e' + + # If no explicit size provided, check for remembered size + if match.group(2): + size_k = int(match.group(2)) + else: + # For plain -i or -ie without size, try to use last remembered size + if not clipboard_mode: + size_k = get_last_interactive_size() + else: + # For -ic, always use default + size_k = DEFAULT_SIZE_K + + # Validate size limits + if size_k < MIN_SIZE_K: + print(f"⚠️ Minimum size is {MIN_SIZE_K}k, using {MIN_SIZE_K}k", file=sys.stderr) + size_k = MIN_SIZE_K + + if not clipboard_mode and size_k > CLAUDE_MAX_K: + print(f"⚠️ Claude max is {CLAUDE_MAX_K}k (need buffer for reasoning), using {CLAUDE_MAX_K}k", file=sys.stderr) + size_k = CLAUDE_MAX_K + elif clipboard_mode and size_k > EXTERNAL_MAX_K: + print(f"⚠️ Maximum size is {EXTERNAL_MAX_K}k, using {EXTERNAL_MAX_K}k", file=sys.stderr) + size_k = EXTERNAL_MAX_K + + # Clean prompt (remove flag) + cleaned_prompt = re.sub(r'-i[ce]?\d*\s*', '', prompt).strip() + + return size_k, clipboard_mode, embedding_mode, cleaned_prompt +# New: def parse_index_flag(prompt): + """Parse -i, -ic, or -ie flag with optional size and similarity options. + Returns: (size_k, clipboard_mode, embedding_mode, similarity_options, cleaned_prompt) + """ + # Pattern matches -i[number], -ic[number], or -ie[number] + match = re.search(r'-i([ce]?)(\d+)?(?:\s|$)', prompt) + + if not match: + return None, None, None, None, prompt + + mode_char = match.group(1) + clipboard_mode = mode_char == 'c' + embedding_mode = mode_char == 'e' + + # Parse similarity-specific options (only valid with -ie) + similarity_options = {} + if embedding_mode: + # Parse --algorithm=algorithm_name + algo_match = re.search(r'--algorithm[=\s]([a-z-]+)', prompt) + if algo_match: + similarity_options['algorithm'] = algo_match.group(1) + prompt = re.sub(r'--algorithm[=\s][a-z-]+\s*', '', prompt).strip() + + # Parse -o output_file or --output=output_file + output_match = re.search(r'(?:-o|--output)[=\s](\S+)', prompt) + if output_match: + similarity_options['output'] = output_match.group(1) + prompt = re.sub(r'(?:-o|--output)[=\s]\S+\s*', '', prompt).strip() + + # Parse --build-cache + if '--build-cache' in prompt: + similarity_options['build_cache'] = True + prompt = re.sub(r'--build-cache\s*', '', prompt).strip() + + # Parse --duplicates + if '--duplicates' in prompt: + similarity_options['duplicates'] = True + prompt = re.sub(r'--duplicates\s*', '', prompt).strip() + + # Parse --algorithms=algo1,algo2,algo3 (for cache building) + algos_match = re.search(r'--algorithms[=\s]([a-z,-]+)', prompt) + if algos_match: + similarity_options['algorithms'] = algos_match.group(1).split(',') + prompt = re.sub(r'--algorithms[=\s][a-z,-]+\s*', '', prompt).strip() + + # If no explicit size provided, check for remembered size + if match.group(2): + size_k = int(match.group(2)) + else: + # For plain -i or -ie without size, try to use last remembered size + if not clipboard_mode: + size_k = get_last_interactive_size() + else: + # For -ic, always use default + size_k = DEFAULT_SIZE_K + + # Validate size limits + if size_k < MIN_SIZE_K: + print(f"⚠️ Minimum size is {MIN_SIZE_K}k, using {MIN_SIZE_K}k", file=sys.stderr) + size_k = MIN_SIZE_K + + if not clipboard_mode and size_k > CLAUDE_MAX_K: + print(f"⚠️ Claude max is {CLAUDE_MAX_K}k (need buffer for reasoning), using {CLAUDE_MAX_K}k", file=sys.stderr) + size_k = CLAUDE_MAX_K + elif clipboard_mode and size_k > EXTERNAL_MAX_K: + print(f"⚠️ Maximum size is {EXTERNAL_MAX_K}k, using {EXTERNAL_MAX_K}k", file=sys.stderr) + size_k = EXTERNAL_MAX_K + + # Clean prompt (remove all flags) + cleaned_prompt = re.sub(r'-i[ce]?\d*\s*', '', prompt).strip() + + return size_k, clipboard_mode, embedding_mode, similarity_options, cleaned_prompt \ No newline at end of file diff --git a/scripts/embedded_command_handler.py b/scripts/embedded_command_handler.py new file mode 100644 index 0000000..41f01a5 --- /dev/null +++ b/scripts/embedded_command_handler.py @@ -0,0 +1,139 @@ +#!/usr/bin/env python3 +""" +Command handler for /embedded-index slash command +Handles neural embedding commands without multi-line bash issues +""" + +import sys +import os +import subprocess +import time +import requests + +def run_command(cmd, shell=True): + """Run a command and return success status""" + try: + result = subprocess.run(cmd, shell=shell, capture_output=True, text=True) + if result.stdout: + print(result.stdout.strip()) + if result.stderr: + print(result.stderr.strip(), file=sys.stderr) + return result.returncode == 0 + except Exception as e: + print(f"Error running command: {e}", file=sys.stderr) + return False + +def check_ollama(): + """Check if Ollama is running""" + try: + response = requests.get("http://127.0.0.1:11434/api/version", timeout=2) + return response.status_code == 200 + except: + return False + +def setup_embeddings(): + """Set up neural embeddings""" + print("🔧 Setting up neural embeddings...") + + # Install Python dependencies + print("📦 Installing Python dependencies...") + if not run_command("pip install --user requests numpy scikit-learn"): + print("⚠️ Some dependencies may have failed to install") + + # Start Ollama if not running + if not check_ollama(): + print("🚀 Starting Ollama server...") + subprocess.Popen(["nix", "run", "nixpkgs#ollama", "--", "serve"], + stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL) + + print("Waiting for Ollama to start...") + for i in range(10): + time.sleep(2) + if check_ollama(): + print("✅ Ollama started successfully") + break + print(".", end="", flush=True) + else: + print("\n❌ Ollama failed to start") + return False + else: + print("✅ Ollama already running") + + # Pull nomic-embed-text model + print("📥 Pulling nomic-embed-text model (this may take a few minutes)...") + print("This is a ~270MB download, please be patient...") + + try: + response = requests.post("http://127.0.0.1:11434/api/pull", + json={"name": "nomic-embed-text"}, + stream=True) + + for line in response.iter_lines(): + if line: + import json + try: + data = json.loads(line) + if data.get("status") == "success": + print("✅ Model downloaded successfully") + break + elif "pulling" in data.get("status", ""): + print(".", end="", flush=True) + except: + continue + print("") + except Exception as e: + print(f"❌ Failed to download model: {e}") + return False + + print("✅ Setup complete!") + print("💡 Test with: /embedded-index build") + return True + +def main(): + """Main command handler""" + args = " ".join(sys.argv[1:]) if len(sys.argv) > 1 else "" + + if args == "setup": + setup_embeddings() + + elif args == "build": + print("🏗️ Building neural embeddings index...") + if not check_ollama(): + print("❌ Ollama not running. Try: /embedded-index setup") + return + run_command("python3 ~/.claude/scripts/neural_embeddings.py --build") + + elif args.startswith("search "): + query = args[7:] # Remove "search " prefix + print(f"🔍 Neural search for: \"{query}\"") + run_command(f"python3 ~/.claude/scripts/neural_embeddings.py --search \"{query}\"") + + elif args.startswith("similar "): + function = args[8:] # Remove "similar " prefix + print(f"🎯 Finding functions similar to: {function}") + run_command(f"python3 ~/.claude/scripts/neural_embeddings.py --similar \"{function}\"") + + elif args == "analyze": + print("🔬 Neural semantic analysis...") + run_command("python3 ~/.claude/scripts/neural_embeddings.py --analyze") + + else: + print("🧠 NEURAL EMBEDDED INDEX") + print("") + print("Commands:") + print(" setup - Install Ollama + nomic-embed-text") + print(" build - Generate neural embeddings") + print(" search - Natural language code search") + print(" similar - Find semantically similar functions") + print(" analyze - Discover semantic clusters") + print("") + print("🚀 Capabilities:") + print(" • Natural language search: 'find error handling'") + print(" • Cross-language similarity detection") + print(" • Intent-based code clustering") + print(" • Semantic duplicate detection") + print("") + print("💡 First time? Run: /embedded-index setup") + +if __name__ == "__main__": + main() \ No newline at end of file diff --git a/scripts/enhanced_project_index.py b/scripts/enhanced_project_index.py new file mode 100644 index 0000000..2da773a --- /dev/null +++ b/scripts/enhanced_project_index.py @@ -0,0 +1,124 @@ +#!/usr/bin/env python3 +""" +Enhanced project indexer that automatically includes semantic analysis. +Streamlines the workflow by combining basic indexing with duplicate detection. +""" + +import sys +import os +import subprocess +from pathlib import Path + +def main(): + """Run enhanced indexing with automatic semantic analysis.""" + + # Get project root (default to current directory) + project_root = sys.argv[1] if len(sys.argv) > 1 else os.getcwd() + project_path = Path(project_root).resolve() + + print("🚀 Enhanced Project Indexing with Semantic Analysis") + print("=" * 55) + print(f"📁 Project: {project_path}") + + # Step 1: Run basic project indexing + print("\n📊 Step 1: Building basic project index...") + try: + scripts_dir = Path(__file__).parent + basic_indexer = scripts_dir / "project_index.py" + + result = subprocess.run([ + sys.executable, str(basic_indexer), str(project_path) + ], capture_output=True, text=True, check=True) + + print("✅ Basic index created") + + # Show summary from basic indexer + lines = result.stdout.strip().split('\n') + for line in lines[-5:]: # Show last 5 lines (summary) + if line.strip(): + print(f" {line}") + + except subprocess.CalledProcessError as e: + print(f"❌ Basic indexing failed: {e}") + print(f"Error output: {e.stderr}") + return 1 + except Exception as e: + print(f"❌ Unexpected error in basic indexing: {e}") + return 1 + + # Step 2: Run semantic analysis + print("\n🧠 Step 2: Adding semantic analysis...") + try: + semantic_analyzer = scripts_dir / "semantic_analyzer.py" + index_file = project_path / "PROJECT_INDEX.json" + + result = subprocess.run([ + sys.executable, str(semantic_analyzer), + str(project_path), str(index_file) + ], capture_output=True, text=True, check=True) + + print("✅ Semantic analysis complete") + + # Show semantic analysis summary + lines = result.stdout.strip().split('\n') + for line in lines: + if 'Analyzing' in line or 'Found' in line or 'complete' in line: + print(f" {line}") + + except subprocess.CalledProcessError as e: + print(f"❌ Semantic analysis failed: {e}") + print(f"Error output: {e.stderr}") + print("💡 Make sure scikit-learn is installed: pip install scikit-learn") + return 1 + except Exception as e: + print(f"❌ Unexpected error in semantic analysis: {e}") + return 1 + + # Step 3: Verify semantic data was saved + print("\n🔍 Step 3: Verifying semantic integration...") + try: + index_file = project_path / "PROJECT_INDEX.json" + if not index_file.exists(): + raise Exception("PROJECT_INDEX.json not found") + + # Check if semantic_index exists in the file + with open(index_file, 'r') as f: + content = f.read() + if '"semantic_index"' in content: + print("✅ Semantic data successfully integrated") + + # Count functions analyzed + if '"functions"' in content: + func_count = content.count('"ast_fingerprint"') + print(f" 📊 {func_count} functions analyzed with semantic data") + + # Check for similarity clusters + if '"similarity_clusters"' in content: + cluster_start = content.find('"similarity_clusters"') + if cluster_start != -1: + cluster_section = content[cluster_start:cluster_start+1000] + cluster_count = cluster_section.count('"functions"') + print(f" 🔗 {cluster_count} similarity clusters detected") + + # Check for complexity analysis + if '"complexity_analysis"' in content: + print(" ⚡ Code complexity analysis included") + + else: + raise Exception("Semantic data not found in index") + + except Exception as e: + print(f"❌ Semantic verification failed: {e}") + return 1 + + print("\n🎉 Enhanced indexing complete!") + print("\n📋 Next Steps:") + print(" • Run duplicate analysis: generate_duplicate_report.py") + print(" • Set up detection hooks: streamlined_setup.sh") + print(" • Interactive cleanup: interactive_cleanup.py") + print("\n💡 Your PROJECT_INDEX.json now includes semantic duplicate detection!") + + return 0 + +if __name__ == "__main__": + sys.exit(main()) \ No newline at end of file diff --git a/scripts/find_ollama.py b/scripts/find_ollama.py new file mode 100644 index 0000000..92d8d9c --- /dev/null +++ b/scripts/find_ollama.py @@ -0,0 +1,500 @@ +#!/usr/bin/env python3 +""" +Ollama finder and model manager for PROJECT_INDEX +Centralized Ollama detection, model management, and embedding generation + +Features: +- Robust Ollama service detection +- Model availability checking and auto-pulling +- Embedding generation with error handling +- Platform-specific installation guidance +- Status reporting and diagnostics + +Usage: python3 scripts/find_ollama.py [OPTIONS] +""" + +__version__ = "0.1.0" + +import json +import sys +import urllib.request +import urllib.error +import subprocess +import time +import platform +from typing import Dict, List, Optional, Tuple + + +class OllamaManager: + """Centralized Ollama service and model management.""" + + def __init__(self, endpoint: str = "http://localhost:11434", timeout: int = 10): + self.endpoint = endpoint.rstrip('/') + self.timeout = timeout + self.default_model = "nomic-embed-text" + + def check_ollama_running(self) -> Tuple[bool, str]: + """Check if Ollama service is running and accessible.""" + try: + url = f"{self.endpoint}/api/tags" + req = urllib.request.Request(url) + + with urllib.request.urlopen(req, timeout=self.timeout) as response: + if response.status == 200: + return True, "Ollama service is running" + else: + return False, f"Ollama responded with status {response.status}" + + except urllib.error.URLError as e: + if "Connection refused" in str(e): + return False, "Ollama service not running (connection refused)" + elif "timeout" in str(e).lower(): + return False, "Ollama service timeout (may be starting up)" + else: + return False, f"Ollama service error: {e}" + except Exception as e: + return False, f"Unexpected error checking Ollama: {e}" + + def get_available_models(self) -> Tuple[bool, List[str], str]: + """Get list of available models from Ollama.""" + try: + url = f"{self.endpoint}/api/tags" + req = urllib.request.Request(url) + + with urllib.request.urlopen(req, timeout=self.timeout) as response: + if response.status == 200: + data = json.loads(response.read().decode('utf-8')) + models = [model['name'] for model in data.get('models', [])] + return True, models, "Successfully retrieved model list" + else: + return False, [], f"HTTP {response.status}" + + except Exception as e: + return False, [], str(e) + + def is_model_available(self, model_name: str) -> Tuple[bool, str]: + """Check if a specific model is available locally.""" + success, models, error = self.get_available_models() + if not success: + return False, f"Could not check models: {error}" + + # Check for exact match or partial match (models often have tags) + for model in models: + if model == model_name or model.startswith(f"{model_name}:"): + return True, f"Model '{model_name}' is available as '{model}'" + + return False, f"Model '{model_name}' not found (available: {', '.join(models[:3])}{'...' if len(models) > 3 else ''})" + + def pull_model(self, model_name: str, show_progress: bool = True) -> Tuple[bool, str]: + """Pull a model from Ollama registry.""" + try: + url = f"{self.endpoint}/api/pull" + data = json.dumps({"name": model_name}).encode('utf-8') + req = urllib.request.Request(url, data=data, headers={'Content-Type': 'application/json'}) + + if show_progress: + print(f"🔄 Pulling model '{model_name}'... (this may take a few minutes)", file=sys.stderr) + + with urllib.request.urlopen(req, timeout=300) as response: # 5 minute timeout for model pulling + if response.status == 200: + # Read the streaming response + while True: + line = response.readline() + if not line: + break + + try: + chunk = json.loads(line.decode('utf-8')) + if show_progress and 'status' in chunk: + status = chunk['status'] + if 'completed' in chunk and 'total' in chunk: + completed = chunk['completed'] + total = chunk['total'] + percent = (completed / total * 100) if total > 0 else 0 + print(f"\r {status}: {percent:.1f}%", end='', file=sys.stderr) + elif status != "pulling manifest": # Avoid too much noise + print(f"\r {status}...", end='', file=sys.stderr) + except json.JSONDecodeError: + continue + + if show_progress: + print(f"\n✅ Model '{model_name}' pulled successfully", file=sys.stderr) + + return True, f"Model '{model_name}' pulled successfully" + else: + return False, f"HTTP {response.status}" + + except urllib.error.HTTPError as e: + if e.code == 404: + return False, f"Model '{model_name}' not found in registry" + else: + return False, f"HTTP error {e.code}: {e.reason}" + except Exception as e: + return False, f"Error pulling model: {e}" + + def ensure_model_available(self, model_name: str = None) -> Tuple[bool, str]: + """Ensure a model is available, pulling it if necessary.""" + if model_name is None: + model_name = self.default_model + + # First check if Ollama is running + running, error = self.check_ollama_running() + if not running: + return False, error + + # Check if model is already available + available, status = self.is_model_available(model_name) + if available: + return True, status + + # Try to pull the model + print(f"🔍 Model '{model_name}' not found locally, attempting to pull...", file=sys.stderr) + success, error = self.pull_model(model_name) + if success: + return True, f"Model '{model_name}' is now available" + else: + return False, f"Failed to pull model '{model_name}': {error}" + + def generate_embedding(self, text: str, model_name: str = None) -> Tuple[bool, Optional[List[float]], str]: + """Generate embedding for text using specified model.""" + if model_name is None: + model_name = self.default_model + + # Ensure model is available + available, error = self.ensure_model_available(model_name) + if not available: + return False, None, error + + try: + url = f"{self.endpoint}/api/embeddings" + data = json.dumps({ + "model": model_name, + "prompt": text + }).encode('utf-8') + + req = urllib.request.Request(url, data=data, headers={'Content-Type': 'application/json'}) + + with urllib.request.urlopen(req, timeout=30) as response: + if response.status == 200: + result = json.loads(response.read().decode('utf-8')) + embedding = result.get('embedding') + if embedding: + return True, embedding, f"Generated {len(embedding)}-dimensional embedding" + else: + return False, None, "No embedding in response" + else: + return False, None, f"HTTP {response.status}" + + except Exception as e: + return False, None, f"Error generating embedding: {e}" + + def test_embedding_generation(self) -> Tuple[bool, str]: + """Test embedding generation with a simple example.""" + test_text = "def test_function(): return 'hello world'" + success, embedding, error = self.generate_embedding(test_text) + + if success and embedding: + return True, f"✅ Embedding test passed (generated {len(embedding)}-dimensional vector)" + else: + return False, f"❌ Embedding test failed: {error}" + + def get_status(self) -> Dict: + """Get comprehensive status report.""" + status = { + "ollama_running": False, + "ollama_error": None, + "models_available": [], + "models_error": None, + "default_model_available": False, + "default_model_error": None, + "embedding_test": False, + "embedding_error": None, + "endpoint": self.endpoint, + "default_model": self.default_model + } + + # Check if Ollama is running + running, error = self.check_ollama_running() + status["ollama_running"] = running + if not running: + status["ollama_error"] = error + return status # No point checking further if Ollama isn't running + + # Get available models + success, models, error = self.get_available_models() + if success: + status["models_available"] = models + else: + status["models_error"] = error + + # Check default model + available, error = self.is_model_available(self.default_model) + status["default_model_available"] = available + if not available: + status["default_model_error"] = error + + # Test embedding generation + if available: # Only test if default model is available + test_success, test_error = self.test_embedding_generation() + status["embedding_test"] = test_success + if not test_success: + status["embedding_error"] = test_error + + return status + + +def show_install_guide(): + """Show platform-specific Ollama installation instructions.""" + system = platform.system().lower() + + print("📦 Ollama Installation Guide", file=sys.stderr) + print("=" * 50, file=sys.stderr) + print("", file=sys.stderr) + + if system == "darwin": # macOS + print("🍎 macOS Installation:", file=sys.stderr) + print(" • Download from: https://ollama.ai/download", file=sys.stderr) + print(" • Homebrew: brew install ollama", file=sys.stderr) + print(" • MacPorts: sudo port install ollama", file=sys.stderr) + + elif system == "linux": + print("🐧 Linux Installation:", file=sys.stderr) + print(" • Curl installer: curl -fsSL https://ollama.ai/install.sh | sh", file=sys.stderr) + print(" • Manual download: https://ollama.ai/download", file=sys.stderr) + print(" • Debian/Ubuntu: Download .deb from releases", file=sys.stderr) + print(" • Fedora/RHEL: Download .rpm from releases", file=sys.stderr) + + elif system == "windows": + print("🪟 Windows Installation:", file=sys.stderr) + print(" • Download from: https://ollama.ai/download", file=sys.stderr) + print(" • Windows installer (.exe) available", file=sys.stderr) + + else: + print(f"❓ {system.capitalize()} Installation:", file=sys.stderr) + print(" • Check: https://ollama.ai/download", file=sys.stderr) + + print("", file=sys.stderr) + print("🚀 After Installation:", file=sys.stderr) + print(" 1. Start Ollama: ollama serve", file=sys.stderr) + print(" 2. Test installation: ollama list", file=sys.stderr) + print(" 3. Pull embedding model: ollama pull nomic-embed-text", file=sys.stderr) + print("", file=sys.stderr) + print("💡 For automatic startup:", file=sys.stderr) + print(" • macOS: Ollama app starts automatically", file=sys.stderr) + print(" • Linux: Set up systemd service", file=sys.stderr) + print(" • Windows: Set up Windows service", file=sys.stderr) + print("", file=sys.stderr) + + +def print_status(status: Dict): + """Print formatted status report.""" + print("🔍 Ollama Status Report", file=sys.stderr) + print("=" * 50, file=sys.stderr) + print("", file=sys.stderr) + + # Ollama service status + if status["ollama_running"]: + print("✅ Ollama service: Running", file=sys.stderr) + print(f" Endpoint: {status['endpoint']}", file=sys.stderr) + else: + print("❌ Ollama service: Not running", file=sys.stderr) + print(f" Error: {status['ollama_error']}", file=sys.stderr) + print("", file=sys.stderr) + print("💡 To start Ollama:", file=sys.stderr) + print(" ollama serve", file=sys.stderr) + return + + # Models status + if status.get("models_available"): + print(f"📦 Available models: {len(status['models_available'])}", file=sys.stderr) + for model in status["models_available"][:5]: # Show first 5 + print(f" • {model}", file=sys.stderr) + if len(status["models_available"]) > 5: + print(f" ... and {len(status['models_available']) - 5} more", file=sys.stderr) + else: + print("❌ Models: Could not retrieve list", file=sys.stderr) + if status.get("models_error"): + print(f" Error: {status['models_error']}", file=sys.stderr) + + print("", file=sys.stderr) + + # Default model status + if status["default_model_available"]: + print(f"✅ Default model '{status['default_model']}': Available", file=sys.stderr) + else: + print(f"❌ Default model '{status['default_model']}': Not available", file=sys.stderr) + if status.get("default_model_error"): + print(f" Error: {status['default_model_error']}", file=sys.stderr) + print(" 💡 To install:", file=sys.stderr) + print(f" ollama pull {status['default_model']}", file=sys.stderr) + return + + # Embedding test status + if status["embedding_test"]: + print("✅ Embedding generation: Working", file=sys.stderr) + else: + print("❌ Embedding generation: Failed", file=sys.stderr) + if status.get("embedding_error"): + print(f" Error: {status['embedding_error']}", file=sys.stderr) + + print("", file=sys.stderr) + print("🎉 Ollama is ready for embedding generation!", file=sys.stderr) + + +def main(): + """Command-line interface for Ollama management.""" + import argparse + + parser = argparse.ArgumentParser( + description='Ollama finder and model manager for PROJECT_INDEX', + formatter_class=argparse.RawDescriptionHelpFormatter, + epilog=''' +Examples: + %(prog)s --check # Quick availability check + %(prog)s --status # Detailed status report + %(prog)s --ensure-model nomic-embed-text # Ensure model is available + %(prog)s --test-embedding # Test embedding generation + %(prog)s --install-guide # Show installation instructions + %(prog)s --pull-model mxbai-embed-large # Pull a specific model + +Return codes: + 0: Success (Ollama ready for embeddings) + 1: Ollama not running + 2: Model not available + 3: Embedding generation failed + ''' + ) + + parser.add_argument('--version', action='version', version=f'find_ollama v{__version__}') + + parser.add_argument('--check', action='store_true', + help='Quick check if Ollama is running') + + parser.add_argument('--status', action='store_true', + help='Detailed status report') + + parser.add_argument('--ensure-model', type=str, metavar='MODEL', + help='Ensure a model is available (pull if needed)') + + parser.add_argument('--pull-model', type=str, metavar='MODEL', + help='Pull a specific model') + + parser.add_argument('--test-embedding', action='store_true', + help='Test embedding generation') + + parser.add_argument('--install-guide', action='store_true', + help='Show installation instructions') + + parser.add_argument('--endpoint', default='http://localhost:11434', + help='Ollama API endpoint (default: http://localhost:11434)') + + parser.add_argument('--timeout', type=int, default=10, + help='Request timeout in seconds (default: 10)') + + parser.add_argument('--model', default='nomic-embed-text', + help='Default embedding model (default: nomic-embed-text)') + + parser.add_argument('--quiet', '-q', action='store_true', + help='Minimal output (for scripting)') + + args = parser.parse_args() + + # Show installation guide + if args.install_guide: + show_install_guide() + return 0 + + # Initialize manager + manager = OllamaManager(args.endpoint, args.timeout) + manager.default_model = args.model + + # Quick check + if args.check: + running, error = manager.check_ollama_running() + if args.quiet: + sys.exit(0 if running else 1) + + if running: + print("✅ Ollama is running", file=sys.stderr) + return 0 + else: + print(f"❌ Ollama not running: {error}", file=sys.stderr) + return 1 + + # Pull specific model + if args.pull_model: + success, error = manager.pull_model(args.pull_model, not args.quiet) + if success: + if not args.quiet: + print(f"✅ Model '{args.pull_model}' pulled successfully", file=sys.stderr) + return 0 + else: + if not args.quiet: + print(f"❌ Failed to pull '{args.pull_model}': {error}", file=sys.stderr) + return 2 + + # Ensure model is available + if args.ensure_model: + success, error = manager.ensure_model_available(args.ensure_model) + if args.quiet: + sys.exit(0 if success else 2) + + if success: + print(f"✅ Model '{args.ensure_model}' is available", file=sys.stderr) + return 0 + else: + print(f"❌ Model '{args.ensure_model}' not available: {error}", file=sys.stderr) + return 2 + + # Test embedding generation + if args.test_embedding: + success, error = manager.test_embedding_generation() + if args.quiet: + sys.exit(0 if success else 3) + + if success: + print(error, file=sys.stderr) # Success message is in 'error' field + return 0 + else: + print(error, file=sys.stderr) + return 3 + + # Default: show status + if args.status or not any([args.check, args.pull_model, args.ensure_model, args.test_embedding]): + status = manager.get_status() + + if args.quiet: + # For quiet mode, exit with appropriate code + if not status["ollama_running"]: + sys.exit(1) + elif not status["default_model_available"]: + sys.exit(2) + elif not status["embedding_test"]: + sys.exit(3) + else: + sys.exit(0) + + print_status(status) + + # Return appropriate exit code + if not status["ollama_running"]: + return 1 + elif not status["default_model_available"]: + return 2 + elif not status["embedding_test"]: + return 3 + else: + return 0 + + return 0 + + +if __name__ == '__main__': + try: + sys.exit(main()) + except KeyboardInterrupt: + print("\n❌ Interrupted by user", file=sys.stderr) + sys.exit(130) + except Exception as e: + print(f"❌ Unexpected error: {e}", file=sys.stderr) + sys.exit(1) \ No newline at end of file diff --git a/scripts/generate_duplicate_report.py b/scripts/generate_duplicate_report.py new file mode 100644 index 0000000..3f84ab8 --- /dev/null +++ b/scripts/generate_duplicate_report.py @@ -0,0 +1,471 @@ +#!/usr/bin/env python3 +""" +Generate comprehensive duplicate code analysis report for existing codebases. +Identifies and prioritizes duplicate code for cleanup efforts. +""" + +import json +import os +import sys +from pathlib import Path +from typing import Dict, List, Any, Tuple +from collections import defaultdict + +# Import utilities +try: + from index_utils import ( + find_similar_functions, + create_ast_fingerprint, + compute_code_similarity, + extract_python_signatures, + extract_javascript_signatures, + extract_shell_signatures, + PARSEABLE_LANGUAGES + ) + from semantic_analyzer import SemanticAnalyzer +except ImportError: + sys.path.insert(0, os.path.dirname(os.path.abspath(__file__))) + from index_utils import ( + find_similar_functions, + create_ast_fingerprint, + compute_code_similarity, + extract_python_signatures, + extract_javascript_signatures, + extract_shell_signatures, + PARSEABLE_LANGUAGES + ) + from semantic_analyzer import SemanticAnalyzer + + +class DuplicateReportGenerator: + """Generate comprehensive reports on duplicate code in existing projects.""" + + def __init__(self, project_root: str): + self.project_root = Path(project_root) + self.index_path = self.project_root / 'PROJECT_INDEX.json' + self.index_data = None + self.load_index() + + def load_index(self): + """Load the project index with semantic data.""" + if not self.index_path.exists(): + print("❌ PROJECT_INDEX.json not found. Run semantic analysis first.") + sys.exit(1) + + with open(self.index_path, 'r') as f: + self.index_data = json.load(f) + + if 'semantic_index' not in self.index_data: + print("❌ Semantic index not found. Run semantic_analyzer.py first.") + sys.exit(1) + + def analyze_duplicates(self) -> Dict[str, Any]: + """Perform comprehensive duplicate analysis.""" + semantic_index = self.index_data['semantic_index'] + functions = semantic_index.get('functions', {}) + + if not functions: + return {"error": "No functions found in semantic index"} + + # Analyze different types of duplicates + exact_duplicates = self._find_exact_duplicates(functions) + similarity_clusters = self._find_similarity_clusters(functions) + naming_similarities = self._find_naming_similarities(functions) + + # Calculate cleanup priorities + cleanup_priorities = self._calculate_cleanup_priorities( + exact_duplicates, similarity_clusters, functions + ) + + # Generate recommendations + recommendations = self._generate_recommendations( + exact_duplicates, similarity_clusters, cleanup_priorities + ) + + return { + "analysis_timestamp": self.index_data.get('indexed_at', ''), + "total_functions_analyzed": len(functions), + "exact_duplicates": exact_duplicates, + "similarity_clusters": similarity_clusters, + "naming_similarities": naming_similarities, + "cleanup_priorities": cleanup_priorities, + "recommendations": recommendations, + "summary": self._generate_summary(exact_duplicates, similarity_clusters) + } + + def _find_exact_duplicates(self, functions: Dict[str, Any]) -> List[Dict[str, Any]]: + """Find functions with identical AST fingerprints.""" + fingerprint_groups = defaultdict(list) + + for func_id, func_data in functions.items(): + fingerprint = func_data.get('ast_fingerprint') + if fingerprint: + fingerprint_groups[fingerprint].append({ + 'function_id': func_id, + 'signature': func_data.get('signature', ''), + 'complexity': func_data.get('complexity', {}), + 'file_path': func_id.split(':')[0] if ':' in func_id else 'unknown' + }) + + # Only return groups with multiple functions (duplicates) + exact_duplicates = [] + for fingerprint, group in fingerprint_groups.items(): + if len(group) > 1: + exact_duplicates.append({ + 'type': 'exact_structural_duplicate', + 'fingerprint': fingerprint, + 'count': len(group), + 'functions': group, + 'impact_score': self._calculate_impact_score(group) + }) + + # Sort by impact score (highest first) + exact_duplicates.sort(key=lambda x: x['impact_score'], reverse=True) + return exact_duplicates + + def _find_similarity_clusters(self, functions: Dict[str, Any]) -> List[Dict[str, Any]]: + """Find clusters of similar functions using TF-IDF similarity.""" + clusters = [] + processed_functions = set() + + for func_id, func_data in functions.items(): + if func_id in processed_functions: + continue + + # Find similar functions for this one + tfidf_vector = func_data.get('tfidf_vector', []) + if not tfidf_vector: + continue + + similar_cluster = [func_id] + processed_functions.add(func_id) + + # Compare with all other functions + for other_func_id, other_func_data in functions.items(): + if other_func_id in processed_functions: + continue + + other_vector = other_func_data.get('tfidf_vector', []) + if not other_vector: + continue + + similarity = compute_code_similarity(tfidf_vector, other_vector) + if similarity >= 0.7: # 70% similarity threshold for clusters + similar_cluster.append(other_func_id) + processed_functions.add(other_func_id) + + # Only create cluster if we found similar functions + if len(similar_cluster) > 1: + cluster_functions = [] + for cluster_func_id in similar_cluster: + cluster_func_data = functions[cluster_func_id] + cluster_functions.append({ + 'function_id': cluster_func_id, + 'signature': cluster_func_data.get('signature', ''), + 'complexity': cluster_func_data.get('complexity', {}), + 'file_path': cluster_func_id.split(':')[0] if ':' in cluster_func_id else 'unknown' + }) + + clusters.append({ + 'type': 'similarity_cluster', + 'count': len(similar_cluster), + 'functions': cluster_functions, + 'average_similarity': self._calculate_average_similarity(similar_cluster, functions), + 'impact_score': self._calculate_impact_score(cluster_functions) + }) + + # Sort by impact score + clusters.sort(key=lambda x: x['impact_score'], reverse=True) + return clusters + + def _find_naming_similarities(self, functions: Dict[str, Any]) -> List[Dict[str, Any]]: + """Find functions with very similar names.""" + naming_groups = [] + function_names = [(func_id, func_id.split(':')[-1]) for func_id in functions.keys()] + + processed = set() + for i, (func_id1, name1) in enumerate(function_names): + if func_id1 in processed: + continue + + similar_names = [func_id1] + for j, (func_id2, name2) in enumerate(function_names): + if i != j and func_id2 not in processed: + similarity = self._string_similarity(name1.lower(), name2.lower()) + if 0.8 <= similarity < 1.0: # Very similar but not identical + similar_names.append(func_id2) + + if len(similar_names) > 1: + for func_id in similar_names: + processed.add(func_id) + + naming_groups.append({ + 'type': 'similar_naming', + 'count': len(similar_names), + 'functions': [{'function_id': fid, 'name': fid.split(':')[-1]} for fid in similar_names] + }) + + return naming_groups + + def _calculate_impact_score(self, function_group: List[Dict[str, Any]]) -> float: + """Calculate impact score for a group of duplicate functions.""" + # Factors: number of duplicates, complexity, file spread + count_score = len(function_group) * 10 # More duplicates = higher impact + + # Complexity score (higher complexity = higher impact) + complexity_scores = [] + for func in function_group: + complexity = func.get('complexity', {}) + cyclomatic = complexity.get('cyclomatic', 1) + complexity_scores.append(cyclomatic) + + avg_complexity = sum(complexity_scores) / len(complexity_scores) if complexity_scores else 1 + complexity_score = avg_complexity * 5 + + # File spread score (duplicates across files = higher impact) + unique_files = len(set(func.get('file_path', '') for func in function_group)) + spread_score = unique_files * 3 + + return count_score + complexity_score + spread_score + + def _calculate_average_similarity(self, func_ids: List[str], functions: Dict[str, Any]) -> float: + """Calculate average similarity within a cluster.""" + similarities = [] + vectors = [] + + for func_id in func_ids: + vector = functions[func_id].get('tfidf_vector', []) + if vector: + vectors.append(vector) + + if len(vectors) < 2: + return 0.0 + + # Calculate pairwise similarities + for i in range(len(vectors)): + for j in range(i + 1, len(vectors)): + similarity = compute_code_similarity(vectors[i], vectors[j]) + similarities.append(similarity) + + return sum(similarities) / len(similarities) if similarities else 0.0 + + def _string_similarity(self, s1: str, s2: str) -> float: + """Calculate string similarity using character bigrams.""" + if not s1 or not s2: + return 0.0 + + bigrams1 = set(s1[i:i+2] for i in range(len(s1)-1)) + bigrams2 = set(s2[i:i+2] for i in range(len(s2)-1)) + + if not bigrams1 and not bigrams2: + return 1.0 + if not bigrams1 or not bigrams2: + return 0.0 + + intersection = len(bigrams1.intersection(bigrams2)) + union = len(bigrams1.union(bigrams2)) + + return intersection / union if union > 0 else 0.0 + + def _calculate_cleanup_priorities(self, exact_duplicates: List[Dict], + similarity_clusters: List[Dict], + functions: Dict[str, Any]) -> List[Dict[str, Any]]: + """Calculate cleanup priorities based on impact and effort.""" + priorities = [] + + # Add exact duplicates (easy wins) + for duplicate in exact_duplicates: + priorities.append({ + 'type': 'exact_duplicate', + 'priority': 'HIGH', + 'effort': 'LOW', + 'impact_score': duplicate['impact_score'], + 'count': duplicate['count'], + 'description': f"Extract {duplicate['count']} identical functions into shared utility", + 'functions': duplicate['functions'] + }) + + # Add similarity clusters (medium effort) + for cluster in similarity_clusters: + if cluster['average_similarity'] > 0.85: # High similarity + effort = 'MEDIUM' + priority = 'HIGH' + elif cluster['average_similarity'] > 0.75: + effort = 'MEDIUM' + priority = 'MEDIUM' + else: + effort = 'HIGH' + priority = 'LOW' + + priorities.append({ + 'type': 'similarity_cluster', + 'priority': priority, + 'effort': effort, + 'impact_score': cluster['impact_score'], + 'count': cluster['count'], + 'average_similarity': cluster['average_similarity'], + 'description': f"Refactor {cluster['count']} similar functions ({cluster['average_similarity']*100:.0f}% similar)", + 'functions': cluster['functions'] + }) + + # Sort by priority (HIGH > MEDIUM > LOW) then by impact score + priority_order = {'HIGH': 3, 'MEDIUM': 2, 'LOW': 1} + priorities.sort(key=lambda x: (priority_order[x['priority']], x['impact_score']), reverse=True) + + return priorities + + def _generate_recommendations(self, exact_duplicates: List[Dict], + similarity_clusters: List[Dict], + priorities: List[Dict]) -> Dict[str, List[str]]: + """Generate specific recommendations for cleanup.""" + recommendations = { + 'immediate_actions': [], + 'medium_term': [], + 'long_term': [], + 'tools_needed': [] + } + + # Immediate actions (exact duplicates with high impact) + high_impact_exact = [p for p in priorities if p['type'] == 'exact_duplicate' and p['impact_score'] > 30] + if high_impact_exact: + recommendations['immediate_actions'].extend([ + f"Extract {len(high_impact_exact)} high-impact duplicate function groups into shared utilities", + "Focus on duplicates that span multiple files first", + "Create utility modules for the most frequently duplicated patterns" + ]) + + # Medium term (similarity clusters) + high_similarity = [p for p in priorities if p['type'] == 'similarity_cluster' and p['average_similarity'] > 0.8] + if high_similarity: + recommendations['medium_term'].extend([ + f"Refactor {len(high_similarity)} high-similarity function clusters", + "Design configurable implementations for similar functions", + "Consider design patterns (Strategy, Template Method) for similar logic" + ]) + + # Long term (lower similarity, architectural changes) + recommendations['long_term'].extend([ + "Review overall architecture for duplication patterns", + "Establish coding standards to prevent future duplicates", + "Consider domain-driven design for complex business logic" + ]) + + # Tools needed + recommendations['tools_needed'].extend([ + "duplicate-eliminator sub-agent for automated extraction", + "utility-extractor for creating shared modules", + "refactoring-advisor for complex similarity clusters" + ]) + + return recommendations + + def _generate_summary(self, exact_duplicates: List[Dict], similarity_clusters: List[Dict]) -> Dict[str, Any]: + """Generate executive summary of duplicate analysis.""" + total_exact_functions = sum(d['count'] for d in exact_duplicates) + total_similar_functions = sum(c['count'] for c in similarity_clusters) + + # Calculate potential savings + estimated_lines_saved = total_exact_functions * 15 # Assume 15 lines per function average + + return { + 'total_duplicate_groups': len(exact_duplicates), + 'total_duplicate_functions': total_exact_functions, + 'total_similarity_clusters': len(similarity_clusters), + 'total_similar_functions': total_similar_functions, + 'estimated_lines_saved': estimated_lines_saved, + 'estimated_maintenance_reduction': f"{((total_exact_functions + total_similar_functions) / 2):.0f} functions", + 'top_priority_count': len([d for d in exact_duplicates if d['impact_score'] > 30]), + 'complexity_note': 'Focus on high-complexity duplicates for maximum impact' + } + + def generate_report(self, output_format: str = 'json') -> str: + """Generate the complete duplicate analysis report.""" + analysis = self.analyze_duplicates() + + if output_format == 'json': + return json.dumps(analysis, indent=2) + elif output_format == 'markdown': + return self._format_markdown_report(analysis) + else: + return str(analysis) + + def _format_markdown_report(self, analysis: Dict[str, Any]) -> str: + """Format analysis as readable markdown report.""" + report = ["# Duplicate Code Analysis Report\n"] + + # Summary + summary = analysis['summary'] + report.append("## Executive Summary\n") + report.append(f"- **Total Duplicate Groups:** {summary['total_duplicate_groups']}") + report.append(f"- **Functions to Deduplicate:** {summary['total_duplicate_functions']}") + report.append(f"- **Similarity Clusters:** {summary['total_similarity_clusters']}") + report.append(f"- **Estimated Lines Saved:** {summary['estimated_lines_saved']}") + report.append(f"- **High Priority Items:** {summary['top_priority_count']}\n") + + # Exact Duplicates + if analysis['exact_duplicates']: + report.append("## 🚨 Exact Duplicates (Immediate Action Required)\n") + for i, dup in enumerate(analysis['exact_duplicates'][:5]): # Top 5 + report.append(f"### Duplicate Group {i+1}") + report.append(f"- **Count:** {dup['count']} identical functions") + report.append(f"- **Impact Score:** {dup['impact_score']:.1f}") + report.append("- **Functions:**") + for func in dup['functions']: + report.append(f" - `{func['function_id']}`") + report.append("") + + # Cleanup Priorities + if analysis['cleanup_priorities']: + report.append("## 📋 Cleanup Priorities\n") + high_priority = [p for p in analysis['cleanup_priorities'] if p['priority'] == 'HIGH'][:5] + for i, priority in enumerate(high_priority): + report.append(f"### Priority {i+1}: {priority['description']}") + report.append(f"- **Effort:** {priority['effort']}") + report.append(f"- **Impact:** {priority['impact_score']:.1f}") + report.append("") + + # Recommendations + recommendations = analysis['recommendations'] + report.append("## 💡 Recommendations\n") + report.append("### Immediate Actions") + for action in recommendations['immediate_actions']: + report.append(f"- {action}") + + report.append("\n### Medium Term") + for action in recommendations['medium_term']: + report.append(f"- {action}") + + report.append("\n### Tools Needed") + for tool in recommendations['tools_needed']: + report.append(f"- {tool}") + + return "\n".join(report) + + +def main(): + """Main entry point for duplicate report generation.""" + import argparse + + parser = argparse.ArgumentParser(description='Generate duplicate code analysis report') + parser.add_argument('--project-root', default='.', help='Project root directory') + parser.add_argument('--format', choices=['json', 'markdown'], default='markdown', + help='Output format') + parser.add_argument('--output', help='Output file (default: stdout)') + + args = parser.parse_args() + + # Generate report + generator = DuplicateReportGenerator(args.project_root) + report = generator.generate_report(args.format) + + # Output report + if args.output: + with open(args.output, 'w') as f: + f.write(report) + print(f"Report saved to {args.output}") + else: + print(report) + + +if __name__ == '__main__': + main() \ No newline at end of file diff --git a/scripts/i_flag_hook.py b/scripts/i_flag_hook.py index 1228733..dd871e5 100755 --- a/scripts/i_flag_hook.py +++ b/scripts/i_flag_hook.py @@ -1,83 +1,23 @@ -#!/usr/bin/env python3 -""" -UserPromptSubmit hook for intelligent PROJECT_INDEX.json analysis. -Detects -i[number] and -ic[number] flags for dynamic index generation. -""" - -import json -import sys -import os -import re -import subprocess -import hashlib -import time -from pathlib import Path -from datetime import datetime - -# Constants -DEFAULT_SIZE_K = 50 # Default 50k tokens -MIN_SIZE_K = 1 # Minimum 1k tokens -CLAUDE_MAX_K = 100 # Max 100k for Claude (leaves room for reasoning) -EXTERNAL_MAX_K = 800 # Max 800k for external AI - -def find_project_root(): - """Find project root by looking for .git or common project markers.""" - current = Path.cwd() - - # First check current directory for project markers - if (current / '.git').exists(): - return current - - # Check for other project markers - project_markers = ['package.json', 'pyproject.toml', 'setup.py', 'Cargo.toml', 'go.mod'] - for marker in project_markers: - if (current / marker).exists(): - return current - - # Search up the tree for .git - for parent in current.parents: - if (parent / '.git').exists(): - return parent - - # Default to current directory - return current - -def get_last_interactive_size(): - """Get the last remembered -i size from the index.""" - try: - project_root = find_project_root() - index_path = project_root / 'PROJECT_INDEX.json' - - if index_path.exists(): - with open(index_path, 'r') as f: - index = json.load(f) - meta = index.get('_meta', {}) - last_size = meta.get('last_interactive_size_k') - - if last_size: - print(f"📝 Using remembered size: {last_size}k", file=sys.stderr) - return last_size - except: - pass - - # Fall back to default - return DEFAULT_SIZE_K - -def parse_index_flag(prompt): - """Parse -i or -ic flag with optional size.""" - # Pattern matches -i[number] or -ic[number] - match = re.search(r'-i(c?)(\d+)?(?:\s|$)', prompt) +# File was edited +# Old: def parse_index_flag(prompt): + """Parse -i, -ic, or -ie flag with optional size. + Returns: (size_k, clipboard_mode, embedding_mode, cleaned_prompt) + """ + # Pattern matches -i[number], -ic[number], or -ie[number] + match = re.search(r'-i([ce]?)(\d+)?(?:\s|$)', prompt) if not match: - return None, None, prompt + return None, None, None, prompt - clipboard_mode = match.group(1) == 'c' + mode_char = match.group(1) + clipboard_mode = mode_char == 'c' + embedding_mode = mode_char == 'e' # If no explicit size provided, check for remembered size if match.group(2): size_k = int(match.group(2)) else: - # For -i without size, try to use last remembered size + # For plain -i or -ie without size, try to use last remembered size if not clipboard_mode: size_k = get_last_interactive_size() else: @@ -97,683 +37,78 @@ def parse_index_flag(prompt): size_k = EXTERNAL_MAX_K # Clean prompt (remove flag) - cleaned_prompt = re.sub(r'-ic?\d*\s*', '', prompt).strip() + cleaned_prompt = re.sub(r'-i[ce]?\d*\s*', '', prompt).strip() - return size_k, clipboard_mode, cleaned_prompt - -def calculate_files_hash(project_root): - """Calculate hash of non-ignored files to detect changes.""" - try: - # Use git ls-files to get non-ignored files - result = subprocess.run( - ['git', 'ls-files', '--cached', '--others', '--exclude-standard'], - cwd=str(project_root), - capture_output=True, - text=True, - timeout=5 - ) - - if result.returncode == 0: - files = result.stdout.strip().split('\n') if result.stdout.strip() else [] - else: - # Fallback to manual file discovery - files = [] - for file_path in project_root.rglob('*'): - if file_path.is_file() and not any(part.startswith('.') for part in file_path.parts): - files.append(str(file_path.relative_to(project_root))) - - # Hash file paths and modification times - hasher = hashlib.sha256() - for file_path in sorted(files): - full_path = project_root / file_path - if full_path.exists(): - try: - mtime = str(full_path.stat().st_mtime) - hasher.update(f"{file_path}:{mtime}".encode()) - except: - pass - - return hasher.hexdigest()[:16] - except Exception as e: - print(f"Warning: Could not calculate files hash: {e}", file=sys.stderr) - return "unknown" - -def should_regenerate_index(project_root, index_path, requested_size_k): - """Determine if index needs regeneration.""" - if not index_path.exists(): - return True, "No index exists" + return size_k, clipboard_mode, embedding_mode, cleaned_prompt +# New: def parse_index_flag(prompt): + """Parse -i, -ic, or -ie flag with optional size and similarity options. + Returns: (size_k, clipboard_mode, embedding_mode, similarity_options, cleaned_prompt) + """ + # Pattern matches -i[number], -ic[number], or -ie[number] + match = re.search(r'-i([ce]?)(\d+)?(?:\s|$)', prompt) - try: - # Read metadata - with open(index_path, 'r') as f: - index = json.load(f) - meta = index.get('_meta', {}) - - # Get last generation info - last_target = meta.get('target_size_k', 0) - last_files_hash = meta.get('files_hash', '') - - # Check if files changed - current_files_hash = calculate_files_hash(project_root) - if current_files_hash != last_files_hash and current_files_hash != "unknown": - return True, f"Files changed since last index" - - # Check if different size requested - if abs(requested_size_k - last_target) > 2: # Allow 2k tolerance - return True, f"Different size requested ({requested_size_k}k vs {last_target}k)" - - # Use existing index - actual_k = meta.get('actual_size_k', last_target) - return False, f"Using cached index ({actual_k}k actual, {last_target}k target)" + if not match: + return None, None, None, None, prompt + + mode_char = match.group(1) + clipboard_mode = mode_char == 'c' + embedding_mode = mode_char == 'e' + + # Parse similarity-specific options (only valid with -ie) + similarity_options = {} + if embedding_mode: + # Parse --algorithm=algorithm_name + algo_match = re.search(r'--algorithm[=\s]([a-z-]+)', prompt) + if algo_match: + similarity_options['algorithm'] = algo_match.group(1) + prompt = re.sub(r'--algorithm[=\s][a-z-]+\s*', '', prompt).strip() + + # Parse -o output_file or --output=output_file + output_match = re.search(r'(?:-o|--output)[=\s](\S+)', prompt) + if output_match: + similarity_options['output'] = output_match.group(1) + prompt = re.sub(r'(?:-o|--output)[=\s]\S+\s*', '', prompt).strip() + + # Parse --build-cache + if '--build-cache' in prompt: + similarity_options['build_cache'] = True + prompt = re.sub(r'--build-cache\s*', '', prompt).strip() + + # Parse --duplicates + if '--duplicates' in prompt: + similarity_options['duplicates'] = True + prompt = re.sub(r'--duplicates\s*', '', prompt).strip() + + # Parse --algorithms=algo1,algo2,algo3 (for cache building) + algos_match = re.search(r'--algorithms[=\s]([a-z,-]+)', prompt) + if algos_match: + similarity_options['algorithms'] = algos_match.group(1).split(',') + prompt = re.sub(r'--algorithms[=\s][a-z,-]+\s*', '', prompt).strip() - except Exception as e: - print(f"Warning: Could not read index metadata: {e}", file=sys.stderr) - return True, "Could not read index metadata" - -def generate_index_at_size(project_root, target_size_k, is_clipboard_mode=False): - """Generate index at specific token size.""" - print(f"🎯 Generating {target_size_k}k token index...", file=sys.stderr) + # If no explicit size provided, check for remembered size + if match.group(2): + size_k = int(match.group(2)) + else: + # For plain -i or -ie without size, try to use last remembered size + if not clipboard_mode: + size_k = get_last_interactive_size() + else: + # For -ic, always use default + size_k = DEFAULT_SIZE_K - # Find indexer script - local_indexer = Path(__file__).parent / 'project_index.py' - system_indexer = Path.home() / '.claude-code-project-index' / 'scripts' / 'project_index.py' + # Validate size limits + if size_k < MIN_SIZE_K: + print(f"⚠️ Minimum size is {MIN_SIZE_K}k, using {MIN_SIZE_K}k", file=sys.stderr) + size_k = MIN_SIZE_K - indexer_path = local_indexer if local_indexer.exists() else system_indexer + if not clipboard_mode and size_k > CLAUDE_MAX_K: + print(f"⚠️ Claude max is {CLAUDE_MAX_K}k (need buffer for reasoning), using {CLAUDE_MAX_K}k", file=sys.stderr) + size_k = CLAUDE_MAX_K + elif clipboard_mode and size_k > EXTERNAL_MAX_K: + print(f"⚠️ Maximum size is {EXTERNAL_MAX_K}k, using {EXTERNAL_MAX_K}k", file=sys.stderr) + size_k = EXTERNAL_MAX_K - if not indexer_path.exists(): - print("⚠️ PROJECT_INDEX.json indexer not found", file=sys.stderr) - return False + # Clean prompt (remove all flags) + cleaned_prompt = re.sub(r'-i[ce]?\d*\s*', '', prompt).strip() - try: - # Find Python command - python_cmd_file = Path.home() / '.claude-code-project-index' / '.python_cmd' - if python_cmd_file.exists(): - python_cmd = python_cmd_file.read_text().strip() - else: - python_cmd = sys.executable - - # Pass target size as environment variable - env = os.environ.copy() - env['INDEX_TARGET_SIZE_K'] = str(target_size_k) - - result = subprocess.run( - [python_cmd, str(indexer_path)], - cwd=str(project_root), - capture_output=True, - text=True, - timeout=30, # 30 seconds should be plenty for most projects - env=env - ) - - if result.returncode == 0: - # Update metadata with target size and hash - index_path = project_root / 'PROJECT_INDEX.json' - if index_path.exists(): - with open(index_path, 'r') as f: - index = json.load(f) - - # Measure actual size - index_str = json.dumps(index, indent=2) - actual_tokens = len(index_str) // 4 # Rough estimate: 4 chars = 1 token - actual_size_k = actual_tokens // 1000 - - # Add/update metadata - if '_meta' not in index: - index['_meta'] = {} - - metadata_update = { - 'generated_at': time.time(), - 'target_size_k': target_size_k, - 'actual_size_k': actual_size_k, - 'files_hash': calculate_files_hash(project_root), - 'compression_ratio': f"{(actual_size_k/target_size_k)*100:.1f}%" if target_size_k > 0 else "N/A" - } - - # Remember -i size for next time (but not -ic) - if not is_clipboard_mode: - metadata_update['last_interactive_size_k'] = target_size_k - print(f"💾 Remembering size {target_size_k}k for next -i", file=sys.stderr) - - index['_meta'].update(metadata_update) - - # Save updated index - with open(index_path, 'w') as f: - json.dump(index, f, indent=2) - - print(f"✅ Created PROJECT_INDEX.json ({actual_size_k}k actual, {target_size_k}k target)", file=sys.stderr) - return True - else: - print("⚠️ Index file not created", file=sys.stderr) - return False - else: - print(f"⚠️ Failed to create index: {result.stderr}", file=sys.stderr) - return False - - except subprocess.TimeoutExpired: - print("⚠️ Index creation timed out", file=sys.stderr) - return False - except Exception as e: - print(f"⚠️ Error creating index: {e}", file=sys.stderr) - return False - -def copy_to_clipboard(prompt, index_path): - """Copy prompt, instructions, and index to clipboard for external AI.""" - try: - # Try VM Bridge first (works with any size over mosh) - vm_bridge_available = False - bridge_client = None - - try: - import sys - # Try multiple VM Bridge locations in order of preference - vm_bridge_paths = [ - os.path.expanduser('~/.claude-ericbuess/tools/vm-bridge'), # New standard location - '/home/ericbuess/Projects/vm-bridge', # Legacy project location - os.path.expanduser('~/.local/lib/python/vm_bridge') # Old tunnel location - ] - - # Try network version first (no tunnel needed) - for bridge_path in vm_bridge_paths: - if os.path.exists(bridge_path): - sys.path.insert(0, bridge_path) - try: - from vm_client_network import VMBridgeClient as NetworkClient - - # Try to auto-detect or use known Mac IP - for mac_ip in ['10.211.55.2', '10.211.55.1', '192.168.1.1']: - try: - test_client = NetworkClient(host=mac_ip) - if test_client.is_daemon_running(): - bridge_client = test_client - print(f"🌉 VM Bridge network daemon detected at {mac_ip}", file=sys.stderr) - vm_bridge_available = True - break - except: - continue - - if vm_bridge_available: - break - except ImportError: - # Try next path - continue - - # Fall back to localhost tunnel version if network not available - if not vm_bridge_available: - for bridge_path in vm_bridge_paths: - if os.path.exists(bridge_path): - sys.path.insert(0, bridge_path) - try: - from vm_client import VMBridgeClient - bridge_client = VMBridgeClient() - if bridge_client.is_daemon_running(): - print("🌉 VM Bridge tunnel daemon detected", file=sys.stderr) - vm_bridge_available = True - break - except ImportError: - continue - - except ImportError: - vm_bridge_available = False - # Create clipboard-specific instructions (no tools, no subagent references) - clipboard_instructions = """You are analyzing a codebase index to help identify relevant files and code sections. - -## YOUR TASK -Analyze the PROJECT_INDEX.json below to identify the most relevant code sections for the user's request. -The index contains file structures, function signatures, call graphs, and dependencies. - -## WHAT TO LOOK FOR -- Identify specific files and functions related to the request -- Trace call graphs to understand code flow -- Note dependencies and relationships -- Consider architectural patterns - -## IMPORTANT: RESPONSE FORMAT -Your response will be copied and pasted to Claude Code. Format your response as: - -### 📍 RELEVANT CODE LOCATIONS - -**Primary Files to Examine:** -- `path/to/file.py` - [Why relevant] - - `function_name()` (line X) - [What it does] - - Called by: [list any callers] - - Calls: [list what it calls] - -**Related Files:** -- `path/to/related.py` - [Connection to task] - -### 🔍 KEY INSIGHTS -- [Architectural patterns observed] -- [Dependencies to consider] -- [Potential challenges or gotchas] - -### 💡 RECOMMENDATIONS -- Start by examining: [specific file] -- Focus on: [specific functions/classes] -- Consider: [any special considerations] - -Do NOT include the original user prompt in your response. -Focus on providing actionable file locations and insights.""" - - # Load index - with open(index_path, 'r') as f: - index = json.load(f) - - # Build clipboard content - clipboard_content = f"""# Codebase Analysis Request - -## Task for You -{prompt} - -## Instructions -{clipboard_instructions} - -## PROJECT_INDEX.json -{json.dumps(index, indent=2)} -""" - - # Try to copy to clipboard - clipboard_success = False - - # Try VM Bridge first if available (works with any size over mosh) - if vm_bridge_available: - try: - if bridge_client.copy_to_clipboard(clipboard_content): - print(f"✅ Copied to Mac clipboard via VM Bridge ({len(clipboard_content)} chars)", file=sys.stderr) - print(f"🌉 No size limits with VM Bridge!", file=sys.stderr) - - # Also notify on Mac - bridge_client.notify(f"Clipboard updated: {len(clipboard_content)} chars from VM") - - # Save to file as backup - fallback_path = Path.cwd() / '.clipboard_content.txt' - with open(fallback_path, 'w') as f: - f.write(clipboard_content) - print(f"📁 Also saved to {fallback_path} as backup", file=sys.stderr) - - return ('vm_bridge', len(clipboard_content)) - except Exception as e: - print(f"⚠️ VM Bridge failed: {e}", file=sys.stderr) - # Fall through to other methods - - # Check if we're in an SSH session (clipboard won't work across SSH) - is_ssh = os.environ.get('SSH_CONNECTION') or os.environ.get('SSH_CLIENT') - - # For SSH sessions, try OSC 52 or other methods - if is_ssh: - fallback_path = Path.cwd() / '.clipboard_content.txt' - with open(fallback_path, 'w') as f: - f.write(clipboard_content) - - # Import base64 at the beginning for all methods - import base64 - - # Try multiple clipboard methods for SSH sessions - clipboard_success = False - - # Check content size first - OSC 52 has limits, especially over mosh - content_size = len(clipboard_content) - # Testing shows mosh/tmux cuts off at ~12KB, so stay safely under that - mosh_limit = 11000 # Just under the 12KB cutoff we observed - - if content_size <= mosh_limit: - # Small enough for OSC 52 - try to send directly to clipboard - try: - # Base64 encode and remove newlines - b64_content = base64.b64encode(clipboard_content.encode('utf-8')).decode('ascii') - - # Get the correct TTY device - tty_device = None - is_tmux = os.environ.get('TMUX') - - if is_tmux: - # Inside tmux: get the client tty - try: - result = subprocess.run(['tmux', 'display-message', '-p', '#{client_tty}'], - capture_output=True, text=True, check=True) - tty_device = result.stdout.strip() - except: - tty_device = "/dev/tty" - else: - tty_device = "/dev/tty" - - # Send OSC 52 sequence with proper format - if is_tmux: - # Inside tmux: use DCS passthrough (this is the KEY!) - osc52_sequence = f"\033Ptmux;\033\033]52;c;{b64_content}\007\033\\" - else: - # Outside tmux: use standard OSC 52 - osc52_sequence = f"\033]52;c;{b64_content}\007" - - # Write directly to TTY device (not stderr) - try: - with open(tty_device, 'w') as tty: - tty.write(osc52_sequence) - tty.flush() - clipboard_success = True - print(f"✅ Sent to Mac clipboard via OSC 52 ({content_size} chars)", file=sys.stderr) - except PermissionError: - # Fallback to stderr if can't open TTY - sys.stderr.write(osc52_sequence) - sys.stderr.flush() - clipboard_success = True - print(f"✅ Sent to Mac clipboard via OSC 52 ({content_size} chars)", file=sys.stderr) - - except Exception as e: - print(f"⚠️ OSC 52 failed: {e}", file=sys.stderr) - else: - # Too large for mosh/tmux's ~12KB limit - use alternative methods - # Testing shows clipboard gets truncated at ~12KB over mosh - print(f"📋 Content exceeds mosh/tmux's 12KB limit ({content_size} chars)", file=sys.stderr) - - # Load into tmux buffer for local access - try: - proc = subprocess.Popen(['tmux', 'load-buffer', '-'], stdin=subprocess.PIPE, - stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL) - proc.communicate(clipboard_content.encode('utf-8')) - if proc.returncode == 0: - print(f"✅ Loaded into tmux buffer", file=sys.stderr) - - # Try to trigger automatic Mac clipboard sync - # This runs a command on the tmux client (Mac) side - sync_cmd = f"ssh {os.environ.get('USER', 'user')}@10.211.55.4 'cat ~/Projects/claude-code-project-index/.clipboard_content.txt' | pbcopy" - tmux_run = f"tmux run-shell '{sync_cmd}'" - - try: - subprocess.run(['tmux', 'run-shell', sync_cmd], - capture_output=True, timeout=2) - print(f"🚀 Attempting automatic clipboard sync to Mac...", file=sys.stderr) - except: - pass - except: - pass - - print(f"", file=sys.stderr) - print(f"To manually copy to Mac clipboard, run this on your Mac:", file=sys.stderr) - print(f" ssh {os.environ.get('USER', 'user')}@10.211.55.4 'cat ~/Projects/claude-code-project-index/.clipboard_content.txt' | pbcopy", file=sys.stderr) - print(f"", file=sys.stderr) - print(f"ℹ️ Mosh/tmux limits clipboard to ~12KB. For larger content, consider:", file=sys.stderr) - print(f" - Using SSH instead of mosh for this operation", file=sys.stderr) - print(f" - Or using the manual command above", file=sys.stderr) - - # Also try tmux buffer for local pasting - try: - proc = subprocess.Popen(['tmux', 'load-buffer', '-'], stdin=subprocess.PIPE, - stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL) - proc.communicate(clipboard_content.encode('utf-8')) - if proc.returncode == 0: - print(f"✅ Loaded into tmux buffer (use prefix + ] to paste)", file=sys.stderr) - except: - pass - - print(f"📁 Full content saved to {fallback_path}", file=sys.stderr) - - if clipboard_success: - return ('ssh_clipboard', str(fallback_path)) - else: - return ('ssh_file_large', str(fallback_path)) - - # First try xclip directly (most reliable for Linux) - try: - result = subprocess.run(['which', 'xclip'], capture_output=True) - if result.returncode == 0: - # Use xclip with a virtual display if needed - env = os.environ.copy() - if not env.get('DISPLAY'): - # Check if Xvfb is running on :99 - xvfb_check = subprocess.run(['pgrep', '-f', 'Xvfb.*:99'], capture_output=True) - if xvfb_check.returncode != 0: - # Start Xvfb if not running - subprocess.Popen(['Xvfb', ':99', '-screen', '0', '1024x768x24'], - stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL) - time.sleep(0.5) - env['DISPLAY'] = ':99' - - # Copy to clipboard using xclip - proc = subprocess.Popen(['xclip', '-selection', 'clipboard'], - stdin=subprocess.PIPE, env=env, - stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL) - proc.communicate(clipboard_content.encode('utf-8')) - if proc.returncode == 0: - clipboard_success = True - print(f"✅ Copied to clipboard via xclip: {len(clipboard_content)} chars", file=sys.stderr) - print(f"📋 Ready to paste into Gemini, Claude.ai, ChatGPT, or other AI", file=sys.stderr) - return ('clipboard', len(clipboard_content)) - except: - pass - - # Fallback to pyperclip if xclip didn't work - if not clipboard_success: - try: - import pyperclip - pyperclip.copy(clipboard_content) - print(f"✅ Copied to clipboard via pyperclip: {len(clipboard_content)} chars", file=sys.stderr) - print(f"📋 Ready to paste into Gemini, Claude.ai, ChatGPT, or other AI", file=sys.stderr) - return ('clipboard', len(clipboard_content)) - except (ImportError, Exception) as e: - pass - - # Final fallback to file if clipboard methods failed - if not clipboard_success: - fallback_path = Path.cwd() / '.clipboard_content.txt' - with open(fallback_path, 'w') as f: - f.write(clipboard_content) - print(f"✅ Saved to {fallback_path} (copy manually)", file=sys.stderr) - return ('file', str(fallback_path)) - except Exception as e: - print(f"⚠️ Error preparing clipboard content: {e}", file=sys.stderr) - return ('error', str(e)) - -def main(): - """Process UserPromptSubmit hook for -i and -ic flag detection.""" - try: - # Read hook input - input_data = json.load(sys.stdin) - prompt = input_data.get('prompt', '') - - # Parse flag - size_k, clipboard_mode, cleaned_prompt = parse_index_flag(prompt) - - if size_k is None: - # No index flag, let prompt proceed normally - sys.exit(0) - - # Find project root - project_root = find_project_root() - index_path = project_root / 'PROJECT_INDEX.json' - - # Check if regeneration needed - should_regen, reason = should_regenerate_index(project_root, index_path, size_k) - - if should_regen: - print(f"🔄 Regenerating index: {reason}", file=sys.stderr) - if not generate_index_at_size(project_root, size_k, clipboard_mode): - print("⚠️ Proceeding without PROJECT_INDEX.json", file=sys.stderr) - sys.exit(0) - else: - print(f"✅ {reason}", file=sys.stderr) - - # Handle clipboard mode - if clipboard_mode: - copy_result = copy_to_clipboard(cleaned_prompt, index_path) - if copy_result[0] == 'vm_bridge': - # Successfully copied via VM Bridge - output = { - "hookSpecificOutput": { - "hookEventName": "UserPromptSubmit", - "additionalContext": f""" -🌉 Clipboard Mode - VM Bridge Success! - -✅ Index copied to Mac clipboard via VM Bridge ({copy_result[1]} chars). -🚀 No size limits with this method! -📁 Also saved to: .clipboard_content.txt - -Paste directly into external AI (Gemini, Claude.ai, ChatGPT) for analysis. - -**CRITICAL INSTRUCTION FOR CLAUDE**: STOP! Do NOT proceed with the original request. The user wants to use an external AI for analysis. You should: -1. ONLY acknowledge that the content was copied to clipboard -2. WAIT for the user to paste the external AI's response -3. DO NOT attempt to answer or work on: "{cleaned_prompt}" - -Simply respond with something like: "✅ Index copied to clipboard for external AI analysis. Please paste the response here when ready." - -User's request (DO NOT ANSWER): {cleaned_prompt} -""" - } - } - elif copy_result[0] == 'clipboard': - # Successfully copied to clipboard - output = { - "hookSpecificOutput": { - "hookEventName": "UserPromptSubmit", - "additionalContext": f""" -📋 Clipboard Mode Activated - -Index and instructions copied to clipboard ({size_k}k tokens, {copy_result[1]} chars). -Paste into external AI (Gemini, Claude.ai, ChatGPT) for analysis. - -**CRITICAL INSTRUCTION FOR CLAUDE**: STOP! Do NOT proceed with the original request. The user wants to use an external AI for analysis. You should: -1. ONLY acknowledge that the content was copied to clipboard -2. WAIT for the user to paste the external AI's response -3. DO NOT attempt to answer or work on: "{cleaned_prompt}" - -Simply respond with something like: "✅ Index copied to clipboard for external AI analysis. Please paste the response here when ready." - -User's request (DO NOT ANSWER): {cleaned_prompt} -""" - } - } - elif copy_result[0] == 'ssh_clipboard': - # SSH session with successful clipboard copy - output = { - "hookSpecificOutput": { - "hookEventName": "UserPromptSubmit", - "additionalContext": f""" -📋 Clipboard Mode - Mac Clipboard Success! - -✅ Index copied to your Mac's clipboard via pbcopy ({size_k}k tokens). -📁 Also saved to: {copy_result[1]} - -Paste directly into external AI (Gemini, Claude.ai, ChatGPT) for analysis. - -**CRITICAL INSTRUCTION FOR CLAUDE**: STOP! Do NOT proceed with the original request. The user wants to use an external AI for analysis. You should: -1. ONLY acknowledge that the content was copied to clipboard -2. WAIT for the user to paste the external AI's response -3. DO NOT attempt to answer or work on: "{cleaned_prompt}" - -Simply respond with something like: "✅ Index copied to clipboard for external AI analysis. Please paste the response here when ready." - -User's request (DO NOT ANSWER): {cleaned_prompt} -""" - } - } - elif copy_result[0] == 'ssh_file_large': - # SSH session with large content - manual copy needed - output = { - "hookSpecificOutput": { - "hookEventName": "UserPromptSubmit", - "additionalContext": f""" -📋 Clipboard Mode - Content Too Large for Auto-Copy - -Index saved to: {copy_result[1]} ({size_k}k tokens). -⚠️ Content exceeds mosh/OSC 52 limit (7.5KB) for automatic clipboard. - -To copy the full index to your Mac clipboard, run this command on your Mac: -ssh {os.environ.get('USER', 'user')}@10.211.55.4 'cat ~/Projects/claude-code-project-index/.clipboard_content.txt' | pbcopy - -Then paste into external AI (Gemini, Claude.ai, ChatGPT) for analysis. - -**CRITICAL INSTRUCTION FOR CLAUDE**: STOP! Do NOT proceed with the original request. The user wants to use an external AI for analysis. You should: -1. ONLY acknowledge that the content was copied to clipboard -2. WAIT for the user to paste the external AI's response -3. DO NOT attempt to answer or work on: "{cleaned_prompt}" - -Simply respond with something like: "✅ Index copied to clipboard for external AI analysis. Please paste the response here when ready." - -User's request (DO NOT ANSWER): {cleaned_prompt} -""" - } - } - elif copy_result[0] == 'file': - # Saved to file fallback - output = { - "hookSpecificOutput": { - "hookEventName": "UserPromptSubmit", - "additionalContext": f""" -📁 Clipboard Mode (File Fallback) - -Index and instructions saved to: {copy_result[1]} ({size_k}k tokens). -⚠️ pyperclip not installed - content saved to file instead. - -To copy: cat {copy_result[1]} | pbcopy # macOS - cat {copy_result[1]} | xclip # Linux - -Then paste into external AI (Gemini, Claude.ai, ChatGPT) for analysis. - -**CRITICAL INSTRUCTION FOR CLAUDE**: STOP! Do NOT proceed with the original request. The user wants to use an external AI for analysis. You should: -1. ONLY acknowledge that the content was copied to clipboard -2. WAIT for the user to paste the external AI's response -3. DO NOT attempt to answer or work on: "{cleaned_prompt}" - -Simply respond with something like: "✅ Index copied to clipboard for external AI analysis. Please paste the response here when ready." - -User's request (DO NOT ANSWER): {cleaned_prompt} -""" - } - } - else: - # Error case - output = { - "hookSpecificOutput": { - "hookEventName": "UserPromptSubmit", - "additionalContext": f""" -❌ Clipboard Mode Failed - -Error: {copy_result[1]} - -Please check the error and try again. -User's request (DO NOT ANSWER): {cleaned_prompt} -""" - } - } - else: - # Standard mode - prepare for subagent - output = { - "hookSpecificOutput": { - "hookEventName": "UserPromptSubmit", - "additionalContext": f""" -## 🎯 Index-Aware Mode Activated - -Generated/loaded {size_k}k token index. - -**IMPORTANT**: You MUST use the index-analyzer subagent to analyze the codebase structure before proceeding with the request. - -Use it like this: -"I'll analyze the codebase structure to understand the relevant code sections for your request." - -Then explicitly invoke: "Using the index-analyzer subagent to analyze PROJECT_INDEX.json..." - -The subagent will provide deep code intelligence including: -- Essential code paths and dependencies -- Call graphs and impact analysis -- Architectural insights and patterns -- Strategic recommendations - -Original request (without -i flag): {cleaned_prompt} - -PROJECT_INDEX.json location: {index_path} -""" - } - } - - print(json.dumps(output)) - sys.exit(0) - - except json.JSONDecodeError as e: - print(f"Error: Invalid JSON input: {e}", file=sys.stderr) - sys.exit(1) - except Exception as e: - print(f"Hook error: {e}", file=sys.stderr) - sys.exit(1) - -if __name__ == '__main__': - main() \ No newline at end of file + return size_k, clipboard_mode, embedding_mode, similarity_options, cleaned_prompt \ No newline at end of file diff --git a/scripts/index_utils.py b/scripts/index_utils.py old mode 100755 new mode 100644 index a106600..50df87f --- a/scripts/index_utils.py +++ b/scripts/index_utils.py @@ -1,1368 +1,392 @@ #!/usr/bin/env python3 """ -Shared utilities for project indexing. -Contains common functionality used by both project_index.py and hook scripts. +Utility functions for project indexing - reconstructed from PROJECT_INDEX.json signatures """ import re import fnmatch from pathlib import Path -from typing import Dict, List, Optional, Set, Tuple +from typing import Dict, List, Set, Optional, Tuple, Any +import subprocess -# What to ignore (sensible defaults) -IGNORE_DIRS = { - '.git', 'node_modules', '__pycache__', '.venv', 'venv', 'env', - 'build', 'dist', '.next', 'target', '.pytest_cache', 'coverage', - '.idea', '.vscode', '__pycache__', '.DS_Store', 'eggs', '.eggs', - '.claude' # Exclude Claude configuration directory +# Constants +CODE_EXTENSIONS = { + '.py', '.js', '.ts', '.tsx', '.jsx', '.java', '.cpp', '.c', '.h', '.hpp', + '.cs', '.go', '.rs', '.php', '.rb', '.swift', '.kt', '.scala', '.clj', '.sh' } -# Languages we can fully parse (extract functions/classes) +MARKDOWN_EXTENSIONS = {'.md', '.markdown', '.rst', '.txt'} + +# Parseable languages mapping PARSEABLE_LANGUAGES = { '.py': 'python', '.js': 'javascript', '.ts': 'typescript', - '.jsx': 'javascript', '.tsx': 'typescript', - '.sh': 'shell', - '.bash': 'shell' + '.jsx': 'javascript', + '.java': 'java', + '.cpp': 'cpp', + '.c': 'c', + '.h': 'c', + '.hpp': 'cpp', + '.cs': 'csharp', + '.go': 'go', + '.rs': 'rust', + '.php': 'php', + '.rb': 'ruby', + '.swift': 'swift', + '.kt': 'kotlin', + '.scala': 'scala', + '.clj': 'clojure', + '.sh': 'shell' } -# All code file extensions we recognize -CODE_EXTENSIONS = { - # Currently parsed - '.py', '.js', '.ts', '.jsx', '.tsx', - # Common languages (listed but not parsed yet) - '.go', '.rs', '.java', '.c', '.cpp', '.cc', '.cxx', - '.h', '.hpp', '.rb', '.php', '.swift', '.kt', '.scala', - '.cs', '.sh', '.bash', '.sql', '.r', '.R', '.lua', '.m', - '.ex', '.exs', '.jl', '.dart', '.vue', '.svelte', - # Configuration and data files - '.json', '.html', '.css' +IGNORE_DIRS = { + '.git', '.svn', '.hg', 'node_modules', '__pycache__', '.pytest_cache', + 'venv', '.venv', 'env', '.env', 'build', 'dist', '.idea', '.vscode' } -# Markdown files to analyze -MARKDOWN_EXTENSIONS = {'.md', '.markdown', '.rst'} - -# Common directory purposes +# Directory purpose mapping DIRECTORY_PURPOSES = { - 'auth': 'Authentication and authorization logic', - 'models': 'Data models and database schemas', - 'views': 'UI views and templates', - 'controllers': 'Request handlers and business logic', - 'services': 'Business logic and external service integrations', - 'utils': 'Shared utility functions and helpers', - 'helpers': 'Helper functions and utilities', - 'tests': 'Test files and test utilities', - 'test': 'Test files and test utilities', - 'spec': 'Test specifications', - 'docs': 'Project documentation', - 'api': 'API endpoints and route handlers', - 'components': 'Reusable UI components', - 'lib': 'Library code and shared modules', - 'src': 'Source code root directory', - 'static': 'Static assets (images, CSS, etc.)', - 'public': 'Publicly accessible files', - 'config': 'Configuration files and settings', + 'tests': 'Test directory', + 'test': 'Test directory', + 'docs': 'Documentation', + 'documentation': 'Documentation', + 'src': 'Source code', + 'source': 'Source code', 'scripts': 'Build and utility scripts', - 'middleware': 'Middleware functions and handlers', - 'migrations': 'Database migration files', - 'fixtures': 'Test fixtures and sample data' + 'bin': 'Binary/executable files', + 'lib': 'Library code', + 'libs': 'Library code', + 'utils': 'Utility functions', + 'config': 'Configuration files', + 'configs': 'Configuration files' +} + +DEFAULT_GITIGNORE_PATTERNS = { + '*.pyc', '*.pyo', '*.pyd', '__pycache__/', '.pytest_cache/', 'node_modules/', + '.DS_Store', '.git/', '.svn/', '.hg/', 'dist/', 'build/', '.idea/', '.vscode/' } def extract_function_calls_python(body: str, all_functions: Set[str]) -> List[str]: """Extract function calls from Python code body.""" - calls = set() - - # Pattern for function calls: word followed by ( - # Excludes: control flow keywords, built-ins we don't care about - call_pattern = r'\b(\w+)\s*\(' - exclude_keywords = { - 'if', 'elif', 'while', 'for', 'with', 'except', 'def', 'class', - 'return', 'yield', 'raise', 'assert', 'print', 'len', 'str', - 'int', 'float', 'bool', 'list', 'dict', 'set', 'tuple', 'type', - 'isinstance', 'issubclass', 'super', 'range', 'enumerate', 'zip', - 'map', 'filter', 'sorted', 'reversed', 'open', 'input', 'eval' - } - - for match in re.finditer(call_pattern, body): + calls = [] + # Simple regex to find function calls + for match in re.finditer(r'\b(\w+)\s*\(', body): func_name = match.group(1) - if func_name in all_functions and func_name not in exclude_keywords: - calls.add(func_name) - - # Also catch method calls like self.method() or obj.method() - method_pattern = r'(?:self|cls|\w+)\.(\w+)\s*\(' - for match in re.finditer(method_pattern, body): - method_name = match.group(1) - if method_name in all_functions: - calls.add(method_name) - - return sorted(list(calls)) + if func_name in all_functions and func_name not in calls: + calls.append(func_name) + return calls def extract_function_calls_javascript(body: str, all_functions: Set[str]) -> List[str]: """Extract function calls from JavaScript/TypeScript code body.""" - calls = set() - - # Pattern for function calls - call_pattern = r'\b(\w+)\s*\(' - exclude_keywords = { - 'if', 'while', 'for', 'switch', 'catch', 'function', 'class', - 'return', 'throw', 'new', 'typeof', 'instanceof', 'void', - 'console', 'Array', 'Object', 'String', 'Number', 'Boolean', - 'Promise', 'Math', 'Date', 'JSON', 'parseInt', 'parseFloat' - } - - for match in re.finditer(call_pattern, body): + calls = [] + # Simple regex to find function calls + for match in re.finditer(r'\b(\w+)\s*\(', body): func_name = match.group(1) - if func_name in all_functions and func_name not in exclude_keywords: - calls.add(func_name) - - # Method calls: obj.method() or this.method() - method_pattern = r'(?:this|\w+)\.(\w+)\s*\(' - for match in re.finditer(method_pattern, body): - method_name = match.group(1) - if method_name in all_functions: - calls.add(method_name) - - return sorted(list(calls)) + if func_name in all_functions and func_name not in calls: + calls.append(func_name) + return calls def build_call_graph(functions: Dict, classes: Dict) -> Tuple[Dict, Dict]: """Build bidirectional call graph from extracted functions and methods.""" - calls_map = {} - called_by_map = {} - - # Build calls_map from functions - for func_name, func_info in functions.items(): - if isinstance(func_info, dict) and 'calls' in func_info: - calls_map[func_name] = func_info['calls'] + call_graph = {} + reverse_call_graph = {} - # Build calls_map from class methods - for class_name, class_info in classes.items(): - if isinstance(class_info, dict) and 'methods' in class_info: - for method_name, method_info in class_info['methods'].items(): - if isinstance(method_info, dict) and 'calls' in method_info: - full_method_name = f"{class_name}.{method_name}" - calls_map[full_method_name] = method_info['calls'] + # Process functions + for func_name, func_data in functions.items(): + if isinstance(func_data, dict) and 'calls' in func_data: + calls = func_data['calls'] + call_graph[func_name] = calls + for called in calls: + if called not in reverse_call_graph: + reverse_call_graph[called] = [] + reverse_call_graph[called].append(func_name) - # Build the reverse index (called_by_map) - for func_name, called_funcs in calls_map.items(): - for called_func in called_funcs: - if called_func not in called_by_map: - called_by_map[called_func] = [] - if func_name not in called_by_map[called_func]: - called_by_map[called_func].append(func_name) - - return calls_map, called_by_map + return call_graph, reverse_call_graph def extract_python_signatures(content: str) -> Dict[str, Dict]: """Extract Python function and class signatures with full details for all files.""" - result = { - 'imports': [], - 'functions': {}, - 'classes': {}, - 'constants': {}, - 'variables': [], - 'type_aliases': {}, - 'enums': {}, - 'call_graph': {} # Track function calls for flow analysis - } - - # Split into lines for line-by-line analysis + functions = {} + classes = {} lines = content.split('\n') - # Track current class context + # Extract functions with more comprehensive pattern + func_pattern = r'^(\s*)def\s+(\w+)\s*\(([^)]*)\)\s*(?:->\s*[^:]+)?\s*:' + for i, line in enumerate(lines): + match = re.match(func_pattern, line) + if match: + indent, func_name, params = match.groups() + + # Extract docstring if present + doc = "" + doc_start = i + 1 + if doc_start < len(lines) and '"""' in lines[doc_start]: + doc_lines = [] + in_docstring = False + for j in range(doc_start, min(doc_start + 10, len(lines))): + if '"""' in lines[j]: + if in_docstring: + break + else: + in_docstring = True + doc_lines.append(lines[j].strip().replace('"""', '').strip()) + elif in_docstring: + doc_lines.append(lines[j].strip()) + doc = ' '.join(doc_lines).strip() + + functions[func_name] = { + 'name': func_name, + 'type': 'function', + 'line': i + 1, + 'signature': f"({params})", + 'doc': doc, + 'indent_level': len(indent) // 4 # Assuming 4-space indentation + } + + # Extract classes with methods + class_pattern = r'^(\s*)class\s+(\w+)(\([^)]*\))?\s*:' current_class = None current_class_indent = -1 - class_stack = [] # For nested classes - - # First pass: collect all function and method names for call detection - all_function_names = set() - for line in lines: - func_match = re.match(r'^(?:[ \t]*)(async\s+)?def\s+(\w+)\s*\(', line) - if func_match: - all_function_names.add(func_match.group(2)) - - # Patterns - class_pattern = r'^([ \t]*)class\s+(\w+)(?:\s*\((.*?)\))?:' - func_pattern = r'^([ \t]*)(async\s+)?def\s+(\w+)\s*\((.*?)\)(?:\s*->\s*([^:]+))?:' - property_pattern = r'^([ \t]*)(\w+)\s*:\s*([^=\n]+)' - # Module-level constants (UPPERCASE_NAME = value) - module_const_pattern = r'^([A-Z_][A-Z0-9_]*)\s*=\s*(.+)$' - # Module-level variables with type annotations - module_var_pattern = r'^(\w+)\s*:\s*([^=]+)\s*=' - # Class-level constants - class_const_pattern = r'^([ \t]+)([A-Z_][A-Z0-9_]*)\s*=\s*(.+)$' - # Import patterns - import_pattern = r'^(?:from\s+([^\s]+)\s+)?import\s+(.+)$' - # Type alias pattern - type_alias_pattern = r'^(\w+)\s*=\s*(?:Union|Optional|List|Dict|Tuple|Set|Type|Callable|Literal|TypeVar|NewType|TypedDict|Protocol)\[.+\]$' - # Decorator pattern - decorator_pattern = r'^([ \t]*)@(\w+)(?:\(.*\))?$' - # Docstring pattern (matches next line after function/class) - docstring_pattern = r'^([ \t]*)(?:\'\'\'|""")(.+?)(?:\'\'\'|""")' - - # Dunder methods to skip (unless in critical files) - skip_dunder = {'__repr__', '__str__', '__hash__', '__eq__', '__ne__', - '__lt__', '__le__', '__gt__', '__ge__', '__bool__'} - # First pass: Extract imports - for line in lines: - import_match = re.match(import_pattern, line.strip()) - if import_match: - module, items = import_match.groups() - if module: - # from X import Y style - result['imports'].append(module) - else: - # import X style - for item in items.split(','): - item = item.strip().split(' as ')[0] # Remove aliases - result['imports'].append(item) - - # Track decorators for next function/method - pending_decorators = [] - - i = 0 - while i < len(lines): - line = lines[i] - - # Skip comments and docstrings - if line.strip().startswith('#') or line.strip().startswith('"""') or line.strip().startswith("'''"): - i += 1 - continue - - # Check for decorators - decorator_match = re.match(decorator_pattern, line) - if decorator_match: - _, decorator_name = decorator_match.groups() - pending_decorators.append(decorator_name) - i += 1 - continue - - # Check for module-level constants (before checking classes) - if not current_class: # Only at module level - # Check for type aliases first - type_alias_match = re.match(type_alias_pattern, line) - if type_alias_match: - alias_name = type_alias_match.group(1) - result['type_aliases'][alias_name] = line.split('=', 1)[1].strip() - i += 1 - continue - - const_match = re.match(module_const_pattern, line) - if const_match: - const_name, const_value = const_match.groups() - # Clean up the value (remove comments, strip quotes for readability) - const_value = const_value.split('#')[0].strip() - # Determine type from value - if const_value.startswith(('{', '[')): - const_type = 'collection' - elif const_value.startswith(("'", '"')): - const_type = 'str' - elif const_value.replace('.', '').replace('-', '').isdigit(): - const_type = 'number' - else: - const_type = 'value' - result['constants'][const_name] = const_type - i += 1 - continue - - # Check for module-level typed variables - var_match = re.match(module_var_pattern, line) - if var_match: - var_name, var_type = var_match.groups() - if var_name not in result['variables'] and not var_name.startswith('_'): - result['variables'].append(var_name) - i += 1 - continue - - # Check for class definition + for i, line in enumerate(lines): class_match = re.match(class_pattern, line) if class_match: - indent, name, bases = class_match.groups() - indent_level = len(indent) - - # Handle nested classes - pop from stack if dedented - while class_stack and indent_level <= class_stack[-1][1]: - class_stack.pop() - - # Only process top-level classes for the index - if indent_level == 0: - class_info = {'methods': {}, 'class_constants': {}} - - # Check for decorators on the class - if pending_decorators: - class_info['decorators'] = pending_decorators.copy() - pending_decorators.clear() - - # Add inheritance info and check special types - if bases: - base_list = [b.strip() for b in bases.split(',') if b.strip()] - if base_list: - class_info['inherits'] = base_list - - # Check for special class types - base_names_lower = [b.lower() for b in base_list] - if 'enum' in base_names_lower or any('enum' in b for b in base_names_lower): - class_info['type'] = 'enum' - # We'll extract enum values later - elif 'exception' in base_names_lower or 'error' in base_names_lower or any('exception' in b or 'error' in b for b in base_names_lower): - class_info['type'] = 'exception' - elif 'abc' in base_names_lower or 'protocol' in base_names_lower: - class_info['abstract'] = True - - # Extract docstring - if i + 1 < len(lines): - next_line = lines[i + 1].strip() - doc_match = re.match(docstring_pattern, lines[i + 1]) - if doc_match: - _, doc_content = doc_match.groups() - class_info['doc'] = doc_content.strip() - - class_info['line'] = i + 1 # Store line number (1-based) - result['classes'][name] = class_info - current_class = name - current_class_indent = indent_level - - # Add to stack - class_stack.append((name, indent_level)) - i += 1 - continue - - # Check if we've left the current class (dedented to module level) - if current_class and line.strip() and len(line) - len(line.lstrip()) <= current_class_indent: - # Check if it's not just a blank line or comment - if not line.strip().startswith('#'): - current_class = None - current_class_indent = -1 - - # Check for class-level constants or enum values - if current_class: - # For enums, capture all uppercase attributes as values - if result['classes'][current_class].get('type') == 'enum': - # Enum value pattern (NAME = value or just NAME) - enum_val_pattern = r'^([ \t]+)([A-Z_][A-Z0-9_]*)\s*(?:=\s*(.+))?$' - enum_match = re.match(enum_val_pattern, line) - if enum_match: - indent, enum_name, enum_value = enum_match.groups() - if len(indent) > current_class_indent: - if 'values' not in result['classes'][current_class]: - result['classes'][current_class]['values'] = [] - result['classes'][current_class]['values'].append(enum_name) - i += 1 - continue - - class_const_match = re.match(class_const_pattern, line) - if class_const_match: - indent, const_name, const_value = class_const_match.groups() - if len(indent) > current_class_indent: - # Clean up the value - const_value = const_value.split('#')[0].strip() - # Determine type - if const_value.startswith(('{', '[')): - const_type = 'collection' - elif const_value.startswith(("'", '"')): - const_type = 'str' - elif const_value.replace('.', '').replace('-', '').isdigit(): - const_type = 'number' - else: - const_type = 'value' - result['classes'][current_class]['class_constants'][const_name] = const_type - i += 1 - continue - - # Check for function/method definition - # First check if this line starts a function definition - func_start_pattern = r'^([ \t]*)(async\s+)?def\s+(\w+)\s*\(' - func_start_match = re.match(func_start_pattern, line) - - if func_start_match: - indent, is_async, name = func_start_match.groups() - indent_level = len(indent) - - # Collect the full signature across multiple lines - full_sig = line.rstrip() - j = i - - # Keep collecting lines until we find the colon that ends the signature - while j < len(lines) and not re.search(r'\).*:', lines[j]): - j += 1 - if j < len(lines): - full_sig += ' ' + lines[j].strip() - - # Make sure we have a complete signature - if j >= len(lines): - i += 1 - continue - - # Now parse the complete signature - complete_match = re.match(func_pattern, full_sig) - if complete_match: - indent, is_async, name, params, return_type = complete_match.groups() - i = j # Skip to the last line we processed - else: - # Failed to parse, skip this function - i += 1 - continue - - # Clean params - params = re.sub(r'\s+', ' ', params).strip() - - # Skip certain dunder methods (except __init__) - if name in skip_dunder and name != '__init__': - i += 1 - continue - - # Build function/method info - func_info = { - 'line': i + 1 # Store line number (1-based) + indent, class_name, inheritance = class_match.groups() + indent_level = len(indent) // 4 + + # Extract docstring if present + doc = "" + doc_start = i + 1 + if doc_start < len(lines) and '"""' in lines[doc_start]: + doc_lines = [] + in_docstring = False + for j in range(doc_start, min(doc_start + 10, len(lines))): + if '"""' in lines[j]: + if in_docstring: + break + else: + in_docstring = True + doc_lines.append(lines[j].strip().replace('"""', '').strip()) + elif in_docstring: + doc_lines.append(lines[j].strip()) + doc = ' '.join(doc_lines).strip() + + classes[class_name] = { + 'name': class_name, + 'type': 'class', + 'line': i + 1, + 'doc': doc, + 'methods': {}, + 'indent_level': indent_level } - # Build full signature - signature = f"({params})" - if return_type: - signature += f" -> {return_type.strip()}" - if is_async: - signature = "async " + signature - - # Add decorators if any - if pending_decorators: - func_info['decorators'] = pending_decorators.copy() - # Check for abstractmethod - if 'abstractmethod' in pending_decorators: - if current_class: - result['classes'][current_class]['abstract'] = True - pending_decorators.clear() + if inheritance: + classes[class_name]['inherits'] = inheritance.strip('()') - # Extract docstring - if i + 1 < len(lines): - doc_match = re.match(docstring_pattern, lines[i + 1]) - if doc_match: - _, doc_content = doc_match.groups() - func_info['doc'] = doc_content.strip() - - # Extract function body to find calls - func_body_start = i + 1 - func_body_lines = [] - func_indent = len(indent) if indent else 0 - - # Skip past any docstring (but include it in body for now) - body_idx = func_body_start - - # Collect function body - everything indented more than the def line - while body_idx < len(lines): - body_line = lines[body_idx] - - # Skip empty lines - if not body_line.strip(): - func_body_lines.append(body_line) - body_idx += 1 - continue - - # Check indentation to see if we're still in the function - line_indent = len(body_line) - len(body_line.lstrip()) + current_class = class_name + current_class_indent = indent_level + + # Check for methods within classes + elif current_class is not None: + method_match = re.match(r'^(\s+)def\s+(\w+)\s*\(([^)]*)\)\s*(?:->\s*[^:]+)?\s*:', line) + if method_match: + method_indent, method_name, method_params = method_match.groups() + method_indent_level = len(method_indent) // 4 - # If we hit a line that's not indented more than the function def, we're done - if line_indent <= func_indent and body_line.strip(): - break + # If method is at class level (one level deeper than class) + if method_indent_level == current_class_indent + 1: + # Extract method docstring + method_doc = "" + doc_start = i + 1 + if doc_start < len(lines) and '"""' in lines[doc_start]: + doc_lines = [] + in_docstring = False + for j in range(doc_start, min(doc_start + 10, len(lines))): + if '"""' in lines[j]: + if in_docstring: + break + else: + in_docstring = True + doc_lines.append(lines[j].strip().replace('"""', '').strip()) + elif in_docstring: + doc_lines.append(lines[j].strip()) + method_doc = ' '.join(doc_lines).strip() - func_body_lines.append(body_line) - body_idx += 1 - - # Extract calls from the body - if func_body_lines: - func_body = '\n'.join(func_body_lines) - calls = extract_function_calls_python(func_body, all_function_names) - if calls: - func_info['calls'] = calls + classes[current_class]['methods'][method_name] = { + 'name': method_name, + 'type': 'method', + 'line': i + 1, + 'signature': f"({method_params})", + 'doc': method_doc + } - # Always store as dict to include line number - func_info['signature'] = signature - - # Determine where to place this function - if current_class and indent_level > current_class_indent: - # It's a method of the current class - result['classes'][current_class]['methods'][name] = func_info - elif indent_level == 0: - # It's a module-level function - result['functions'][name] = func_info - - # Check for class properties - if current_class: - prop_match = re.match(property_pattern, line) - if prop_match: - indent, prop_name, prop_type = prop_match.groups() - if len(indent) > current_class_indent and not prop_name.startswith('_'): - if 'properties' not in result['classes'][current_class]: - result['classes'][current_class]['properties'] = [] - result['classes'][current_class]['properties'].append(prop_name) - - i += 1 - - # Post-process - remove empty collections - for class_name, class_info in result['classes'].items(): - if 'properties' in class_info and not class_info['properties']: - del class_info['properties'] - if 'class_constants' in class_info and not class_info['class_constants']: - del class_info['class_constants'] - if 'decorators' in class_info and not class_info['decorators']: - del class_info['decorators'] - if 'values' in class_info and not class_info['values']: - del class_info['values'] - - # Remove empty module-level collections - if not result['constants']: - del result['constants'] - if not result['variables']: - del result['variables'] - if not result['type_aliases']: - del result['type_aliases'] - if not result['enums']: - del result['enums'] - if not result['imports']: - del result['imports'] - - # Move enum classes to enums section - enums_to_move = {} - for class_name, class_info in list(result['classes'].items()): - if class_info.get('type') == 'enum': - enums_to_move[class_name] = { - 'values': class_info.get('values', []), - 'doc': class_info.get('doc', '') - } - del result['classes'][class_name] + # Reset current class if we've moved to a different indentation level + elif line.strip() and not line.startswith(' ' * ((current_class_indent + 1) * 4)): + if not line.startswith(' ' * (current_class_indent * 4)): + current_class = None + current_class_indent = -1 - if enums_to_move: - result['enums'] = enums_to_move - - return result + return {'functions': functions, 'classes': classes} + + +def pos_to_line(content: str, pos: int) -> int: + """Convert character position to line number.""" + return content[:pos].count('\n') + 1 -def extract_javascript_signatures(content: str) -> Dict[str, any]: +def extract_javascript_signatures(content: str) -> Dict[str, Any]: """Extract JavaScript/TypeScript function and class signatures with full details.""" - result = { - 'imports': [], - 'functions': {}, - 'classes': {}, - 'constants': {}, - 'variables': [], - 'type_aliases': {}, - 'interfaces': {}, - 'enums': {}, - 'call_graph': {} # Track function calls for flow analysis - } - - # Helper to convert character position to line number - def pos_to_line(pos: int) -> int: - return content[:pos].count('\n') + 1 - - # First pass: collect all function names for call detection - all_function_names = set() - # Regular functions - for match in re.finditer(r'(?:async\s+)?function\s+(\w+)', content): - all_function_names.add(match.group(1)) - # Arrow functions and const functions - for match in re.finditer(r'(?:const|let|var)\s+(\w+)\s*=\s*(?:async\s+)?\(', content): - all_function_names.add(match.group(1)) - # Method names - for match in re.finditer(r'(\w+)\s*\([^)]*\)\s*{', content): - all_function_names.add(match.group(1)) + functions = {} + classes = {} - # Extract imports first - # import X from 'Y', import {X} from 'Y', import * as X from 'Y' - import_pattern = r'import\s+(?:([^{}\s]+)|{([^}]+)}|\*\s+as\s+(\w+))\s+from\s+[\'"]([^\'"]+)[\'"]' - for match in re.finditer(import_pattern, content): - default_import, named_imports, namespace_import, module = match.groups() - if module: - result['imports'].append(module) - - # require() style imports - require_pattern = r'(?:const|let|var)\s+(?:{[^}]+}|\w+)\s*=\s*require\s*\([\'"]([^\'"]+)[\'"]\)' - for match in re.finditer(require_pattern, content): - result['imports'].append(match.group(1)) - - # Extract type aliases (TypeScript) - simpler approach with brace counting - type_alias_pattern = r'(?:export\s+)?type\s+(\w+)\s*=\s*(.+?)(?:;[\s]*(?:(?:export\s+)?(?:type|const|let|var|function|class|interface|enum)\s+|\/\/|$))' - - for match in re.finditer(type_alias_pattern, content, re.MULTILINE | re.DOTALL): - alias_name, alias_type = match.groups() - # Clean up the type definition - clean_type = ' '.join(alias_type.strip().split()) - - # If it starts with { but seems incomplete, try to capture the full object - if clean_type.startswith('{') and clean_type.count('{') > clean_type.count('}'): - # Find the position after the = sign - start_pos = match.start(2) - brace_count = 0 - end_pos = start_pos - - # Count braces to find the complete type - for i, char in enumerate(content[start_pos:]): - if char == '{': - brace_count += 1 - elif char == '}': - brace_count -= 1 - if brace_count == 0: - end_pos = start_pos + i + 1 - break - - if end_pos > start_pos: - complete_type = content[start_pos:end_pos].strip() - clean_type = ' '.join(complete_type.split()) - - result['type_aliases'][alias_name] = clean_type - - # Extract interfaces (TypeScript) - interface_pattern = r'(?:export\s+)?interface\s+(\w+)(?:\s+extends\s+([^{]+))?\s*{' - for match in re.finditer(interface_pattern, content): - interface_name, extends = match.groups() - interface_info = {} - if extends: - interface_info['extends'] = [e.strip() for e in extends.split(',')] - # Extract first line of JSDoc if present - jsdoc_match = re.search(r'/\*\*\s*\n?\s*\*?\s*([^@\n]+)', content[:match.start()]) - if jsdoc_match: - interface_info['doc'] = jsdoc_match.group(1).strip() - result['interfaces'][interface_name] = interface_info - - # Extract enums (TypeScript) - enum_pattern = r'(?:export\s+)?enum\s+(\w+)\s*{' - enum_matches = list(re.finditer(enum_pattern, content)) - for match in enum_matches: - enum_name = match.group(1) - # Find enum values - start_pos = match.end() - brace_count = 1 - end_pos = start_pos - for i in range(start_pos, len(content)): - if content[i] == '{': - brace_count += 1 - elif content[i] == '}': - brace_count -= 1 - if brace_count == 0: - end_pos = i - break - - enum_body = content[start_pos:end_pos] - # Extract enum values - value_pattern = r'(\w+)\s*(?:=\s*[^,\n]+)?' - values = re.findall(value_pattern, enum_body) - result['enums'][enum_name] = {'values': values} - - # Extract module-level constants and variables - # const CONSTANT_NAME = value - const_pattern = r'(?:export\s+)?const\s+([A-Z_][A-Z0-9_]*)\s*=\s*([^;]+)' - for match in re.finditer(const_pattern, content): - const_name, const_value = match.groups() - const_value = const_value.strip() - if const_value.startswith(('{', '[')): - const_type = 'collection' - elif const_value.startswith(("'", '"', '`')): - const_type = 'str' - elif const_value.replace('.', '').replace('-', '').isdigit(): - const_type = 'number' - else: - const_type = 'value' - result['constants'][const_name] = const_type - - # let/const variables (not uppercase) - var_pattern = r'(?:export\s+)?(?:let|const)\s+([a-z]\w*)\s*(?::\s*\w+)?\s*=' - for match in re.finditer(var_pattern, content): - var_name = match.group(1) - if var_name not in result['variables']: - result['variables'].append(var_name) - - # Find all classes first with their boundaries - class_pattern = r'(?:export\s+)?class\s+(\w+)(?:\s+extends\s+(\w+))?' - class_positions = {} # {class_name: (start_pos, end_pos)} - - for match in re.finditer(class_pattern, content): - class_name, extends = match.groups() - start_pos = match.start() - - # Find the class body (between { and }) - brace_count = 0 - in_class = False - end_pos = start_pos - - for i in range(match.end(), len(content)): - if content[i] == '{': - if not in_class: - in_class = True - brace_count += 1 - elif content[i] == '}': - brace_count -= 1 - if brace_count == 0 and in_class: - end_pos = i - break - - class_positions[class_name] = (start_pos, end_pos) - - # Initialize class info - class_info = { - 'line': pos_to_line(start_pos), - 'methods': {}, - 'static_constants': {} - } - if extends: - class_info['extends'] = extends - # Check for exception classes - if extends.lower() in ['error', 'exception'] or 'error' in extends.lower(): - class_info['type'] = 'exception' - - # Extract JSDoc comment - jsdoc_match = re.search(r'/\*\*\s*\n?\s*\*?\s*([^@\n]+)', content[:start_pos]) - if jsdoc_match: - class_info['doc'] = jsdoc_match.group(1).strip() - - result['classes'][class_name] = class_info - - # Extract methods from classes - method_patterns = [ - # Regular methods: methodName(...) { or async methodName(...) { - r'^\s*(async\s+)?(\w+)\s*\((.*?)\)\s*(?::\s*([^{]+))?\s*{', - # Arrow function properties: methodName = (...) => { - r'^\s*(\w+)\s*=\s*(?:async\s+)?\(([^)]*)\)\s*(?::\s*([^=]+))?\s*=>', - # Constructor - r'^\s*(constructor)\s*\(([^)]*)\)\s*{' - ] - - for class_name, (start, end) in class_positions.items(): - class_content = content[start:end] - - for pattern in method_patterns: - for match in re.finditer(pattern, class_content, re.MULTILINE): - # Extract method name and params based on pattern - if 'constructor' in pattern: - method_name = '__init__' # Convert to Python-style - params = match.group(2) - return_type = None - elif '=' in pattern: - method_name = match.group(1) - params = match.group(2) - return_type = match.group(3) - else: - is_async = match.group(1) - method_name = match.group(2) - params = match.group(3) - return_type = match.group(4) - - # Skip getters/setters and keywords - if method_name in ['get', 'set', 'if', 'for', 'while', 'switch', 'catch', 'try']: - continue - - method_info = { - 'line': pos_to_line(start + match.start()) - } - - # Build full signature - params = re.sub(r'\s+', ' ', params).strip() - signature = f"({params})" - if return_type: - signature += f": {return_type.strip()}" - if 'async' in str(match.group(0)): - signature = "async " + signature - - # Try to extract method body for call analysis - method_start = match.end() - # Find the opening brace - brace_pos = class_content.find('{', method_start) - if brace_pos != -1 and brace_pos - method_start < 100: - # Extract method body - brace_count = 1 - body_start = brace_pos + 1 - body_end = body_start - - for i in range(body_start, min(len(class_content), body_start + 3000)): - if class_content[i] == '{': - brace_count += 1 - elif class_content[i] == '}': - brace_count -= 1 - if brace_count == 0: - body_end = i - break - - if body_end > body_start: - method_body = class_content[body_start:body_end] - calls = extract_function_calls_javascript(method_body, all_function_names) - if calls: - method_info['calls'] = calls - - # Store method info - if method_info: - method_info['signature'] = signature - result['classes'][class_name]['methods'][method_name] = method_info - else: - result['classes'][class_name]['methods'][method_name] = signature - - # Extract static constants in class - static_const_pattern = r'static\s+([A-Z_][A-Z0-9_]*)\s*=\s*([^;]+)' - for match in re.finditer(static_const_pattern, class_content): - const_name, const_value = match.groups() - const_value = const_value.strip() - if const_value.startswith(('{', '[')): - const_type = 'collection' - elif const_value.startswith(("'", '"', '`')): - const_type = 'str' - elif const_value.replace('.', '').replace('-', '').isdigit(): - const_type = 'number' - else: - const_type = 'value' - result['classes'][class_name]['static_constants'][const_name] = const_type - - # Extract standalone functions (not inside classes) + # Basic function extraction for JS/TS func_patterns = [ - # Function declarations - r'(?:export\s+)?(?:async\s+)?function\s+(\w+)\s*(?:<[^>]+>)?\s*\(([^)]*)\)(?:\s*:\s*([^{]+))?', - # Arrow functions assigned to const - r'(?:export\s+)?const\s+(\w+)\s*(?::\s*[^=]+)?\s*=\s*(?:async\s+)?\(([^)]*)\)\s*(?::\s*([^=]+))?\s*=>' + r'function\s+(\w+)\s*\([^)]*\)', + r'const\s+(\w+)\s*=\s*\([^)]*\)\s*=>', + r'(\w+)\s*:\s*\([^)]*\)\s*=>' ] for pattern in func_patterns: - for match in re.finditer(pattern, content): + for match in re.finditer(pattern, content, re.MULTILINE): func_name = match.group(1) - params = match.group(2) if match.lastindex >= 2 else '' - return_type = match.group(3) if match.lastindex >= 3 else None - - # Check if this function is inside any class - func_pos = match.start() - inside_class = False - for class_name, (start, end) in class_positions.items(): - if start <= func_pos <= end: - inside_class = True - break - - if not inside_class: - func_info = { - 'line': pos_to_line(func_pos) - } - - # Build full signature - params = re.sub(r'\s+', ' ', params).strip() - signature = f"({params})" - if return_type: - signature += f": {return_type.strip()}" - if 'async' in match.group(0): - signature = "async " + signature - - # Try to extract function body for call analysis - func_start = match.end() - # Find the opening brace - brace_pos = content.find('{', func_start) - if brace_pos != -1 and brace_pos - func_start < 100: # Reasonable distance - # Extract function body - brace_count = 1 - body_start = brace_pos + 1 - body_end = body_start - - for i in range(body_start, min(len(content), body_start + 5000)): # Limit scan - if content[i] == '{': - brace_count += 1 - elif content[i] == '}': - brace_count -= 1 - if brace_count == 0: - body_end = i - break - - if body_end > body_start: - func_body = content[body_start:body_end] - calls = extract_function_calls_javascript(func_body, all_function_names) - if calls: - func_info['calls'] = calls - - # Store function info - if func_info: - func_info['signature'] = signature - result['functions'][func_name] = func_info - else: - result['functions'][func_name] = signature - - # Clean up empty collections - for class_name, class_info in result['classes'].items(): - if 'static_constants' in class_info and not class_info['static_constants']: - del class_info['static_constants'] - - # Remove empty module-level collections - if not result['constants']: - del result['constants'] - if not result['variables']: - del result['variables'] - if not result['imports']: - del result['imports'] - if not result['type_aliases']: - del result['type_aliases'] - if not result['interfaces']: - del result['interfaces'] - if not result['enums']: - del result['enums'] + functions[func_name] = { + 'name': func_name, + 'type': 'function', + 'line': pos_to_line(content, match.start()) + } - return result + return {'functions': functions, 'classes': classes} def extract_function_calls_shell(body: str, all_functions: Set[str]) -> List[str]: """Extract function calls from shell script body.""" - calls = set() - - # In shell, functions are called just by name (no parentheses) - # We need to be careful to avoid false positives - for func_name in all_functions: - # Look for function name at start of line or after common shell operators - patterns = [ - rf'^\s*{func_name}\b', # Start of line - rf'[;&|]\s*{func_name}\b', # After operators - rf'\$\({func_name}\b', # Command substitution - rf'`{func_name}\b', # Backtick substitution - ] - for pattern in patterns: - if re.search(pattern, body, re.MULTILINE): - calls.add(func_name) - break - - return sorted(list(calls)) + calls = [] + # Simple pattern for shell function calls + for match in re.finditer(r'\b(\w+)\b', body): + func_name = match.group(1) + if func_name in all_functions and func_name not in calls: + calls.append(func_name) + return calls -def extract_shell_signatures(content: str) -> Dict[str, any]: +def extract_shell_signatures(content: str) -> Dict[str, Any]: """Extract shell script function signatures and structure.""" - result = { - 'functions': {}, - 'variables': [], - 'exports': {}, - 'sources': [], - 'call_graph': {} # Track function calls - } - - lines = content.split('\n') - - # First pass: collect all function names - all_function_names = set() - for line in lines: - # Style 1: function_name() { - match1 = re.match(r'^(\w+)\s*\(\)\s*\{?', line) - if match1: - all_function_names.add(match1.group(1)) - # Style 2: function function_name { - match2 = re.match(r'^function\s+(\w+)\s*\{?', line) - if match2: - all_function_names.add(match2.group(1)) - - # Function patterns - # Style 1: function_name() { ... } - func_pattern1 = r'^(\w+)\s*\(\)\s*\{?' - # Style 2: function function_name { ... } - func_pattern2 = r'^function\s+(\w+)\s*\{?' + functions = {} - # Variable patterns - # Export pattern: export VAR=value - export_pattern = r'^export\s+([A-Z_][A-Z0-9_]*)(=(.*))?' - # Regular variable: VAR=value (uppercase) - var_pattern = r'^([A-Z_][A-Z0-9_]*)=(.+)$' - - # Source patterns - handle quotes and command substitution - source_patterns = [ - r'^(?:source|\.)\s+([\'"])([^\'"]+)\1', # Quoted paths - r'^(?:source|\.)\s+(\$\([^)]+\)[^\s]*)', # Command substitution like $(dirname "$0")/file - r'^(?:source|\.)\s+([^\s]+)', # Unquoted paths - ] - - # Track if we're in a function - in_function = False - current_function = None - function_start_line = -1 - - for i, line in enumerate(lines): - stripped = line.strip() - - # Skip empty lines and pure comments - if not stripped or stripped.startswith('#!'): - continue - - # Check for function definition (style 1) - match = re.match(func_pattern1, stripped) - if match: - func_name = match.group(1) - # Extract documentation comment if present - doc = None - if i > 0 and lines[i-1].strip().startswith('#'): - doc = lines[i-1].strip()[1:].strip() - - # Try to find parameters from the function body - params = [] - brace_count = 0 - in_func_body = False - - # Look for $1, $2, etc. usage in the function body only - for j in range(i+1, min(i+20, len(lines))): - line_content = lines[j].strip() - - # Track braces to know when we're in the function - if '{' in line_content: - brace_count += line_content.count('{') - in_func_body = True - if '}' in line_content: - brace_count -= line_content.count('}') - if brace_count <= 0: - break # End of function - - # Only look for parameters inside the function body - if in_func_body: - param_matches = re.findall(r'\$(\d+)', lines[j]) - for p in param_matches: - param_num = int(p) - if param_num > 0 and param_num not in params: - params.append(param_num) - - # Build signature - if params: - max_param = max(params) - param_list = ' '.join(f'$1' if j == 1 else f'${{{j}}}' for j in range(1, max_param + 1)) - signature = f"({param_list})" - else: - signature = "()" - - # Extract function body for call analysis - func_body_lines = [] - brace_count = 0 - in_func_body = False - for j in range(i+1, len(lines)): - line_content = lines[j] - if '{' in line_content: - brace_count += line_content.count('{') - in_func_body = True - if in_func_body: - func_body_lines.append(line_content) - if '}' in line_content: - brace_count -= line_content.count('}') - if brace_count <= 0: - break - - func_info = {} - if func_body_lines: - func_body = '\n'.join(func_body_lines) - calls = extract_function_calls_shell(func_body, all_function_names) - if calls: - func_info['calls'] = calls - - if doc: - func_info['doc'] = doc - - if func_info: - func_info['signature'] = signature - result['functions'][func_name] = func_info - else: - result['functions'][func_name] = signature - continue - - # Check for function definition (style 2) - match = re.match(func_pattern2, stripped) - if match: - func_name = match.group(1) - # Extract documentation comment if present - doc = None - if i > 0 and lines[i-1].strip().startswith('#'): - doc = lines[i-1].strip()[1:].strip() - - # Try to find parameters from the function body - params = [] - brace_count = 0 - in_func_body = False - - # Look for $1, $2, etc. usage in the function body only - for j in range(i+1, min(i+20, len(lines))): - line_content = lines[j].strip() - - # Track braces to know when we're in the function - if '{' in line_content: - brace_count += line_content.count('{') - in_func_body = True - if '}' in line_content: - brace_count -= line_content.count('}') - if brace_count <= 0: - break # End of function - - # Only look for parameters inside the function body - if in_func_body: - param_matches = re.findall(r'\$(\d+)', lines[j]) - for p in param_matches: - param_num = int(p) - if param_num > 0 and param_num not in params: - params.append(param_num) - - # Build signature - if params: - max_param = max(params) - param_list = ' '.join(f'$1' if j == 1 else f'${{{j}}}' for j in range(1, max_param + 1)) - signature = f"({param_list})" - else: - signature = "()" - - # Extract function body for call analysis - func_body_lines = [] - brace_count = 0 - in_func_body = False - for j in range(i+1, len(lines)): - line_content = lines[j] - if '{' in line_content: - brace_count += line_content.count('{') - in_func_body = True - if in_func_body: - func_body_lines.append(line_content) - if '}' in line_content: - brace_count -= line_content.count('}') - if brace_count <= 0: - break - - func_info = {} - if func_body_lines: - func_body = '\n'.join(func_body_lines) - calls = extract_function_calls_shell(func_body, all_function_names) - if calls: - func_info['calls'] = calls - - if doc: - func_info['doc'] = doc - - if func_info: - func_info['signature'] = signature - result['functions'][func_name] = func_info - else: - result['functions'][func_name] = signature - continue - - # Check for exports - match = re.match(export_pattern, stripped) - if match: - var_name = match.group(1) - var_value = match.group(3) if match.group(3) else None - if var_value: - # Determine type - if var_value.startswith(("'", '"')): - var_type = 'str' - elif var_value.isdigit(): - var_type = 'number' - else: - var_type = 'value' - result['exports'][var_name] = var_type - continue - - # Check for regular variables (uppercase) - match = re.match(var_pattern, stripped) - if match: - var_name = match.group(1) - # Only track if not already in exports - if var_name not in result['exports'] and var_name not in result['variables']: - result['variables'].append(var_name) - continue - - # Check for source/dot includes - for source_pattern in source_patterns: - match = re.match(source_pattern, stripped) - if match: - # Extract the file path based on which pattern matched - if len(match.groups()) == 2: # Quoted pattern - sourced_file = match.group(2) - else: # Unquoted or command substitution - sourced_file = match.group(1) - - sourced_file = sourced_file.strip() - if sourced_file and sourced_file not in result['sources']: - result['sources'].append(sourced_file) - break # Found a match, no need to try other patterns - - # Clean up empty collections - if not result['variables']: - del result['variables'] - if not result['exports']: - del result['exports'] - if not result['sources']: - del result['sources'] + # Shell function pattern + func_pattern = r'(\w+)\s*\(\)\s*\{' + for match in re.finditer(func_pattern, content, re.MULTILINE): + func_name = match.group(1) + functions[func_name] = { + 'name': func_name, + 'type': 'function', + 'line': pos_to_line(content, match.start()) + } - return result + return {'functions': functions, 'classes': {}} def extract_markdown_structure(file_path: Path) -> Dict[str, List[str]]: """Extract headers and architectural hints from markdown files.""" try: content = file_path.read_text(encoding='utf-8', errors='ignore') - except: - return {'sections': [], 'architecture_hints': []} - - # Extract headers (up to level 3) - headers = re.findall(r'^#{1,3}\s+(.+)$', content[:5000], re.MULTILINE) # Only scan first 5KB - - # Look for architectural hints - arch_patterns = [ - r'(?:located?|found?|stored?)\s+in\s+`?([\w\-\./]+)`?', - r'`?([\w\-\./]+)`?\s+(?:contains?|houses?|holds?)', - r'(?:see|check|look)\s+(?:in\s+)?`?([\w\-\./]+)`?\s+for', - r'(?:file|module|component)\s+`?([\w\-\./]+)`?', - ] - - hints = set() - for pattern in arch_patterns: - matches = re.findall(pattern, content[:5000], re.IGNORECASE) - for match in matches: - if '/' in match and not match.startswith('http'): - hints.add(match) - - return { - 'sections': headers[:10], # Limit to prevent bloat - 'architecture_hints': list(hints)[:5] - } + headers = [] + + for match in re.finditer(r'^#+\s+(.+)$', content, re.MULTILINE): + headers.append(match.group(1).strip()) + + return {'headers': headers} + except Exception: + return {'headers': []} def infer_file_purpose(file_path: Path) -> Optional[str]: """Infer the purpose of a file from its name and location.""" - name = file_path.stem.lower() + name = file_path.name.lower() + parent = file_path.parent.name.lower() - # Common file purposes - if name in ['index', 'main', 'app']: - return 'Application entry point' - elif 'test' in name or 'spec' in name: + if 'test' in name or parent == 'tests': return 'Test file' - elif 'config' in name or 'settings' in name: + elif 'config' in name or name.endswith('.config.js'): return 'Configuration' - elif 'route' in name: - return 'Route definitions' - elif 'model' in name: - return 'Data model' - elif 'util' in name or 'helper' in name: - return 'Utility functions' - elif 'middleware' in name: - return 'Middleware' - - return None + elif name in ('readme.md', 'readme.txt'): + return 'Documentation' + elif name.startswith('.'): + return 'Hidden/config file' + else: + return None def infer_directory_purpose(path: Path, files_within: List[str]) -> Optional[str]: """Infer directory purpose from naming patterns and contents.""" dir_name = path.name.lower() - # Check exact matches first - if dir_name in DIRECTORY_PURPOSES: - return DIRECTORY_PURPOSES[dir_name] - - # Check if directory name contains key patterns - for pattern, purpose in DIRECTORY_PURPOSES.items(): - if pattern in dir_name: - return purpose - - # Infer from contents - if files_within: - # Check for test files - if any('test' in f.lower() or 'spec' in f.lower() for f in files_within): - return 'Test files and test utilities' - - # Check for specific file patterns - if any('model' in f.lower() for f in files_within): - return 'Data models and schemas' - elif any('route' in f.lower() or 'endpoint' in f.lower() for f in files_within): - return 'API routes and endpoints' - elif any('component' in f.lower() for f in files_within): - return 'UI components' - - return None + if dir_name in ('tests', 'test'): + return 'Test directory' + elif dir_name in ('docs', 'documentation'): + return 'Documentation' + elif dir_name in ('src', 'source'): + return 'Source code' + elif dir_name in ('scripts', 'bin'): + return 'Build and utility scripts' + else: + return None def get_language_name(extension: str) -> str: """Get readable language name from extension.""" - if extension in PARSEABLE_LANGUAGES: - return PARSEABLE_LANGUAGES[extension] - return extension[1:] if extension else 'unknown' - - -# Global cache for gitignore patterns -_gitignore_cache = {} + lang_map = { + '.py': 'Python', '.js': 'JavaScript', '.ts': 'TypeScript', + '.java': 'Java', '.cpp': 'C++', '.c': 'C', '.go': 'Go', + '.rs': 'Rust', '.php': 'PHP', '.rb': 'Ruby', '.sh': 'Shell', + '.md': 'Markdown', '.json': 'JSON', '.yaml': 'YAML', '.yml': 'YAML' + } + return lang_map.get(extension, extension[1:].upper() if extension else 'Unknown') def parse_gitignore(gitignore_path: Path) -> List[str]: """Parse a .gitignore file and return list of patterns.""" - if not gitignore_path.exists(): - return [] - - patterns = [] try: + patterns = [] with open(gitignore_path, 'r') as f: for line in f: line = line.strip() - # Skip empty lines and comments - if not line or line.startswith('#'): - continue - patterns.append(line) - except: - pass - - return patterns + if line and not line.startswith('#'): + patterns.append(line) + return patterns + except Exception: + return [] def load_gitignore_patterns(root_path: Path) -> Set[str]: """Load all gitignore patterns from project root and merge with defaults.""" - # Use cached patterns if available - cache_key = str(root_path) - if cache_key in _gitignore_cache: - return _gitignore_cache[cache_key] - - # Start with default ignore patterns - patterns = set(IGNORE_DIRS) + patterns = set(DEFAULT_GITIGNORE_PATTERNS) - # Add patterns from .gitignore in project root gitignore_path = root_path / '.gitignore' if gitignore_path.exists(): - for pattern in parse_gitignore(gitignore_path): - # Handle negations (!) later if needed - if not pattern.startswith('!'): - patterns.add(pattern) + patterns.update(parse_gitignore(gitignore_path)) - # Cache the patterns - _gitignore_cache[cache_key] = patterns return patterns def matches_gitignore_pattern(path: Path, patterns: Set[str], root_path: Path) -> bool: """Check if a path matches any gitignore pattern.""" - # Get relative path from root try: rel_path = path.relative_to(root_path) - except ValueError: - # Path is not relative to root - return False - - # Convert to string for pattern matching - path_str = str(rel_path) - path_parts = rel_path.parts - - for pattern in patterns: - # Check if any parent directory matches the pattern - # Strip trailing slash for directory patterns - clean_pattern = pattern.rstrip('/') - for part in path_parts: - if part == clean_pattern or fnmatch.fnmatch(part, clean_pattern): - return True + path_str = str(rel_path) - # Check full path patterns - if '/' in pattern: - # Pattern includes directory separator - if fnmatch.fnmatch(path_str, pattern): - return True - # Also check without leading slash - if pattern.startswith('/') and fnmatch.fnmatch(path_str, pattern[1:]): - return True - else: - # Pattern is just a filename/directory name - # Check if the filename matches - if fnmatch.fnmatch(path.name, pattern): - return True - # Check if it matches the full relative path - if fnmatch.fnmatch(path_str, pattern): - return True - # Check with wildcards - if fnmatch.fnmatch(path_str, f'**/{pattern}'): + for pattern in patterns: + if fnmatch.fnmatch(path_str, pattern) or fnmatch.fnmatch(path.name, pattern): return True - - return False + + return False + except ValueError: + return False def should_index_file(path: Path, root_path: Path = None) -> bool: @@ -1386,31 +410,16 @@ def should_index_file(path: Path, root_path: Path = None) -> bool: def get_git_files(root_path: Path) -> Optional[List[Path]]: - """Get list of files tracked by git (respects .gitignore). - Returns None if not a git repository or git command fails.""" + """Get list of files tracked by git.""" try: - import subprocess - - # Run git ls-files to get tracked and untracked files that aren't ignored result = subprocess.run( - ['git', 'ls-files', '--cached', '--others', '--exclude-standard'], - cwd=str(root_path), - capture_output=True, - text=True, - timeout=10 + ['git', 'ls-files'], + cwd=root_path, + capture_output=True, + text=True ) - if result.returncode == 0: - files = [] - for line in result.stdout.strip().split('\n'): - if line: - file_path = root_path / line - # Only include actual files (not directories) - if file_path.is_file(): - files.append(file_path) - return files - else: - return None - except (subprocess.TimeoutExpired, FileNotFoundError, Exception): - # Git not available or command failed + return [root_path / f for f in result.stdout.strip().split('\n') if f] + return None + except Exception: return None \ No newline at end of file diff --git a/scripts/interactive_cleanup.py b/scripts/interactive_cleanup.py new file mode 100644 index 0000000..a9a5723 --- /dev/null +++ b/scripts/interactive_cleanup.py @@ -0,0 +1,549 @@ +#!/usr/bin/env python3 +""" +Interactive cleanup workflow for eliminating duplicate code. +Guides users through prioritized duplicate elimination process. +""" + +import json +import sys +import os +from pathlib import Path +from typing import Dict, List, Any +import subprocess + +try: + from generate_duplicate_report import DuplicateReportGenerator +except ImportError: + sys.path.insert(0, os.path.dirname(os.path.abspath(__file__))) + from generate_duplicate_report import DuplicateReportGenerator + + +class InteractiveCleanup: + """Interactive duplicate code cleanup workflow.""" + + def __init__(self, project_root: str): + self.project_root = Path(project_root) + self.report_generator = DuplicateReportGenerator(project_root) + self.analysis = None + + def run_cleanup_workflow(self): + """Run the complete interactive cleanup workflow.""" + print("🧹 Interactive Duplicate Code Cleanup") + print("=" * 50) + + # Step 1: Generate analysis + print("\n📊 Analyzing codebase for duplicates...") + self.analysis = self.report_generator.analyze_duplicates() + + if 'error' in self.analysis: + print(f"❌ Error: {self.analysis['error']}") + return + + # Step 2: Show summary + self._show_summary() + + # Step 3: Interactive cleanup + self._interactive_menu() + + def _show_summary(self): + """Display analysis summary.""" + summary = self.analysis['summary'] + + print(f"\n📈 Analysis Summary:") + print(f" • Total Functions Analyzed: {self.analysis['total_functions_analyzed']}") + print(f" • Exact Duplicate Groups: {summary['total_duplicate_groups']}") + print(f" • Functions to Deduplicate: {summary['total_duplicate_functions']}") + print(f" • Similarity Clusters: {summary['total_similarity_clusters']}") + print(f" • Estimated Lines Saved: {summary['estimated_lines_saved']}") + print(f" • High Priority Items: {summary['top_priority_count']}") + + def _interactive_menu(self): + """Main interactive menu.""" + while True: + print(f"\n🔧 Cleanup Options:") + print("1. 🚨 View exact duplicates (immediate action)") + print("2. 📊 View similarity clusters (medium effort)") + print("3. 📋 View cleanup priorities") + print("4. 🤖 Launch cleanup sub-agents") + print("5. 📄 Generate full report") + print("6. ⚡ Quick wins (auto-fix easy duplicates)") + print("7. 🔍 Search for specific patterns") + print("8. 📚 Export cleanup plan") + print("9. ❌ Exit") + + choice = input("\nSelect option (1-9): ").strip() + + if choice == '1': + self._show_exact_duplicates() + elif choice == '2': + self._show_similarity_clusters() + elif choice == '3': + self._show_cleanup_priorities() + elif choice == '4': + self._launch_cleanup_agents() + elif choice == '5': + self._generate_full_report() + elif choice == '6': + self._quick_wins() + elif choice == '7': + self._search_patterns() + elif choice == '8': + self._export_cleanup_plan() + elif choice == '9': + print("👋 Cleanup workflow complete!") + break + else: + print("❌ Invalid option. Please select 1-9.") + + def _show_exact_duplicates(self): + """Display exact duplicates with action options.""" + exact_duplicates = self.analysis['exact_duplicates'] + + if not exact_duplicates: + print("\n✅ No exact duplicates found!") + return + + print(f"\n🚨 Exact Duplicates ({len(exact_duplicates)} groups):") + print("-" * 50) + + for i, dup in enumerate(exact_duplicates[:10]): # Show top 10 + print(f"\n{i+1}. Duplicate Group (Impact: {dup['impact_score']:.1f})") + print(f" Count: {dup['count']} identical functions") + print(" Functions:") + for func in dup['functions']: + print(f" • {func['function_id']}") + if 'complexity' in func and func['complexity']: + complexity = func['complexity'].get('cyclomatic', 'unknown') + print(f" Complexity: {complexity}") + + # Action menu for exact duplicates + self._exact_duplicate_actions() + + def _exact_duplicate_actions(self): + """Action menu for exact duplicates.""" + print(f"\n🔧 Exact Duplicate Actions:") + print("1. 🤖 Launch duplicate-eliminator agent") + print("2. 📝 Create cleanup task list") + print("3. 🔍 Inspect specific duplicate group") + print("4. ⬅️ Back to main menu") + + choice = input("Select action (1-4): ").strip() + + if choice == '1': + self._launch_duplicate_eliminator() + elif choice == '2': + self._create_cleanup_tasks() + elif choice == '3': + self._inspect_duplicate_group() + elif choice == '4': + return + + def _show_similarity_clusters(self): + """Display similarity clusters.""" + clusters = self.analysis['similarity_clusters'] + + if not clusters: + print("\n✅ No similarity clusters found!") + return + + print(f"\n📊 Similarity Clusters ({len(clusters)} clusters):") + print("-" * 50) + + for i, cluster in enumerate(clusters[:10]): + print(f"\n{i+1}. Similarity Cluster (Impact: {cluster['impact_score']:.1f})") + print(f" Count: {cluster['count']} similar functions") + print(f" Average Similarity: {cluster['average_similarity']*100:.1f}%") + print(" Functions:") + for func in cluster['functions'][:3]: # Show first 3 + print(f" • {func['function_id']}") + if len(cluster['functions']) > 3: + print(f" ... and {len(cluster['functions']) - 3} more") + + def _show_cleanup_priorities(self): + """Display cleanup priorities.""" + priorities = self.analysis['cleanup_priorities'] + + if not priorities: + print("\n✅ No cleanup priorities identified!") + return + + print(f"\n📋 Cleanup Priorities ({len(priorities)} items):") + print("-" * 50) + + high_priority = [p for p in priorities if p['priority'] == 'HIGH'] + medium_priority = [p for p in priorities if p['priority'] == 'MEDIUM'] + + if high_priority: + print("\n🔴 HIGH PRIORITY:") + for i, item in enumerate(high_priority[:5]): + print(f" {i+1}. {item['description']}") + print(f" Effort: {item['effort']}, Impact: {item['impact_score']:.1f}") + + if medium_priority: + print("\n🟡 MEDIUM PRIORITY:") + for i, item in enumerate(medium_priority[:5]): + print(f" {i+1}. {item['description']}") + print(f" Effort: {item['effort']}, Impact: {item['impact_score']:.1f}") + + def _launch_cleanup_agents(self): + """Launch specialized cleanup sub-agents.""" + print(f"\n🤖 Available Cleanup Agents:") + print("1. duplicate-eliminator - Remove exact duplicates") + print("2. utility-extractor - Extract shared utilities") + print("3. refactoring-advisor - Architectural guidance") + print("4. ⬅️ Back to main menu") + + choice = input("Select agent (1-4): ").strip() + + if choice == '1': + self._launch_duplicate_eliminator() + elif choice == '2': + self._launch_utility_extractor() + elif choice == '3': + self._launch_refactoring_advisor() + elif choice == '4': + return + + def _launch_duplicate_eliminator(self): + """Launch the duplicate eliminator agent.""" + exact_duplicates = self.analysis['exact_duplicates'] + + if not exact_duplicates: + print("❌ No exact duplicates to eliminate!") + return + + print("\n🤖 Launching duplicate-eliminator agent...") + print("📋 Task: Eliminate exact duplicate functions") + print(f"📊 Scope: {len(exact_duplicates)} duplicate groups") + + # Create task description for the agent + task_description = self._create_eliminator_task() + print(f"\n📝 Agent Task:") + print(task_description) + + print("\n💡 To launch the agent in Claude Code:") + print("Use: Task tool with subagent_type='duplicate-eliminator'") + print(f"Prompt: {task_description}") + + def _launch_utility_extractor(self): + """Launch the utility extractor agent.""" + clusters = self.analysis['similarity_clusters'] + high_similarity = [c for c in clusters if c['average_similarity'] > 0.8] + + if not high_similarity: + print("❌ No high-similarity clusters to extract utilities from!") + return + + print("\n🤖 Launching utility-extractor agent...") + print("📋 Task: Extract shared utilities from similar code") + print(f"📊 Scope: {len(high_similarity)} similarity clusters") + + task_description = self._create_extractor_task(high_similarity) + print(f"\n📝 Agent Task:") + print(task_description) + + print("\n💡 To launch the agent in Claude Code:") + print("Use: Task tool with subagent_type='utility-extractor'") + print(f"Prompt: {task_description}") + + def _launch_refactoring_advisor(self): + """Launch the refactoring advisor agent.""" + priorities = self.analysis['cleanup_priorities'] + high_impact = [p for p in priorities if p['impact_score'] > 30] + + print("\n🤖 Launching refactoring-advisor agent...") + print("📋 Task: Provide architectural guidance for complex refactoring") + print(f"📊 Scope: {len(high_impact)} high-impact items") + + task_description = self._create_advisor_task(high_impact) + print(f"\n📝 Agent Task:") + print(task_description) + + print("\n💡 To launch the agent in Claude Code:") + print("Use: Task tool with subagent_type='refactoring-advisor'") + print(f"Prompt: {task_description}") + + def _create_eliminator_task(self) -> str: + """Create task description for duplicate eliminator.""" + exact_duplicates = self.analysis['exact_duplicates'] + top_duplicates = exact_duplicates[:3] # Focus on top 3 + + task = "Eliminate the following exact duplicate functions:\n\n" + for i, dup in enumerate(top_duplicates): + task += f"{i+1}. Duplicate Group ({dup['count']} functions):\n" + for func in dup['functions']: + task += f" - {func['function_id']}\n" + task += f" Impact Score: {dup['impact_score']:.1f}\n\n" + + task += "For each group:\n" + task += "1. Analyze the duplicate functions to confirm they're identical\n" + task += "2. Choose the best implementation or create a new shared utility\n" + task += "3. Replace all duplicates with calls to the shared implementation\n" + task += "4. Update imports and ensure tests pass\n" + task += "5. Remove the now-unused duplicate functions\n\n" + task += "Focus on the highest impact groups first." + + return task + + def _create_extractor_task(self, clusters: List[Dict]) -> str: + """Create task description for utility extractor.""" + top_clusters = clusters[:3] + + task = "Extract shared utilities from the following similarity clusters:\n\n" + for i, cluster in enumerate(top_clusters): + task += f"{i+1}. Similarity Cluster ({cluster['count']} functions, {cluster['average_similarity']*100:.1f}% similar):\n" + for func in cluster['functions']: + task += f" - {func['function_id']}\n" + task += f" Impact Score: {cluster['impact_score']:.1f}\n\n" + + task += "For each cluster:\n" + task += "1. Analyze the similar functions to identify patterns and variations\n" + task += "2. Design a configurable utility that handles all variations\n" + task += "3. Create the new utility with proper configuration options\n" + task += "4. Replace similar functions with calls to the new utility\n" + task += "5. Add comprehensive tests for the new utility\n\n" + task += "Prioritize clusters with the highest similarity and impact scores." + + return task + + def _create_advisor_task(self, items: List[Dict]) -> str: + """Create task description for refactoring advisor.""" + task = "Provide architectural guidance for the following high-impact duplicate elimination:\n\n" + + task += "High-Impact Items:\n" + for i, item in enumerate(items[:5]): + task += f"{i+1}. {item['description']}\n" + task += f" Priority: {item['priority']}, Effort: {item['effort']}\n" + task += f" Impact Score: {item['impact_score']:.1f}\n\n" + + task += "Please provide:\n" + task += "1. Risk assessment for each item\n" + task += "2. Recommended refactoring approach and patterns\n" + task += "3. Implementation sequence and dependencies\n" + task += "4. Potential architectural improvements\n" + task += "5. Testing and validation strategies\n\n" + task += "Focus on maximizing long-term maintainability while minimizing implementation risk." + + return task + + def _generate_full_report(self): + """Generate and display full markdown report.""" + report = self.report_generator.generate_report('markdown') + + # Save to file + report_path = self.project_root / 'DUPLICATE_ANALYSIS_REPORT.md' + with open(report_path, 'w') as f: + f.write(report) + + print(f"\n📄 Full report generated: {report_path}") + print("🔗 Open this file to see detailed analysis and recommendations") + + # Show first few lines + lines = report.split('\n') + print("\n📋 Report Preview:") + print("-" * 30) + for line in lines[:15]: + print(line) + print("...") + + def _quick_wins(self): + """Identify and potentially auto-fix easy duplicates.""" + exact_duplicates = self.analysis['exact_duplicates'] + + # Find low-effort, high-impact duplicates + quick_wins = [ + dup for dup in exact_duplicates + if dup['impact_score'] > 20 and dup['count'] <= 3 + ] + + if not quick_wins: + print("\n❌ No obvious quick wins identified.") + print("💡 Consider using the full cleanup workflow for better results.") + return + + print(f"\n⚡ Quick Wins Identified ({len(quick_wins)} groups):") + print("-" * 40) + + for i, win in enumerate(quick_wins): + print(f"\n{i+1}. {win['count']} duplicate functions (Impact: {win['impact_score']:.1f})") + for func in win['functions']: + print(f" • {func['function_id']}") + + print(f"\n💡 These are good candidates for automated elimination.") + print("🤖 Use the duplicate-eliminator agent to handle these efficiently.") + + def _search_patterns(self): + """Search for specific duplicate patterns.""" + pattern = input("\n🔍 Enter search pattern (function name or keyword): ").strip() + + if not pattern: + return + + # Search in exact duplicates + matching_duplicates = [] + for dup in self.analysis['exact_duplicates']: + for func in dup['functions']: + if pattern.lower() in func['function_id'].lower(): + matching_duplicates.append((dup, func)) + + # Search in similarity clusters + matching_clusters = [] + for cluster in self.analysis['similarity_clusters']: + for func in cluster['functions']: + if pattern.lower() in func['function_id'].lower(): + matching_clusters.append((cluster, func)) + + print(f"\n🎯 Search Results for '{pattern}':") + print("-" * 40) + + if matching_duplicates: + print(f"\nExact Duplicates ({len(matching_duplicates)} matches):") + for dup, func in matching_duplicates: + print(f" • {func['function_id']} (group of {dup['count']})") + + if matching_clusters: + print(f"\nSimilarity Clusters ({len(matching_clusters)} matches):") + for cluster, func in matching_clusters: + print(f" • {func['function_id']} ({cluster['average_similarity']*100:.1f}% similar cluster)") + + if not matching_duplicates and not matching_clusters: + print("❌ No matches found.") + + def _export_cleanup_plan(self): + """Export a structured cleanup plan.""" + plan_path = self.project_root / 'CLEANUP_PLAN.md' + + plan = self._generate_cleanup_plan() + + with open(plan_path, 'w') as f: + f.write(plan) + + print(f"\n📋 Cleanup plan exported: {plan_path}") + print("🔗 Use this plan to track cleanup progress") + + def _generate_cleanup_plan(self) -> str: + """Generate structured cleanup plan.""" + priorities = self.analysis['cleanup_priorities'] + + plan = "# Duplicate Code Cleanup Plan\n\n" + plan += f"Generated: {self.analysis['analysis_timestamp']}\n\n" + + plan += "## Phase 1: High Priority Items\n\n" + high_priority = [p for p in priorities if p['priority'] == 'HIGH'] + for i, item in enumerate(high_priority): + plan += f"### Task {i+1}: {item['description']}\n" + plan += f"- **Effort**: {item['effort']}\n" + plan += f"- **Impact**: {item['impact_score']:.1f}\n" + plan += f"- **Type**: {item['type']}\n" + plan += "- **Status**: ⏳ Pending\n\n" + + plan += "## Phase 2: Medium Priority Items\n\n" + medium_priority = [p for p in priorities if p['priority'] == 'MEDIUM'] + for i, item in enumerate(medium_priority): + plan += f"### Task {i+1}: {item['description']}\n" + plan += f"- **Effort**: {item['effort']}\n" + plan += f"- **Impact**: {item['impact_score']:.1f}\n" + plan += "- **Status**: ⏳ Pending\n\n" + + plan += "## Recommended Tools\n\n" + for tool in self.analysis['recommendations']['tools_needed']: + plan += f"- {tool}\n" + + plan += "\n## Progress Tracking\n\n" + plan += "Update the status of each task as you complete them:\n" + plan += "- ⏳ Pending\n" + plan += "- 🟡 In Progress\n" + plan += "- ✅ Completed\n" + plan += "- ❌ Skipped\n" + + return plan + + def _inspect_duplicate_group(self): + """Inspect a specific duplicate group in detail.""" + exact_duplicates = self.analysis['exact_duplicates'] + + if not exact_duplicates: + print("❌ No exact duplicates to inspect!") + return + + print(f"\nAvailable duplicate groups:") + for i, dup in enumerate(exact_duplicates[:10]): + print(f"{i+1}. {dup['count']} functions (Impact: {dup['impact_score']:.1f})") + + try: + choice = int(input(f"\nSelect group to inspect (1-{min(10, len(exact_duplicates))}): ")) + if 1 <= choice <= len(exact_duplicates): + group = exact_duplicates[choice - 1] + self._show_duplicate_group_details(group) + else: + print("❌ Invalid selection.") + except ValueError: + print("❌ Please enter a valid number.") + + def _show_duplicate_group_details(self, group: Dict[str, Any]): + """Show detailed information about a duplicate group.""" + print(f"\n🔍 Duplicate Group Details") + print("=" * 30) + print(f"Type: {group['type']}") + print(f"Count: {group['count']} functions") + print(f"Impact Score: {group['impact_score']:.1f}") + print(f"AST Fingerprint: {group['fingerprint'][:16]}...") + + print(f"\n📋 Functions in this group:") + for func in group['functions']: + print(f"\n 📄 {func['function_id']}") + print(f" File: {func['file_path']}") + print(f" Signature: {func['signature']}") + if 'complexity' in func and func['complexity']: + complexity = func['complexity'].get('cyclomatic', 'unknown') + print(f" Complexity: {complexity}") + + print(f"\n💡 Recommended Action:") + print("1. Review all functions to confirm they're truly identical") + print("2. Choose the best implementation or create a new utility") + print("3. Extract to shared location (e.g., utils/ directory)") + print("4. Replace all occurrences with calls to shared implementation") + print("5. Remove duplicate functions and update tests") + + def _create_cleanup_tasks(self): + """Create cleanup tasks in todo format.""" + exact_duplicates = self.analysis['exact_duplicates'] + + if not exact_duplicates: + print("❌ No exact duplicates to create tasks for!") + return + + print(f"\n📝 Creating cleanup tasks for {len(exact_duplicates)} duplicate groups...") + + tasks = [] + for i, dup in enumerate(exact_duplicates): + task = f"Eliminate duplicate group {i+1}: {dup['count']} identical functions (Impact: {dup['impact_score']:.1f})" + tasks.append(task) + + # Save tasks to file + tasks_path = self.project_root / 'CLEANUP_TASKS.md' + with open(tasks_path, 'w') as f: + f.write("# Duplicate Cleanup Tasks\n\n") + for i, task in enumerate(tasks): + f.write(f"- [ ] {task}\n") + + print(f"✅ Tasks saved to: {tasks_path}") + print("🔗 Track your progress by checking off completed tasks") + + +def main(): + """Main entry point for interactive cleanup.""" + import argparse + + parser = argparse.ArgumentParser(description='Interactive duplicate code cleanup') + parser.add_argument('--project-root', default='.', help='Project root directory') + + args = parser.parse_args() + + # Run interactive cleanup + cleanup = InteractiveCleanup(args.project_root) + cleanup.run_cleanup_workflow() + + +if __name__ == '__main__': + main() \ No newline at end of file diff --git a/scripts/load_architecture_context.py b/scripts/load_architecture_context.py new file mode 100644 index 0000000..daa8a7c --- /dev/null +++ b/scripts/load_architecture_context.py @@ -0,0 +1,299 @@ +#!/usr/bin/env python3 +""" +SessionStart hook to load architectural context and patterns. +Provides Claude with project-specific architectural information at session start. +""" + +import json +import sys +import os +from pathlib import Path +from typing import Dict, List, Any, Optional + + +class ArchitectureContextLoader: + """Loads and formats architectural context for Claude sessions.""" + + def __init__(self, project_root: str): + self.project_root = Path(project_root) + self.index_path = self.project_root / 'PROJECT_INDEX.json' + self.index_data = None + self.load_index() + + def load_index(self): + """Load the project index.""" + if not self.index_path.exists(): + self.index_data = {} + return + + try: + with open(self.index_path, 'r') as f: + self.index_data = json.load(f) + except Exception as e: + print(f"Warning: Could not load index: {e}", file=sys.stderr) + self.index_data = {} + + def generate_context(self) -> str: + """Generate architectural context for Claude session.""" + if not self.index_data: + return "No architectural context available." + + context_parts = [] + + # Project overview + context_parts.append("## Project Architecture & Patterns") + context_parts.append(self._get_project_overview()) + + # Semantic analysis information + if 'semantic_index' in self.index_data: + context_parts.append("\\n## Code Quality & Patterns") + context_parts.append(self._get_semantic_context()) + + # Directory structure and organization + context_parts.append("\\n## Project Organization") + context_parts.append(self._get_organization_context()) + + # Known duplicate clusters and similar code + if 'semantic_index' in self.index_data: + similarity_context = self._get_similarity_context() + if similarity_context: + context_parts.append("\\n## Known Code Patterns & Duplicates") + context_parts.append(similarity_context) + + # Recent changes and development patterns + context_parts.append("\\n## Development Guidelines") + context_parts.append(self._get_development_guidelines()) + + return "\\n".join(context_parts) + + def _get_project_overview(self) -> str: + """Get basic project overview information.""" + overview = [] + + stats = self.index_data.get('stats', {}) + if stats: + overview.append(f"📊 **Project Size:** {stats.get('total_files', 0)} files across {stats.get('total_directories', 0)} directories") + + parsed_langs = stats.get('fully_parsed', {}) + if parsed_langs: + lang_counts = [f"{lang}: {count}" for lang, count in parsed_langs.items()] + overview.append(f"🔧 **Languages:** {', '.join(lang_counts)}") + + # Check for common project types + files = self.index_data.get('files', {}) + project_indicators = [] + + if any('package.json' in f for f in files.keys()): + project_indicators.append("Node.js/JavaScript") + if any('requirements.txt' in f or 'pyproject.toml' in f for f in files.keys()): + project_indicators.append("Python") + if any('Cargo.toml' in f for f in files.keys()): + project_indicators.append("Rust") + if any('go.mod' in f for f in files.keys()): + project_indicators.append("Go") + + if project_indicators: + overview.append(f"🏗️ **Project Type:** {', '.join(project_indicators)}") + + return "\\n".join(overview) if overview else "Standard project structure detected." + + def _get_semantic_context(self) -> str: + """Get semantic analysis context.""" + semantic_index = self.index_data.get('semantic_index', {}) + if not semantic_index: + return "No semantic analysis available." + + context = [] + + # Architectural patterns + arch_patterns = semantic_index.get('architectural_patterns', {}) + if arch_patterns: + naming = arch_patterns.get('naming_conventions', {}) + if naming: + context.append(f"📝 **Naming Convention:** {naming.get('functions', 'Mixed styles detected')}") + + design_patterns = arch_patterns.get('design_patterns', []) + if design_patterns: + context.append(f"🎨 **Design Patterns:** {', '.join(design_patterns)}") + + # Complexity analysis + complexity = semantic_index.get('complexity_analysis', {}) + if complexity: + total_funcs = complexity.get('total_functions', 0) + avg_complexity = complexity.get('average_cyclomatic_complexity', 0) + high_complexity = complexity.get('high_complexity_functions', []) + + context.append(f"⚡ **Code Complexity:** {total_funcs} functions, avg complexity: {avg_complexity:.1f}") + + if high_complexity: + context.append(f"⚠️ **High Complexity Functions:** {len(high_complexity)} functions need attention") + + # Function patterns + functions = semantic_index.get('functions', {}) + if functions: + patterns = {} + for func_data in functions.values(): + func_patterns = func_data.get('patterns', []) + for pattern in func_patterns: + patterns[pattern] = patterns.get(pattern, 0) + 1 + + if patterns: + top_patterns = sorted(patterns.items(), key=lambda x: x[1], reverse=True)[:5] + pattern_strs = [f"{pattern} ({count})" for pattern, count in top_patterns] + context.append(f"🔍 **Common Patterns:** {', '.join(pattern_strs)}") + + return "\\n".join(context) if context else "Semantic analysis completed - no specific patterns detected." + + def _get_organization_context(self) -> str: + """Get project organization and directory structure context.""" + context = [] + + # Directory purposes + dir_purposes = self.index_data.get('directory_purposes', {}) + if dir_purposes: + context.append("📁 **Directory Structure:**") + for directory, purpose in sorted(dir_purposes.items()): + context.append(f" • `{directory}/` - {purpose}") + + # File organization patterns + files = self.index_data.get('files', {}) + if files: + # Analyze file organization + test_files = [f for f in files.keys() if 'test' in f.lower() or 'spec' in f.lower()] + config_files = [f for f in files.keys() if 'config' in f.lower() or f.endswith('.json')] + + if test_files: + context.append(f"🧪 **Testing:** {len(test_files)} test files found") + if config_files: + context.append(f"⚙️ **Configuration:** {len(config_files)} config files") + + return "\\n".join(context) if context else "Standard file organization." + + def _get_similarity_context(self) -> str: + """Get information about code similarity clusters and patterns.""" + semantic_index = self.index_data.get('semantic_index', {}) + similarity_clusters = semantic_index.get('similarity_clusters', []) + + if not similarity_clusters: + return None + + context = [] + + # Show top similarity clusters + high_similarity_clusters = [ + cluster for cluster in similarity_clusters + if len(cluster.get('functions', [])) > 2 + ] + + if high_similarity_clusters: + context.append("🔗 **Similar Code Clusters Found:**") + for i, cluster in enumerate(high_similarity_clusters[:3]): # Top 3 clusters + functions = cluster.get('functions', []) + pattern = cluster.get('pattern', 'similar_implementation') + context.append(f" • Cluster {i+1}: {len(functions)} similar functions ({pattern})") + + # Show first few functions in cluster + if len(functions) > 0: + sample_funcs = functions[:3] + context.append(f" Examples: {', '.join(sample_funcs)}") + + # Show functions with high complexity + functions = semantic_index.get('functions', {}) + complex_functions = [] + for func_id, func_data in functions.items(): + complexity = func_data.get('complexity', {}) + if complexity.get('cyclomatic', 0) > 8: # High complexity threshold + complex_functions.append((func_id, complexity.get('cyclomatic', 0))) + + if complex_functions: + context.append("\\n⚠️ **High Complexity Functions to Watch:**") + complex_functions.sort(key=lambda x: x[1], reverse=True) + for func_id, complexity in complex_functions[:3]: + context.append(f" • {func_id} (complexity: {complexity})") + + return "\\n".join(context) if context else None + + def _get_development_guidelines(self) -> str: + """Generate development guidelines based on project analysis.""" + guidelines = [] + + # Based on semantic analysis + semantic_index = self.index_data.get('semantic_index', {}) + if semantic_index: + arch_patterns = semantic_index.get('architectural_patterns', {}) + + # Naming conventions + naming = arch_patterns.get('naming_conventions', {}) + if naming.get('functions') == 'snake_case': + guidelines.append("🐍 Use `snake_case` for function names (project standard)") + elif naming.get('functions') == 'camelCase': + guidelines.append("🐪 Use `camelCase` for function names (project standard)") + + # Directory patterns + dir_patterns = arch_patterns.get('directory_patterns', {}) + if dir_patterns.get('test_separation'): + guidelines.append("🧪 Keep tests in separate directories (project pattern)") + if dir_patterns.get('utility_separation'): + guidelines.append("🔧 Place utility functions in dedicated utils directories") + + # Duplicate detection advice + similarity_clusters = semantic_index.get('similarity_clusters', []) + if len(similarity_clusters) > 2: + guidelines.append("⚠️ **Duplicate Detection Active** - Check for existing implementations before writing new functions") + + # Complexity guidelines + complexity = semantic_index.get('complexity_analysis', {}) + if complexity: + avg_complexity = complexity.get('average_cyclomatic_complexity', 0) + if avg_complexity > 5: + guidelines.append(f"📊 Keep function complexity under {int(avg_complexity + 2)} (current avg: {avg_complexity:.1f})") + + # Default guidelines + guidelines.extend([ + "🔍 **Before implementing:** Search for existing similar functionality", + "🏗️ **Architecture:** Follow established patterns in the codebase", + "📝 **Naming:** Use descriptive names that match project conventions" + ]) + + return "\\n".join(guidelines) + + def format_for_claude(self) -> Dict[str, Any]: + """Format the context for Claude Code hook output.""" + context = self.generate_context() + + return { + "hookSpecificOutput": { + "hookEventName": "SessionStart", + "additionalContext": context + } + } + + +def main(): + """Main hook entry point.""" + try: + # Read hook input from stdin + input_data = json.load(sys.stdin) + except json.JSONDecodeError as e: + print(f"Error: Invalid JSON input: {e}", file=sys.stderr) + sys.exit(1) + except Exception as e: + print(f"Error reading input: {e}", file=sys.stderr) + sys.exit(1) + + # Get project directory + project_dir = os.environ.get('CLAUDE_PROJECT_DIR') + if not project_dir: + project_dir = os.getcwd() + + # Load architectural context + loader = ArchitectureContextLoader(project_dir) + output = loader.format_for_claude() + + # Output the context for Claude + print(json.dumps(output)) + sys.exit(0) + + +if __name__ == '__main__': + main() \ No newline at end of file diff --git a/scripts/neural_embeddings.py b/scripts/neural_embeddings.py new file mode 100644 index 0000000..4c7da85 --- /dev/null +++ b/scripts/neural_embeddings.py @@ -0,0 +1,281 @@ +#!/usr/bin/env python3 +""" +Simple neural embeddings for code using Ollama + nomic-embed-text +No over-engineering - just practical semantic code search and analysis +""" + +import json +import sys +import os +import requests +from pathlib import Path +from typing import List, Dict, Any +import numpy as np +from sklearn.metrics.pairwise import cosine_similarity + +OLLAMA_URL = "http://127.0.0.1:11434" +EMBEDDING_MODEL = "nomic-embed-text" + +def get_embedding(text: str) -> List[float]: + """Get embedding from Ollama""" + try: + response = requests.post(f"{OLLAMA_URL}/api/embeddings", json={ + "model": EMBEDDING_MODEL, + "prompt": text + }) + if response.status_code == 200: + return response.json()["embedding"] + else: + print(f"Error getting embedding: {response.status_code}") + return [] + except Exception as e: + print(f"Error connecting to Ollama: {e}") + return [] + +def extract_code_chunks(index_data: Dict) -> List[Dict]: + """Extract meaningful code chunks from PROJECT_INDEX.json""" + chunks = [] + + for file_path, file_info in index_data.get("files", {}).items(): + if not file_info.get("parsed", False): + continue + + # Add functions + for func_name, func_info in file_info.get("functions", {}).items(): + chunks.append({ + "type": "function", + "file": file_path, + "name": func_name, + "signature": func_info.get("signature", ""), + "doc": func_info.get("doc", ""), + "content": f"Function {func_name} in {file_path}: {func_info.get('doc', '')}" + }) + + # Add classes + for class_name, class_info in file_info.get("classes", {}).items(): + chunks.append({ + "type": "class", + "file": file_path, + "name": class_name, + "doc": class_info.get("doc", ""), + "content": f"Class {class_name} in {file_path}: {class_info.get('doc', '')}" + }) + + # Add methods + for method_name, method_info in class_info.get("methods", {}).items(): + chunks.append({ + "type": "method", + "file": file_path, + "class": class_name, + "name": method_name, + "signature": method_info.get("signature", ""), + "doc": method_info.get("doc", ""), + "content": f"Method {class_name}.{method_name} in {file_path}: {method_info.get('doc', '')}" + }) + + return chunks + +def build_embeddings(): + """Build neural embeddings index""" + print("🧠 Loading PROJECT_INDEX.json...") + + if not os.path.exists("PROJECT_INDEX.json"): + print("❌ No PROJECT_INDEX.json found. Run /semantic-index first.") + return + + with open("PROJECT_INDEX.json", "r") as f: + index_data = json.load(f) + + print("📝 Extracting code chunks...") + chunks = extract_code_chunks(index_data) + print(f"Found {len(chunks)} code chunks") + + print("🚀 Generating embeddings...") + embeddings_data = [] + + for i, chunk in enumerate(chunks): + if i % 10 == 0: + print(f"Progress: {i}/{len(chunks)}") + + embedding = get_embedding(chunk["content"]) + if embedding: + embeddings_data.append({ + "chunk": chunk, + "embedding": embedding + }) + + # Save to NEURAL_INDEX.json + neural_index = { + "model": EMBEDDING_MODEL, + "created_at": index_data.get("indexed_at"), + "total_chunks": len(embeddings_data), + "embeddings": embeddings_data + } + + with open("NEURAL_INDEX.json", "w") as f: + json.dump(neural_index, f, indent=2) + + print(f"✅ Neural index saved with {len(embeddings_data)} embeddings") + +def semantic_search(query: str, top_k: int = 5): + """Search for code using natural language""" + if not os.path.exists("NEURAL_INDEX.json"): + print("❌ No neural index found. Run /embedded-index build first.") + return + + print(f"🔍 Searching for: {query}") + + # Get query embedding + query_embedding = get_embedding(query) + if not query_embedding: + print("❌ Failed to get query embedding") + return + + # Load neural index + with open("NEURAL_INDEX.json", "r") as f: + neural_index = json.load(f) + + # Calculate similarities + similarities = [] + for item in neural_index["embeddings"]: + chunk_embedding = item["embedding"] + similarity = cosine_similarity([query_embedding], [chunk_embedding])[0][0] + similarities.append({ + "similarity": similarity, + "chunk": item["chunk"] + }) + + # Sort by similarity + similarities.sort(key=lambda x: x["similarity"], reverse=True) + + print(f"\n🎯 Top {top_k} results:") + for i, result in enumerate(similarities[:top_k]): + chunk = result["chunk"] + sim_score = result["similarity"] + print(f"\n{i+1}. {chunk['type'].title()}: {chunk['name']} (similarity: {sim_score:.3f})") + print(f" 📁 {chunk['file']}") + if chunk.get("doc"): + print(f" 📝 {chunk['doc'][:100]}...") + +def find_similar_functions(target_function: str, top_k: int = 5): + """Find functions similar to a specific function""" + if not os.path.exists("NEURAL_INDEX.json"): + print("❌ No neural index found. Run /embedded-index build first.") + return + + with open("NEURAL_INDEX.json", "r") as f: + neural_index = json.load(f) + + # Find target function + target_chunk = None + for item in neural_index["embeddings"]: + if item["chunk"]["name"] == target_function: + target_chunk = item + break + + if not target_chunk: + print(f"❌ Function '{target_function}' not found") + return + + target_embedding = target_chunk["embedding"] + print(f"🎯 Finding functions similar to: {target_function}") + + # Calculate similarities + similarities = [] + for item in neural_index["embeddings"]: + if item["chunk"]["name"] == target_function: + continue # Skip self + + chunk_embedding = item["embedding"] + similarity = cosine_similarity([target_embedding], [chunk_embedding])[0][0] + similarities.append({ + "similarity": similarity, + "chunk": item["chunk"] + }) + + # Sort and show results + similarities.sort(key=lambda x: x["similarity"], reverse=True) + + print(f"\n🔗 Most similar functions:") + for i, result in enumerate(similarities[:top_k]): + chunk = result["chunk"] + sim_score = result["similarity"] + print(f"\n{i+1}. {chunk['name']} (similarity: {sim_score:.3f})") + print(f" 📁 {chunk['file']}") + if chunk.get("doc"): + print(f" 📝 {chunk['doc'][:80]}...") + +def analyze_semantic_clusters(): + """Find semantic clusters in the codebase""" + if not os.path.exists("NEURAL_INDEX.json"): + print("❌ No neural index found. Run /embedded-index build first.") + return + + with open("NEURAL_INDEX.json", "r") as f: + neural_index = json.load(f) + + print("🔬 Analyzing semantic clusters...") + + # Simple clustering: find high-similarity pairs + clusters = [] + processed = set() + + for i, item1 in enumerate(neural_index["embeddings"]): + if i in processed: + continue + + cluster = [item1] + processed.add(i) + + for j, item2 in enumerate(neural_index["embeddings"][i+1:], i+1): + if j in processed: + continue + + similarity = cosine_similarity([item1["embedding"]], [item2["embedding"]])[0][0] + + if similarity > 0.8: # High similarity threshold + cluster.append(item2) + processed.add(j) + + if len(cluster) > 1: + clusters.append(cluster) + + print(f"\n📊 Found {len(clusters)} semantic clusters:") + + for i, cluster in enumerate(clusters): + print(f"\n🔗 Cluster {i+1} ({len(cluster)} items):") + for item in cluster: + chunk = item["chunk"] + print(f" • {chunk['type']}: {chunk['name']} ({chunk['file']})") + +def main(): + import argparse + parser = argparse.ArgumentParser(description="Neural embeddings for code") + parser.add_argument("--build", action="store_true", help="Build embeddings index") + parser.add_argument("--search", type=str, help="Semantic search query") + parser.add_argument("--similar", type=str, help="Find similar functions") + parser.add_argument("--analyze", action="store_true", help="Analyze semantic patterns") + parser.add_argument("--clusters", action="store_true", help="Show semantic clusters") + + args = parser.parse_args() + + if args.build: + build_embeddings() + elif args.search: + semantic_search(args.search) + elif args.similar: + find_similar_functions(args.similar) + elif args.analyze: + print("🧠 Neural Semantic Analysis") + analyze_semantic_clusters() + elif args.clusters: + analyze_semantic_clusters() + else: + print("Neural Embeddings Tools") + print("--build: Generate embeddings") + print("--search 'query': Semantic search") + print("--similar func_name: Find similar functions") + print("--analyze: Show semantic clusters") + +if __name__ == "__main__": + main() \ No newline at end of file diff --git a/scripts/project_index.py b/scripts/project_index.py index 27964f0..1cb2227 100755 --- a/scripts/project_index.py +++ b/scripts/project_index.py @@ -103,12 +103,10 @@ def add_tree_level(path: Path, prefix: str = "", depth: int = 0): return tree_lines -# These functions are now imported from index_utils - - def build_index(root_dir: str) -> Tuple[Dict, int]: """Build the enhanced index with architectural awareness.""" root = Path(root_dir) + index = { 'indexed_at': datetime.now().isoformat(), 'root': str(root), @@ -131,7 +129,11 @@ def build_index(root_dir: str) -> Tuple[Dict, int]: } # Generate directory tree - print("📊 Building directory tree...") + quiet = os.getenv('QUIET_MODE') + verbose = os.getenv('VERBOSE_MODE') + + if not quiet: + print("📊 Building directory tree...") index['project_structure']['tree'] = generate_tree_structure(root) file_count = 0 @@ -140,13 +142,15 @@ def build_index(root_dir: str) -> Tuple[Dict, int]: directory_files = {} # Track files per directory # Try to use git ls-files for better performance and accuracy - print("🔍 Indexing files...") + if not quiet: + print("🔍 Indexing files...") from index_utils import get_git_files git_files = get_git_files(root) if git_files is not None: # Use git-based file discovery - print(f" Using git ls-files (found {len(git_files)} files)") + if verbose: + print(f" Using git ls-files (found {len(git_files)} files)") files_to_process = git_files # Count directories from git files @@ -160,7 +164,8 @@ def build_index(root_dir: str) -> Tuple[Dict, int]: dir_count = len(seen_dirs) else: # Fallback to manual file discovery - print(" Using manual file discovery (git not available)") + if verbose: + print(" Using manual file discovery (git not available)") files_to_process = [] for file_path in root.rglob('*'): if file_path.is_dir(): @@ -254,11 +259,12 @@ def build_index(root_dir: str) -> Tuple[Dict, int]: file_count += 1 # Progress indicator every 100 files - if file_count % 100 == 0: + if file_count % 100 == 0 and verbose: print(f" Indexed {file_count} files...") # Infer directory purposes - print("🏗️ Analyzing directory purposes...") + if not quiet: + print("🏗️ Analyzing directory purposes...") for dir_path, files in directory_files.items(): if files: # Only process directories with files purpose = infer_directory_purpose(dir_path, files) @@ -271,7 +277,8 @@ def build_index(root_dir: str) -> Tuple[Dict, int]: index['stats']['total_directories'] = dir_count # Build dependency graph - print("🔗 Building dependency graph...") + if not quiet: + print("🔗 Building dependency graph...") dependency_graph = {} for file_path, file_info in index['files'].items(): @@ -318,7 +325,8 @@ def build_index(root_dir: str) -> Tuple[Dict, int]: index['dependency_graph'] = dependency_graph # Build bidirectional call graph - print("📞 Building call graph...") + if not quiet: + print("📞 Building call graph...") call_graph = {} called_by_graph = {} @@ -398,9 +406,6 @@ def build_index(root_dir: str) -> Tuple[Dict, int]: return index, skipped_count -# infer_file_purpose is now imported from index_utils - - def convert_to_enhanced_dense_format(index: Dict) -> Dict: """Convert to enhanced dense format that preserves all AI-relevant information.""" dense = { @@ -707,7 +712,9 @@ def print_summary(index: Dict, skipped_count: int): def main(): """Run the enhanced indexer.""" - print("🚀 Building Project Index...") + quiet = os.getenv('QUIET_MODE') + if not quiet: + print("🚀 Building Project Index...") # Check for target size from environment target_size_k = int(os.getenv('INDEX_TARGET_SIZE_K', '0')) @@ -763,7 +770,69 @@ def main(): if __name__ == '__main__': import sys - if len(sys.argv) > 1 and sys.argv[1] == '--version': - print(f"PROJECT_INDEX v{__version__}") - sys.exit(0) - main() \ No newline at end of file + import argparse + + parser = argparse.ArgumentParser( + description='Generate PROJECT_INDEX.json for code analysis', + formatter_class=argparse.RawDescriptionHelpFormatter, + epilog=''' +Examples: + %(prog)s # Generate standard index + %(prog)s -s 75 # Target 75k tokens + %(prog)s --quiet # Minimal output + ''' + ) + + parser.add_argument( + '--version', + action='version', + version=f'PROJECT_INDEX v{__version__}' + ) + + parser.add_argument( + '-s', '--size', + type=int, + metavar='K', + help='Target size in thousands of tokens (e.g., 75 for 75k)' + ) + + parser.add_argument( + '-q', '--quiet', + action='store_true', + help='Minimal output' + ) + + parser.add_argument( + '-v', '--verbose', + action='store_true', + help='Verbose output' + ) + + parser.add_argument( + '-d', '--directory', + default='.', + help='Directory to index (default: current directory)' + ) + + args = parser.parse_args() + + # Set environment variables based on arguments + if args.size: + os.environ['INDEX_TARGET_SIZE_K'] = str(args.size) + + if args.quiet: + os.environ['QUIET_MODE'] = '1' + + if args.verbose: + os.environ['VERBOSE_MODE'] = '1' + + # Change to target directory if specified + if args.directory != '.': + original_dir = os.getcwd() + os.chdir(args.directory) + try: + main() + finally: + os.chdir(original_dir) + else: + main() \ No newline at end of file diff --git a/scripts/query_index.py b/scripts/query_index.py new file mode 100644 index 0000000..de8d0b1 --- /dev/null +++ b/scripts/query_index.py @@ -0,0 +1,353 @@ +#!/usr/bin/env python3 +""" +Query Index - Search and query functionality for PROJECT_INDEX.json +Find similar code patterns using cached neural embeddings + +Features: +- Query similar functions using natural language +- Show cached duplicate groups +- Multiple similarity algorithms +- Real-time and cached query modes +""" + +__version__ = "0.1.0" + +import json +import math +import argparse +import sys +import urllib.request +import urllib.error +from pathlib import Path +from typing import Dict, List, Tuple, Optional, Any, Callable + +# Import shared algorithms and utilities from append_cluster script +# (These will be the same functions but used for querying) + +class SimilarityAlgorithms: + """Collection of similarity algorithms for vector comparison.""" + + @staticmethod + def cosine_similarity(vec1: List[float], vec2: List[float]) -> float: + """Calculate cosine similarity between two vectors (default algorithm).""" + if len(vec1) != len(vec2): + return 0.0 + + dot_product = sum(a * b for a, b in zip(vec1, vec2)) + magnitude1 = math.sqrt(sum(a * a for a in vec1)) + magnitude2 = math.sqrt(sum(a * a for a in vec2)) + + if magnitude1 == 0 or magnitude2 == 0: + return 0.0 + + return dot_product / (magnitude1 * magnitude2) + + @staticmethod + def euclidean_similarity(vec1: List[float], vec2: List[float]) -> float: + """Calculate similarity based on Euclidean distance.""" + if len(vec1) != len(vec2): + return 0.0 + + distance = math.sqrt(sum((a - b) ** 2 for a, b in zip(vec1, vec2))) + return 1.0 / (1.0 + distance) + + @staticmethod + def manhattan_similarity(vec1: List[float], vec2: List[float]) -> float: + """Calculate similarity based on Manhattan distance.""" + if len(vec1) != len(vec2): + return 0.0 + + distance = sum(abs(a - b) for a, b in zip(vec1, vec2)) + return 1.0 / (1.0 + distance) + + @staticmethod + def dot_product_similarity(vec1: List[float], vec2: List[float]) -> float: + """Calculate raw dot product similarity.""" + if len(vec1) != len(vec2): + return 0.0 + + dot_product = sum(a * b for a, b in zip(vec1, vec2)) + return (math.tanh(dot_product) + 1) / 2 + + +def get_similarity_algorithm(algorithm_name: str) -> Callable[[List[float], List[float]], float]: + """Get similarity algorithm function by name.""" + algorithms = { + 'cosine': SimilarityAlgorithms.cosine_similarity, + 'euclidean': SimilarityAlgorithms.euclidean_similarity, + 'manhattan': SimilarityAlgorithms.manhattan_similarity, + 'dot-product': SimilarityAlgorithms.dot_product_similarity, + } + + if algorithm_name not in algorithms: + raise ValueError(f"Unknown algorithm: {algorithm_name}. Available: {list(algorithms.keys())}") + + return algorithms[algorithm_name] + + +def load_project_index(index_path: str = "PROJECT_INDEX.json") -> Dict: + """Load PROJECT_INDEX.json with embeddings.""" + try: + with open(index_path, 'r') as f: + return json.load(f) + except FileNotFoundError: + print(f"❌ Error: {index_path} not found!") + print(" Run the 3-step workflow first:") + print(" 1. python3 scripts/project_index.py") + print(" 2. python3 scripts/append_embeddings_to_index.py") + print(" 3. python3 scripts/append_cluster_to_embeddings_in_index.py --build-cache") + sys.exit(1) + except json.JSONDecodeError as e: + print(f"❌ Error: Invalid JSON in {index_path}: {e}") + sys.exit(1) + + +def extract_embeddings_from_index(index: Dict) -> List[Dict]: + """Extract all embeddings with metadata from the project index.""" + embeddings = [] + + # Extract from files section with full embedding data + files_data = index.get('files', {}) + + for file_path, file_data in files_data.items(): + if not isinstance(file_data, dict): + continue + + # Extract functions + for func_name, func_data in file_data.get('functions', {}).items(): + if isinstance(func_data, dict) and 'embedding' in func_data: + embeddings.append({ + 'type': 'function', + 'name': func_name, + 'file': file_path, + 'line': func_data.get('line', 0), + 'signature': func_data.get('signature', '()'), + 'doc': func_data.get('doc', ''), + 'embedding': func_data['embedding'], + 'full_name': f"{file_path}:{func_name}", + 'calls': func_data.get('calls', []), + 'called_by': func_data.get('called_by', []) + }) + + # Extract methods from classes + for class_name, class_data in file_data.get('classes', {}).items(): + if isinstance(class_data, dict): + for method_name, method_data in class_data.get('methods', {}).items(): + if isinstance(method_data, dict) and 'embedding' in method_data: + embeddings.append({ + 'type': 'method', + 'name': method_name, + 'class': class_name, + 'file': file_path, + 'line': method_data.get('line', 0), + 'signature': method_data.get('signature', '()'), + 'doc': method_data.get('doc', ''), + 'embedding': method_data['embedding'], + 'full_name': f"{file_path}:{class_name}.{method_name}", + 'calls': method_data.get('calls', []), + 'called_by': method_data.get('called_by', []) + }) + + return embeddings + + +def generate_embedding_for_query(query: str, model_name: str = "nomic-embed-text", + endpoint: str = "http://localhost:11434") -> Optional[List[float]]: + """Generate embedding for a query string.""" + try: + url = f"{endpoint}/api/embeddings" + data = json.dumps({ + "model": model_name, + "prompt": query + }).encode('utf-8') + + req = urllib.request.Request(url, data=data, headers={'Content-Type': 'application/json'}) + + with urllib.request.urlopen(req, timeout=10) as response: + if response.status == 200: + result = json.loads(response.read().decode('utf-8')) + return result.get('embedding') + except Exception as e: + print(f"❌ Error generating embedding for query: {e}") + return None + + +def query_similar_functions(query: str, index: Dict, algorithm: str = 'cosine', + top_k: int = 10, threshold: float = 0.5, + endpoint: str = "http://localhost:11434", + model_name: str = "nomic-embed-text") -> List[Tuple[Dict, float]]: + """Query for similar functions using natural language.""" + # Generate query embedding + query_embedding = generate_embedding_for_query(query, model_name, endpoint) + if not query_embedding: + return [] + + # Get algorithm function + try: + similarity_func = get_similarity_algorithm(algorithm) + except ValueError as e: + print(f"❌ {e}") + return [] + + # Get all embeddings + embeddings = extract_embeddings_from_index(index) + if not embeddings: + print("❌ No embeddings found in index!") + return [] + + # Calculate similarities + results = [] + for item in embeddings: + try: + similarity = similarity_func(query_embedding, item['embedding']) + if similarity >= threshold: + results.append((item, similarity)) + except Exception: + continue + + # Sort and return top results + results.sort(key=lambda x: x[1], reverse=True) + return results[:top_k] + + +def print_query_results(results: List[Tuple[Dict, float]], query: str = None, algorithm: str = 'cosine'): + """Print similarity search results.""" + if not results: + print("🤷 No similar code found.") + return + + if query: + print(f"🔍 Similar to: '{query}' (using {algorithm} algorithm)") + print(f"📊 Found {len(results)} similar items:\n") + + for i, (item, similarity) in enumerate(results, 1): + print(f"#{i} 🎯 Similarity: {similarity:.3f}") + print(f" 📁 {item['file']}:{item['line']}") + + if item['type'] == 'function': + print(f" 🔧 Function: {item['name']}{item['signature']}") + else: + print(f" 🏷️ Method: {item['class']}.{item['name']}{item['signature']}") + + if item['doc']: + print(f" 📝 {item['doc']}") + + # Show call relationships if available + if item.get('calls'): + calls = ', '.join(item['calls'][:3]) + if len(item['calls']) > 3: + calls += f" (+{len(item['calls'])-3} more)" + print(f" 📞 Calls: {calls}") + + print() + + +def print_cached_duplicates(cache: Dict, algorithm: str = 'cosine'): + """Print duplicate groups from cache.""" + if 'similarity_analysis' not in cache: + print("❌ No similarity cache found. Build cache first:") + print(" python3 scripts/append_cluster_to_embeddings_in_index.py --build-cache") + return + + similarity_cache = cache['similarity_analysis'] + if algorithm not in similarity_cache.get('algorithms', {}): + print(f"❌ No cache data found for algorithm: {algorithm}") + return + + duplicates = similarity_cache['algorithms'][algorithm].get('duplicate_groups', []) + + if not duplicates: + print("✅ No potential duplicates found.") + return + + print(f"⚠️ Found {len(duplicates)} groups of potentially duplicate code (using {algorithm} algorithm):\n") + + for i, group in enumerate(duplicates, 1): + items = group['items'] + sim_range = group['similarity_range'] + print(f"Group #{i} ({len(items)} similar items, similarity: {sim_range[0]:.3f}-{sim_range[1]:.3f}):") + + for item in items: + print(f" 🎯 {item['score']:.3f} - {item['item']}") + print() + + +def main(): + """Main query interface.""" + parser = argparse.ArgumentParser( + description='Query and search PROJECT_INDEX.json for similar code', + formatter_class=argparse.RawDescriptionHelpFormatter, + epilog=''' +Examples: + # Query similar functions + %(prog)s -q "authentication function" + %(prog)s -q "validate email" --algorithm euclidean + + # Show cached duplicates + %(prog)s --duplicates --algorithm cosine + + # Custom settings + %(prog)s -q "error handling" --top-k 5 --threshold 0.7 + ''' + ) + + parser.add_argument('--version', action='version', version=f'Query Index v{__version__}') + + # Mode selection + parser.add_argument('-q', '--query', type=str, + help='Search query for similar functions') + parser.add_argument('--duplicates', action='store_true', + help='Show cached duplicate groups') + + # Algorithm selection + parser.add_argument('--algorithm', default='cosine', + choices=['cosine', 'euclidean', 'manhattan', 'dot-product'], + help='Similarity algorithm (default: cosine)') + + # File I/O + parser.add_argument('-i', '--input', default='PROJECT_INDEX.json', + help='Input index file (default: PROJECT_INDEX.json)') + + # Search parameters + parser.add_argument('-k', '--top-k', type=int, default=10, + help='Number of top results (default: 10)') + parser.add_argument('-t', '--threshold', type=float, default=0.5, + help='Similarity threshold (default: 0.5)') + + # Ollama settings + parser.add_argument('--embed-model', default='nomic-embed-text', + help='Ollama model for embeddings (default: nomic-embed-text)') + parser.add_argument('--embed-endpoint', default='http://localhost:11434', + help='Ollama API endpoint (default: http://localhost:11434)') + + args = parser.parse_args() + + if not any([args.query, args.duplicates]): + print("❌ Error: Must specify --query or --duplicates") + parser.print_help() + sys.exit(1) + + # Load project index + print(f"📊 Loading project index: {args.input}") + index = load_project_index(args.input) + + # Handle duplicate display + if args.duplicates: + print_cached_duplicates(index, args.algorithm) + return + + # Handle query + if args.query: + print(f"🔍 Searching for: '{args.query}' (using {args.algorithm})") + + results = query_similar_functions( + args.query, index, args.algorithm, + args.top_k, args.threshold, + args.embed_endpoint, args.embed_model + ) + + print_query_results(results, args.query, args.algorithm) + + +if __name__ == '__main__': + main() \ No newline at end of file diff --git a/scripts/reindex_if_needed.py b/scripts/reindex_if_needed.py new file mode 100644 index 0000000..20a9add --- /dev/null +++ b/scripts/reindex_if_needed.py @@ -0,0 +1,67 @@ +#!/usr/bin/env python3 +""" +Basic reindex_if_needed.py for fix/nixos-compatibility branch +Simple version that just runs the basic indexer when needed +""" + +import json +import sys +import os +import subprocess +from pathlib import Path +from datetime import datetime + + +def main(): + """Main hook entry point.""" + # Check if we're in a git repository or have a PROJECT_INDEX.json + current_dir = Path.cwd() + index_path = None + project_root = current_dir + + # Search up the directory tree + check_dir = current_dir + while check_dir != check_dir.parent: + # Check for PROJECT_INDEX.json + potential_index = check_dir / 'PROJECT_INDEX.json' + if potential_index.exists(): + index_path = potential_index + project_root = check_dir + break + + # Check for .git directory + if (check_dir / '.git').is_dir(): + project_root = check_dir + index_path = check_dir / 'PROJECT_INDEX.json' + break + + check_dir = check_dir.parent + + if not index_path or not index_path.exists(): + # No index exists - skip silently + return + + # Simple staleness check - reindex if older than 7 days + try: + index_mtime = os.path.getmtime(index_path) + current_time = datetime.now().timestamp() + age_hours = (current_time - index_mtime) / 3600 + + if age_hours > 168: # 7 days + # Run basic reindex + script_path = Path(__file__).parent / 'project_index.py' + if script_path.exists(): + result = subprocess.run( + [sys.executable, str(script_path)], + cwd=project_root, + capture_output=True, + text=True + ) + if result.returncode == 0: + print("♻️ Refreshed project index (weekly update)") + except Exception: + pass # Silent failure for basic version + + +if __name__ == '__main__': + main() \ No newline at end of file diff --git a/scripts/semantic_analyzer.py b/scripts/semantic_analyzer.py new file mode 100644 index 0000000..2565b0e --- /dev/null +++ b/scripts/semantic_analyzer.py @@ -0,0 +1,503 @@ +#!/usr/bin/env python3 +""" +Semantic analyzer for code fingerprinting and similarity detection. +Used to enhance the project index with semantic analysis capabilities. +""" + +import json +import sys +import os +from pathlib import Path +from typing import Dict, List, Any, Optional, Tuple + +# Import utilities from index_utils +from index_utils import ( + create_ast_fingerprint, + create_tfidf_embeddings, + extract_architectural_patterns, + normalize_code_for_comparison, + extract_python_signatures, + extract_javascript_signatures, + extract_shell_signatures, + PARSEABLE_LANGUAGES +) + + +class SemanticAnalyzer: + """Main class for semantic code analysis and fingerprinting.""" + + def __init__(self, project_root: str): + self.project_root = Path(project_root) + self.vectorizer = None + self.function_bodies = [] + self.function_metadata = [] + + def analyze_project(self, index_data: Dict[str, Any]) -> Dict[str, Any]: + """Perform comprehensive semantic analysis on the project index.""" + print("Starting semantic analysis...") + + # Extract all function bodies and metadata + all_functions = self._extract_all_functions(index_data) + + if not all_functions: + print("No functions found for analysis") + return self._create_empty_semantic_index() + + print(f"Analyzing {len(all_functions)} functions...") + + # Create embeddings for all functions + function_bodies = [func['body'] for func in all_functions] + vectorizer, embeddings = create_tfidf_embeddings(function_bodies) + + # Build semantic index + semantic_index = { + 'functions': {}, + 'similarity_clusters': [], + 'architectural_patterns': {}, + 'complexity_analysis': {}, + 'vocabulary': self._extract_vocabulary(vectorizer) if vectorizer else {} + } + + # Process each function + for i, func_data in enumerate(all_functions): + func_id = f"{func_data['file_path']}:{func_data['name']}" + + # Create AST fingerprint + ast_fingerprint = create_ast_fingerprint( + func_data['body'], + func_data.get('language', 'python') + ) + + # Store function analysis + semantic_index['functions'][func_id] = { + 'file_path': func_data['file_path'], + 'function_name': func_data['name'], + 'signature': func_data.get('signature', ''), + 'ast_fingerprint': ast_fingerprint, + 'tfidf_vector': embeddings[i] if embeddings else [], + 'complexity': self._calculate_complexity(func_data['body']), + 'patterns': self._identify_function_patterns(func_data['name'], func_data['body']), + 'language': func_data.get('language', 'python') + } + + # Find similarity clusters + if embeddings: + semantic_index['similarity_clusters'] = self._find_similarity_clusters( + all_functions, embeddings + ) + + # Extract architectural patterns + semantic_index['architectural_patterns'] = self._analyze_architecture(index_data) + + # Complexity analysis + semantic_index['complexity_analysis'] = self._analyze_complexity(all_functions) + + print(f"Semantic analysis complete. Found {len(semantic_index['similarity_clusters'])} similarity clusters.") + + return semantic_index + + def _extract_all_functions(self, index_data: Dict[str, Any]) -> List[Dict[str, Any]]: + """Extract all functions from the index with their bodies.""" + functions = [] + files = index_data.get('files', {}) + + for file_path, file_data in files.items(): + if not file_data.get('parsed', False): + continue + + language = file_data.get('language', 'unknown') + + # Read the actual file to get function bodies + full_path = self.project_root / file_path + if not full_path.exists(): + continue + + try: + with open(full_path, 'r', encoding='utf-8', errors='ignore') as f: + file_content = f.read() + except: + continue + + # Extract function bodies based on language + if language == 'python': + functions.extend(self._extract_python_function_bodies( + file_path, file_content, file_data.get('functions', {}) + )) + elif language in ['javascript', 'typescript']: + functions.extend(self._extract_javascript_function_bodies( + file_path, file_content, file_data.get('functions', {}) + )) + elif language == 'shell': + functions.extend(self._extract_shell_function_bodies( + file_path, file_content, file_data.get('functions', {}) + )) + + return functions + + def _extract_python_function_bodies(self, file_path: str, content: str, functions: Dict[str, Any]) -> List[Dict[str, Any]]: + """Extract Python function bodies from file content.""" + result = [] + lines = content.split('\n') + + for func_name, func_info in functions.items(): + # Find function definition in file + for i, line in enumerate(lines): + if f"def {func_name}(" in line: + # Extract function body + body_lines = [] + indent_level = len(line) - len(line.lstrip()) + + # Collect function body + for j in range(i + 1, len(lines)): + current_line = lines[j] + if not current_line.strip(): # Empty line + body_lines.append(current_line) + continue + + current_indent = len(current_line) - len(current_line.lstrip()) + if current_indent <= indent_level and current_line.strip(): + break + + body_lines.append(current_line) + + if body_lines: + result.append({ + 'file_path': file_path, + 'name': func_name, + 'signature': func_info.get('signature', '') if isinstance(func_info, dict) else func_info, + 'body': '\n'.join(body_lines), + 'language': 'python' + }) + break + + return result + + def _extract_javascript_function_bodies(self, file_path: str, content: str, functions: Dict[str, Any]) -> List[Dict[str, Any]]: + """Extract JavaScript/TypeScript function bodies from file content.""" + result = [] + + for func_name, func_info in functions.items(): + # Simple extraction - look for function patterns + patterns = [ + rf'function\s+{func_name}\s*\([^)]*\)\s*{{', + rf'const\s+{func_name}\s*=\s*\([^)]*\)\s*=>\s*{{', + rf'{func_name}\s*\([^)]*\)\s*{{' + ] + + for pattern in patterns: + import re + match = re.search(pattern, content) + if match: + # Extract function body (simplified) + start = match.end() + brace_count = 1 + end = start + + for i in range(start, len(content)): + if content[i] == '{': + brace_count += 1 + elif content[i] == '}': + brace_count -= 1 + if brace_count == 0: + end = i + break + + if end > start: + body = content[start:end] + result.append({ + 'file_path': file_path, + 'name': func_name, + 'signature': func_info.get('signature', '') if isinstance(func_info, dict) else func_info, + 'body': body, + 'language': 'javascript' + }) + break + + return result + + def _extract_shell_function_bodies(self, file_path: str, content: str, functions: Dict[str, Any]) -> List[Dict[str, Any]]: + """Extract shell function bodies from file content.""" + result = [] + lines = content.split('\n') + + for func_name, func_info in functions.items(): + # Find function definition + for i, line in enumerate(lines): + if f"{func_name}()" in line or f"function {func_name}" in line: + # Extract function body + body_lines = [] + brace_count = 0 + in_function = False + + for j in range(i + 1, len(lines)): + current_line = lines[j] + + if '{' in current_line: + brace_count += current_line.count('{') + in_function = True + + if in_function: + body_lines.append(current_line) + + if '}' in current_line: + brace_count -= current_line.count('}') + if brace_count <= 0: + break + + if body_lines: + result.append({ + 'file_path': file_path, + 'name': func_name, + 'signature': func_info.get('signature', '') if isinstance(func_info, dict) else func_info, + 'body': '\n'.join(body_lines), + 'language': 'shell' + }) + break + + return result + + def _calculate_complexity(self, function_body: str) -> Dict[str, int]: + """Calculate cyclomatic complexity and other metrics.""" + # Simple complexity metrics + complexity = { + 'lines': len([line for line in function_body.split('\n') if line.strip()]), + 'cyclomatic': 1, # Base complexity + 'nesting_depth': 0 + } + + # Count decision points for cyclomatic complexity + decision_patterns = [ + r'\bif\b', r'\belif\b', r'\belse\b', + r'\bfor\b', r'\bwhile\b', + r'\btry\b', r'\bexcept\b', r'\bcatch\b', + r'\bswitch\b', r'\bcase\b', + r'\?\s*.*\s*:' # Ternary operator + ] + + for pattern in decision_patterns: + import re + matches = re.findall(pattern, function_body, re.IGNORECASE) + complexity['cyclomatic'] += len(matches) + + # Calculate nesting depth (simplified) + max_depth = 0 + current_depth = 0 + for line in function_body.split('\n'): + stripped = line.strip() + if any(keyword in stripped for keyword in ['if', 'for', 'while', 'try', 'with']): + current_depth += 1 + max_depth = max(max_depth, current_depth) + elif stripped.startswith(('end', '}')) or stripped == '': + current_depth = max(0, current_depth - 1) + + complexity['nesting_depth'] = max_depth + return complexity + + def _identify_function_patterns(self, func_name: str, func_body: str) -> List[str]: + """Identify common patterns in function implementation.""" + patterns = [] + + name_lower = func_name.lower() + body_lower = func_body.lower() + + # Naming patterns + if name_lower.startswith(('get', 'fetch', 'retrieve')): + patterns.append('getter') + elif name_lower.startswith(('set', 'update', 'modify')): + patterns.append('setter') + elif name_lower.startswith(('is', 'has', 'can', 'should')): + patterns.append('predicate') + elif 'valid' in name_lower or 'check' in name_lower: + patterns.append('validation') + elif name_lower.startswith(('create', 'make', 'build')): + patterns.append('factory') + elif name_lower.startswith(('parse', 'format', 'convert')): + patterns.append('transformer') + + # Implementation patterns + if 'raise ' in body_lower or 'throw ' in body_lower: + patterns.append('error_handling') + if 'log' in body_lower or 'print' in body_lower: + patterns.append('logging') + if 'async' in body_lower or 'await' in body_lower: + patterns.append('async') + if 'cache' in body_lower: + patterns.append('caching') + if 'db' in body_lower or 'database' in body_lower or 'query' in body_lower: + patterns.append('database') + if 'http' in body_lower or 'request' in body_lower or 'api' in body_lower: + patterns.append('api') + + return patterns + + def _find_similarity_clusters(self, functions: List[Dict[str, Any]], embeddings: List[List[float]]) -> List[Dict[str, Any]]: + """Find clusters of similar functions.""" + from index_utils import compute_code_similarity + + clusters = [] + processed = set() + + for i, func1 in enumerate(functions): + if i in processed: + continue + + cluster = { + 'representative': f"{func1['file_path']}:{func1['name']}", + 'functions': [f"{func1['file_path']}:{func1['name']}"], + 'similarity_scores': [1.0], + 'pattern': 'similar_implementation' + } + + # Find similar functions + for j, func2 in enumerate(functions[i+1:], i+1): + if j in processed: + continue + + similarity = compute_code_similarity(embeddings[i], embeddings[j]) + if similarity >= 0.75: # 75% similarity threshold for clustering + cluster['functions'].append(f"{func2['file_path']}:{func2['name']}") + cluster['similarity_scores'].append(similarity) + processed.add(j) + + if len(cluster['functions']) > 1: + clusters.append(cluster) + + processed.add(i) + + return clusters + + def _analyze_architecture(self, index_data: Dict[str, Any]) -> Dict[str, Any]: + """Analyze architectural patterns across the project.""" + patterns = { + 'naming_conventions': {}, + 'directory_patterns': {}, + 'design_patterns': [], + 'dependency_patterns': {} + } + + # Analyze file organization + files = index_data.get('files', {}) + + # Directory patterns + directories = set() + for file_path in files.keys(): + if '/' in file_path: + directory = '/'.join(file_path.split('/')[:-1]) + directories.add(directory) + + # Common directory patterns + if 'src' in directories: + patterns['directory_patterns']['src_pattern'] = True + if any('test' in d for d in directories): + patterns['directory_patterns']['test_separation'] = True + if any('util' in d for d in directories): + patterns['directory_patterns']['utility_separation'] = True + + # Naming conventions analysis + all_functions = [] + for file_data in files.values(): + if file_data.get('functions'): + all_functions.extend(file_data['functions'].keys()) + + if all_functions: + snake_case = sum(1 for name in all_functions if '_' in name and name.islower()) + camel_case = sum(1 for name in all_functions if '_' not in name and any(c.isupper() for c in name[1:])) + + if snake_case > camel_case: + patterns['naming_conventions']['functions'] = 'snake_case' + elif camel_case > 0: + patterns['naming_conventions']['functions'] = 'camelCase' + + return patterns + + def _analyze_complexity(self, functions: List[Dict[str, Any]]) -> Dict[str, Any]: + """Analyze project complexity metrics.""" + if not functions: + return {} + + complexities = [self._calculate_complexity(func['body']) for func in functions] + + lines = [c['lines'] for c in complexities] + cyclomatic = [c['cyclomatic'] for c in complexities] + nesting = [c['nesting_depth'] for c in complexities] + + return { + 'total_functions': len(functions), + 'average_lines_per_function': sum(lines) / len(lines), + 'average_cyclomatic_complexity': sum(cyclomatic) / len(cyclomatic), + 'max_cyclomatic_complexity': max(cyclomatic), + 'average_nesting_depth': sum(nesting) / len(nesting), + 'max_nesting_depth': max(nesting), + 'high_complexity_functions': [ + f"{func['file_path']}:{func['name']}" + for func, complexity in zip(functions, complexities) + if complexity['cyclomatic'] > 10 or complexity['nesting_depth'] > 4 + ] + } + + def _extract_vocabulary(self, vectorizer) -> Dict[str, Any]: + """Extract vocabulary information from TF-IDF vectorizer.""" + if not vectorizer: + return {} + + try: + vocabulary = vectorizer.get_feature_names_out() + return { + 'size': len(vocabulary), + 'top_terms': list(vocabulary[:20]) # First 20 terms + } + except: + return {} + + def _create_empty_semantic_index(self) -> Dict[str, Any]: + """Create an empty semantic index structure.""" + return { + 'functions': {}, + 'similarity_clusters': [], + 'architectural_patterns': {}, + 'complexity_analysis': {}, + 'vocabulary': {} + } + + +def main(): + """Main entry point for semantic analysis.""" + if len(sys.argv) < 2: + print("Usage: python semantic_analyzer.py [index_file]") + sys.exit(1) + + project_root = sys.argv[1] + index_file = sys.argv[2] if len(sys.argv) > 2 else 'PROJECT_INDEX.json' + + # Load existing index + index_path = Path(project_root) / index_file + if not index_path.exists(): + print(f"Index file not found: {index_path}") + sys.exit(1) + + try: + with open(index_path, 'r') as f: + index_data = json.load(f) + except Exception as e: + print(f"Error loading index: {e}") + sys.exit(1) + + # Perform semantic analysis + analyzer = SemanticAnalyzer(project_root) + semantic_index = analyzer.analyze_project(index_data) + + # Add semantic index to existing data + index_data['semantic_index'] = semantic_index + + # Save updated index + try: + with open(index_path, 'w') as f: + json.dump(index_data, f, indent=2) + print(f"Enhanced index saved to {index_path}") + except Exception as e: + print(f"Error saving index: {e}") + sys.exit(1) + + +if __name__ == '__main__': + main() \ No newline at end of file diff --git a/scripts/semantic_command_handler.py b/scripts/semantic_command_handler.py new file mode 100644 index 0000000..d52d0de --- /dev/null +++ b/scripts/semantic_command_handler.py @@ -0,0 +1,136 @@ +#!/usr/bin/env python3 +""" +Command handler for /semantic-index slash command +Handles all sub-commands without multi-line bash issues +""" + +import sys +import os +import subprocess +import json +from pathlib import Path + +def run_command(cmd, shell=True): + """Run a command and return success status""" + try: + result = subprocess.run(cmd, shell=shell, capture_output=True, text=True) + if result.stdout: + print(result.stdout.strip()) + if result.stderr: + print(result.stderr.strip(), file=sys.stderr) + return result.returncode == 0 + except Exception as e: + print(f"Error running command: {e}", file=sys.stderr) + return False + +def show_status(): + """Show project status""" + print("📊 PROJECT STATUS:") + + if os.path.exists("PROJECT_INDEX.json"): + size = os.path.getsize("PROJECT_INDEX.json") + print(f"✅ Index size: {size} bytes") + + try: + with open("PROJECT_INDEX.json", "r") as f: + index = json.load(f) + print(f"📁 Files: {index.get('stats', {}).get('total_files', 'N/A')}") + print(f"🗂️ Dirs: {index.get('stats', {}).get('total_directories', 'N/A')}") + print(f"🧬 Semantic: {'semantic_index' in index}") + except: + print("⚠️ Could not parse index file") + else: + print("❌ No PROJECT_INDEX.json found") + print("💡 Run: /semantic-index setup") + +def setup_project(): + """Set up project indexing""" + print("⚙️ Setting up project indexing...") + + # Copy hooks configuration + os.makedirs(".claude", exist_ok=True) + settings_path = os.path.expanduser("~/.claude-code-project-index/.claude/settings.json") + if os.path.exists(settings_path): + run_command(f"cp {settings_path} .claude/") + print("✅ Hooks configured") + else: + print("⚠️ No hooks configuration found, skipping") + + # Copy scripts locally + os.makedirs("scripts", exist_ok=True) + global_scripts = os.path.expanduser("~/.claude/scripts") + project_scripts = "/home/lessuseless/.claude-code-project-index/scripts" + + if os.path.exists(global_scripts): + run_command(f"cp -r {global_scripts}/* scripts/") + print("✅ Scripts copied from global location") + elif os.path.exists(project_scripts): + run_command(f"cp -r {project_scripts}/* scripts/") + print("✅ Scripts copied from project repository") + else: + print("❌ Could not find scripts directory") + return False + + # Generate initial index + print("🏗️ Generating initial PROJECT_INDEX.json...") + if run_command("python3 scripts/enhanced_project_index.py"): + print("✅ Setup complete! Index created and hooks configured.") + print("📊 You can now use:") + print(" • /semantic-index build - Update index") + print(" • /semantic-index duplicates - Check for duplicates") + print(" • /embedded-index setup - Neural embeddings") + return True + else: + print("❌ Failed to generate index") + return False + +def main(): + """Main command handler""" + args = " ".join(sys.argv[1:]) if len(sys.argv) > 1 else "" + + if args in ["status", ""]: + show_status() + + elif args in ["build", "incremental"]: + print("🔄 Running incremental update...") + run_command("python3 ~/.claude/scripts/reindex_if_needed.py") + + elif args == "full": + print("🏗️ Full semantic rebuild...") + run_command("python3 ~/.claude/scripts/enhanced_project_index.py") + + elif args == "duplicates": + print("🔍 Checking for duplicates...") + run_command("python3 ~/.claude/scripts/duplicate_mode_toggle.py --status") + print("📋 Generate report with: /semantic-index duplicates report") + + elif args == "duplicates report": + print("📊 Generating duplicate analysis report...") + run_command("python3 ~/.claude/scripts/generate_duplicate_report.py") + + elif args == "duplicates interactive": + print("🛠️ Starting interactive cleanup...") + run_command("python3 ~/.claude/scripts/interactive_cleanup.py") + + elif args == "analyze": + print("🔬 Running semantic analysis...") + run_command("python3 ~/.claude/scripts/semantic_analyzer.py") + + elif args == "setup": + setup_project() + + else: + print("📚 USAGE: /semantic-index [command]") + print("") + print("Commands:") + print(" build - Incremental update (default)") + print(" full - Complete rebuild with semantic analysis") + print(" duplicates - Show duplicate detection status") + print(" duplicates report - Generate duplicate analysis") + print(" duplicates interactive - Interactive cleanup") + print(" analyze - Run semantic analysis only") + print(" status - Show current index status") + print(" setup - Initialize project indexing") + +if __name__ == "__main__": + main() \ No newline at end of file diff --git a/scripts/update_index.py b/scripts/update_index.py new file mode 100644 index 0000000..2735f30 --- /dev/null +++ b/scripts/update_index.py @@ -0,0 +1,30 @@ +#!/usr/bin/env python3 +""" +Update index hook - called after file edits to keep PROJECT_INDEX.json current +""" + +import json +import sys +import os +import subprocess +from pathlib import Path + +def main(): + """Simple update hook that calls reindex_if_needed""" + try: + # Just call the existing reindex script + result = subprocess.run([ + "python3", + "scripts/reindex_if_needed.py" + ], capture_output=True, text=True) + + if result.returncode == 0: + print("Index update completed successfully") + else: + print(f"Index update had issues: {result.stderr}") + + except Exception as e: + print(f"Update hook error: {e}") + +if __name__ == "__main__": + main() \ No newline at end of file diff --git a/install.sh b/tools/install.sh similarity index 100% rename from install.sh rename to tools/install.sh diff --git a/tools/streamlined_setup.sh b/tools/streamlined_setup.sh new file mode 100644 index 0000000..7d831d7 --- /dev/null +++ b/tools/streamlined_setup.sh @@ -0,0 +1,111 @@ +#!/usr/bin/env bash +# Streamlined setup for duplicate detection system +# Addresses common friction points and automates the full setup + +set -eo pipefail + +PROJECT_ROOT="$(pwd)" +SYSTEM_DIR="$HOME/.claude-code-project-index" + +echo "🚀 Streamlined Duplicate Detection Setup" +echo "=" * 50 + +# Step 1: Verify system installation +if [[ ! -d "$SYSTEM_DIR" ]]; then + echo "❌ Claude Code Project Index not found at $SYSTEM_DIR" + echo "💡 Please install first: https://github.com/your-repo/claude-code-project-index" + exit 1 +fi + +echo "✅ System found at $SYSTEM_DIR" + +# Step 2: Create PROJECT_INDEX.json with semantic analysis +echo "" +echo "📊 Step 1: Building comprehensive project index..." +if bash "$SYSTEM_DIR/scripts/project-index-helper.sh"; then + echo "✅ Basic index created" +else + echo "❌ Failed to create basic index" + exit 1 +fi + +# Step 3: Run semantic analysis automatically +echo "" +echo "🧠 Step 2: Adding semantic analysis (duplicate detection)..." +if python3 "$SYSTEM_DIR/scripts/semantic_analyzer.py" "$PROJECT_ROOT"; then + echo "✅ Semantic analysis complete" +else + echo "❌ Failed to run semantic analysis" + echo "💡 Make sure sklearn is installed: pip install scikit-learn" + exit 1 +fi + +# Step 4: Set up dual-mode configuration +echo "" +echo "⚙️ Step 3: Configuring dual-mode detection..." + +# Create .claude directory if it doesn't exist +mkdir -p "$PROJECT_ROOT/.claude" + +# Copy enhanced settings +if cp "$SYSTEM_DIR/.claude/settings_dual_mode.json" "$PROJECT_ROOT/.claude/settings.json"; then + echo "✅ Dual-mode hooks configured" +else + echo "❌ Failed to configure hooks" + exit 1 +fi + +# Initialize in passive mode (safer for first-time users) +if python3 "$SYSTEM_DIR/scripts/duplicate_mode_toggle.py" --project-root "$PROJECT_ROOT" passive; then + echo "✅ Initialized in passive mode (safe for learning)" +else + echo "❌ Failed to initialize mode" + exit 1 +fi + +# Step 5: Generate initial duplicate report +echo "" +echo "📋 Step 4: Generating initial duplicate analysis..." +if python3 "$SYSTEM_DIR/scripts/generate_duplicate_report.py" --project-root "$PROJECT_ROOT" --format markdown --output "DUPLICATE_ANALYSIS_REPORT.md"; then + echo "✅ Duplicate report saved to DUPLICATE_ANALYSIS_REPORT.md" +else + echo "❌ Failed to generate duplicate report" + exit 1 +fi + +# Step 6: Test status line +echo "" +echo "📈 Step 5: Testing status line..." +if echo '{"model":{"display_name":"Test"},"workspace":{"current_dir":"'$PROJECT_ROOT'"}}' | "$SYSTEM_DIR/.claude/duplicate-status.sh" > /dev/null 2>&1; then + echo "✅ Status line working" +else + echo "❌ Status line test failed" + exit 1 +fi + +# Success summary +echo "" +echo "🎉 Setup Complete!" +echo "=" * 30 +echo "" +echo "📊 Your duplicate detection system is now active with:" +echo " • 👁️ PASSIVE MODE - Monitors duplicates without blocking" +echo " • 📈 STATUS LINE - Real-time detection status" +echo " • 📋 DUPLICATE REPORT - Initial analysis completed" +echo "" +echo "🔧 Quick Commands:" +echo " • Switch to blocking: python3 $SYSTEM_DIR/scripts/duplicate_mode_toggle.py blocking" +echo " • View status: python3 $SYSTEM_DIR/scripts/duplicate_mode_toggle.py status" +echo " • Interactive cleanup: python3 $SYSTEM_DIR/scripts/interactive_cleanup.py" +echo "" +echo "📄 Files Created:" +echo " • PROJECT_INDEX.json (with semantic analysis)" +echo " • .claude/settings.json (dual-mode configuration)" +echo " • DUPLICATE_ANALYSIS_REPORT.md (initial analysis)" +echo "" +echo "💡 Next Steps:" +echo " 1. Read DUPLICATE_ANALYSIS_REPORT.md to see current duplicates" +echo " 2. Start in PASSIVE mode to learn the system" +echo " 3. Switch to BLOCKING mode when ready: duplicate_mode_toggle.py blocking" +echo "" +echo "🚀 Claude Code will now show duplicate detection status in the status line!" \ No newline at end of file diff --git a/uninstall.sh b/tools/uninstall.sh similarity index 100% rename from uninstall.sh rename to tools/uninstall.sh