Skip to content

Latest commit

 

History

History
470 lines (349 loc) · 21.3 KB

File metadata and controls

470 lines (349 loc) · 21.3 KB

DISCOVERIES.md

This file documents non-obvious problems, solutions, and patterns discovered during development. Make sure these are regularly reviewed and updated, removing outdated entries or those replaced by better practices or code or tools, updating those where the best practice has evolved.

DevContainer Setup: Using Official Features Instead of Custom Scripts (2025-10-22)

Issue

Claude CLI was not reliably available in DevContainers, and there was no visibility into what tools were installed during container creation.

Root Cause

  1. Custom installation approach: Previously attempted to install Claude CLI via npm in post-create script (was commented out, indicating unreliability)
  2. Broken pipx feature URL: Used devcontainers-contrib which was incorrect
  3. No logging: Post-create script had no output to help diagnose issues
  4. No status reporting: Users couldn't easily see what tools were available

Solution

Switched to declarative DevContainer features instead of custom installation scripts:

devcontainer.json changes:

// Fixed broken pipx feature URL
"ghcr.io/devcontainers-extra/features/pipx-package:1": { ... }

// Added official Claude Code feature
"ghcr.io/anthropics/devcontainer-features/claude-code:1": {},

// Added VSCode extension
"extensions": ["anthropic.claude-code", ...]

// Named container for easier identification
"runArgs": ["--name=amplifier_devcontainer"]

post-create.sh improvements:

# Added logging to persistent file for troubleshooting
LOG_FILE="/tmp/devcontainer-post-create.log"
exec > >(tee -a "$LOG_FILE") 2>&1

# Added development environment status report
echo "📋 Development Environment Ready:"
echo "  • Python: $(python3 --version 2>&1 | cut -d' ' -f2)"
echo "  • Claude CLI: $(claude --version 2>&1 || echo 'NOT INSTALLED')"
# ... other tools

Key Learnings

  1. Use official DevContainer features over custom scripts: Features are tested, maintained, and more reliable than custom npm installs
  2. Declarative > imperative: Define what you need in devcontainer.json rather than scripting installations
  3. Add logging for troubleshooting: Persistent logs help diagnose container build issues
  4. Provide status reporting: Show users what tools are available after container creation
  5. Test with fresh containers: Only way to verify DevContainer configuration works

Prevention

  • Prefer official DevContainer features from ghcr.io/anthropics/, ghcr.io/devcontainers/, etc.
  • Add logging (tee to a log file) in post-create scripts for troubleshooting
  • Include tool version reporting to confirm installations
  • Use named containers (runArgs) for easier identification in Docker Desktop
  • Test DevContainer changes by rebuilding containers from scratch

pnpm Global Bin Directory Not Configured (2025-10-23)

Issue

make install fails with ERR_PNPM_NO_GLOBAL_BIN_DIR error when trying to install global npm packages via pnpm in fresh DevContainer builds.

Root Cause

Two issues combined to cause the failure:

  1. Missing SHELL environment variable: During DevContainer post-create script execution, the SHELL environment variable is not set
  2. pnpm setup requires SHELL: The pnpm setup command fails with ERR_PNPM_UNKNOWN_SHELL when SHELL is not set
  3. Silent failure: The error was hidden by || true in the script, allowing the script to continue and report success even though pnpm wasn't configured

From the post-create log:

🔧  Setting up pnpm global bin directory...
 ERR_PNPM_UNKNOWN_SHELL  Could not infer shell type.
Set the SHELL environment variable to your active shell.
    ✅ pnpm configured  # <-- False success!

Solution

Fixed post-create script to explicitly set SHELL before running pnpm setup:

post-create.sh addition:

echo "🔧  Setting up pnpm global bin directory..."
# Ensure SHELL is set for pnpm setup
export SHELL="${SHELL:-/bin/bash}"
# Configure pnpm to use a global bin directory
pnpm setup 2>&1 | grep -v "^$" || true
# Export for current session (will also be in ~/.bashrc for future sessions)
export PNPM_HOME="/home/vscode/.local/share/pnpm"
export PATH="$PNPM_HOME:$PATH"
echo "    ✅ pnpm configured"

This ensures:

  1. SHELL is explicitly set before pnpm setup runs
  2. pnpm's global bin directory is configured on first container build
  3. The configuration is added to ~/.bashrc for all future sessions
  4. The environment variables are set for the post-create script itself

Key Learnings

  1. SHELL not set in post-create context - DevContainer post-create scripts run in an environment where SHELL may not be set
  2. pnpm requires SHELL - Unlike npm, pnpm needs to know the shell type to modify the correct config file
  3. Silent failures are dangerous - Using || true hid the actual error; consider logging errors even when continuing
  4. Check the logs - The /tmp/devcontainer-post-create.log revealed the actual error that was hidden from the console

Prevention

  • Always set SHELL explicitly in post-create scripts before running shell-dependent commands
  • Check post-create logs (/tmp/devcontainer-post-create.log) after rebuilding containers
  • Consider conditional error handling instead of blanket || true to catch real failures
  • Test make install as part of DevContainer validation

OneDrive/Cloud Sync File I/O Errors (2025-01-21)

Issue

Knowledge synthesis and other file operations were experiencing intermittent I/O errors (OSError errno 5) in WSL2 environment. The errors appeared random but were actually caused by OneDrive cloud sync delays.

Root Cause

The ~/amplifier directory was symlinked to a OneDrive folder on Windows (C:\ drive). When files weren't downloaded locally ("cloud-only" files), file operations would fail with I/O errors while OneDrive fetched them from the cloud. This affects:

  1. WSL2 + OneDrive: Symlinked directories from Windows OneDrive folders
  2. Other cloud sync services: Dropbox, Google Drive, iCloud Drive can cause similar issues
  3. Network drives: Similar delays can occur with network-mounted filesystems

Solution

Two-part solution implemented:

  1. Immediate fix: Added retry logic with exponential backoff and informative warnings
  2. Long-term fix: Created centralized file I/O utility module
# Enhanced retry logic in events.py with cloud sync warning:
for attempt in range(max_retries):
    try:
        with open(self.path, "a", encoding="utf-8") as f:
            f.write(json.dumps(asdict(rec), ensure_ascii=False) + "\n")
            f.flush()
        return
    except OSError as e:
        if e.errno == 5 and attempt < max_retries - 1:
            if attempt == 0:  # Log warning on first retry
                logger.warning(
                    f"File I/O error writing to {self.path} - retrying. "
                    "This may be due to cloud-synced files (OneDrive, Dropbox, etc.). "
                    "If using cloud sync, consider enabling 'Always keep on this device' "
                    f"for the data folder: {self.path.parent}"
                )
            time.sleep(retry_delay)
            retry_delay *= 2
        else:
            raise

# New centralized utility (amplifier/utils/file_io.py):
from amplifier.utils.file_io import write_json, read_json
write_json(data, filepath)  # Automatically handles retries

Affected Operations Identified

High-priority file operations requiring retry protection:

  1. Memory Store (memory/core.py) - Saves after every operation
  2. Knowledge Store (knowledge_synthesis/store.py) - Append operations
  3. Content Processing - Document and image saves
  4. Knowledge Integration - Graph saves and entity cache
  5. Synthesis Engine - Results saving

Key Learnings

  1. Cloud sync can cause mysterious I/O errors - Not immediately obvious from error messages
  2. Symlinked directories inherit cloud sync behavior - WSL directories linked to OneDrive folders are affected
  3. "Always keep on device" setting fixes it - Ensures files are locally available
  4. Retry logic should be informative - Tell users WHY retries are happening
  5. Centralized utilities prevent duplication - One retry utility for all file operations

Prevention

  • Enable "Always keep on this device" for any OneDrive folders used in development
  • Use the centralized file_io utility for all file operations
  • Add retry logic proactively for user-facing file operations
  • Consider data directory location when setting up projects (prefer local over cloud-synced)
  • Test file operations with cloud sync scenarios during development

Tool Generation Pattern Failures (2025-01-23)

Issue

Generated CLI tools consistently fail with predictable patterns:

  • Non-recursive file discovery (using *.md instead of **/*.md)
  • No minimum input validation (synthesis with 1 file when 2+ needed)
  • Silent failures without user feedback
  • Poor visibility into what's being processed

Root Cause

  • Missing standard patterns: No enforced template for common requirements
  • Agent guidance confusion: Documentation references examples/ as primary location
  • Philosophy violations: Generated code adds complexity instead of embracing simplicity

Solutions

Standard tool patterns (enforced in all generated tools):

# Recursive file discovery
files = list(Path(dir).glob("**/*.md"))  # NOT "*.md"

# Minimum input validation
if len(files) < required_min:
    logger.error(f"Need at least {required_min} files, found {len(files)}")
    sys.exit(1)

# Clear progress visibility
logger.info(f"Processing {len(files)} files:")
for f in files[:5]:
    logger.info(f"  • {f.name}")

Tool generation checklist:

  • Uses recursive glob patterns for file discovery
  • Validates minimum inputs before processing
  • Shows clear progress/activity to user
  • Fails fast with descriptive errors
  • Uses defensive utilities from toolkit

Key Learnings

  1. Templates prevent predictable failures: Common patterns should be enforced
  2. Visibility prevents confusion: Always show what's being processed
  3. Fail fast and loud: Silent failures create debugging nightmares
  4. Philosophy must be enforced: Generated code often violates simplicity

Prevention

  • Validate against checklist before accepting generated tools
  • Update agent guidance to specify correct directories
  • Test with edge cases (empty dirs, single file, nested structures)
  • Review generated code for philosophy compliance

LLM Response Handling and Defensive Utilities (2025-01-19)

Issue

Some CCSDK tools experienced multiple failure modes when processing LLM responses:

  • JSON parsing errors when LLMs returned markdown-wrapped JSON or explanatory text
  • Context contamination where LLMs referenced system instructions in their outputs
  • Transient failures with no retry mechanism causing tool crashes

Root Cause

LLMs don't reliably return pure JSON responses, even with explicit instructions. Common issues:

  1. Format variations: LLMs wrap JSON in markdown blocks, add explanations, or include preambles
  2. Context leakage: System prompts and instructions bleed into generated content
  3. Transient failures: API timeouts, rate limits, and temporary errors not handled gracefully

Solution

Created minimal defensive utilities in amplifier/ccsdk_toolkit/defensive/:

# parse_llm_json() - Extracts JSON from any LLM response format
result = parse_llm_json(llm_response)
# Handles: markdown blocks, explanations, nested JSON, malformed quotes

# retry_with_feedback() - Intelligent retry with error correction
result = await retry_with_feedback(
    async_func=generate_synthesis,
    prompt=prompt,
    max_retries=3
)
# Provides error feedback to LLM for self-correction on retry

# isolate_prompt() - Prevents context contamination
clean_prompt = isolate_prompt(user_prompt)
# Adds barriers to prevent system instruction leakage

Real-World Validation (2025-09-19)

Test Results: Fresh md_synthesizer run with defensive utilities showed dramatic improvement:

  • ✅ Zero JSON parsing errors (was 100% failure rate in original versions)
  • ✅ Zero context contamination (was synthesizing from wrong system files)
  • ✅ Zero crashes (was failing with exceptions on basic operations)
  • ✅ 62.5% completion rate (5 of 8 ideas expanded before timeout vs. 0% before)
  • ✅ High-quality output - Generated 8 relevant, insightful ideas from 3 documents

Performance Profile:

  • Stage 1 (Summarization): ~10-12 seconds per file - Excellent
  • Stage 2 (Synthesis): ~3 seconds per idea - Excellent with zero JSON failures
  • Stage 3 (Expansion): ~45 seconds per idea - Reasonable but could be optimized

Key Wins:

  1. parse_llm_json() eliminated all JSON parsing failures
  2. isolate_prompt() prevented system context leakage
  3. Progress checkpoint system preserved work through timeout
  4. Tool now fundamentally sound - remaining work is optimization, not bug fixing

Key Patterns

  1. Extraction over validation: Don't expect perfect JSON, extract it from whatever format arrives
  2. Feedback loops: When retrying, tell the LLM what went wrong so it can correct
  3. Context isolation: Use clear delimiters to separate user content from system instructions
  4. Defensive by default: All CCSDK tools should assume LLM responses need cleaning
  5. Test early with real data: Defensive utilities prove their worth only under real conditions

Prevention

  • Use parse_llm_json() for all LLM JSON responses - never use raw json.loads()
  • Wrap LLM operations with retry_with_feedback() for automatic error recovery
  • Apply isolate_prompt() when user content might be confused with instructions

Dual Backend Integration: Claude Code vs Codex (2025-10-24)

Issue

Implementing dual backend support (Claude Code and Codex) revealed several architectural differences and limitations that affect feature parity and testing strategies.

Root Cause

Claude Code and Codex have fundamentally different architectures:

  1. Automation Model: Claude Code uses automatic hooks (SessionStart, PostToolUse, PreCompact, Stop) while Codex requires explicit MCP tool invocation or wrapper scripts
  2. Agent Spawning: Claude Code has native Task tool for seamless agent spawning; Codex uses codex exec subprocess with different invocation model
  3. Configuration: Claude Code uses JSON (settings.json) with limited profiles; Codex uses TOML (config.toml) with rich profile support
  4. Transcript Format: Claude Code uses single text files (compact_*.txt); Codex uses session directories with multiple files (transcript.md, transcript_extended.md, history.jsonl)
  5. Tool Availability: Claude Code has Task, TodoWrite, WebFetch, WebSearch; Codex has Read, Write, Edit, Grep, Glob, Bash

Solutions Implemented

1. Backend Abstraction Layer (amplifier/core/backend.py):

  • Created AmplifierBackend abstract base class with methods: initialize_session(), finalize_session(), run_quality_checks(), export_transcript()
  • Implemented ClaudeCodeBackend and CodexBackend concrete classes
  • Both backends delegate to same amplifier modules (memory, extraction, search) ensuring consistency
  • Factory pattern (BackendFactory) for backend instantiation based on environment/config

2. Agent Abstraction Layer (amplifier/core/agent_backend.py):

  • Created AgentBackend abstract base class with spawn_agent() method
  • ClaudeCodeAgentBackend uses Claude Code SDK Task tool
  • CodexAgentBackend uses codex exec subprocess
  • Agent definitions converted from Claude format to Codex format (removed Task tool references, adapted tools array)

3. MCP Servers for Codex (.codex/mcp_servers/):

  • Implemented three MCP servers to replace Claude Code hooks:
    • session_manager - Replaces SessionStart/Stop hooks
    • quality_checker - Replaces PostToolUse hook
    • transcript_saver - Replaces PreCompact hook
  • Used FastMCP framework for rapid development
  • Servers expose tools that must be explicitly invoked (vs automatic hooks)

4. Wrapper Scripts:

  • amplify-codex.sh - Bash wrapper providing hook-like experience for Codex
  • amplify.py - Unified Python CLI for both backends
  • .codex/tools/session_init.py and session_cleanup.py - Standalone session management

5. Configuration System (amplifier/core/config.py):

  • Pydantic BackendConfig with environment variable support
  • Configuration precedence: CLI args > env vars > .env file > defaults
  • Auto-detection when AMPLIFIER_BACKEND not set
  • Validation for backend types and profiles

Feature Parity Status

Full Parity:

  • ✅ Memory system (both use same MemoryStore, MemorySearcher, MemoryExtractor)
  • ✅ Quality checks (both use same make check command)
  • ✅ Agent spawning (different invocation, same agent definitions)
  • ✅ Transcript export (different formats, both functional)
  • ✅ Configuration management (different formats, both comprehensive)

Partial Parity:

  • ⚠️ Automation: Claude Code hooks are automatic; Codex requires explicit tool calls or wrapper script
  • ⚠️ Task tracking: Claude Code has TodoWrite; Codex has no equivalent (use external tools)
  • ⚠️ Slash commands: Claude Code has native support; Codex has no equivalent (use MCP tools or natural language)
  • ⚠️ Notifications: Claude Code has desktop notifications; Codex returns tool responses only

No Parity (Intentional):

  • ❌ VS Code integration: Claude Code only (Codex is CLI-first)
  • ❌ Profiles: Codex only (Claude Code has single configuration)
  • ❌ MCP servers: Codex only (Claude Code uses hooks)

Testing Challenges Discovered

1. CLI Availability in Tests:

  • Challenge: Integration tests require Claude CLI or Codex CLI to be installed
  • Solution: Mock subprocess calls at the boundary; test backend abstraction logic without requiring real CLIs
  • Impact: Tests validate command construction and orchestration but not actual CLI behavior

2. MCP Protocol Testing:

  • Challenge: Testing MCP servers requires JSON-RPC communication over stdio
  • Solution: Start servers as subprocesses and communicate via stdin/stdout; alternatively mock FastMCP for unit tests
  • Impact: Integration tests are more complex but validate real protocol compliance

3. Async Testing:

  • Challenge: Many backend operations are async (memory extraction, agent spawning)
  • Solution: Use @pytest.mark.asyncio decorator and pytest-asyncio plugin
  • Impact: Tests must handle async/await correctly; some fixtures need async variants

4. Environment Isolation:

  • Challenge: Tests must not interfere with each other or real project data
  • Solution: Use temp_dir fixtures, mock environment variables, create isolated project structures
  • Impact: Tests are slower due to setup/teardown but are reliable and deterministic

5. Cross-Backend Validation:

  • Challenge: Verifying both backends produce identical results for same operations
  • Solution: Run same test scenarios with both backends, compare outputs
  • Impact: Test suite is larger but provides confidence in feature parity

Key Learnings

  1. Abstraction enables testing: Backend abstraction layer allows testing workflows without requiring real CLIs
  2. Mock at boundaries: Mock subprocess calls and file I/O, but test real backend logic
  3. Shared modules ensure consistency: Both backends using same amplifier modules (memory, extraction, search) guarantees identical behavior
  4. Configuration is critical: Proper configuration management (precedence, validation, defaults) is essential for dual-backend support
  5. Documentation prevents confusion: Comprehensive docs (CODEX_INTEGRATION.md, BACKEND_COMPARISON.md, MIGRATION_GUIDE.md) are essential for users
  6. Smoke tests validate critical paths: Fast smoke tests catch regressions without full integration test suite
  7. Wrapper scripts bridge gaps: amplify-codex.sh provides hook-like experience for Codex despite lack of native hooks

Limitations Documented

Claude Code Limitations:

  • No profile support (single configuration for all workflows)
  • Limited CI/CD integration (requires VS Code)
  • No headless operation (VS Code extension only)
  • Hooks can't be easily disabled (always run)

Codex Limitations:

  • No automatic hooks (must invoke tools explicitly or use wrapper)
  • No slash commands (use MCP tools or natural language)
  • No TodoWrite equivalent (use external task tracking)
  • No desktop notifications (tool responses only)
  • Requires wrapper script for convenient session management

Testing Limitations:

  • Integration tests mock CLI calls (don't test actual Claude/Codex behavior)
  • MCP server tests require subprocess communication (more complex)
  • Agent spawning tests mock SDK/subprocess (don't test actual agent execution)
  • Cross-backend tests assume both backends are available (may not be true in all environments)

Prevention

  • Use backend abstraction layer for all backend operations (don't call CLIs directly)
  • Test both backends for any new feature to ensure parity
  • Document limitations clearly when features can't be replicated
  • Use wrapper scripts to provide consistent user experience across backends
  • Keep backend-specific code isolated in .claude/ and .codex/ directories
  • Maintain comprehensive documentation for both backends
  • Run smoke tests in CI to catch regressions early
  • Update DISCOVERIES.md when new limitations are found