Skip to content

Latest commit

 

History

History
857 lines (674 loc) · 25.6 KB

File metadata and controls

857 lines (674 loc) · 25.6 KB

Code Review & Refactoring Plan v2.0

Review Date: 2026-01-31 Review Scope: Last 3-4 days of development (37 commits, ~8000 LOC changed) Reviewer Perspective: Staff Engineer Assessment Overall Grade: 6.5/10 - Functional but Needs Refactoring


Executive Summary

The codebase demonstrates impressive velocity and strong technical fundamentals but has accumulated significant technical debt that will impact maintainability. The work shows senior-level execution speed with mid-level architectural discipline.

Key Achievements ✅

  • Shipped comprehensive observability (Langfuse integration)
  • Implemented streaming responses with SSE
  • Built React frontend with TypeScript
  • Added session resumption and token rate limiting
  • Created multi-agent orchestration with fallback chains
  • Integrated A/B testing and evaluation infrastructure

Critical Issues 🚨

  • 995-line TutorEngine violating Single Responsibility Principle
  • Observability code bleeding into business logic (150+ lines of repetitive Langfuse calls)
  • Documentation churn (911 lines deleted as "AI slop")
  • Test coverage not keeping pace with features
  • Global state in agent implementations

Detailed Analysis

1. Architecture Assessment

Current State: Monolithic Engine

File: app/tutor/engine.py (995 lines)

Responsibilities (Too Many):

  • Orchestration of judge/tutor/feedback agents
  • Observability and tracing
  • Token usage tracking and reporting
  • Rate limiting enforcement
  • State management
  • Streaming logic
  • Response formatting

Problem: Single class doing 7+ distinct jobs violates SRP.

Recommended Architecture

Current (Monolithic):
┌─────────────────────────────────────────────┐
│           TutorEngine (995 lines)           │
│  ┌─────────────────────────────────────┐   │
│  │ Orchestration                       │   │
│  │ + Observability                     │   │
│  │ + Token Tracking                    │   │
│  │ + Rate Limiting                     │   │
│  │ + State Management                  │   │
│  │ + Streaming                         │   │
│  └─────────────────────────────────────┘   │
└─────────────────────────────────────────────┘

Proposed (Modular):
┌──────────────────────────┐
│  TutorOrchestrator       │  ← Pure coordination (200 lines)
│  (coordination only)     │
└────────┬─────────────────┘
         │
    ┌────┴────────────────────────────────┐
    │                                     │
┌───▼────────────┐  ┌──────────────┐  ┌─▼──────────────┐
│ AgentService   │  │ Metrics      │  │ RateLimiter    │
│ (observability)│  │ Collector    │  │ (policy)       │
└────┬───────────┘  └──────────────┘  └────────────────┘
     │
┌────┴─────────────────────────────┐
│ JudgeService                     │
│ TutorService                     │
│ FeedbackService                  │
└──────────────────────────────────┘

2. Code Quality Issues

Issue #1: Repetitive Agent Wrapper Code (DRY Violation)

Location: app/tutor/engine.py:249-421

Problem: Three nearly identical methods with 20+ lines of Langfuse reporting each.

Current Code Pattern:

@observe(as_type="generation", name="judge_assessment")
async def _run_judge_agent(self, deps: JudgeDependencies, session_id: str) -> AgentResult[BloomAssessment]:
    result = await run_agent_with_fallback(
        agent=judge_agent,
        prompt="Evaluate this response.",
        deps=deps,
        output_type=BloomAssessment,
        model_config=self.llm_config.judge_model,
        agent_name="judge",
        session_id=session_id
    )

    link_judge_prompt_to_generation()

    # 20+ lines of token reporting (repeated 3x)
    if result.usage.total_tokens > 0:
        try:
            langfuse_client = get_client()
            if langfuse_client:
                model_name = result.model_used.split(":")[-1] if result.model_used else None
                update_params = {
                    "usage_details": {
                        "input": result.usage.input_tokens,
                        "output": result.usage.output_tokens,
                        "total": result.usage.total_tokens
                    }
                }
                if model_name:
                    update_params["model"] = model_name
                langfuse_client.update_current_generation(**update_params)
                logger.debug(f"[judge] Reported {result.usage.total_tokens} tokens...")
        except Exception as e:
            logger.warning(f"[judge] Failed to report usage to Langfuse: {e}")

    return result

Recommended Solution:

# Create a generic wrapper
class ObservableAgentService:
    def __init__(self, langfuse_client: Langfuse):
        self.client = langfuse_client

    @observe_with_tokens
    async def run_agent(
        self,
        agent_name: str,
        agent: Agent,
        deps: Any,
        output_type: Type[T],
        model_config: ModelConfig,
        session_id: str
    ) -> AgentResult[T]:
        result = await run_agent_with_fallback(
            agent=agent,
            prompt=f"{agent_name} execution",
            deps=deps,
            output_type=output_type,
            model_config=model_config,
            agent_name=agent_name,
            session_id=session_id
        )
        return result

# Then use it
judge_result = await self.agent_service.run_agent(
    agent_name="judge",
    agent=judge_agent,
    deps=deps,
    output_type=BloomAssessment,
    model_config=self.llm_config.judge_model,
    session_id=session_id
)

Impact:

  • Remove ~150 lines of repetitive code
  • Centralize observability logic
  • Easier to test and maintain

Issue #2: Observability Bleeding into Business Logic

Problem: Langfuse code appears in 5+ locations throughout engine.py, mixed with business logic.

Examples:

  • Lines 462-495: Trace update in handle_turn()
  • Lines 278-303: Token reporting in _run_judge_agent()
  • Lines 336-361: Token reporting in _run_tutor_agent()
  • Lines 394-419: Token reporting in _run_feedback_agent()
  • Lines 735-772: Trace updates in handle_turn_stream()

Why This Is Bad:

  • Violates separation of concerns
  • Makes business logic harder to read
  • Observability failures can break business logic
  • Testing requires mocking Langfuse everywhere

Recommended Pattern - Decorator Approach:

# Create reusable decorators
def trace_generation(agent_name: str, report_tokens: bool = True):
    """Decorator to add Langfuse tracing to agent methods."""
    def decorator(func):
        @wraps(func)
        async def wrapper(*args, **kwargs):
            with observe(as_type="generation", name=agent_name):
                result = await func(*args, **kwargs)

                if report_tokens and result.usage.total_tokens > 0:
                    _report_usage_to_langfuse(result, agent_name)

                return result
        return wrapper
    return decorator

# Then use it
@trace_generation(agent_name="judge", report_tokens=True)
async def _run_judge_agent(self, deps: JudgeDependencies, session_id: str):
    return await run_agent_with_fallback(...)
    # That's it. No Langfuse code here.

Impact:

  • Business logic becomes pure and readable
  • Observability is opt-in via decorators
  • Easy to disable for testing
  • Centralized error handling for observability

Issue #3: Global State in Agent Implementations

Location: app/llm/judge_agent.py:16-17

Problem:

_judge_prompt_service: Optional[PromptService] = None
_judge_prompt_metadata: Optional[Dict[str, Any]] = None

Why This Is Bad:

  1. Testing nightmare - Must reset globals between tests
  2. Thread safety - Potential race conditions in async context
  3. Hidden dependencies - Function behavior depends on external state
  4. Difficult to reason about - Who sets these? When? In what order?

Recommended Solution - Dependency Injection:

# Instead of globals
class JudgeAgentService:
    def __init__(
        self,
        prompt_service: Optional[PromptService] = None,
        llm_config: LLMConfig = None
    ):
        self.prompt_service = prompt_service
        self.llm_config = llm_config or get_llm_config()
        self.prompt_metadata: Dict[str, Any] = {}

        # Initialize agent
        self.agent = Agent(
            model=create_model_with_fallback(...),
            deps_type=JudgeDependencies,
            retries=self.llm_config.judge_model.max_retries,
            output_type=BloomAssessment
        )

    async def evaluate(self, deps: JudgeDependencies) -> AgentResult[BloomAssessment]:
        """Execute judge agent with proper dependency management."""
        ...

Impact:

  • Testable with dependency injection
  • Thread-safe by design
  • Clear ownership of state
  • Easier to mock in tests

Issue #4: Magic Parameter Names (Implicit Coupling)

Location: app/tutor/engine.py:430

Problem:

async def handle_turn(
    self,
    session_id: str,
    user_message: str,
    user_id: str = "default",
    topic: str = "General",
    langfuse_trace_id: Optional[str] = None  # Special parameter - @observe decorator uses this!
):

Issue: Function signature depends on decorator implementation details. Renaming this parameter breaks observability silently.

Recommended Solution:

# Option 1: Explicit context object
@dataclass
class TurnContext:
    session_id: str
    user_message: str
    user_id: str = "default"
    topic: str = "General"
    trace_id: Optional[str] = None  # Explicit, not magic

async def handle_turn(self, context: TurnContext) -> TutorResponse:
    """Process a single turn with explicit context."""
    ...

# Option 2: Use context manager
async def handle_turn(self, session_id: str, user_message: str, ...):
    trace_id = session_id.replace("-", "")
    with langfuse.trace(trace_id=trace_id):
        # Observability is explicit, not parameter-based
        ...

Issue #5: Hardcoded Magic Numbers

Location: app/tutor/engine.py:502

Problem:

await self.rate_limiter.check_budget(user_id=user_id, estimated_tokens=2000)

Issues:

  • Token estimate is hardcoded
  • No visibility into why 2000 was chosen
  • Will become inaccurate as models/prompts change

Recommended Solution:

# Add to ModelConfig
@dataclass
class ModelConfig:
    model_name: str
    fallback_models: list[str]
    max_retries: int
    timeout_seconds: int
    estimated_tokens_per_request: int = 2000  # Make it configurable

# Then use it
estimated = (
    self.llm_config.judge_model.estimated_tokens_per_request +
    self.llm_config.tutor_model.estimated_tokens_per_request +
    self.llm_config.feedback_model.estimated_tokens_per_request
)
await self.rate_limiter.check_budget(user_id=user_id, estimated_tokens=estimated)

3. Documentation Issues

Problem: High Documentation Churn

Evidence from Git History:

7313549 Remove multiple documentation files. (911 lines deleted)
27e2842 Remove the files with AI slop and over written verbose markdown info.
46911d8 docs: remove redundant scoring documentation files

What This Signals:

  • Unclear requirements leading to exploratory documentation
  • Possible over-reliance on AI-generated content
  • Documentation not aligned with actual implementation
  • Lack of lightweight Architecture Decision Records (ADRs)

Recommended Approach:

  1. Replace verbose docs with concise ADRs:
# ADR-001: Why Langfuse for Observability

**Status:** Accepted
**Date:** 2026-01-30
**Deciders:** [Team]

## Context
Need observability for multi-agent LLM system to track costs, performance, and quality.

## Decision
Use Langfuse for tracing, prompt management, and evaluation.

## Alternatives Considered
- LangSmith: Better integration with LangChain, but we use PydanticAI
- Custom logging: Full control but high maintenance burden
- Weights & Biases: Strong ML focus but overkill for our use case

## Consequences
**Positive:**
- Built-in prompt versioning and A/B testing
- Cost tracking across providers
- Production-ready UI for analytics

**Negative:**
- Tight coupling to Langfuse APIs
- Additional infrastructure dependency
- Learning curve for team
  1. Keep implementation docs focused and current:
    • One doc per major component (already done: docs/observability.md)
    • Code examples over prose
    • Clear "last updated" dates

4. Testing Gaps

Current State

  • ✅ 19 test files exist in tests/
  • No test updates in last 10 commits despite adding:
    • app/evaluation/scorer.py (356 lines) - ZERO tests
    • app/prompts/service.py (313 lines) - Tests added but need more coverage
    • app/prompts/ab_testing.py (126 lines) - ZERO tests
    • app/api/feedback_router.py (172 lines) - ZERO tests

Test Debt Calculation

New production code:   1,167 lines
New test code:         ~271 lines (prompt_service only)
Test coverage ratio:   23% (target should be 70-80%)

Priority Testing Needs

P0 (Critical - No Production Deployment Without These):

# tests/test_evaluation_scorer.py
async def test_scorer_reports_quality_metrics():
    """Verify scorer creates Langfuse scores correctly."""

async def test_scorer_handles_langfuse_failures_gracefully():
    """Scoring failures shouldn't break learning sessions."""

# tests/test_feedback_router.py
async def test_user_feedback_endpoint():
    """Test POST /api/v1/feedback with satisfaction scores."""

async def test_feedback_links_to_trace():
    """Verify feedback is linked to correct Langfuse trace."""

P1 (High - Should Be Added This Week):

# tests/test_ab_testing.py
def test_cohort_assignment_is_consistent():
    """Same user_id always gets same variant."""

def test_cohort_distribution_is_balanced():
    """50/50 split across large user sample."""

5. Dependency Audit

Current Dependencies (29 packages)

# Web Framework
fastapi>=0.128.0
uvicorn[standard]>=0.40.0

# LLM Orchestration
pydantic-ai>=1.48.0
pydantic>=2.12.5

# LLM Providers (4 providers)
google-genai>=1.60.0
anthropic>=0.76.0
openai>=2.16.0
deepseek>=1.0.0

# Observability
langfuse>=3.12.1

# Database (UNUSED?)
sqlalchemy>=2.0.46
aiosqlite>=0.22.1
alembic>=1.18.1

# Utilities
python-dotenv>=1.0.1
httpx>=0.28.1
tenacity>=9.1.2

# CLI
rich>=14.3.1
typer>=0.21.1
prompt-toolkit>=3.0.0
redis>=5.0.1

Issues Identified

1. SQLAlchemy + Alembic appear unused:

# Search results
grep -r "sqlalchemy" app/  # No imports found
grep -r "alembic" app/      # No imports found

Recommendation: Remove unless planned for future use. Redis is handling state storage.

2. Four LLM providers:

  • OpenAI (primary)
  • Google/Gemini (fallback)
  • Anthropic (not actively used?)
  • DeepSeek (not actively used?)

Recommendation: Keep OpenAI + Gemini. Remove unused providers or document why they're needed.


Refactoring Roadmap

Phase 1: Critical Path (Week 1) - 5 days

Day 1-2: Extract Observability Middleware

Goal: Remove 150+ lines of repetitive Langfuse code.

Tasks:

  1. Create app/observability/middleware.py:

    class ObservabilityMiddleware:
        async def wrap_agent_call(self, agent_name: str, fn: Callable, *args, **kwargs)
  2. Create decorators:

    @trace_generation(agent_name="judge")
    async def _run_judge_agent(...)
  3. Refactor engine.py to use middleware

  4. Add tests for middleware in isolation

Success Metrics:

  • engine.py reduced by 150+ lines
  • All existing tests pass
  • No Langfuse code in handle_turn() business logic

Day 3: Split TutorEngine into Services

Goal: Break 995-line file into focused modules.

New Structure:

app/tutor/
├── engine.py (200 lines - orchestration only)
├── services/
│   ├── __init__.py
│   ├── judge_service.py (100 lines)
│   ├── tutor_service.py (100 lines)
│   ├── feedback_service.py (100 lines)
│   └── metrics_collector.py (80 lines)

Migration Steps:

  1. Create AgentServiceBase class with common logic
  2. Extract judge/tutor/feedback services
  3. Extract MetricsCollector for token tracking
  4. Update TutorEngine.__init__() to use services
  5. Run full test suite after each extraction

Success Metrics:

  • Each file < 250 lines
  • All services independently testable
  • Test suite passes

Day 4: Add Missing Critical Tests

Goal: Get scorer and feedback router to 80% coverage.

Tests to Write:

# tests/test_evaluation_scorer.py (4 tests)
- test_score_turn_complete_success()
- test_score_turn_complete_with_langfuse_failure()
- test_score_first_turn_handling()
- test_cost_tracking_accuracy()

# tests/test_feedback_router.py (3 tests)
- test_user_feedback_endpoint_success()
- test_feedback_validates_score_range()
- test_feedback_links_to_correct_trace()

# tests/test_ab_testing.py (3 tests)
- test_cohort_assignment_consistency()
- test_cohort_distribution()
- test_variant_selection()

Success Metrics:

  • All new production code has tests
  • Coverage report shows 70%+ on new modules
  • CI pipeline includes coverage checks

Day 5: Dependency Cleanup & ADR Documentation

Goal: Remove unused dependencies, document key decisions.

Tasks:

  1. Dependency Audit:

    # Remove unused dependencies
    uv remove sqlalchemy aiosqlite alembic
    
    # Audit provider usage
    grep -r "anthropic\|deepseek" app/
    # If unused, remove them
  2. Write ADRs:

    • ADR-001: Langfuse for observability
    • ADR-002: Redis over in-memory state
    • ADR-003: Multi-provider fallback strategy
    • ADR-004: PydanticAI over LangChain
  3. Update docs:

    • Consolidate implementation docs
    • Add "Architecture" section to README
    • Remove outdated/verbose docs

Success Metrics:

  • pyproject.toml only lists used dependencies
  • 4 ADRs written and reviewed
  • Documentation reduced by 30% but clarity improved

Phase 2: Quality Improvements (Week 2) - 3 days

Day 6: Remove Global State from Agents

Goal: Convert agents to use dependency injection.

Tasks:

  1. Create JudgeAgentService class
  2. Create TutorAgentService class
  3. Create FeedbackAgentService class
  4. Update TutorOrchestrator to inject services
  5. Remove global variables from app/llm/*_agent.py

Success Metrics:

  • No global state in agent files
  • Services are thread-safe
  • Integration tests pass

Day 7: Improve Type Safety

Goal: Add stricter type checking and reduce Any usage.

Tasks:

  1. Run mypy --strict app/
  2. Fix type errors in order:
    • app/tutor/engine.py
    • app/llm/*.py
    • app/evaluation/*.py
  3. Remove Any types where possible
  4. Add py.typed marker for library consumers

Success Metrics:

  • mypy --strict passes with 0 errors
  • No type: ignore comments added
  • Type coverage > 95%

Day 8: Performance Optimization

Goal: Identify and fix performance bottlenecks.

Tasks:

  1. Add profiling to key endpoints
  2. Identify slow queries/operations
  3. Add caching where appropriate:
    # Cache prompt fetches (they don't change often)
    @lru_cache(maxsize=100)
    async def get_prompt(self, name: str, label: str = "production"):
        ...
  4. Parallelize independent operations
  5. Add performance tests

Success Metrics:

  • P95 latency < 2 seconds for /chat endpoint
  • Token rate limiter overhead < 50ms
  • Langfuse reporting happens async (doesn't block response)

Phase 3: Production Hardening (Week 3) - 2 days

Day 9: Error Handling & Resilience

Goal: Comprehensive error handling and graceful degradation.

Tasks:

  1. Audit all error handlers in app/api/
  2. Add circuit breaker for Langfuse calls
  3. Implement retry budgets for rate limiter
  4. Add structured error responses
  5. Test failure scenarios:
    • Redis down
    • Langfuse unavailable
    • All LLM providers failing
    • Rate limit exceeded

Success Metrics:

  • System degrades gracefully (logs errors but serves users)
  • No 500 errors for external service failures
  • Circuit breakers prevent cascade failures

Day 10: Monitoring & Alerting Setup

Goal: Production-ready observability beyond Langfuse.

Tasks:

  1. Add Prometheus metrics:

    from prometheus_client import Counter, Histogram
    
    request_latency = Histogram('tutor_request_duration_seconds', 'Request latency')
    llm_calls = Counter('llm_calls_total', 'Total LLM calls', ['provider', 'status'])
  2. Add health check endpoint:

    @router.get("/health")
    async def health_check():
        return {
            "redis": await check_redis(),
            "langfuse": await check_langfuse(),
            "llm_providers": await check_llm_providers()
        }
  3. Document alerting rules:

    • Alert if error rate > 5%
    • Alert if P95 latency > 5s
    • Alert if Redis is down

Success Metrics:

  • /health endpoint returns detailed status
  • Prometheus metrics exported at /metrics
  • Runbook created for on-call engineers

Success Metrics (Overall)

Code Quality Metrics

Metric Current Target Priority
Largest file size 995 lines < 300 lines P0
Test coverage ~60% > 80% P0
Repetitive code blocks 5+ 0 P1
Global state usage 6 instances 0 P1
Type coverage ~85% > 95% P2
Dependency count 29 < 20 P2

Performance Metrics

Metric Current Target Priority
P95 latency (/chat) ~3-4s < 2s P1
Test suite runtime Unknown < 30s P2
Rate limiter overhead ~100ms < 50ms P2

Architecture Metrics

Metric Current Target Priority
Max class responsibilities 7 1-2 P0
Service coupling High Loose P1
Observability separation Mixed Clean P0

Risk Assessment

High Risk (Must Address Before Next Feature)

  1. 995-line TutorEngine - Will become unmaintainable
  2. Test coverage gaps - Scorer and feedback router have zero tests
  3. Observability coupling - Changes to Langfuse break business logic

Medium Risk (Should Address This Sprint)

  1. Global state in agents - Potential thread safety issues
  2. Documentation debt - Team knowledge is in code, not docs
  3. Unused dependencies - Bloating deployment size

Low Risk (Can Defer)

  1. Type coverage - Already decent, not blocking
  2. Performance optimization - Current performance is acceptable
  3. Additional LLM providers - Not needed yet

Conclusion

What We Did Right

  • Fast execution - Shipped impressive scope in 4 days
  • Error handling - Fallback chains are production-grade
  • Type safety - Good use of Pydantic and type hints
  • Observability - Comprehensive (though needs cleanup)

What Needs Improvement

  • Architecture - Extract concerns into focused modules
  • Testing - Keep pace with feature development
  • Documentation - Use lightweight ADRs, not verbose guides
  • Dependencies - Audit and remove unused packages

Recommendation

Pause feature development for 1-2 weeks to address technical debt. The current architecture is unsustainable for the next 3-6 months of planned features.

If we continue at current pace without refactoring, we'll hit a complexity wall where:

  • New features take 2-3x longer to implement
  • Bug rate increases exponentially
  • New team members need weeks to onboard
  • Refactoring becomes riskier and more expensive

The best time to refactor is now, while the codebase is still comprehensible.


Next Steps

  1. Review this document with the team
  2. Prioritize refactoring items based on team capacity
  3. Create tickets for Phase 1 tasks (Week 1)
  4. Set up monitoring for success metrics
  5. Schedule weekly check-ins to track progress

Document Version: 2.0 Last Updated: 2026-01-31 Next Review: After Phase 1 completion (2026-02-07)