Code Review & Refactoring Plan v2.0

Review Date: 2026-01-31 Review Scope: Last 3-4 days of development (37 commits, ~8000 LOC changed) Reviewer Perspective: Staff Engineer Assessment Overall Grade: 6.5/10 - Functional but Needs Refactoring

Executive Summary

The codebase demonstrates impressive velocity and strong technical fundamentals but has accumulated significant technical debt that will impact maintainability. The work shows senior-level execution speed with mid-level architectural discipline.

Key Achievements ✅

Shipped comprehensive observability (Langfuse integration)
Implemented streaming responses with SSE
Built React frontend with TypeScript
Added session resumption and token rate limiting
Created multi-agent orchestration with fallback chains
Integrated A/B testing and evaluation infrastructure

Critical Issues 🚨

995-line TutorEngine violating Single Responsibility Principle
Observability code bleeding into business logic (150+ lines of repetitive Langfuse calls)
Documentation churn (911 lines deleted as "AI slop")
Test coverage not keeping pace with features
Global state in agent implementations

Detailed Analysis

1. Architecture Assessment

Current State: Monolithic Engine

File: app/tutor/engine.py (995 lines)

Responsibilities (Too Many):

Orchestration of judge/tutor/feedback agents
Observability and tracing
Token usage tracking and reporting
Rate limiting enforcement
State management
Streaming logic
Response formatting

Problem: Single class doing 7+ distinct jobs violates SRP.

Recommended Architecture

Current (Monolithic):
┌─────────────────────────────────────────────┐
│           TutorEngine (995 lines)           │
│  ┌─────────────────────────────────────┐   │
│  │ Orchestration                       │   │
│  │ + Observability                     │   │
│  │ + Token Tracking                    │   │
│  │ + Rate Limiting                     │   │
│  │ + State Management                  │   │
│  │ + Streaming                         │   │
│  └─────────────────────────────────────┘   │
└─────────────────────────────────────────────┘

Proposed (Modular):
┌──────────────────────────┐
│  TutorOrchestrator       │  ← Pure coordination (200 lines)
│  (coordination only)     │
└────────┬─────────────────┘
         │
    ┌────┴────────────────────────────────┐
    │                                     │
┌───▼────────────┐  ┌──────────────┐  ┌─▼──────────────┐
│ AgentService   │  │ Metrics      │  │ RateLimiter    │
│ (observability)│  │ Collector    │  │ (policy)       │
└────┬───────────┘  └──────────────┘  └────────────────┘
     │
┌────┴─────────────────────────────┐
│ JudgeService                     │
│ TutorService                     │
│ FeedbackService                  │
└──────────────────────────────────┘

2. Code Quality Issues

Issue #1: Repetitive Agent Wrapper Code (DRY Violation)

Location: app/tutor/engine.py:249-421

Problem: Three nearly identical methods with 20+ lines of Langfuse reporting each.

Current Code Pattern:

@observe(as_type="generation", name="judge_assessment")
async def _run_judge_agent(self, deps: JudgeDependencies, session_id: str) -> AgentResult[BloomAssessment]:
    result = await run_agent_with_fallback(
        agent=judge_agent,
        prompt="Evaluate this response.",
        deps=deps,
        output_type=BloomAssessment,
        model_config=self.llm_config.judge_model,
        agent_name="judge",
        session_id=session_id
    )

    link_judge_prompt_to_generation()

    # 20+ lines of token reporting (repeated 3x)
    if result.usage.total_tokens > 0:
        try:
            langfuse_client = get_client()
            if langfuse_client:
                model_name = result.model_used.split(":")[-1] if result.model_used else None
                update_params = {
                    "usage_details": {
                        "input": result.usage.input_tokens,
                        "output": result.usage.output_tokens,
                        "total": result.usage.total_tokens
                    }
                }
                if model_name:
                    update_params["model"] = model_name
                langfuse_client.update_current_generation(**update_params)
                logger.debug(f"[judge] Reported {result.usage.total_tokens} tokens...")
        except Exception as e:
            logger.warning(f"[judge] Failed to report usage to Langfuse: {e}")

    return result

Recommended Solution:

# Create a generic wrapper
class ObservableAgentService:
    def __init__(self, langfuse_client: Langfuse):
        self.client = langfuse_client

    @observe_with_tokens
    async def run_agent(
        self,
        agent_name: str,
        agent: Agent,
        deps: Any,
        output_type: Type[T],
        model_config: ModelConfig,
        session_id: str
    ) -> AgentResult[T]:
        result = await run_agent_with_fallback(
            agent=agent,
            prompt=f"{agent_name} execution",
            deps=deps,
            output_type=output_type,
            model_config=model_config,
            agent_name=agent_name,
            session_id=session_id
        )
        return result

# Then use it
judge_result = await self.agent_service.run_agent(
    agent_name="judge",
    agent=judge_agent,
    deps=deps,
    output_type=BloomAssessment,
    model_config=self.llm_config.judge_model,
    session_id=session_id
)

Impact:

Remove ~150 lines of repetitive code
Centralize observability logic
Easier to test and maintain

Issue #2: Observability Bleeding into Business Logic

Problem: Langfuse code appears in 5+ locations throughout engine.py, mixed with business logic.

Examples:

Lines 462-495: Trace update in handle_turn()
Lines 278-303: Token reporting in _run_judge_agent()
Lines 336-361: Token reporting in _run_tutor_agent()
Lines 394-419: Token reporting in _run_feedback_agent()
Lines 735-772: Trace updates in handle_turn_stream()

Why This Is Bad:

Violates separation of concerns
Makes business logic harder to read
Observability failures can break business logic
Testing requires mocking Langfuse everywhere

Recommended Pattern - Decorator Approach:

# Create reusable decorators
def trace_generation(agent_name: str, report_tokens: bool = True):
    """Decorator to add Langfuse tracing to agent methods."""
    def decorator(func):
        @wraps(func)
        async def wrapper(*args, **kwargs):
            with observe(as_type="generation", name=agent_name):
                result = await func(*args, **kwargs)

                if report_tokens and result.usage.total_tokens > 0:
                    _report_usage_to_langfuse(result, agent_name)

                return result
        return wrapper
    return decorator

# Then use it
@trace_generation(agent_name="judge", report_tokens=True)
async def _run_judge_agent(self, deps: JudgeDependencies, session_id: str):
    return await run_agent_with_fallback(...)
    # That's it. No Langfuse code here.

Impact:

Business logic becomes pure and readable
Observability is opt-in via decorators
Easy to disable for testing
Centralized error handling for observability

Issue #3: Global State in Agent Implementations

Location: app/llm/judge_agent.py:16-17

Problem:

_judge_prompt_service: Optional[PromptService] = None
_judge_prompt_metadata: Optional[Dict[str, Any]] = None

Why This Is Bad:

Testing nightmare - Must reset globals between tests
Thread safety - Potential race conditions in async context
Hidden dependencies - Function behavior depends on external state
Difficult to reason about - Who sets these? When? In what order?

Recommended Solution - Dependency Injection:

# Instead of globals
class JudgeAgentService:
    def __init__(
        self,
        prompt_service: Optional[PromptService] = None,
        llm_config: LLMConfig = None
    ):
        self.prompt_service = prompt_service
        self.llm_config = llm_config or get_llm_config()
        self.prompt_metadata: Dict[str, Any] = {}

        # Initialize agent
        self.agent = Agent(
            model=create_model_with_fallback(...),
            deps_type=JudgeDependencies,
            retries=self.llm_config.judge_model.max_retries,
            output_type=BloomAssessment
        )

    async def evaluate(self, deps: JudgeDependencies) -> AgentResult[BloomAssessment]:
        """Execute judge agent with proper dependency management."""
        ...

Impact:

Testable with dependency injection
Thread-safe by design
Clear ownership of state
Easier to mock in tests

Issue #4: Magic Parameter Names (Implicit Coupling)

Location: app/tutor/engine.py:430

Problem:

async def handle_turn(
    self,
    session_id: str,
    user_message: str,
    user_id: str = "default",
    topic: str = "General",
    langfuse_trace_id: Optional[str] = None  # Special parameter - @observe decorator uses this!
):

Issue: Function signature depends on decorator implementation details. Renaming this parameter breaks observability silently.

Recommended Solution:

# Option 1: Explicit context object
@dataclass
class TurnContext:
    session_id: str
    user_message: str
    user_id: str = "default"
    topic: str = "General"
    trace_id: Optional[str] = None  # Explicit, not magic

async def handle_turn(self, context: TurnContext) -> TutorResponse:
    """Process a single turn with explicit context."""
    ...

# Option 2: Use context manager
async def handle_turn(self, session_id: str, user_message: str, ...):
    trace_id = session_id.replace("-", "")
    with langfuse.trace(trace_id=trace_id):
        # Observability is explicit, not parameter-based
        ...

Issue #5: Hardcoded Magic Numbers

Location: app/tutor/engine.py:502

Problem:

await self.rate_limiter.check_budget(user_id=user_id, estimated_tokens=2000)

Issues:

Token estimate is hardcoded
No visibility into why 2000 was chosen
Will become inaccurate as models/prompts change

Recommended Solution:

# Add to ModelConfig
@dataclass
class ModelConfig:
    model_name: str
    fallback_models: list[str]
    max_retries: int
    timeout_seconds: int
    estimated_tokens_per_request: int = 2000  # Make it configurable

# Then use it
estimated = (
    self.llm_config.judge_model.estimated_tokens_per_request +
    self.llm_config.tutor_model.estimated_tokens_per_request +
    self.llm_config.feedback_model.estimated_tokens_per_request
)
await self.rate_limiter.check_budget(user_id=user_id, estimated_tokens=estimated)

3. Documentation Issues

Problem: High Documentation Churn

Evidence from Git History:

7313549 Remove multiple documentation files. (911 lines deleted)
27e2842 Remove the files with AI slop and over written verbose markdown info.
46911d8 docs: remove redundant scoring documentation files

What This Signals:

Unclear requirements leading to exploratory documentation
Possible over-reliance on AI-generated content
Documentation not aligned with actual implementation
Lack of lightweight Architecture Decision Records (ADRs)

Recommended Approach:

Replace verbose docs with concise ADRs:

# ADR-001: Why Langfuse for Observability

**Status:** Accepted
**Date:** 2026-01-30
**Deciders:** [Team]

## Context
Need observability for multi-agent LLM system to track costs, performance, and quality.

## Decision
Use Langfuse for tracing, prompt management, and evaluation.

## Alternatives Considered
- LangSmith: Better integration with LangChain, but we use PydanticAI
- Custom logging: Full control but high maintenance burden
- Weights & Biases: Strong ML focus but overkill for our use case

## Consequences
**Positive:**
- Built-in prompt versioning and A/B testing
- Cost tracking across providers
- Production-ready UI for analytics

**Negative:**
- Tight coupling to Langfuse APIs
- Additional infrastructure dependency
- Learning curve for team

Keep implementation docs focused and current:
- One doc per major component (already done: docs/observability.md)
- Code examples over prose
- Clear "last updated" dates

4. Testing Gaps

Current State

✅ 19 test files exist in tests/
❌ No test updates in last 10 commits despite adding:
- app/evaluation/scorer.py (356 lines) - ZERO tests
- app/prompts/service.py (313 lines) - Tests added but need more coverage
- app/prompts/ab_testing.py (126 lines) - ZERO tests
- app/api/feedback_router.py (172 lines) - ZERO tests

Test Debt Calculation

New production code:   1,167 lines
New test code:         ~271 lines (prompt_service only)
Test coverage ratio:   23% (target should be 70-80%)

Priority Testing Needs

P0 (Critical - No Production Deployment Without These):

# tests/test_evaluation_scorer.py
async def test_scorer_reports_quality_metrics():
    """Verify scorer creates Langfuse scores correctly."""

async def test_scorer_handles_langfuse_failures_gracefully():
    """Scoring failures shouldn't break learning sessions."""

# tests/test_feedback_router.py
async def test_user_feedback_endpoint():
    """Test POST /api/v1/feedback with satisfaction scores."""

async def test_feedback_links_to_trace():
    """Verify feedback is linked to correct Langfuse trace."""

P1 (High - Should Be Added This Week):

# tests/test_ab_testing.py
def test_cohort_assignment_is_consistent():
    """Same user_id always gets same variant."""

def test_cohort_distribution_is_balanced():
    """50/50 split across large user sample."""

5. Dependency Audit

Current Dependencies (29 packages)

# Web Framework
fastapi>=0.128.0
uvicorn[standard]>=0.40.0

# LLM Orchestration
pydantic-ai>=1.48.0
pydantic>=2.12.5

# LLM Providers (4 providers)
google-genai>=1.60.0
anthropic>=0.76.0
openai>=2.16.0
deepseek>=1.0.0

# Observability
langfuse>=3.12.1

# Database (UNUSED?)
sqlalchemy>=2.0.46
aiosqlite>=0.22.1
alembic>=1.18.1

# Utilities
python-dotenv>=1.0.1
httpx>=0.28.1
tenacity>=9.1.2

# CLI
rich>=14.3.1
typer>=0.21.1
prompt-toolkit>=3.0.0
redis>=5.0.1

Issues Identified

1. SQLAlchemy + Alembic appear unused:

# Search results
grep -r "sqlalchemy" app/  # No imports found
grep -r "alembic" app/      # No imports found

Recommendation: Remove unless planned for future use. Redis is handling state storage.

2. Four LLM providers:

OpenAI (primary)
Google/Gemini (fallback)
Anthropic (not actively used?)
DeepSeek (not actively used?)

Recommendation: Keep OpenAI + Gemini. Remove unused providers or document why they're needed.

Refactoring Roadmap

Phase 1: Critical Path (Week 1) - 5 days

Day 1-2: Extract Observability Middleware

Goal: Remove 150+ lines of repetitive Langfuse code.

Tasks:

Create app/observability/middleware.py:

class ObservabilityMiddleware:
    async def wrap_agent_call(self, agent_name: str, fn: Callable, *args, **kwargs)

Create decorators:

@trace_generation(agent_name="judge")
async def _run_judge_agent(...)

Refactor engine.py to use middleware
Add tests for middleware in isolation

Success Metrics:

engine.py reduced by 150+ lines
All existing tests pass
No Langfuse code in handle_turn() business logic

Day 3: Split TutorEngine into Services

Goal: Break 995-line file into focused modules.

New Structure:

app/tutor/
├── engine.py (200 lines - orchestration only)
├── services/
│   ├── __init__.py
│   ├── judge_service.py (100 lines)
│   ├── tutor_service.py (100 lines)
│   ├── feedback_service.py (100 lines)
│   └── metrics_collector.py (80 lines)

Migration Steps:

Create AgentServiceBase class with common logic
Extract judge/tutor/feedback services
Extract MetricsCollector for token tracking
Update TutorEngine.__init__() to use services
Run full test suite after each extraction

Success Metrics:

Each file < 250 lines
All services independently testable
Test suite passes

Day 4: Add Missing Critical Tests

Goal: Get scorer and feedback router to 80% coverage.

Tests to Write:

# tests/test_evaluation_scorer.py (4 tests)
- test_score_turn_complete_success()
- test_score_turn_complete_with_langfuse_failure()
- test_score_first_turn_handling()
- test_cost_tracking_accuracy()

# tests/test_feedback_router.py (3 tests)
- test_user_feedback_endpoint_success()
- test_feedback_validates_score_range()
- test_feedback_links_to_correct_trace()

# tests/test_ab_testing.py (3 tests)
- test_cohort_assignment_consistency()
- test_cohort_distribution()
- test_variant_selection()

Success Metrics:

All new production code has tests
Coverage report shows 70%+ on new modules
CI pipeline includes coverage checks

Day 5: Dependency Cleanup & ADR Documentation

Goal: Remove unused dependencies, document key decisions.

Tasks:

Dependency Audit:

# Remove unused dependencies
uv remove sqlalchemy aiosqlite alembic

# Audit provider usage
grep -r "anthropic\|deepseek" app/
# If unused, remove them

Write ADRs:
- ADR-001: Langfuse for observability
- ADR-002: Redis over in-memory state
- ADR-003: Multi-provider fallback strategy
- ADR-004: PydanticAI over LangChain
Update docs:
- Consolidate implementation docs
- Add "Architecture" section to README
- Remove outdated/verbose docs

Success Metrics:

pyproject.toml only lists used dependencies
4 ADRs written and reviewed
Documentation reduced by 30% but clarity improved

Phase 2: Quality Improvements (Week 2) - 3 days

Day 6: Remove Global State from Agents

Goal: Convert agents to use dependency injection.

Tasks:

Create JudgeAgentService class
Create TutorAgentService class
Create FeedbackAgentService class
Update TutorOrchestrator to inject services
Remove global variables from app/llm/*_agent.py

Success Metrics:

No global state in agent files
Services are thread-safe
Integration tests pass

Day 7: Improve Type Safety

Goal: Add stricter type checking and reduce Any usage.

Tasks:

Run mypy --strict app/
Fix type errors in order:
- app/tutor/engine.py
- app/llm/*.py
- app/evaluation/*.py
Remove Any types where possible
Add py.typed marker for library consumers

Success Metrics:

mypy --strict passes with 0 errors
No type: ignore comments added
Type coverage > 95%

Day 8: Performance Optimization

Goal: Identify and fix performance bottlenecks.

Tasks:

Add profiling to key endpoints
Identify slow queries/operations

Add caching where appropriate:

# Cache prompt fetches (they don't change often)
@lru_cache(maxsize=100)
async def get_prompt(self, name: str, label: str = "production"):
    ...

Parallelize independent operations
Add performance tests

Success Metrics:

P95 latency < 2 seconds for /chat endpoint
Token rate limiter overhead < 50ms
Langfuse reporting happens async (doesn't block response)

Phase 3: Production Hardening (Week 3) - 2 days

Day 9: Error Handling & Resilience

Goal: Comprehensive error handling and graceful degradation.

Tasks:

Audit all error handlers in app/api/
Add circuit breaker for Langfuse calls
Implement retry budgets for rate limiter
Add structured error responses
Test failure scenarios:
- Redis down
- Langfuse unavailable
- All LLM providers failing
- Rate limit exceeded

Success Metrics:

System degrades gracefully (logs errors but serves users)
No 500 errors for external service failures
Circuit breakers prevent cascade failures

Day 10: Monitoring & Alerting Setup

Goal: Production-ready observability beyond Langfuse.

Tasks:

Add Prometheus metrics:

from prometheus_client import Counter, Histogram

request_latency = Histogram('tutor_request_duration_seconds', 'Request latency')
llm_calls = Counter('llm_calls_total', 'Total LLM calls', ['provider', 'status'])

Add health check endpoint:

@router.get("/health")
async def health_check():
    return {
        "redis": await check_redis(),
        "langfuse": await check_langfuse(),
        "llm_providers": await check_llm_providers()
    }

Document alerting rules:
- Alert if error rate > 5%
- Alert if P95 latency > 5s
- Alert if Redis is down

Success Metrics:

/health endpoint returns detailed status
Prometheus metrics exported at /metrics
Runbook created for on-call engineers

Success Metrics (Overall)

Code Quality Metrics

Metric	Current	Target	Priority
Largest file size	995 lines	< 300 lines	P0
Test coverage	~60%	> 80%	P0
Repetitive code blocks	5+	0	P1
Global state usage	6 instances	0	P1
Type coverage	~85%	> 95%	P2
Dependency count	29	< 20	P2

Performance Metrics

Metric	Current	Target	Priority
P95 latency (/chat)	~3-4s	< 2s	P1
Test suite runtime	Unknown	< 30s	P2
Rate limiter overhead	~100ms	< 50ms	P2

Architecture Metrics

Metric	Current	Target	Priority
Max class responsibilities	7	1-2	P0
Service coupling	High	Loose	P1
Observability separation	Mixed	Clean	P0

Risk Assessment

High Risk (Must Address Before Next Feature)

995-line TutorEngine - Will become unmaintainable
Test coverage gaps - Scorer and feedback router have zero tests
Observability coupling - Changes to Langfuse break business logic

Medium Risk (Should Address This Sprint)

Global state in agents - Potential thread safety issues
Documentation debt - Team knowledge is in code, not docs
Unused dependencies - Bloating deployment size

Low Risk (Can Defer)

Type coverage - Already decent, not blocking
Performance optimization - Current performance is acceptable
Additional LLM providers - Not needed yet

Conclusion

What We Did Right

Fast execution - Shipped impressive scope in 4 days
Error handling - Fallback chains are production-grade
Type safety - Good use of Pydantic and type hints
Observability - Comprehensive (though needs cleanup)

What Needs Improvement

Architecture - Extract concerns into focused modules
Testing - Keep pace with feature development
Documentation - Use lightweight ADRs, not verbose guides
Dependencies - Audit and remove unused packages

Recommendation

Pause feature development for 1-2 weeks to address technical debt. The current architecture is unsustainable for the next 3-6 months of planned features.

If we continue at current pace without refactoring, we'll hit a complexity wall where:

New features take 2-3x longer to implement
Bug rate increases exponentially
New team members need weeks to onboard
Refactoring becomes riskier and more expensive

The best time to refactor is now, while the codebase is still comprehensible.

Next Steps

Review this document with the team
Prioritize refactoring items based on team capacity
Create tickets for Phase 1 tasks (Week 1)
Set up monitoring for success metrics
Schedule weekly check-ins to track progress

Document Version: 2.0 Last Updated: 2026-01-31 Next Review: After Phase 1 completion (2026-02-07)

FilesExpand file tree

refactor-v2.md

Latest commit

History

refactor-v2.md

File metadata and controls

Code Review & Refactoring Plan v2.0

Executive Summary

Key Achievements ✅

Critical Issues 🚨

Detailed Analysis

1. Architecture Assessment

Current State: Monolithic Engine

Recommended Architecture

2. Code Quality Issues

Issue #1: Repetitive Agent Wrapper Code (DRY Violation)

Issue #2: Observability Bleeding into Business Logic

Issue #3: Global State in Agent Implementations

Issue #4: Magic Parameter Names (Implicit Coupling)

Issue #5: Hardcoded Magic Numbers

3. Documentation Issues

Problem: High Documentation Churn

4. Testing Gaps

Current State

Test Debt Calculation

Priority Testing Needs

5. Dependency Audit

Current Dependencies (29 packages)

Issues Identified

Refactoring Roadmap

Phase 1: Critical Path (Week 1) - 5 days

Day 1-2: Extract Observability Middleware

Day 3: Split TutorEngine into Services

Day 4: Add Missing Critical Tests

Day 5: Dependency Cleanup & ADR Documentation

Phase 2: Quality Improvements (Week 2) - 3 days

Day 6: Remove Global State from Agents

Day 7: Improve Type Safety

Day 8: Performance Optimization

Phase 3: Production Hardening (Week 3) - 2 days

Day 9: Error Handling & Resilience

Day 10: Monitoring & Alerting Setup

Success Metrics (Overall)

Code Quality Metrics

Performance Metrics

Architecture Metrics

Risk Assessment

High Risk (Must Address Before Next Feature)

Medium Risk (Should Address This Sprint)

Low Risk (Can Defer)

Conclusion

What We Did Right

What Needs Improvement

Recommendation

Next Steps