Review Date: 2026-01-31 Review Scope: Last 3-4 days of development (37 commits, ~8000 LOC changed) Reviewer Perspective: Staff Engineer Assessment Overall Grade: 6.5/10 - Functional but Needs Refactoring
The codebase demonstrates impressive velocity and strong technical fundamentals but has accumulated significant technical debt that will impact maintainability. The work shows senior-level execution speed with mid-level architectural discipline.
- Shipped comprehensive observability (Langfuse integration)
- Implemented streaming responses with SSE
- Built React frontend with TypeScript
- Added session resumption and token rate limiting
- Created multi-agent orchestration with fallback chains
- Integrated A/B testing and evaluation infrastructure
- 995-line
TutorEngineviolating Single Responsibility Principle - Observability code bleeding into business logic (150+ lines of repetitive Langfuse calls)
- Documentation churn (911 lines deleted as "AI slop")
- Test coverage not keeping pace with features
- Global state in agent implementations
File: app/tutor/engine.py (995 lines)
Responsibilities (Too Many):
- Orchestration of judge/tutor/feedback agents
- Observability and tracing
- Token usage tracking and reporting
- Rate limiting enforcement
- State management
- Streaming logic
- Response formatting
Problem: Single class doing 7+ distinct jobs violates SRP.
Current (Monolithic):
┌─────────────────────────────────────────────┐
│ TutorEngine (995 lines) │
│ ┌─────────────────────────────────────┐ │
│ │ Orchestration │ │
│ │ + Observability │ │
│ │ + Token Tracking │ │
│ │ + Rate Limiting │ │
│ │ + State Management │ │
│ │ + Streaming │ │
│ └─────────────────────────────────────┘ │
└─────────────────────────────────────────────┘
Proposed (Modular):
┌──────────────────────────┐
│ TutorOrchestrator │ ← Pure coordination (200 lines)
│ (coordination only) │
└────────┬─────────────────┘
│
┌────┴────────────────────────────────┐
│ │
┌───▼────────────┐ ┌──────────────┐ ┌─▼──────────────┐
│ AgentService │ │ Metrics │ │ RateLimiter │
│ (observability)│ │ Collector │ │ (policy) │
└────┬───────────┘ └──────────────┘ └────────────────┘
│
┌────┴─────────────────────────────┐
│ JudgeService │
│ TutorService │
│ FeedbackService │
└──────────────────────────────────┘
Location: app/tutor/engine.py:249-421
Problem: Three nearly identical methods with 20+ lines of Langfuse reporting each.
Current Code Pattern:
@observe(as_type="generation", name="judge_assessment")
async def _run_judge_agent(self, deps: JudgeDependencies, session_id: str) -> AgentResult[BloomAssessment]:
result = await run_agent_with_fallback(
agent=judge_agent,
prompt="Evaluate this response.",
deps=deps,
output_type=BloomAssessment,
model_config=self.llm_config.judge_model,
agent_name="judge",
session_id=session_id
)
link_judge_prompt_to_generation()
# 20+ lines of token reporting (repeated 3x)
if result.usage.total_tokens > 0:
try:
langfuse_client = get_client()
if langfuse_client:
model_name = result.model_used.split(":")[-1] if result.model_used else None
update_params = {
"usage_details": {
"input": result.usage.input_tokens,
"output": result.usage.output_tokens,
"total": result.usage.total_tokens
}
}
if model_name:
update_params["model"] = model_name
langfuse_client.update_current_generation(**update_params)
logger.debug(f"[judge] Reported {result.usage.total_tokens} tokens...")
except Exception as e:
logger.warning(f"[judge] Failed to report usage to Langfuse: {e}")
return resultRecommended Solution:
# Create a generic wrapper
class ObservableAgentService:
def __init__(self, langfuse_client: Langfuse):
self.client = langfuse_client
@observe_with_tokens
async def run_agent(
self,
agent_name: str,
agent: Agent,
deps: Any,
output_type: Type[T],
model_config: ModelConfig,
session_id: str
) -> AgentResult[T]:
result = await run_agent_with_fallback(
agent=agent,
prompt=f"{agent_name} execution",
deps=deps,
output_type=output_type,
model_config=model_config,
agent_name=agent_name,
session_id=session_id
)
return result
# Then use it
judge_result = await self.agent_service.run_agent(
agent_name="judge",
agent=judge_agent,
deps=deps,
output_type=BloomAssessment,
model_config=self.llm_config.judge_model,
session_id=session_id
)Impact:
- Remove ~150 lines of repetitive code
- Centralize observability logic
- Easier to test and maintain
Problem: Langfuse code appears in 5+ locations throughout engine.py, mixed with business logic.
Examples:
- Lines 462-495: Trace update in
handle_turn() - Lines 278-303: Token reporting in
_run_judge_agent() - Lines 336-361: Token reporting in
_run_tutor_agent() - Lines 394-419: Token reporting in
_run_feedback_agent() - Lines 735-772: Trace updates in
handle_turn_stream()
Why This Is Bad:
- Violates separation of concerns
- Makes business logic harder to read
- Observability failures can break business logic
- Testing requires mocking Langfuse everywhere
Recommended Pattern - Decorator Approach:
# Create reusable decorators
def trace_generation(agent_name: str, report_tokens: bool = True):
"""Decorator to add Langfuse tracing to agent methods."""
def decorator(func):
@wraps(func)
async def wrapper(*args, **kwargs):
with observe(as_type="generation", name=agent_name):
result = await func(*args, **kwargs)
if report_tokens and result.usage.total_tokens > 0:
_report_usage_to_langfuse(result, agent_name)
return result
return wrapper
return decorator
# Then use it
@trace_generation(agent_name="judge", report_tokens=True)
async def _run_judge_agent(self, deps: JudgeDependencies, session_id: str):
return await run_agent_with_fallback(...)
# That's it. No Langfuse code here.Impact:
- Business logic becomes pure and readable
- Observability is opt-in via decorators
- Easy to disable for testing
- Centralized error handling for observability
Location: app/llm/judge_agent.py:16-17
Problem:
_judge_prompt_service: Optional[PromptService] = None
_judge_prompt_metadata: Optional[Dict[str, Any]] = NoneWhy This Is Bad:
- Testing nightmare - Must reset globals between tests
- Thread safety - Potential race conditions in async context
- Hidden dependencies - Function behavior depends on external state
- Difficult to reason about - Who sets these? When? In what order?
Recommended Solution - Dependency Injection:
# Instead of globals
class JudgeAgentService:
def __init__(
self,
prompt_service: Optional[PromptService] = None,
llm_config: LLMConfig = None
):
self.prompt_service = prompt_service
self.llm_config = llm_config or get_llm_config()
self.prompt_metadata: Dict[str, Any] = {}
# Initialize agent
self.agent = Agent(
model=create_model_with_fallback(...),
deps_type=JudgeDependencies,
retries=self.llm_config.judge_model.max_retries,
output_type=BloomAssessment
)
async def evaluate(self, deps: JudgeDependencies) -> AgentResult[BloomAssessment]:
"""Execute judge agent with proper dependency management."""
...Impact:
- Testable with dependency injection
- Thread-safe by design
- Clear ownership of state
- Easier to mock in tests
Location: app/tutor/engine.py:430
Problem:
async def handle_turn(
self,
session_id: str,
user_message: str,
user_id: str = "default",
topic: str = "General",
langfuse_trace_id: Optional[str] = None # Special parameter - @observe decorator uses this!
):Issue: Function signature depends on decorator implementation details. Renaming this parameter breaks observability silently.
Recommended Solution:
# Option 1: Explicit context object
@dataclass
class TurnContext:
session_id: str
user_message: str
user_id: str = "default"
topic: str = "General"
trace_id: Optional[str] = None # Explicit, not magic
async def handle_turn(self, context: TurnContext) -> TutorResponse:
"""Process a single turn with explicit context."""
...
# Option 2: Use context manager
async def handle_turn(self, session_id: str, user_message: str, ...):
trace_id = session_id.replace("-", "")
with langfuse.trace(trace_id=trace_id):
# Observability is explicit, not parameter-based
...Location: app/tutor/engine.py:502
Problem:
await self.rate_limiter.check_budget(user_id=user_id, estimated_tokens=2000)Issues:
- Token estimate is hardcoded
- No visibility into why 2000 was chosen
- Will become inaccurate as models/prompts change
Recommended Solution:
# Add to ModelConfig
@dataclass
class ModelConfig:
model_name: str
fallback_models: list[str]
max_retries: int
timeout_seconds: int
estimated_tokens_per_request: int = 2000 # Make it configurable
# Then use it
estimated = (
self.llm_config.judge_model.estimated_tokens_per_request +
self.llm_config.tutor_model.estimated_tokens_per_request +
self.llm_config.feedback_model.estimated_tokens_per_request
)
await self.rate_limiter.check_budget(user_id=user_id, estimated_tokens=estimated)Evidence from Git History:
7313549 Remove multiple documentation files. (911 lines deleted)
27e2842 Remove the files with AI slop and over written verbose markdown info.
46911d8 docs: remove redundant scoring documentation files
What This Signals:
- Unclear requirements leading to exploratory documentation
- Possible over-reliance on AI-generated content
- Documentation not aligned with actual implementation
- Lack of lightweight Architecture Decision Records (ADRs)
Recommended Approach:
- Replace verbose docs with concise ADRs:
# ADR-001: Why Langfuse for Observability
**Status:** Accepted
**Date:** 2026-01-30
**Deciders:** [Team]
## Context
Need observability for multi-agent LLM system to track costs, performance, and quality.
## Decision
Use Langfuse for tracing, prompt management, and evaluation.
## Alternatives Considered
- LangSmith: Better integration with LangChain, but we use PydanticAI
- Custom logging: Full control but high maintenance burden
- Weights & Biases: Strong ML focus but overkill for our use case
## Consequences
**Positive:**
- Built-in prompt versioning and A/B testing
- Cost tracking across providers
- Production-ready UI for analytics
**Negative:**
- Tight coupling to Langfuse APIs
- Additional infrastructure dependency
- Learning curve for team- Keep implementation docs focused and current:
- One doc per major component (already done:
docs/observability.md) - Code examples over prose
- Clear "last updated" dates
- One doc per major component (already done:
- ✅ 19 test files exist in
tests/ - ❌ No test updates in last 10 commits despite adding:
app/evaluation/scorer.py(356 lines) - ZERO testsapp/prompts/service.py(313 lines) - Tests added but need more coverageapp/prompts/ab_testing.py(126 lines) - ZERO testsapp/api/feedback_router.py(172 lines) - ZERO tests
New production code: 1,167 lines
New test code: ~271 lines (prompt_service only)
Test coverage ratio: 23% (target should be 70-80%)
P0 (Critical - No Production Deployment Without These):
# tests/test_evaluation_scorer.py
async def test_scorer_reports_quality_metrics():
"""Verify scorer creates Langfuse scores correctly."""
async def test_scorer_handles_langfuse_failures_gracefully():
"""Scoring failures shouldn't break learning sessions."""
# tests/test_feedback_router.py
async def test_user_feedback_endpoint():
"""Test POST /api/v1/feedback with satisfaction scores."""
async def test_feedback_links_to_trace():
"""Verify feedback is linked to correct Langfuse trace."""P1 (High - Should Be Added This Week):
# tests/test_ab_testing.py
def test_cohort_assignment_is_consistent():
"""Same user_id always gets same variant."""
def test_cohort_distribution_is_balanced():
"""50/50 split across large user sample."""# Web Framework
fastapi>=0.128.0
uvicorn[standard]>=0.40.0
# LLM Orchestration
pydantic-ai>=1.48.0
pydantic>=2.12.5
# LLM Providers (4 providers)
google-genai>=1.60.0
anthropic>=0.76.0
openai>=2.16.0
deepseek>=1.0.0
# Observability
langfuse>=3.12.1
# Database (UNUSED?)
sqlalchemy>=2.0.46
aiosqlite>=0.22.1
alembic>=1.18.1
# Utilities
python-dotenv>=1.0.1
httpx>=0.28.1
tenacity>=9.1.2
# CLI
rich>=14.3.1
typer>=0.21.1
prompt-toolkit>=3.0.0
redis>=5.0.11. SQLAlchemy + Alembic appear unused:
# Search results
grep -r "sqlalchemy" app/ # No imports found
grep -r "alembic" app/ # No imports foundRecommendation: Remove unless planned for future use. Redis is handling state storage.
2. Four LLM providers:
- OpenAI (primary)
- Google/Gemini (fallback)
- Anthropic (not actively used?)
- DeepSeek (not actively used?)
Recommendation: Keep OpenAI + Gemini. Remove unused providers or document why they're needed.
Goal: Remove 150+ lines of repetitive Langfuse code.
Tasks:
-
Create
app/observability/middleware.py:class ObservabilityMiddleware: async def wrap_agent_call(self, agent_name: str, fn: Callable, *args, **kwargs)
-
Create decorators:
@trace_generation(agent_name="judge") async def _run_judge_agent(...)
-
Refactor
engine.pyto use middleware -
Add tests for middleware in isolation
Success Metrics:
engine.pyreduced by 150+ lines- All existing tests pass
- No Langfuse code in
handle_turn()business logic
Goal: Break 995-line file into focused modules.
New Structure:
app/tutor/
├── engine.py (200 lines - orchestration only)
├── services/
│ ├── __init__.py
│ ├── judge_service.py (100 lines)
│ ├── tutor_service.py (100 lines)
│ ├── feedback_service.py (100 lines)
│ └── metrics_collector.py (80 lines)
Migration Steps:
- Create
AgentServiceBaseclass with common logic - Extract judge/tutor/feedback services
- Extract
MetricsCollectorfor token tracking - Update
TutorEngine.__init__()to use services - Run full test suite after each extraction
Success Metrics:
- Each file < 250 lines
- All services independently testable
- Test suite passes
Goal: Get scorer and feedback router to 80% coverage.
Tests to Write:
# tests/test_evaluation_scorer.py (4 tests)
- test_score_turn_complete_success()
- test_score_turn_complete_with_langfuse_failure()
- test_score_first_turn_handling()
- test_cost_tracking_accuracy()
# tests/test_feedback_router.py (3 tests)
- test_user_feedback_endpoint_success()
- test_feedback_validates_score_range()
- test_feedback_links_to_correct_trace()
# tests/test_ab_testing.py (3 tests)
- test_cohort_assignment_consistency()
- test_cohort_distribution()
- test_variant_selection()Success Metrics:
- All new production code has tests
- Coverage report shows 70%+ on new modules
- CI pipeline includes coverage checks
Goal: Remove unused dependencies, document key decisions.
Tasks:
-
Dependency Audit:
# Remove unused dependencies uv remove sqlalchemy aiosqlite alembic # Audit provider usage grep -r "anthropic\|deepseek" app/ # If unused, remove them
-
Write ADRs:
- ADR-001: Langfuse for observability
- ADR-002: Redis over in-memory state
- ADR-003: Multi-provider fallback strategy
- ADR-004: PydanticAI over LangChain
-
Update docs:
- Consolidate implementation docs
- Add "Architecture" section to README
- Remove outdated/verbose docs
Success Metrics:
pyproject.tomlonly lists used dependencies- 4 ADRs written and reviewed
- Documentation reduced by 30% but clarity improved
Goal: Convert agents to use dependency injection.
Tasks:
- Create
JudgeAgentServiceclass - Create
TutorAgentServiceclass - Create
FeedbackAgentServiceclass - Update
TutorOrchestratorto inject services - Remove global variables from
app/llm/*_agent.py
Success Metrics:
- No global state in agent files
- Services are thread-safe
- Integration tests pass
Goal: Add stricter type checking and reduce Any usage.
Tasks:
- Run
mypy --strict app/ - Fix type errors in order:
app/tutor/engine.pyapp/llm/*.pyapp/evaluation/*.py
- Remove
Anytypes where possible - Add
py.typedmarker for library consumers
Success Metrics:
mypy --strictpasses with 0 errors- No
type: ignorecomments added - Type coverage > 95%
Goal: Identify and fix performance bottlenecks.
Tasks:
- Add profiling to key endpoints
- Identify slow queries/operations
- Add caching where appropriate:
# Cache prompt fetches (they don't change often) @lru_cache(maxsize=100) async def get_prompt(self, name: str, label: str = "production"): ...
- Parallelize independent operations
- Add performance tests
Success Metrics:
- P95 latency < 2 seconds for
/chatendpoint - Token rate limiter overhead < 50ms
- Langfuse reporting happens async (doesn't block response)
Goal: Comprehensive error handling and graceful degradation.
Tasks:
- Audit all error handlers in
app/api/ - Add circuit breaker for Langfuse calls
- Implement retry budgets for rate limiter
- Add structured error responses
- Test failure scenarios:
- Redis down
- Langfuse unavailable
- All LLM providers failing
- Rate limit exceeded
Success Metrics:
- System degrades gracefully (logs errors but serves users)
- No 500 errors for external service failures
- Circuit breakers prevent cascade failures
Goal: Production-ready observability beyond Langfuse.
Tasks:
-
Add Prometheus metrics:
from prometheus_client import Counter, Histogram request_latency = Histogram('tutor_request_duration_seconds', 'Request latency') llm_calls = Counter('llm_calls_total', 'Total LLM calls', ['provider', 'status'])
-
Add health check endpoint:
@router.get("/health") async def health_check(): return { "redis": await check_redis(), "langfuse": await check_langfuse(), "llm_providers": await check_llm_providers() }
-
Document alerting rules:
- Alert if error rate > 5%
- Alert if P95 latency > 5s
- Alert if Redis is down
Success Metrics:
/healthendpoint returns detailed status- Prometheus metrics exported at
/metrics - Runbook created for on-call engineers
| Metric | Current | Target | Priority |
|---|---|---|---|
| Largest file size | 995 lines | < 300 lines | P0 |
| Test coverage | ~60% | > 80% | P0 |
| Repetitive code blocks | 5+ | 0 | P1 |
| Global state usage | 6 instances | 0 | P1 |
| Type coverage | ~85% | > 95% | P2 |
| Dependency count | 29 | < 20 | P2 |
| Metric | Current | Target | Priority |
|---|---|---|---|
| P95 latency (/chat) | ~3-4s | < 2s | P1 |
| Test suite runtime | Unknown | < 30s | P2 |
| Rate limiter overhead | ~100ms | < 50ms | P2 |
| Metric | Current | Target | Priority |
|---|---|---|---|
| Max class responsibilities | 7 | 1-2 | P0 |
| Service coupling | High | Loose | P1 |
| Observability separation | Mixed | Clean | P0 |
- 995-line TutorEngine - Will become unmaintainable
- Test coverage gaps - Scorer and feedback router have zero tests
- Observability coupling - Changes to Langfuse break business logic
- Global state in agents - Potential thread safety issues
- Documentation debt - Team knowledge is in code, not docs
- Unused dependencies - Bloating deployment size
- Type coverage - Already decent, not blocking
- Performance optimization - Current performance is acceptable
- Additional LLM providers - Not needed yet
- Fast execution - Shipped impressive scope in 4 days
- Error handling - Fallback chains are production-grade
- Type safety - Good use of Pydantic and type hints
- Observability - Comprehensive (though needs cleanup)
- Architecture - Extract concerns into focused modules
- Testing - Keep pace with feature development
- Documentation - Use lightweight ADRs, not verbose guides
- Dependencies - Audit and remove unused packages
Pause feature development for 1-2 weeks to address technical debt. The current architecture is unsustainable for the next 3-6 months of planned features.
If we continue at current pace without refactoring, we'll hit a complexity wall where:
- New features take 2-3x longer to implement
- Bug rate increases exponentially
- New team members need weeks to onboard
- Refactoring becomes riskier and more expensive
The best time to refactor is now, while the codebase is still comprehensible.
- Review this document with the team
- Prioritize refactoring items based on team capacity
- Create tickets for Phase 1 tasks (Week 1)
- Set up monitoring for success metrics
- Schedule weekly check-ins to track progress
Document Version: 2.0 Last Updated: 2026-01-31 Next Review: After Phase 1 completion (2026-02-07)