basicmachines-co · phernandez · Oct 16, 2025 · Oct 14, 2025 · Oct 15, 2025 · Oct 15, 2025
diff --git a/specs/SPEC-10 Unified Deployment Workflow and Event Tracking.md b/specs/SPEC-10 Unified Deployment Workflow and Event Tracking.md
diff --git a/specs/SPEC-11 Basic Memory API Performance Optimization.md b/specs/SPEC-11 Basic Memory API Performance Optimization.md
@@ -31,8 +31,6 @@ HTTP requests to the API suffer from 350ms-2.6s latency overhead **before** any
 
 This creates compounding effects with tenant auto-start delays and increases timeout risk in cloud deployments.
 
-Github issue: https://github.com/basicmachines-co/basic-memory-cloud/issues/82
-
 ## What
 
 This optimization affects the **core basic-memory repository** components:
@@ -170,76 +168,19 @@ Validation Checklist
 - Documentation: Performance optimization documented in README
 - Cloud Integration: basic-memory-cloud sees performance benefits
 
-## Implementation Status ✅ COMPLETED
-
-**Implementation Date**: 2025-09-26
-**Branch**: `feature/spec-11-api-performance-optimization`
-**Commit**: `771f60b`
-
-### ✅ Phase 1: Database Connection Caching - IMPLEMENTED
-
-**Files Modified:**
-- `src/basic_memory/api/app.py` - Added database connection caching in app.state
-- `src/basic_memory/deps.py` - Updated get_engine_factory() to use cached connections
-- `src/basic_memory/config.py` - Added skip_initialization_sync configuration flag
-
-**Implementation Details:**
-1. **API Lifespan Caching**: Database engine and session_maker cached in app.state during startup
-2. **Dependency Injection Optimization**: get_engine_factory() now returns cached connections instead of calling get_or_create_db()
-3. **Project Reconciliation Removal**: Eliminated expensive reconcile_projects_with_config() from API startup
-4. **CLI Fallback Preserved**: Non-API contexts continue to work with fallback database initialization
-
-### ✅ Performance Validation - ACHIEVED
-
-**Live Testing Results** (2025-09-26 14:03-14:09):
-
-| Operation | Before | After | Improvement |
-|-----------|--------|-------|-------------|
-| `read_note` | 350ms-2.6s | **20ms** | **95-99% faster** |
-| `edit_note` | 350ms-2.6s | **218ms** | **75-92% faster** |
-| `search_notes` | 350ms-2.6s | **<500ms** | **Responsive** |
-| `list_memory_projects` | N/A | **<100ms** | **Fast** |
-
-**Key Achievements:**
-- ✅ **95-99% improvement** in read operations (primary workflow)
-- ✅ **75-92% improvement** in edit operations
-- ✅ **Zero overhead** for project switching
-- ✅ **Database connection overhead eliminated** (0ms vs 50-100ms)
-- ✅ **Project reconciliation delays removed** from API requests
-- ✅ **<500ms target achieved** for all operations except write (which includes file sync)
-
-### ✅ Backwards Compatibility - MAINTAINED
-
-- All existing functionality preserved
-- CLI operations unaffected
-- Fallback for non-API contexts maintained
-- No breaking changes to existing APIs
-- Optional configuration with safe defaults
-
-### ✅ Testing Validation - PASSED
-
-- Integration tests passing
-- Type checking clear
-- Linting checks passed
-- Live testing with real MCP tools successful
-- Multi-project workflows validated
-- Rapid project switching validated
-
-## Notes
+Notes
 
 Implementation Priority:
-- ✅ Phase 1 COMPLETED: Database connection caching provides 95%+ performance gains
-- ⚪ Phase 2 NOT NEEDED: Project reconciliation removal achieved the goals
-- ⚪ Phase 3 INCLUDED: skip_initialization_sync flag added
+- Phase 1 provides 80% of performance gains and should be implemented first
+- Phase 2 provides remaining 20% and addresses edge cases
+- Phase 3 is optional for maximum cloud optimization
 
 Risk Mitigation:
-- ✅ All changes backwards compatible implemented
-- ✅ Gradual implementation successful (Phase 1 → validation)
-- ✅ Easy rollback via configuration flags available
+- All changes backwards compatible
+- Gradual rollout possible (Phase 1 → 2 → 3)
+- Easy rollback via configuration flags
 
 Cloud Integration:
-- ✅ This optimization directly addresses basic-memory-cloud issue #82
-- ✅ Changes in core basic-memory will benefit all cloud tenants
-- ✅ No changes needed in basic-memory-cloud itself
-
-**Result**: SPEC-11 performance optimizations successfully implemented and validated. The 95-99% improvement in MCP tool response times exceeds the original 50-80% target, providing exceptional performance gains for cloud deployments and local usage.
+- This optimization directly addresses basic-memory-cloud issue #82
+- Changes in core basic-memory will benefit all cloud tenants
+- No changes needed in basic-memory-cloud itself
diff --git a/specs/SPEC-12 OpenTelemetry Observability.md b/specs/SPEC-12 OpenTelemetry Observability.md
@@ -0,0 +1,182 @@
+# SPEC-12: OpenTelemetry Observability
+
+## Why
+
+We need comprehensive observability for basic-memory-cloud to:
+- Track request flows across our multi-tenant architecture (MCP → Cloud → API services)
+- Debug performance issues and errors in production
+- Understand user behavior and system usage patterns
+- Correlate issues to specific tenants for targeted debugging
+- Monitor service health and latency across the distributed system
+
+Currently, we only have basic logging without request correlation or distributed tracing capabilities.
+
+## What
+
+Implement OpenTelemetry instrumentation across all basic-memory-cloud services with:
+
+### Core Requirements
+1. **Distributed Tracing**: End-to-end request tracing from MCP gateway through to tenant API instances
+2. **Tenant Correlation**: All traces tagged with tenant_id, user_id, and workos_user_id
+3. **Service Identification**: Clear service naming and namespace separation
+4. **Auto-instrumentation**: Automatic tracing for FastAPI, SQLAlchemy, HTTP clients
+5. **Grafana Cloud Integration**: Direct OTLP export to Grafana Cloud Tempo
+
+### Services to Instrument
+- **MCP Gateway** (basic-memory-mcp): Entry point with JWT extraction
+- **Cloud Service** (basic-memory-cloud): Provisioning and management operations
+- **API Service** (basic-memory-api): Tenant-specific instances
+- **Worker Processes** (ARQ workers): Background job processing
+
+### Key Trace Attributes
+- `tenant.id`: UUID from UserProfile.tenant_id
+- `user.id`: WorkOS user identifier
+- `user.email`: User email for debugging
+- `service.name`: Specific service identifier
+- `service.namespace`: Environment (development/production)
+- `operation.type`: Business operation (provision/update/delete)
+- `tenant.app_name`: Fly.io app name for tenant instances
+
+## How
+
+### Phase 1: Setup OpenTelemetry SDK
+1. Add OpenTelemetry dependencies to each service's pyproject.toml:
+   ```python
+   "opentelemetry-distro[otlp]>=1.29.0",
+   "opentelemetry-instrumentation-fastapi>=0.50b0",
+   "opentelemetry-instrumentation-httpx>=0.50b0",
+   "opentelemetry-instrumentation-sqlalchemy>=0.50b0",
+   "opentelemetry-instrumentation-logging>=0.50b0",
+   ```
+
+2. Create shared telemetry initialization module (`apps/shared/telemetry.py`)
+
+3. Configure Grafana Cloud OTLP endpoint via environment variables:
+   ```bash
+   OTEL_EXPORTER_OTLP_ENDPOINT=https://otlp-gateway-prod-us-east-2.grafana.net/otlp
+   OTEL_EXPORTER_OTLP_HEADERS=Authorization=Basic[token]
+   OTEL_EXPORTER_OTLP_PROTOCOL=http/protobuf
+   ```
+
+### Phase 2: Instrument MCP Gateway
+1. Extract tenant context from AuthKit JWT in middleware
+2. Create root span with tenant attributes
+3. Propagate trace context to downstream services via headers
+
+### Phase 3: Instrument Cloud Service
+1. Continue trace from MCP gateway
+2. Add operation-specific attributes (provisioning events)
+3. Instrument ARQ worker jobs for async operations
+4. Track Fly.io API calls and latency
+
+### Phase 4: Instrument API Service
+1. Extract tenant context from JWT
+2. Add machine-specific metadata (instance ID, region)
+3. Instrument database operations with SQLAlchemy
+4. Track MCP protocol operations
+
+### Phase 5: Configure and Deploy
+1. Add OTLP configuration to `.env.example` and `.env.example.secrets`
+2. Set Fly.io secrets for production deployment
+3. Update Dockerfiles to use `opentelemetry-instrument` wrapper
+4. Deploy to development environment first for testing
+
+## How to Evaluate
+
+### Success Criteria
+1. **End-to-end traces visible in Grafana Cloud** showing complete request flow
+2. **Tenant filtering works** - Can filter traces by tenant_id to see all requests for a user
+3. **Service maps accurate** - Grafana shows correct service dependencies
+4. **Performance overhead < 5%** - Minimal latency impact from instrumentation
+5. **Error correlation** - Can trace errors back to specific tenant and operation
+
+### Testing Checklist
+- [x] Single request creates connected trace across all services
+- [x] Tenant attributes present on all spans
+- [x] Background jobs (ARQ) appear in traces
+- [x] Database queries show in trace timeline
+- [x] HTTP calls to Fly.io API tracked
+- [x] Traces exported successfully to Grafana Cloud
+- [x] Can search traces by tenant_id in Grafana
+- [x] Service dependency graph shows correct flow
+
+### Monitoring Success
+- All services reporting traces to Grafana Cloud
+- No OTLP export errors in logs
+- Trace sampling working correctly (if implemented)
+- Resource usage acceptable (CPU/memory)
+
+## Dependencies
+- Grafana Cloud account with OTLP endpoint configured
+- OpenTelemetry Python SDK v1.29.0+
+- FastAPI instrumentation compatibility
+- Network access from Fly.io to Grafana Cloud
+
+## Implementation Assignment
+**Recommended Agent**: python-developer
+- Requires Python/FastAPI expertise
+- Needs understanding of distributed systems
+- Must implement middleware and context propagation
+- Should understand OpenTelemetry SDK and instrumentation
+
+## Follow-up Tasks
+
+### Enhanced Log Correlation
+While basic trace-to-log correlation works automatically via OpenTelemetry logging instrumentation, consider adding structured logging for improved log filtering:
+
+1. **Structured Logging Context**: Add `logger.bind()` calls to inject tenant/user context directly into log records
+2. **Custom Loguru Formatter**: Extract OpenTelemetry span attributes for better log readability
+3. **Direct Log Filtering**: Enable searching logs directly by tenant_id, workflow_id without going through traces
+
+This would complement the existing automatic trace correlation and provide better log search capabilities.
+
+## Alternative Solution: Logfire
+
+After implementing OpenTelemetry with Grafana Cloud, we discovered limitations in the observability experience:
+- Traces work but lack useful context without correlated logs
+- Setting up log correlation with Grafana is complex and requires additional infrastructure
+- The developer experience for Python observability is suboptimal
+
+### Logfire Evaluation
+
+**Pydantic Logfire** offers a compelling alternative that addresses your specific requirements:
+
+#### Core Requirements Match
+- ✅ **User Activity Tracking**: Automatic request tracing with business context
+- ✅ **Error Monitoring**: Built-in exception tracking with full context
+- ✅ **Performance Metrics**: Automatic latency and performance monitoring
+- ✅ **Request Tracing**: Native distributed tracing across services
+- ✅ **Log Correlation**: Seamless trace-to-log correlation without setup
+
+#### Key Advantages
+1. **Python-First Design**: Built specifically for Python/FastAPI applications by the Pydantic team
+2. **Simple Integration**: `pip install logfire` + `logfire.configure()` vs complex OTLP setup
+3. **Automatic Correlation**: Logs automatically include trace context without manual configuration
+4. **Real-time SQL Interface**: Query spans and logs using SQL with auto-completion
+5. **Better Developer UX**: Purpose-built observability UI vs generic Grafana dashboards
+6. **Loguru Integration**: `logger.configure(handlers=[logfire.loguru_handler()])` maintains existing logging
+
+#### Pricing Assessment
+- **Free Tier**: 10M spans/month (suitable for development and small production workloads)
+- **Transparent Pricing**: $1 per million spans/metrics after free tier
+- **No Hidden Costs**: No per-host fees, only usage-based metering
+- **Production Ready**: Recently exited beta, enterprise features available
+
+#### Migration Path
+The existing OpenTelemetry instrumentation is compatible - Logfire uses OpenTelemetry under the hood, so the current spans and attributes would work unchanged.
+
+### Recommendation
+
+**Consider migrating to Logfire** for the following reasons:
+1. It directly addresses the "next to useless" traces problem by providing integrated logs
+2. Dramatically simpler setup and maintenance compared to Grafana Cloud + custom log correlation
+3. Better ROI on observability investment with purpose-built Python tooling
+4. Free tier sufficient for current development needs with clear scaling path
+
+The current Grafana Cloud implementation provides a solid foundation and could remain as a backup/export target, while Logfire becomes the primary observability platform.
+
+## Status
+**Created**: 2024-01-28
+**Status**: Completed (OpenTelemetry + Grafana Cloud)
+**Next Phase**: Evaluate Logfire migration
+**Priority**: High - Critical for production observability