Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
569 changes: 569 additions & 0 deletions specs/SPEC-10 Unified Deployment Workflow and Event Tracking.md

Large diffs are not rendered by default.

79 changes: 10 additions & 69 deletions specs/SPEC-11 Basic Memory API Performance Optimization.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,8 +31,6 @@ HTTP requests to the API suffer from 350ms-2.6s latency overhead **before** any

This creates compounding effects with tenant auto-start delays and increases timeout risk in cloud deployments.

Github issue: https://github.com/basicmachines-co/basic-memory-cloud/issues/82

## What

This optimization affects the **core basic-memory repository** components:
Expand Down Expand Up @@ -170,76 +168,19 @@ Validation Checklist
- Documentation: Performance optimization documented in README
- Cloud Integration: basic-memory-cloud sees performance benefits

## Implementation Status ✅ COMPLETED

**Implementation Date**: 2025-09-26
**Branch**: `feature/spec-11-api-performance-optimization`
**Commit**: `771f60b`

### ✅ Phase 1: Database Connection Caching - IMPLEMENTED

**Files Modified:**
- `src/basic_memory/api/app.py` - Added database connection caching in app.state
- `src/basic_memory/deps.py` - Updated get_engine_factory() to use cached connections
- `src/basic_memory/config.py` - Added skip_initialization_sync configuration flag

**Implementation Details:**
1. **API Lifespan Caching**: Database engine and session_maker cached in app.state during startup
2. **Dependency Injection Optimization**: get_engine_factory() now returns cached connections instead of calling get_or_create_db()
3. **Project Reconciliation Removal**: Eliminated expensive reconcile_projects_with_config() from API startup
4. **CLI Fallback Preserved**: Non-API contexts continue to work with fallback database initialization

### ✅ Performance Validation - ACHIEVED

**Live Testing Results** (2025-09-26 14:03-14:09):

| Operation | Before | After | Improvement |
|-----------|--------|-------|-------------|
| `read_note` | 350ms-2.6s | **20ms** | **95-99% faster** |
| `edit_note` | 350ms-2.6s | **218ms** | **75-92% faster** |
| `search_notes` | 350ms-2.6s | **<500ms** | **Responsive** |
| `list_memory_projects` | N/A | **<100ms** | **Fast** |

**Key Achievements:**
- ✅ **95-99% improvement** in read operations (primary workflow)
- ✅ **75-92% improvement** in edit operations
- ✅ **Zero overhead** for project switching
- ✅ **Database connection overhead eliminated** (0ms vs 50-100ms)
- ✅ **Project reconciliation delays removed** from API requests
- ✅ **<500ms target achieved** for all operations except write (which includes file sync)

### ✅ Backwards Compatibility - MAINTAINED

- All existing functionality preserved
- CLI operations unaffected
- Fallback for non-API contexts maintained
- No breaking changes to existing APIs
- Optional configuration with safe defaults

### ✅ Testing Validation - PASSED

- Integration tests passing
- Type checking clear
- Linting checks passed
- Live testing with real MCP tools successful
- Multi-project workflows validated
- Rapid project switching validated

## Notes
Notes

Implementation Priority:
- Phase 1 COMPLETED: Database connection caching provides 95%+ performance gains
- Phase 2 NOT NEEDED: Project reconciliation removal achieved the goals
- Phase 3 INCLUDED: skip_initialization_sync flag added
- Phase 1 provides 80% of performance gains and should be implemented first
- Phase 2 provides remaining 20% and addresses edge cases
- Phase 3 is optional for maximum cloud optimization

Risk Mitigation:
- All changes backwards compatible implemented
- Gradual implementation successful (Phase 1 → validation)
- Easy rollback via configuration flags available
- All changes backwards compatible
- Gradual rollout possible (Phase 1 → 2 → 3)
- Easy rollback via configuration flags

Cloud Integration:
- ✅ This optimization directly addresses basic-memory-cloud issue #82
- ✅ Changes in core basic-memory will benefit all cloud tenants
- ✅ No changes needed in basic-memory-cloud itself

**Result**: SPEC-11 performance optimizations successfully implemented and validated. The 95-99% improvement in MCP tool response times exceeds the original 50-80% target, providing exceptional performance gains for cloud deployments and local usage.
- This optimization directly addresses basic-memory-cloud issue #82
- Changes in core basic-memory will benefit all cloud tenants
- No changes needed in basic-memory-cloud itself
182 changes: 182 additions & 0 deletions specs/SPEC-12 OpenTelemetry Observability.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,182 @@
# SPEC-12: OpenTelemetry Observability

## Why

We need comprehensive observability for basic-memory-cloud to:
- Track request flows across our multi-tenant architecture (MCP → Cloud → API services)
- Debug performance issues and errors in production
- Understand user behavior and system usage patterns
- Correlate issues to specific tenants for targeted debugging
- Monitor service health and latency across the distributed system

Currently, we only have basic logging without request correlation or distributed tracing capabilities.

## What

Implement OpenTelemetry instrumentation across all basic-memory-cloud services with:

### Core Requirements
1. **Distributed Tracing**: End-to-end request tracing from MCP gateway through to tenant API instances
2. **Tenant Correlation**: All traces tagged with tenant_id, user_id, and workos_user_id
3. **Service Identification**: Clear service naming and namespace separation
4. **Auto-instrumentation**: Automatic tracing for FastAPI, SQLAlchemy, HTTP clients
5. **Grafana Cloud Integration**: Direct OTLP export to Grafana Cloud Tempo

### Services to Instrument
- **MCP Gateway** (basic-memory-mcp): Entry point with JWT extraction
- **Cloud Service** (basic-memory-cloud): Provisioning and management operations
- **API Service** (basic-memory-api): Tenant-specific instances
- **Worker Processes** (ARQ workers): Background job processing

### Key Trace Attributes
- `tenant.id`: UUID from UserProfile.tenant_id
- `user.id`: WorkOS user identifier
- `user.email`: User email for debugging
- `service.name`: Specific service identifier
- `service.namespace`: Environment (development/production)
- `operation.type`: Business operation (provision/update/delete)
- `tenant.app_name`: Fly.io app name for tenant instances

## How

### Phase 1: Setup OpenTelemetry SDK
1. Add OpenTelemetry dependencies to each service's pyproject.toml:
```python
"opentelemetry-distro[otlp]>=1.29.0",
"opentelemetry-instrumentation-fastapi>=0.50b0",
"opentelemetry-instrumentation-httpx>=0.50b0",
"opentelemetry-instrumentation-sqlalchemy>=0.50b0",
"opentelemetry-instrumentation-logging>=0.50b0",
```

2. Create shared telemetry initialization module (`apps/shared/telemetry.py`)

3. Configure Grafana Cloud OTLP endpoint via environment variables:
```bash
OTEL_EXPORTER_OTLP_ENDPOINT=https://otlp-gateway-prod-us-east-2.grafana.net/otlp
OTEL_EXPORTER_OTLP_HEADERS=Authorization=Basic[token]
OTEL_EXPORTER_OTLP_PROTOCOL=http/protobuf
```

### Phase 2: Instrument MCP Gateway
1. Extract tenant context from AuthKit JWT in middleware
2. Create root span with tenant attributes
3. Propagate trace context to downstream services via headers

### Phase 3: Instrument Cloud Service
1. Continue trace from MCP gateway
2. Add operation-specific attributes (provisioning events)
3. Instrument ARQ worker jobs for async operations
4. Track Fly.io API calls and latency

### Phase 4: Instrument API Service
1. Extract tenant context from JWT
2. Add machine-specific metadata (instance ID, region)
3. Instrument database operations with SQLAlchemy
4. Track MCP protocol operations

### Phase 5: Configure and Deploy
1. Add OTLP configuration to `.env.example` and `.env.example.secrets`
2. Set Fly.io secrets for production deployment
3. Update Dockerfiles to use `opentelemetry-instrument` wrapper
4. Deploy to development environment first for testing

## How to Evaluate

### Success Criteria
1. **End-to-end traces visible in Grafana Cloud** showing complete request flow
2. **Tenant filtering works** - Can filter traces by tenant_id to see all requests for a user
3. **Service maps accurate** - Grafana shows correct service dependencies
4. **Performance overhead < 5%** - Minimal latency impact from instrumentation
5. **Error correlation** - Can trace errors back to specific tenant and operation

### Testing Checklist
- [x] Single request creates connected trace across all services
- [x] Tenant attributes present on all spans
- [x] Background jobs (ARQ) appear in traces
- [x] Database queries show in trace timeline
- [x] HTTP calls to Fly.io API tracked
- [x] Traces exported successfully to Grafana Cloud
- [x] Can search traces by tenant_id in Grafana
- [x] Service dependency graph shows correct flow

### Monitoring Success
- All services reporting traces to Grafana Cloud
- No OTLP export errors in logs
- Trace sampling working correctly (if implemented)
- Resource usage acceptable (CPU/memory)

## Dependencies
- Grafana Cloud account with OTLP endpoint configured
- OpenTelemetry Python SDK v1.29.0+
- FastAPI instrumentation compatibility
- Network access from Fly.io to Grafana Cloud

## Implementation Assignment
**Recommended Agent**: python-developer
- Requires Python/FastAPI expertise
- Needs understanding of distributed systems
- Must implement middleware and context propagation
- Should understand OpenTelemetry SDK and instrumentation

## Follow-up Tasks

### Enhanced Log Correlation
While basic trace-to-log correlation works automatically via OpenTelemetry logging instrumentation, consider adding structured logging for improved log filtering:

1. **Structured Logging Context**: Add `logger.bind()` calls to inject tenant/user context directly into log records
2. **Custom Loguru Formatter**: Extract OpenTelemetry span attributes for better log readability
3. **Direct Log Filtering**: Enable searching logs directly by tenant_id, workflow_id without going through traces

This would complement the existing automatic trace correlation and provide better log search capabilities.

## Alternative Solution: Logfire

After implementing OpenTelemetry with Grafana Cloud, we discovered limitations in the observability experience:
- Traces work but lack useful context without correlated logs
- Setting up log correlation with Grafana is complex and requires additional infrastructure
- The developer experience for Python observability is suboptimal

### Logfire Evaluation

**Pydantic Logfire** offers a compelling alternative that addresses your specific requirements:

#### Core Requirements Match
- ✅ **User Activity Tracking**: Automatic request tracing with business context
- ✅ **Error Monitoring**: Built-in exception tracking with full context
- ✅ **Performance Metrics**: Automatic latency and performance monitoring
- ✅ **Request Tracing**: Native distributed tracing across services
- ✅ **Log Correlation**: Seamless trace-to-log correlation without setup

#### Key Advantages
1. **Python-First Design**: Built specifically for Python/FastAPI applications by the Pydantic team
2. **Simple Integration**: `pip install logfire` + `logfire.configure()` vs complex OTLP setup
3. **Automatic Correlation**: Logs automatically include trace context without manual configuration
4. **Real-time SQL Interface**: Query spans and logs using SQL with auto-completion
5. **Better Developer UX**: Purpose-built observability UI vs generic Grafana dashboards
6. **Loguru Integration**: `logger.configure(handlers=[logfire.loguru_handler()])` maintains existing logging

#### Pricing Assessment
- **Free Tier**: 10M spans/month (suitable for development and small production workloads)
- **Transparent Pricing**: $1 per million spans/metrics after free tier
- **No Hidden Costs**: No per-host fees, only usage-based metering
- **Production Ready**: Recently exited beta, enterprise features available

#### Migration Path
The existing OpenTelemetry instrumentation is compatible - Logfire uses OpenTelemetry under the hood, so the current spans and attributes would work unchanged.

### Recommendation

**Consider migrating to Logfire** for the following reasons:
1. It directly addresses the "next to useless" traces problem by providing integrated logs
2. Dramatically simpler setup and maintenance compared to Grafana Cloud + custom log correlation
3. Better ROI on observability investment with purpose-built Python tooling
4. Free tier sufficient for current development needs with clear scaling path

The current Grafana Cloud implementation provides a solid foundation and could remain as a backup/export target, while Logfire becomes the primary observability platform.

## Status
**Created**: 2024-01-28
**Status**: Completed (OpenTelemetry + Grafana Cloud)
**Next Phase**: Evaluate Logfire migration
**Priority**: High - Critical for production observability
Loading
Loading