Skip to content

Commit 256c851

Browse files
phernandezclaude
andcommitted
Make project creation endpoint idempotent
When a user tries to create a project that already exists (e.g., using the --create-project CLI flag multiple times), return 200 OK with the existing project info instead of 400 Bad Request. This improves the user experience by making the endpoint idempotent, which is particularly useful when uploading files with the --create-project flag. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
1 parent 994c8b8 commit 256c851

12 files changed

Lines changed: 1988 additions & 842 deletions

specs/SPEC-10 Unified Deployment Workflow and Event Tracking.md

Lines changed: 569 additions & 0 deletions
Large diffs are not rendered by default.

specs/SPEC-11 Basic Memory API Performance Optimization.md

Lines changed: 10 additions & 69 deletions
Original file line numberDiff line numberDiff line change
@@ -31,8 +31,6 @@ HTTP requests to the API suffer from 350ms-2.6s latency overhead **before** any
3131

3232
This creates compounding effects with tenant auto-start delays and increases timeout risk in cloud deployments.
3333

34-
Github issue: https://github.com/basicmachines-co/basic-memory-cloud/issues/82
35-
3634
## What
3735

3836
This optimization affects the **core basic-memory repository** components:
@@ -170,76 +168,19 @@ Validation Checklist
170168
- Documentation: Performance optimization documented in README
171169
- Cloud Integration: basic-memory-cloud sees performance benefits
172170

173-
## Implementation Status ✅ COMPLETED
174-
175-
**Implementation Date**: 2025-09-26
176-
**Branch**: `feature/spec-11-api-performance-optimization`
177-
**Commit**: `771f60b`
178-
179-
### ✅ Phase 1: Database Connection Caching - IMPLEMENTED
180-
181-
**Files Modified:**
182-
- `src/basic_memory/api/app.py` - Added database connection caching in app.state
183-
- `src/basic_memory/deps.py` - Updated get_engine_factory() to use cached connections
184-
- `src/basic_memory/config.py` - Added skip_initialization_sync configuration flag
185-
186-
**Implementation Details:**
187-
1. **API Lifespan Caching**: Database engine and session_maker cached in app.state during startup
188-
2. **Dependency Injection Optimization**: get_engine_factory() now returns cached connections instead of calling get_or_create_db()
189-
3. **Project Reconciliation Removal**: Eliminated expensive reconcile_projects_with_config() from API startup
190-
4. **CLI Fallback Preserved**: Non-API contexts continue to work with fallback database initialization
191-
192-
### ✅ Performance Validation - ACHIEVED
193-
194-
**Live Testing Results** (2025-09-26 14:03-14:09):
195-
196-
| Operation | Before | After | Improvement |
197-
|-----------|--------|-------|-------------|
198-
| `read_note` | 350ms-2.6s | **20ms** | **95-99% faster** |
199-
| `edit_note` | 350ms-2.6s | **218ms** | **75-92% faster** |
200-
| `search_notes` | 350ms-2.6s | **<500ms** | **Responsive** |
201-
| `list_memory_projects` | N/A | **<100ms** | **Fast** |
202-
203-
**Key Achievements:**
204-
-**95-99% improvement** in read operations (primary workflow)
205-
-**75-92% improvement** in edit operations
206-
-**Zero overhead** for project switching
207-
-**Database connection overhead eliminated** (0ms vs 50-100ms)
208-
-**Project reconciliation delays removed** from API requests
209-
-**<500ms target achieved** for all operations except write (which includes file sync)
210-
211-
### ✅ Backwards Compatibility - MAINTAINED
212-
213-
- All existing functionality preserved
214-
- CLI operations unaffected
215-
- Fallback for non-API contexts maintained
216-
- No breaking changes to existing APIs
217-
- Optional configuration with safe defaults
218-
219-
### ✅ Testing Validation - PASSED
220-
221-
- Integration tests passing
222-
- Type checking clear
223-
- Linting checks passed
224-
- Live testing with real MCP tools successful
225-
- Multi-project workflows validated
226-
- Rapid project switching validated
227-
228-
## Notes
171+
Notes
229172

230173
Implementation Priority:
231-
- Phase 1 COMPLETED: Database connection caching provides 95%+ performance gains
232-
- Phase 2 NOT NEEDED: Project reconciliation removal achieved the goals
233-
- Phase 3 INCLUDED: skip_initialization_sync flag added
174+
- Phase 1 provides 80% of performance gains and should be implemented first
175+
- Phase 2 provides remaining 20% and addresses edge cases
176+
- Phase 3 is optional for maximum cloud optimization
234177

235178
Risk Mitigation:
236-
- All changes backwards compatible implemented
237-
- Gradual implementation successful (Phase 1 → validation)
238-
- Easy rollback via configuration flags available
179+
- All changes backwards compatible
180+
- Gradual rollout possible (Phase 1 → 2 → 3)
181+
- Easy rollback via configuration flags
239182

240183
Cloud Integration:
241-
- ✅ This optimization directly addresses basic-memory-cloud issue #82
242-
- ✅ Changes in core basic-memory will benefit all cloud tenants
243-
- ✅ No changes needed in basic-memory-cloud itself
244-
245-
**Result**: SPEC-11 performance optimizations successfully implemented and validated. The 95-99% improvement in MCP tool response times exceeds the original 50-80% target, providing exceptional performance gains for cloud deployments and local usage.
184+
- This optimization directly addresses basic-memory-cloud issue #82
185+
- Changes in core basic-memory will benefit all cloud tenants
186+
- No changes needed in basic-memory-cloud itself
Lines changed: 182 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,182 @@
1+
# SPEC-12: OpenTelemetry Observability
2+
3+
## Why
4+
5+
We need comprehensive observability for basic-memory-cloud to:
6+
- Track request flows across our multi-tenant architecture (MCP → Cloud → API services)
7+
- Debug performance issues and errors in production
8+
- Understand user behavior and system usage patterns
9+
- Correlate issues to specific tenants for targeted debugging
10+
- Monitor service health and latency across the distributed system
11+
12+
Currently, we only have basic logging without request correlation or distributed tracing capabilities.
13+
14+
## What
15+
16+
Implement OpenTelemetry instrumentation across all basic-memory-cloud services with:
17+
18+
### Core Requirements
19+
1. **Distributed Tracing**: End-to-end request tracing from MCP gateway through to tenant API instances
20+
2. **Tenant Correlation**: All traces tagged with tenant_id, user_id, and workos_user_id
21+
3. **Service Identification**: Clear service naming and namespace separation
22+
4. **Auto-instrumentation**: Automatic tracing for FastAPI, SQLAlchemy, HTTP clients
23+
5. **Grafana Cloud Integration**: Direct OTLP export to Grafana Cloud Tempo
24+
25+
### Services to Instrument
26+
- **MCP Gateway** (basic-memory-mcp): Entry point with JWT extraction
27+
- **Cloud Service** (basic-memory-cloud): Provisioning and management operations
28+
- **API Service** (basic-memory-api): Tenant-specific instances
29+
- **Worker Processes** (ARQ workers): Background job processing
30+
31+
### Key Trace Attributes
32+
- `tenant.id`: UUID from UserProfile.tenant_id
33+
- `user.id`: WorkOS user identifier
34+
- `user.email`: User email for debugging
35+
- `service.name`: Specific service identifier
36+
- `service.namespace`: Environment (development/production)
37+
- `operation.type`: Business operation (provision/update/delete)
38+
- `tenant.app_name`: Fly.io app name for tenant instances
39+
40+
## How
41+
42+
### Phase 1: Setup OpenTelemetry SDK
43+
1. Add OpenTelemetry dependencies to each service's pyproject.toml:
44+
```python
45+
"opentelemetry-distro[otlp]>=1.29.0",
46+
"opentelemetry-instrumentation-fastapi>=0.50b0",
47+
"opentelemetry-instrumentation-httpx>=0.50b0",
48+
"opentelemetry-instrumentation-sqlalchemy>=0.50b0",
49+
"opentelemetry-instrumentation-logging>=0.50b0",
50+
```
51+
52+
2. Create shared telemetry initialization module (`apps/shared/telemetry.py`)
53+
54+
3. Configure Grafana Cloud OTLP endpoint via environment variables:
55+
```bash
56+
OTEL_EXPORTER_OTLP_ENDPOINT=https://otlp-gateway-prod-us-east-2.grafana.net/otlp
57+
OTEL_EXPORTER_OTLP_HEADERS=Authorization=Basic[token]
58+
OTEL_EXPORTER_OTLP_PROTOCOL=http/protobuf
59+
```
60+
61+
### Phase 2: Instrument MCP Gateway
62+
1. Extract tenant context from AuthKit JWT in middleware
63+
2. Create root span with tenant attributes
64+
3. Propagate trace context to downstream services via headers
65+
66+
### Phase 3: Instrument Cloud Service
67+
1. Continue trace from MCP gateway
68+
2. Add operation-specific attributes (provisioning events)
69+
3. Instrument ARQ worker jobs for async operations
70+
4. Track Fly.io API calls and latency
71+
72+
### Phase 4: Instrument API Service
73+
1. Extract tenant context from JWT
74+
2. Add machine-specific metadata (instance ID, region)
75+
3. Instrument database operations with SQLAlchemy
76+
4. Track MCP protocol operations
77+
78+
### Phase 5: Configure and Deploy
79+
1. Add OTLP configuration to `.env.example` and `.env.example.secrets`
80+
2. Set Fly.io secrets for production deployment
81+
3. Update Dockerfiles to use `opentelemetry-instrument` wrapper
82+
4. Deploy to development environment first for testing
83+
84+
## How to Evaluate
85+
86+
### Success Criteria
87+
1. **End-to-end traces visible in Grafana Cloud** showing complete request flow
88+
2. **Tenant filtering works** - Can filter traces by tenant_id to see all requests for a user
89+
3. **Service maps accurate** - Grafana shows correct service dependencies
90+
4. **Performance overhead < 5%** - Minimal latency impact from instrumentation
91+
5. **Error correlation** - Can trace errors back to specific tenant and operation
92+
93+
### Testing Checklist
94+
- [x] Single request creates connected trace across all services
95+
- [x] Tenant attributes present on all spans
96+
- [x] Background jobs (ARQ) appear in traces
97+
- [x] Database queries show in trace timeline
98+
- [x] HTTP calls to Fly.io API tracked
99+
- [x] Traces exported successfully to Grafana Cloud
100+
- [x] Can search traces by tenant_id in Grafana
101+
- [x] Service dependency graph shows correct flow
102+
103+
### Monitoring Success
104+
- All services reporting traces to Grafana Cloud
105+
- No OTLP export errors in logs
106+
- Trace sampling working correctly (if implemented)
107+
- Resource usage acceptable (CPU/memory)
108+
109+
## Dependencies
110+
- Grafana Cloud account with OTLP endpoint configured
111+
- OpenTelemetry Python SDK v1.29.0+
112+
- FastAPI instrumentation compatibility
113+
- Network access from Fly.io to Grafana Cloud
114+
115+
## Implementation Assignment
116+
**Recommended Agent**: python-developer
117+
- Requires Python/FastAPI expertise
118+
- Needs understanding of distributed systems
119+
- Must implement middleware and context propagation
120+
- Should understand OpenTelemetry SDK and instrumentation
121+
122+
## Follow-up Tasks
123+
124+
### Enhanced Log Correlation
125+
While basic trace-to-log correlation works automatically via OpenTelemetry logging instrumentation, consider adding structured logging for improved log filtering:
126+
127+
1. **Structured Logging Context**: Add `logger.bind()` calls to inject tenant/user context directly into log records
128+
2. **Custom Loguru Formatter**: Extract OpenTelemetry span attributes for better log readability
129+
3. **Direct Log Filtering**: Enable searching logs directly by tenant_id, workflow_id without going through traces
130+
131+
This would complement the existing automatic trace correlation and provide better log search capabilities.
132+
133+
## Alternative Solution: Logfire
134+
135+
After implementing OpenTelemetry with Grafana Cloud, we discovered limitations in the observability experience:
136+
- Traces work but lack useful context without correlated logs
137+
- Setting up log correlation with Grafana is complex and requires additional infrastructure
138+
- The developer experience for Python observability is suboptimal
139+
140+
### Logfire Evaluation
141+
142+
**Pydantic Logfire** offers a compelling alternative that addresses your specific requirements:
143+
144+
#### Core Requirements Match
145+
-**User Activity Tracking**: Automatic request tracing with business context
146+
-**Error Monitoring**: Built-in exception tracking with full context
147+
-**Performance Metrics**: Automatic latency and performance monitoring
148+
-**Request Tracing**: Native distributed tracing across services
149+
-**Log Correlation**: Seamless trace-to-log correlation without setup
150+
151+
#### Key Advantages
152+
1. **Python-First Design**: Built specifically for Python/FastAPI applications by the Pydantic team
153+
2. **Simple Integration**: `pip install logfire` + `logfire.configure()` vs complex OTLP setup
154+
3. **Automatic Correlation**: Logs automatically include trace context without manual configuration
155+
4. **Real-time SQL Interface**: Query spans and logs using SQL with auto-completion
156+
5. **Better Developer UX**: Purpose-built observability UI vs generic Grafana dashboards
157+
6. **Loguru Integration**: `logger.configure(handlers=[logfire.loguru_handler()])` maintains existing logging
158+
159+
#### Pricing Assessment
160+
- **Free Tier**: 10M spans/month (suitable for development and small production workloads)
161+
- **Transparent Pricing**: $1 per million spans/metrics after free tier
162+
- **No Hidden Costs**: No per-host fees, only usage-based metering
163+
- **Production Ready**: Recently exited beta, enterprise features available
164+
165+
#### Migration Path
166+
The existing OpenTelemetry instrumentation is compatible - Logfire uses OpenTelemetry under the hood, so the current spans and attributes would work unchanged.
167+
168+
### Recommendation
169+
170+
**Consider migrating to Logfire** for the following reasons:
171+
1. It directly addresses the "next to useless" traces problem by providing integrated logs
172+
2. Dramatically simpler setup and maintenance compared to Grafana Cloud + custom log correlation
173+
3. Better ROI on observability investment with purpose-built Python tooling
174+
4. Free tier sufficient for current development needs with clear scaling path
175+
176+
The current Grafana Cloud implementation provides a solid foundation and could remain as a backup/export target, while Logfire becomes the primary observability platform.
177+
178+
## Status
179+
**Created**: 2024-01-28
180+
**Status**: Completed (OpenTelemetry + Grafana Cloud)
181+
**Next Phase**: Evaluate Logfire migration
182+
**Priority**: High - Critical for production observability

0 commit comments

Comments
 (0)