Skip to content

Commit 0549750

Browse files
committed
Add Tenant Lifecycle Automation design document
Expanded `tenant-lifecycle-automation.md` with a detailed framework for automating tenant provisioning, activation, and health verification. - Defined goals, scope, and non-goals for the automation process. - Outlined personas (Platform Admin, Tenant Admin, SRE/DevOps). - Documented high-level workflow for provisioning steps. - Specified functional, operational, and security requirements. - Added acceptance criteria for successful provisioning. - Included failure/recovery criteria for error handling and retries. This update ensures a robust, secure, and observable tenant lifecycle management process.
1 parent affa69f commit 0549750

1 file changed

Lines changed: 81 additions & 0 deletions

File tree

Lines changed: 81 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,81 @@
1+
# Tenant Lifecycle Automation
2+
3+
Goal: automate tenant provisioning, activation, and health verification so new tenants are production-ready with minimal manual steps while preserving multi-tenant safety, auditing, and observability.
4+
5+
## Scope (In)
6+
- Create/activate tenant triggers background provisioning workflow.
7+
- Per-tenant database creation (or schema), migrations, and seed data.
8+
- Default identity bootstrap (admin user/roles/permissions) tied to tenant.
9+
- Health verification and status reporting.
10+
- Idempotent, retryable orchestration with audit and telemetry.
11+
- Admin endpoints/UX to view workflow state and retry/re-run steps.
12+
13+
## Non-Goals (Out)
14+
- Full-feature feature-flag platform.
15+
- Billing/usage metering.
16+
- Cross-cloud infrastructure automation (K8s, DNS, CDN).
17+
18+
## Personas
19+
- Platform Admin: initiates tenant creation, monitors status, retries failed steps.
20+
- Tenant Admin: receives bootstrap credentials, validates app access post-provision.
21+
- SRE/DevOps: monitors health, investigates failed jobs, tunes resilience.
22+
23+
## High-Level Flow
24+
1) Admin issues `CreateTenant` (or activates an existing tenant).
25+
2) System enqueues a provisioning job (Hangfire) keyed by TenantId + correlation.
26+
3) Workflow steps (all idempotent):
27+
- Validate tenant metadata (provider, connection string template, validity).
28+
- Create tenant database/schema (or ensure exists) using provider-specific strategy.
29+
- Apply EF Core migrations for each enabled module (Multitenancy, Identity, Auditing, etc.).
30+
- Seed baseline data (roles, permissions, admin user with reset token, root tenant data if applicable).
31+
- Warm caches if enabled (e.g., permissions).
32+
- Emit audit + telemetry events for each step.
33+
4) Mark tenant as `Active` when all steps succeed; surface status via API.
34+
5) On failure: capture error, mark status `Failed`, allow retry/resume from failed step.
35+
36+
## Functional Requirements
37+
- Provisioning job:
38+
- Runs as Hangfire background job; supports manual trigger and automatic trigger on create/activate.
39+
- Stores per-step status, timestamps, and error messages (persisted per tenant).
40+
- Uses correlation/trace IDs; logs to OpenTelemetry.
41+
- Supports cancellation and exponential backoff retries.
42+
- Database orchestration:
43+
- Provider-aware strategies (PostgreSQL initial target; hooks for SQL Server).
44+
- Option to create database if missing; else validate connectivity.
45+
- Runs module migrations in deterministic order; stops on first failure.
46+
- Seeding:
47+
- Seeds Identity admin user, default roles/permissions, and tenant metadata.
48+
- Issues one-time admin credential or password reset token for Tenant Admin.
49+
- Seeds demo data optionally (flag).
50+
- Status surface:
51+
- API to fetch provisioning status history per tenant.
52+
- Health check should include tenant provisioning status (ready/degraded/failed).
53+
- Safety & idempotency:
54+
- All steps re-runnable without corrupting state (check-before-create).
55+
- Guard against concurrent provisioning for same tenant.
56+
- Respect tenant validity/activation flags.
57+
58+
## Operational/Observability Requirements
59+
- Emit structured logs with TenantId, correlationId, step name, duration, outcome.
60+
- Create OpenTelemetry spans for each step (db create, migrate, seed, cache warm).
61+
- Publish audit events for lifecycle changes (Requested, Started, StepFailed, Completed).
62+
- Expose metrics: provision_duration_seconds, provision_step_failures_total, active_tenants.
63+
64+
## Security Requirements
65+
- No secrets in logs/audits; hash/scrub credentials.
66+
- Bootstrap credentials delivered via secure channel (email with reset token or out-of-band).
67+
- Enforce tenant isolation during provisioning (context scopes, connection string guards).
68+
- Authorization: only platform admins can trigger or retry provisioning.
69+
70+
## Acceptance Criteria (Happy Path)
71+
- Creating a tenant triggers a job that:
72+
- Creates/validates DB, applies migrations for all enabled modules, seeds identity/admin, warms caches.
73+
- Marks tenant Active and Ready; status endpoint shows completed steps with durations.
74+
- Audit trail shows Requested -> Started -> Completed with TenantId and correlationId.
75+
- Metrics and traces include the provisioning spans and surface in health checks.
76+
77+
## Failure/Recovery Criteria
78+
- If migrations fail, status is Failed with error details; job can be retried and resumes idempotently.
79+
- Double-submit provisioning for same tenant does not run concurrent workflows (dedupe/lock).
80+
- Partial seeds are safe to re-run (no duplicate roles/users; admin user upsert).
81+
- Health check reports degraded for tenants with failed provisioning; improves after successful retry.

0 commit comments

Comments
 (0)