OpenSIN-AI
diff --git a/‎docs/best-practices/hf-fleet-keepalive.md‎
Lines changed: 102 additions & 23 deletions b/‎docs/best-practices/hf-fleet-keepalive.md‎
Lines changed: 102 additions & 23 deletions
diff --git a/‎docs/best-practices/index.md‎
Lines changed: 40 additions & 37 deletions b/‎docs/best-practices/index.md‎
Lines changed: 40 additions & 37 deletions
@@ -1,32 +1,111 @@
-# Hugging Face Fleet Keep-Alive
+---
+title: Ultimate HF Fleet Keep-Alive & Persistence Protocol
+description: How Hugging Face Spaces must stay awake, preserve state, and avoid accidental data loss.
+---
+
+# Ultimate HF Fleet Keep-Alive & Persistence Protocol
+
+> **RULE:** A Hugging Face Free VM is disposable by default. If you do not actively keep it warm and persist critical state, you are building on sand.
+
+---
+
+## 1. Why This Exists
+
+HF Spaces on free tiers can sleep. Sleep means:
+- memory is gone
+- in-flight sessions die
+- temporary files vanish
+- browser/auth state may be lost
+
+Therefore every meaningful HF-hosted agent must have:
+1. a keep-alive strategy
+2. a persistence strategy
+3. a recovery strategy
+
+---
+
+## 2. Keep-Alive Strategy
+
+### Canonical pattern
+Use a centralized scheduler (n8n on OCI VM) that pings every active HF Space health endpoint on a fixed interval.
+
+### Why centralized
+A decentralized “each agent pings itself” model is fragile and harder to audit.
+The OCI scheduler is stable, visible, and cheap.
+
+### Recommended interval
+Every ~45 minutes for free-tier sleep prevention, unless platform behavior changes.
+
+---
+
+## 3. Persistence Strategy
 
-To ensure that our A2A agents and MCP servers hosted on Hugging Face (HF) Spaces remain active and do not fall into sleep mode (Free Tier), we implement a centralized **Keep-Alive Pinger** via n8n.
+Never rely on the HF VM filesystem for important long-lived state.
+Use one of:
+- Hugging Face Dataset as persistence store
+- Supabase / remote DB for structured state
+- Git repository for durable text/config state
 
-## Architecture
+### Typical persisted items
+- auth/session bundles
+- queue checkpoints
+- agent state snapshots
+- generated artifacts requiring recovery
 
-1.  **n8n Workflow:** A scheduled workflow running on our OCI VM.
-2.  **Frequency:** Every **45 minutes**.
-3.  **Target:** Every HF Space URL in the OpenSIN-AI fleet.
-4.  **Endpoint:** `https://<space-name>.hf.space/health`
+---
+
+## 4. Recovery Strategy
+
+On restart, the agent should:
+1. restore persisted state
+2. verify auth/session validity
+3. resume pending jobs safely
+4. emit a recovery log/heartbeat
+
+### Why
+Without a recovery path, every HF restart becomes hidden data loss.
+
+---
+
+## 5. What Must Never Be Local-Only
+
+Do **not** leave only on the HF filesystem:
+- browser auth you cannot cheaply recreate
+- workflow queue state
+- issue mapping state
+- important logs or evidence
+- generated outputs the user depends on
+
+---
+
+## 6. Health Endpoint Requirement
 
-## Covered Agents (as of April 2026)
+Every HF-hosted service should expose a lightweight health endpoint such as:
+- `/health`
+- `/status`
+- a tiny JSON heartbeat route
 
-| Agent | Purpose |
-| :--- | :--- |
-| **openjerro-opensin-bridge-mcp** | Prolific Bridge / Chrome Extension Bridge |
-| **delqhi-sin-authenticator** | Fleet Auth & OTP |
-| **delqhi-sin-github-issues** | Autonomous Issue Management |
-| **delqhi-sin-stripe** | SaaS Billing & Payments |
-| **delqhi-sin-passwordmanager** | Secure Credential Storage |
-| **delqhi-sin-code-ai** | Autonomous Coding Engine |
-| **10+ Frontend Agents** | Accessibility, App-Shell, Commerce-UI, etc. |
+This is what the keep-alive scheduler pings and what operators use for status checks.
 
-## How to add a new Space
+---
 
-1.  Locate the workflow in `OpenSIN-backend/n8n-workflows/hf-keepalive.json`.
-2.  Add the new HF URL to the `Full HF Fleet List` node.
-3.  Commit and push to the backend repository.
-4.  The OCI CI Runner will automatically update the production workflow.
+## 7. Operational Checklist
+
+- [ ] health endpoint exists
+- [ ] n8n keep-alive poller includes the Space URL
+- [ ] important state is persisted remotely
+- [ ] restart flow restores state
+- [ ] recovery produces observable logs
 
 ---
-*Note: This mechanism is essential for the reliability of the OpenSIN Bridge SaaS.*
+
+## 8. Final Rule
+
+**HF Spaces are excellent workers, but terrible memory.**
+Treat them like resumable executors, not durable homes.
+
+---
+
+*Last updated:* 2026-04-10  
+*Status:* **ACTIVE & MANDATORY**  
+*Maintainer:* sin-zeus
@@ -4,48 +4,51 @@ title: "Best Practices"
 
 # Best Practices
 
-Production-tested guidelines for building, deploying, and operating OpenSIN agents and infrastructure.
+This section is the rulebook for building, operating, debugging, and scaling the OpenSIN fleet.
+It is not a loose suggestion shelf. It is the operational memory of the system.
 
-## Core Practices
+## Core Mandates
 
 | Document | Focus |
 |----------|-------|
-| [Agent Design](/best-practices/agent-design) | Single responsibility, model selection, system prompts |
-| [Security](/best-practices/security) | Credential management, input validation, MCP security |
-| [Performance](/best-practices/performance) | Model routing, context management, caching |
+| [Agent Design](/best-practices/agent-design) | Ultimate fleet mandates, no-silo rules, self-healing, test-proof culture |
+| [Code Quality](/best-practices/code-quality) | Extreme commenting mandate, anti-AI-slop, review discipline |
+| [Error Handling](/best-practices/error-handling) | Immediate bug registry, no-assumptions, self-healing escalation |
+| [Browser Automation](/best-practices/browser-automation) | DevTools-first, anti-bot bypass, Chrome profile law |
+| [A2A Communication](/best-practices/a2a-communication) | Pure agentic paradigm, inbound governance, opencode-only LLM usage |
 
-## Extended Practices
+## System Reliability & Execution
 
 | Document | Focus |
 |----------|-------|
-| [Testing](/best-practices/testing) | Unit, integration, E2E testing strategies |
-| [A2A Communication](/best-practices/a2a-communication) | Message design, reliability, security |
-| [Plugin Development](/best-practices/plugin-development) | Plugin structure, commands, agents, skills, hooks |
-| [MCP Integration](/best-practices/mcp-integration) | Transport selection, connection management, security |
-| [Team Orchestration](/best-practices/team-orchestration) | Delegation strategies, retry, monitoring |
-| [Error Handling](/best-practices/error-handling) | Error classification, retry patterns, recovery |
-| [Monitoring & Observability](/best-practices/monitoring-observability) | Metrics, health checks, alerting, dashboards |
-| [Code Quality](/best-practices/code-quality) | Code style, architecture, review standards |
-| [**CI/CD mit n8n + sin-github-action**](/best-practices/ci-cd-n8n) | **🚨 PFLICHT**: Zero-Billing CI via n8n OCI Runner — NIEMALS normale GitHub Actions! |
-
-## Quick Reference
-
-### Before Deploying an Agent
-
-- [ ] All secrets in environment variables (not hardcoded)
-- [ ] Permission manager configured
-- [ ] Error handling covers edge cases
-- [ ] Tests pass with adequate coverage
-- [ ] Logging configured with redaction
-- [ ] Model routing optimized for cost
-- [ ] A2A endpoints authenticated
-
-### Before Merging Code
-
-- [ ] ESLint passing
-- [ ] TypeScript strict mode clean
-- [ ] No `as any` or `@ts-ignore`
-- [ ] Tests included
-- [ ] Documentation updated
-- [ ] Security review completed
-- [Software 3.0: Neural-Bus](/docs/best-practices/software-3.0-neural-bus)
+| [Testing](/best-practices/testing) | Runtime proof, workflow validation, UI/browser verification |
+| [Monitoring & Observability](/best-practices/monitoring-observability) | Health, metrics, evidence retention, alert usefulness |
+| [Team Orchestration](/best-practices/team-orchestration) | Parallel vs sequential work, retries, specialist routing |
+| [HF Fleet Keep-Alive](/best-practices/hf-fleet-keepalive) | Hugging Face wake strategy, persistence, recovery |
+| [CI/CD mit n8n + sin-github-action](/best-practices/ci-cd-n8n) | Zero-billing CI via OCI + n8n |
+
+## Advanced / Specialized
+
+| Document | Focus |
+|----------|-------|
+| [MCP Integration](/best-practices/mcp-integration) | MCP transport, safety, integration patterns |
+| [Plugin Development](/best-practices/plugin-development) | Plugin architecture and extension rules |
+| [Performance](/best-practices/performance) | Cost, model routing, latency, efficiency |
+| [Security](/best-practices/security) | Secrets, auth boundaries, operator trust |
+| [Software 3.0: Neural-Bus](/docs/best-practices/software-3.0-neural-bus) | Higher-level architecture doctrine |
+| [SEO Pipeline](/best-practices/seo-pipeline) | Proof-of-work blog publishing pipeline |
+
+## Before You Call Something “Done”
+
+- [ ] issue exists and matches the work
+- [ ] code is commented with WHAT / WHY / WHY NOT / CONSEQUENCES
+- [ ] repo-native checks pass
+- [ ] runtime proof exists
+- [ ] screenshots/logs exist where relevant
+- [ ] docs updated if architecture or workflow changed
+- [ ] remaining risk is clearly stated
+
+## Final Reminder
+
+The OpenSIN fleet is allowed to move fast **only because** it is forced to leave evidence, structure, and recoverable knowledge behind.
+Without that, autonomy becomes chaos.