Skip to content

Commit 5bfed73

Browse files
author
OpenSIN-AI
committed
docs: expand core best-practices for testing, monitoring, orchestration, HF keepalive, and index
1 parent 0650111 commit 5bfed73

5 files changed

Lines changed: 654 additions & 680 deletions

File tree

Lines changed: 102 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -1,32 +1,111 @@
1-
# Hugging Face Fleet Keep-Alive
1+
---
2+
title: Ultimate HF Fleet Keep-Alive & Persistence Protocol
3+
description: How Hugging Face Spaces must stay awake, preserve state, and avoid accidental data loss.
4+
---
5+
6+
# Ultimate HF Fleet Keep-Alive & Persistence Protocol
7+
8+
> **RULE:** A Hugging Face Free VM is disposable by default. If you do not actively keep it warm and persist critical state, you are building on sand.
9+
10+
---
11+
12+
## 1. Why This Exists
13+
14+
HF Spaces on free tiers can sleep. Sleep means:
15+
- memory is gone
16+
- in-flight sessions die
17+
- temporary files vanish
18+
- browser/auth state may be lost
19+
20+
Therefore every meaningful HF-hosted agent must have:
21+
1. a keep-alive strategy
22+
2. a persistence strategy
23+
3. a recovery strategy
24+
25+
---
26+
27+
## 2. Keep-Alive Strategy
28+
29+
### Canonical pattern
30+
Use a centralized scheduler (n8n on OCI VM) that pings every active HF Space health endpoint on a fixed interval.
31+
32+
### Why centralized
33+
A decentralized “each agent pings itself” model is fragile and harder to audit.
34+
The OCI scheduler is stable, visible, and cheap.
35+
36+
### Recommended interval
37+
Every ~45 minutes for free-tier sleep prevention, unless platform behavior changes.
38+
39+
---
40+
41+
## 3. Persistence Strategy
242

3-
To ensure that our A2A agents and MCP servers hosted on Hugging Face (HF) Spaces remain active and do not fall into sleep mode (Free Tier), we implement a centralized **Keep-Alive Pinger** via n8n.
43+
Never rely on the HF VM filesystem for important long-lived state.
44+
Use one of:
45+
- Hugging Face Dataset as persistence store
46+
- Supabase / remote DB for structured state
47+
- Git repository for durable text/config state
448

5-
## Architecture
49+
### Typical persisted items
50+
- auth/session bundles
51+
- queue checkpoints
52+
- agent state snapshots
53+
- generated artifacts requiring recovery
654

7-
1. **n8n Workflow:** A scheduled workflow running on our OCI VM.
8-
2. **Frequency:** Every **45 minutes**.
9-
3. **Target:** Every HF Space URL in the OpenSIN-AI fleet.
10-
4. **Endpoint:** `https://<space-name>.hf.space/health`
55+
---
56+
57+
## 4. Recovery Strategy
58+
59+
On restart, the agent should:
60+
1. restore persisted state
61+
2. verify auth/session validity
62+
3. resume pending jobs safely
63+
4. emit a recovery log/heartbeat
64+
65+
### Why
66+
Without a recovery path, every HF restart becomes hidden data loss.
67+
68+
---
69+
70+
## 5. What Must Never Be Local-Only
71+
72+
Do **not** leave only on the HF filesystem:
73+
- browser auth you cannot cheaply recreate
74+
- workflow queue state
75+
- issue mapping state
76+
- important logs or evidence
77+
- generated outputs the user depends on
78+
79+
---
80+
81+
## 6. Health Endpoint Requirement
1182

12-
## Covered Agents (as of April 2026)
83+
Every HF-hosted service should expose a lightweight health endpoint such as:
84+
- `/health`
85+
- `/status`
86+
- a tiny JSON heartbeat route
1387

14-
| Agent | Purpose |
15-
| :--- | :--- |
16-
| **openjerro-opensin-bridge-mcp** | Prolific Bridge / Chrome Extension Bridge |
17-
| **delqhi-sin-authenticator** | Fleet Auth & OTP |
18-
| **delqhi-sin-github-issues** | Autonomous Issue Management |
19-
| **delqhi-sin-stripe** | SaaS Billing & Payments |
20-
| **delqhi-sin-passwordmanager** | Secure Credential Storage |
21-
| **delqhi-sin-code-ai** | Autonomous Coding Engine |
22-
| **10+ Frontend Agents** | Accessibility, App-Shell, Commerce-UI, etc. |
88+
This is what the keep-alive scheduler pings and what operators use for status checks.
2389

24-
## How to add a new Space
90+
---
2591

26-
1. Locate the workflow in `OpenSIN-backend/n8n-workflows/hf-keepalive.json`.
27-
2. Add the new HF URL to the `Full HF Fleet List` node.
28-
3. Commit and push to the backend repository.
29-
4. The OCI CI Runner will automatically update the production workflow.
92+
## 7. Operational Checklist
93+
94+
- [ ] health endpoint exists
95+
- [ ] n8n keep-alive poller includes the Space URL
96+
- [ ] important state is persisted remotely
97+
- [ ] restart flow restores state
98+
- [ ] recovery produces observable logs
3099

31100
---
32-
*Note: This mechanism is essential for the reliability of the OpenSIN Bridge SaaS.*
101+
102+
## 8. Final Rule
103+
104+
**HF Spaces are excellent workers, but terrible memory.**
105+
Treat them like resumable executors, not durable homes.
106+
107+
---
108+
109+
*Last updated:* 2026-04-10
110+
*Status:* **ACTIVE & MANDATORY**
111+
*Maintainer:* sin-zeus

docs/best-practices/index.md

Lines changed: 40 additions & 37 deletions
Original file line numberDiff line numberDiff line change
@@ -4,48 +4,51 @@ title: "Best Practices"
44

55
# Best Practices
66

7-
Production-tested guidelines for building, deploying, and operating OpenSIN agents and infrastructure.
7+
This section is the rulebook for building, operating, debugging, and scaling the OpenSIN fleet.
8+
It is not a loose suggestion shelf. It is the operational memory of the system.
89

9-
## Core Practices
10+
## Core Mandates
1011

1112
| Document | Focus |
1213
|----------|-------|
13-
| [Agent Design](/best-practices/agent-design) | Single responsibility, model selection, system prompts |
14-
| [Security](/best-practices/security) | Credential management, input validation, MCP security |
15-
| [Performance](/best-practices/performance) | Model routing, context management, caching |
14+
| [Agent Design](/best-practices/agent-design) | Ultimate fleet mandates, no-silo rules, self-healing, test-proof culture |
15+
| [Code Quality](/best-practices/code-quality) | Extreme commenting mandate, anti-AI-slop, review discipline |
16+
| [Error Handling](/best-practices/error-handling) | Immediate bug registry, no-assumptions, self-healing escalation |
17+
| [Browser Automation](/best-practices/browser-automation) | DevTools-first, anti-bot bypass, Chrome profile law |
18+
| [A2A Communication](/best-practices/a2a-communication) | Pure agentic paradigm, inbound governance, opencode-only LLM usage |
1619

17-
## Extended Practices
20+
## System Reliability & Execution
1821

1922
| Document | Focus |
2023
|----------|-------|
21-
| [Testing](/best-practices/testing) | Unit, integration, E2E testing strategies |
22-
| [A2A Communication](/best-practices/a2a-communication) | Message design, reliability, security |
23-
| [Plugin Development](/best-practices/plugin-development) | Plugin structure, commands, agents, skills, hooks |
24-
| [MCP Integration](/best-practices/mcp-integration) | Transport selection, connection management, security |
25-
| [Team Orchestration](/best-practices/team-orchestration) | Delegation strategies, retry, monitoring |
26-
| [Error Handling](/best-practices/error-handling) | Error classification, retry patterns, recovery |
27-
| [Monitoring & Observability](/best-practices/monitoring-observability) | Metrics, health checks, alerting, dashboards |
28-
| [Code Quality](/best-practices/code-quality) | Code style, architecture, review standards |
29-
| [**CI/CD mit n8n + sin-github-action**](/best-practices/ci-cd-n8n) | **🚨 PFLICHT**: Zero-Billing CI via n8n OCI Runner — NIEMALS normale GitHub Actions! |
30-
31-
## Quick Reference
32-
33-
### Before Deploying an Agent
34-
35-
- [ ] All secrets in environment variables (not hardcoded)
36-
- [ ] Permission manager configured
37-
- [ ] Error handling covers edge cases
38-
- [ ] Tests pass with adequate coverage
39-
- [ ] Logging configured with redaction
40-
- [ ] Model routing optimized for cost
41-
- [ ] A2A endpoints authenticated
42-
43-
### Before Merging Code
44-
45-
- [ ] ESLint passing
46-
- [ ] TypeScript strict mode clean
47-
- [ ] No `as any` or `@ts-ignore`
48-
- [ ] Tests included
49-
- [ ] Documentation updated
50-
- [ ] Security review completed
51-
- [Software 3.0: Neural-Bus](/docs/best-practices/software-3.0-neural-bus)
24+
| [Testing](/best-practices/testing) | Runtime proof, workflow validation, UI/browser verification |
25+
| [Monitoring & Observability](/best-practices/monitoring-observability) | Health, metrics, evidence retention, alert usefulness |
26+
| [Team Orchestration](/best-practices/team-orchestration) | Parallel vs sequential work, retries, specialist routing |
27+
| [HF Fleet Keep-Alive](/best-practices/hf-fleet-keepalive) | Hugging Face wake strategy, persistence, recovery |
28+
| [CI/CD mit n8n + sin-github-action](/best-practices/ci-cd-n8n) | Zero-billing CI via OCI + n8n |
29+
30+
## Advanced / Specialized
31+
32+
| Document | Focus |
33+
|----------|-------|
34+
| [MCP Integration](/best-practices/mcp-integration) | MCP transport, safety, integration patterns |
35+
| [Plugin Development](/best-practices/plugin-development) | Plugin architecture and extension rules |
36+
| [Performance](/best-practices/performance) | Cost, model routing, latency, efficiency |
37+
| [Security](/best-practices/security) | Secrets, auth boundaries, operator trust |
38+
| [Software 3.0: Neural-Bus](/docs/best-practices/software-3.0-neural-bus) | Higher-level architecture doctrine |
39+
| [SEO Pipeline](/best-practices/seo-pipeline) | Proof-of-work blog publishing pipeline |
40+
41+
## Before You Call Something “Done”
42+
43+
- [ ] issue exists and matches the work
44+
- [ ] code is commented with WHAT / WHY / WHY NOT / CONSEQUENCES
45+
- [ ] repo-native checks pass
46+
- [ ] runtime proof exists
47+
- [ ] screenshots/logs exist where relevant
48+
- [ ] docs updated if architecture or workflow changed
49+
- [ ] remaining risk is clearly stated
50+
51+
## Final Reminder
52+
53+
The OpenSIN fleet is allowed to move fast **only because** it is forced to leave evidence, structure, and recoverable knowledge behind.
54+
Without that, autonomy becomes chaos.

0 commit comments

Comments
 (0)