Skip to content

Commit fd74e44

Browse files
committed
initial commit
1 parent db51cc3 commit fd74e44

File tree

15 files changed

+4526
-1
lines changed

15 files changed

+4526
-1
lines changed

.github/workflows/pages.yml

Lines changed: 55 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,55 @@
1+
name: Deploy to GitHub Pages
2+
3+
on:
4+
push:
5+
branches: ["main"]
6+
workflow_dispatch:
7+
8+
permissions:
9+
contents: read
10+
pages: write
11+
id-token: write
12+
13+
concurrency:
14+
group: "pages"
15+
cancel-in-progress: false
16+
17+
jobs:
18+
build:
19+
runs-on: ubuntu-latest
20+
steps:
21+
- name: Checkout
22+
uses: actions/checkout@v4
23+
with:
24+
fetch-depth: 0
25+
26+
- name: Setup Node.js
27+
uses: actions/setup-node@v4
28+
with:
29+
node-version: 20
30+
cache: npm
31+
32+
- name: Install dependencies
33+
run: npm ci
34+
35+
- name: Build with VitePress
36+
run: npm run docs:build
37+
38+
- name: Setup Pages
39+
uses: actions/configure-pages@v5
40+
41+
- name: Upload artifact
42+
uses: actions/upload-pages-artifact@v3
43+
with:
44+
path: docs/.vitepress/dist
45+
46+
deploy:
47+
environment:
48+
name: github-pages
49+
url: ${{ steps.deployment.outputs.page_url }}
50+
runs-on: ubuntu-latest
51+
needs: build
52+
steps:
53+
- name: Deploy to GitHub Pages
54+
id: deployment
55+
uses: actions/deploy-pages@v4

.gitignore

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
node_modules/
2+
docs/.vitepress/dist/
3+
docs/.vitepress/cache/
4+
*.log
5+
.DS_Store

README.md

Lines changed: 45 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1,45 @@
1-
# pylet.github.io
1+
# PyLet Documentation
2+
3+
Documentation site for [PyLet](https://github.com/ServerlessLLM/pylet) — a simple distributed task execution system for GPU servers. Built with [VitePress](https://vitepress.dev/).
4+
5+
**Live site**: https://serverlessllm.github.io/pylet.github.io
6+
7+
## Local Development
8+
9+
```bash
10+
# Install dependencies
11+
npm install
12+
13+
# Start dev server with hot reload
14+
npm run docs:dev
15+
# Open http://localhost:5173
16+
17+
# Build for production
18+
npm run docs:build
19+
20+
# Preview production build
21+
npm run docs:preview
22+
```
23+
24+
## Deployment
25+
26+
Auto-deploys to GitHub Pages on push to `main` via `.github/workflows/pages.yml`.
27+
28+
## Structure
29+
30+
```
31+
docs/
32+
├── index.md # Landing page
33+
├── getting-started/
34+
│ └── quickstart.md # Step-by-step getting started
35+
├── guide/
36+
│ ├── concepts.md # Instances, workers, resources
37+
│ ├── configuration.md # TOML config & env vars
38+
│ └── examples.md # Practical recipes
39+
├── reference/
40+
│ ├── cli.md # All pylet commands
41+
│ └── python-api.md # import pylet reference
42+
└── advanced/
43+
├── architecture.md # How PyLet works internally
44+
└── troubleshooting.md # Common issues & fixes
45+
```

docs/.vitepress/config.mts

Lines changed: 74 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,74 @@
1+
import { defineConfig } from 'vitepress'
2+
3+
export default defineConfig({
4+
title: 'PyLet',
5+
description: 'A simple distributed task execution system for GPU servers',
6+
base: '/pylet.github.io/',
7+
8+
head: [
9+
['link', { rel: 'icon', type: 'image/svg+xml', href: '/pylet.github.io/logo.svg' }],
10+
],
11+
12+
themeConfig: {
13+
logo: '/logo.svg',
14+
15+
nav: [
16+
{ text: 'Guide', link: '/getting-started/quickstart' },
17+
{ text: 'Reference', link: '/reference/cli' },
18+
{ text: 'PyPI', link: 'https://pypi.org/project/pylet/' },
19+
],
20+
21+
sidebar: [
22+
{
23+
text: 'Getting Started',
24+
items: [
25+
{ text: 'Quick Start', link: '/getting-started/quickstart' },
26+
],
27+
},
28+
{
29+
text: 'Guide',
30+
items: [
31+
{ text: 'Core Concepts', link: '/guide/concepts' },
32+
{ text: 'Configuration', link: '/guide/configuration' },
33+
{ text: 'Examples', link: '/guide/examples' },
34+
],
35+
},
36+
{
37+
text: 'Reference',
38+
items: [
39+
{ text: 'CLI Reference', link: '/reference/cli' },
40+
{ text: 'Python API', link: '/reference/python-api' },
41+
],
42+
},
43+
{
44+
text: 'Advanced',
45+
items: [
46+
{ text: 'Architecture', link: '/advanced/architecture' },
47+
{ text: 'Troubleshooting', link: '/advanced/troubleshooting' },
48+
],
49+
},
50+
],
51+
52+
socialLinks: [
53+
{ icon: 'github', link: 'https://github.com/ServerlessLLM/pylet' },
54+
],
55+
56+
search: {
57+
provider: 'local',
58+
},
59+
60+
editLink: {
61+
pattern: 'https://github.com/ServerlessLLM/pylet.github.io/edit/main/docs/:path',
62+
text: 'Edit this page on GitHub',
63+
},
64+
65+
footer: {
66+
message: 'Released under the Apache 2.0 License.',
67+
copyright: 'Copyright © 2024-present ServerlessLLM',
68+
},
69+
70+
outline: {
71+
level: [2, 3],
72+
},
73+
},
74+
})

docs/advanced/architecture.md

Lines changed: 189 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,189 @@
1+
# Architecture
2+
3+
How PyLet works under the hood. Read this if you're curious — it's not required for using PyLet.
4+
5+
## System Design
6+
7+
```
8+
┌──────────────┐ poke ┌──────────────────┐
9+
│ Controller │──────────────>│ Scheduler │
10+
│ (FastAPI) │ │ (in-process) │
11+
└──────┬───────┘ └────────┬─────────┘
12+
│ │
13+
│ ┌──────────────┐ │
14+
└────────>│ SQLite │<──────┘
15+
│ (WAL) │
16+
└──────────────┘
17+
18+
│ heartbeat (long-poll)
19+
┌───────────────┴───────────────┐
20+
│ │ │
21+
┌────┴────┐ ┌────┴────┐ ┌────┴────┐
22+
│ Worker │ │ Worker │ │ Worker │
23+
└─────────┘ └─────────┘ └─────────┘
24+
```
25+
26+
**Head node**: Runs the controller (FastAPI) and scheduler. SQLite is the single source of truth.
27+
28+
**Workers**: Connect to head, receive desired state via heartbeat, reconcile local processes.
29+
30+
---
31+
32+
## The One Primitive: Instance
33+
34+
PyLet has exactly one concept: the **instance** — a process with resource allocation.
35+
36+
An instance has:
37+
- A command to run
38+
- Resource requirements (CPU, GPU, memory)
39+
- A lifecycle (PENDING → ASSIGNED → RUNNING → COMPLETED/FAILED)
40+
- An optional endpoint (host:port) for service discovery
41+
42+
That's it. No pods, replicas, services, deployments, or jobs. Higher-level abstractions are left to the application.
43+
44+
---
45+
46+
## Worker Reconciliation
47+
48+
Workers don't receive "start X" / "stop Y" commands. Instead, they receive **desired state** and reconcile:
49+
50+
```
51+
Desired state (from head): [instance_a@attempt=2, instance_b@attempt=1]
52+
Actual state (local): [instance_a@attempt=1, instance_c@attempt=1]
53+
54+
Reconcile:
55+
instance_a@attempt=1 → stale attempt → kill
56+
instance_b@attempt=1 → not running → start
57+
instance_c@attempt=1 → not desired → kill
58+
```
59+
60+
This declarative model means:
61+
- **Crash recovery is automatic**: worker restarts, gets desired state, reconciles
62+
- **Network partitions are safe**: stale workers can't corrupt state (attempt fencing)
63+
- **No command queue**: simpler than ack/retry protocols
64+
65+
---
66+
67+
## Instance Lifecycle
68+
69+
```
70+
PENDING ──[assign]──> ASSIGNED ──[start]──> RUNNING ──[exit]──> COMPLETED
71+
│ │ │ │
72+
│ │ │ FAILED
73+
│ │ │
74+
│ └─[worker offline]────┴──> UNKNOWN
75+
│ │
76+
└──[cancel]──────────────────────────────────> CANCELLED
77+
```
78+
79+
| State | Meaning |
80+
|:------|:--------|
81+
| PENDING | Waiting in queue |
82+
| ASSIGNED | Worker selected, resources reserved |
83+
| RUNNING | Process executing |
84+
| UNKNOWN | Worker offline, outcome uncertain |
85+
| COMPLETED | Exit code 0 |
86+
| FAILED | Exit code ≠ 0 |
87+
| CANCELLED | User cancelled |
88+
89+
### Cancellation Model
90+
91+
Cancellation uses a timestamp model (like Kubernetes `deletionTimestamp`):
92+
93+
1. User requests cancel → `cancellation_requested_at` is set
94+
2. Instance excluded from desired state
95+
3. Worker sees absence, sends SIGTERM
96+
4. Grace period (default 30s)
97+
5. SIGKILL if still running
98+
6. Worker reports CANCELLED
99+
100+
---
101+
102+
## Heartbeat Protocol
103+
104+
Workers use **generation-based long-polling**:
105+
106+
1. Worker sends heartbeat with `last_seen_gen` and instance status reports
107+
2. Controller processes reports (with attempt fencing)
108+
3. Controller waits for state change or timeout (30s)
109+
4. Returns new `gen` and `desired_instances`
110+
111+
**Cancel-and-reissue**: When local state changes (process starts/exits), the worker cancels the in-flight heartbeat and sends a new one immediately. This means the head gets updates within milliseconds.
112+
113+
---
114+
115+
## Attempt-Based Fencing
116+
117+
Each instance has an `attempt` counter that increments on each assignment:
118+
119+
```
120+
Instance assigned to Worker A (attempt=1)
121+
Network partition...
122+
Instance reassigned to Worker B (attempt=2)
123+
Worker A reconnects, reports for attempt=1
124+
→ Controller ignores (stale attempt)
125+
```
126+
127+
Only reports matching the current attempt can change state. This prevents stale workers from corrupting cluster state.
128+
129+
---
130+
131+
## Fine-Grained GPU Scheduling
132+
133+
These features exist because real research workloads need them:
134+
135+
- **Physical GPU indices** (`gpu_indices`): Request specific GPUs. Exposed via `CUDA_VISIBLE_DEVICES`.
136+
- **GPU sharing** (`exclusive=False`): GPUs aren't reserved exclusively. Enables daemons to coexist with inference.
137+
- **Worker placement** (`target_worker`): Target a specific worker (e.g., where a model is cached).
138+
139+
---
140+
141+
## Log Capture
142+
143+
Instance logs are captured using a **sidecar pattern**:
144+
145+
1. Worker wraps each command: `(cmd) 2>&1 | python3 -m pylet.log_sidecar`
146+
2. Sidecar writes rotating log files in `~/.pylet/logs/`
147+
3. Worker runs an HTTP server (port 15599) for log retrieval
148+
4. Head proxies log requests to workers
149+
150+
The sidecar survives even if the instance crashes, so logs are never lost.
151+
152+
---
153+
154+
## Components
155+
156+
| File | Purpose |
157+
|:-----|:--------|
158+
| `controller.py` | Core scheduling and state management |
159+
| `worker.py` | Process management and reconciliation |
160+
| `schemas.py` | Pydantic models, state transitions |
161+
| `db.py` | SQLite persistence layer |
162+
| `server.py` | FastAPI HTTP endpoints |
163+
| `client.py` | Async HTTP client |
164+
165+
---
166+
167+
## Design Decisions
168+
169+
| Decision | Choice | Why |
170+
|:---------|:-------|:----|
171+
| Database | SQLite (WAL mode) | Single file, no dependencies, survives restarts |
172+
| Heartbeat | Long-poll | Instant updates, natural liveness check |
173+
| State model | Declarative reconciliation | Automatic crash recovery, no command queue |
174+
| Head topology | Single head | Simpler than consensus, good enough for ~100 nodes |
175+
| GPU scheduling | Integer-based (not fractional) | Predictable, no oversubscription surprises |
176+
177+
---
178+
179+
## Limitations
180+
181+
| Limitation | Value | Workaround |
182+
|:-----------|:------|:-----------|
183+
| Port range per worker | 101 ports (15600–15700) | Deploy fewer services per worker |
184+
| Log retention | 50 MB per instance | Use external log aggregation |
185+
| SQLite scale | ~10K instances | Archive completed instances |
186+
| Single head node | No redundancy | Run head on reliable hardware |
187+
| No load balancing | N/A | Use nginx/HAProxy externally |
188+
| No job dependencies | N/A | Handle in application logic |
189+
| No authentication | N/A | Use network-level security |

0 commit comments

Comments
 (0)