Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
74 changes: 74 additions & 0 deletions .github/workflows/benchmark-competitors.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
name: Competitor Benchmarks

on:
schedule:
- cron: "0 6 * * 1" # Weekly on Monday at 6 AM UTC
workflow_dispatch:

jobs:
benchmark_competitors:
runs-on: ubicloud-standard-8
timeout-minutes: 90
steps:
- uses: actions/checkout@v4

- uses: denoland/setup-deno@v2
with:
deno-version: v2.x

- uses: actions/setup-node@v4
with:
node-version: "20"

- name: Pre-pull competitor Docker images
run: |
docker pull postgres:16 &
docker pull ghcr.io/windmill-labs/windmill:main &
docker pull temporalio/auto-setup:latest &
docker pull inngest/inngest:latest &
docker pull docker.io/restatedev/restate:latest &
docker pull kestra/kestra:latest &
wait

- name: Run competitor benchmarks
timeout-minutes: 60
run: |
cd benchmarks/competitors
deno run -A competitor_suite.ts \
-c competitor_suite_config.json \
--output-dir ./results

- name: Generate comparison graphs
run: |
cd benchmarks/competitors
deno run -A competitor_graphs.ts \
-c competitor_graphs_config.json \
--results-dir ./results \
--output-dir ./results

- name: Upload results
uses: actions/upload-artifact@v4
with:
name: competitor_benchmarks
path: |
benchmarks/competitors/results/*.json
benchmarks/competitors/results/*.svg

commit_results:
runs-on: ubicloud
needs: benchmark_competitors
steps:
- uses: actions/checkout@v4
with:
ref: benchmarks
- uses: actions/download-artifact@v4
with:
name: competitor_benchmarks
path: competitors/
- name: Push changes
run: |
git add .
git config --local user.email "41898282+github-actions[bot]@users.noreply.github.com"
git config --local user.name "github-actions[bot]"
git commit -m "Update competitor benchmarks" || exit 0
git push
12 changes: 9 additions & 3 deletions backend/write_latest_ee_ref.sh
Original file line number Diff line number Diff line change
Expand Up @@ -15,11 +15,17 @@ else
echo "Directory not found"; exit 1
fi

# Get the current commit hash
commit_hash=$(git rev-parse HEAD)
# If --main is passed, fetch and use latest main
if [ "$1" = "--main" ]; then
git fetch origin main
commit_hash=$(git rev-parse origin/main)
else
# Get the current commit hash
commit_hash=$(git rev-parse HEAD)
fi

# Navigate back to the original directory
cd - || exit
cd - > /dev/null || exit

# Write the commit hash to ./ee-repo-ref.txt
echo -n "$commit_hash" > ./ee-repo-ref.txt
Expand Down
2 changes: 2 additions & 0 deletions benchmarks/competitors/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
node_modules/
results/
178 changes: 178 additions & 0 deletions benchmarks/competitors/CONTEXT.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,178 @@
# Competitor Benchmark Framework — Context for Continuation

## Goal

Reproducible performance benchmarks comparing Windmill's workflow-as-code (WAC) against competitors, for SEO comparison pages. All platforms run an equivalent **3-step sequential workflow** with inline step execution (no subprocess/child job dispatch). This isolates orchestration overhead from actual compute.

## Current Competitors (7)

| Competitor | Version Tested | Adapter | Workflow Type |
|---|---|---|---|
| **Windmill** | CE v1.673.0 | `windmill/adapter.ts` | `step()` from `windmill-client` (inline, no child jobs) |
| **Temporal** | auto-setup:latest | `temporal/adapter.ts` | `proxyActivities()` with 3 sequential activities |
| **Inngest** | inngest:latest | `inngest/adapter.ts` | `step.run()` × 3 via Express app |
| **Restate** | restate:latest | `restate/adapter.ts` | `ctx.run()` × 3 via Node.js endpoint |
| **Kestra** | 1.3.7 | `kestra/adapter.ts` | `io.kestra.plugin.core.debug.Return` × 3 |
| **Prefect** | 3.6.25 | `prefect/adapter.ts` | `@task` × 3 with `.serve()` runner |
| **Airflow** | 2.10.5 | `airflow/adapter.ts` | `PythonOperator` × 3 with LocalExecutor |

## Latest Results (local dev machine, 2026-04-03)

| Competitor | Cold Start | Median Latency | P95 Latency | Throughput | Step Overhead | Workers |
|---|---|---|---|---|---|---|
| Restate | 15ms | 4.1ms | 5.0ms | 2,057/s | 1.4ms | event loop |
| **Windmill** | **106ms** | **103ms** | **105ms** | **152/s** | **34ms** | 10 |
| Temporal | 166ms | 119ms | 283ms | 39/s | 40ms | 10 |
| Kestra | 410ms | 177ms | 326ms | 68/s | 59ms | JVM threads |
| Inngest | 313ms | 260ms | 261ms | 90/s | 87ms | 10 |
| Airflow | 3,858ms | 3,792ms | 4,277ms | 7.4/s | 1,264ms | LocalExecutor |
| Prefect | 2,764ms | 9,679ms | 12,631ms | 1.3/s | 3,226ms | .serve() |

## Architecture

```
benchmarks/competitors/
├── types.ts # CompetitorAdapter interface + BenchmarkResult types
├── competitor_harness.ts # Main orchestrator: for each competitor → setup → deploy → test → teardown
├── competitor_suite.ts # CLI wrapper (cliffy): --competitors, --latency-samples, --throughput-batch, --output-dir
├── competitor_graphs.ts # SVG bar chart generation (D3 + JSDOM)
├── competitor_suite_config.json
├── competitor_graphs_config.json
├── lib/
│ ├── docker.ts # composeUp(), composeDown(), waitForHealth(), composeLogs()
│ ├── timing.ts # measureLatency(), warmup(), collectLatencySamples(), computeStats()
│ └── results.ts # saveResult(), saveSummary(), getCpuCount(), getMachineId()
├── .gitignore # excludes node_modules/ and results/
├── {competitor}/
│ ├── adapter.ts # implements CompetitorAdapter
│ ├── docker-compose.yml # isolated container stack
│ ├── workflow.* # equivalent workflow definition
│ └── Dockerfile.* # (some competitors need custom app/worker images)
└── results/ # (gitignored) JSON output from benchmark runs
```

## CompetitorAdapter Interface

```typescript
interface CompetitorAdapter {
readonly name: string;
readonly composeFile: string;
setup(): Promise<void>; // docker compose up + health wait
deployWorkflow(): Promise<void>; // register/create the 3-step workflow
triggerOne(): Promise<{ latencyMs: number; result: unknown }>;
triggerBatch(n: number): Promise<{ totalMs: number; results: unknown[] }>;
teardown(): Promise<void>; // docker compose down -v
getVersion(): Promise<string>;
}
```

## Harness Flow (per competitor)

1. `setup()` — `docker compose up -d` + wait for health endpoint
2. `deployWorkflow()` — register the 3-step workflow via REST API / SQL / CLI
3. **Cold start** — `triggerOne()` immediately after deploy
4. **Warmup** — `triggerOne()` × N, discard results
5. **Single latency** — `triggerOne()` × N, collect `performance.now()` timings → compute median, P95, mean, stdev
6. **Throughput** — `triggerBatch(N)` (concurrent `Promise.all`), measure total wall-clock time
7. **Step overhead** — median_latency / 3 (trivial steps, so overhead ≈ latency)
8. `teardown()` — `docker compose down -v`
9. Save JSON results

## How to Run

```bash
cd benchmarks/competitors

# Single competitor
deno run -A competitor_suite.ts --competitors windmill --latency-samples 50 --throughput-batch 100 --output-dir ./results

# All competitors
deno run -A competitor_suite.ts --latency-samples 100 --throughput-batch 200 --warmup-count 10 --output-dir ./results

# Generate SVG graphs
deno run -A --allow-import competitor_graphs.ts -c competitor_graphs_config.json --results-dir ./results
```

## Key Decisions & Gotchas

### Windmill
- Uses `step()` (inline), NOT `task()` (child jobs). Other competitors don't create separate jobs per step either, so this is the fair comparison.
- `WORKER_GROUP=main` + `NUM_WORKERS=10` + `I_ACK_NUM_WORKERS_IS_UNSAFE=1` + `SLEEP_QUEUE=50`. Without `SLEEP_QUEUE=50`, the queue poll interval scales to `50ms × NUM_WORKERS / 2 = 250ms`, killing latency. See `backend/windmill-worker/src/worker.rs:318-329`.
- `WORKER_TAGS=bun,flow,dependency` — must include `bun` since WAC scripts deploy as bun language.
- Cannot use `WORKER_GROUP=native` or `NATIVE_MODE=true` because native mode forces `NUM_WORKERS=8` and doesn't support the `bun` tag needed for WAC scripts.

### Temporal
- Needs a **separate worker container** (Dockerfile.worker) because the Temporal TypeScript SDK has native gRPC bindings that don't work in Deno.
- The adapter uses `node -e` subprocess calls to run Temporal client operations (connect, start workflow, get result).
- `npm install` in `temporal/` is needed before running (the adapter does this in `setup()`).
- `maxConcurrentWorkflowTaskExecution: 10` and `maxConcurrentActivityTaskExecution: 10` in worker.ts to match Windmill's 10 workers.
- Cold start is high (~6s) because `temporalio/auto-setup` does DB schema migration on first boot. Not representative of steady-state.

### Inngest
- The dev server's event run status API (`/v1/events/{id}/runs`) **caches responses for 15 seconds**. Must add `?ts=${Date.now()}` cache-buster to polling requests, otherwise each step appears to take 5 seconds.
- `--queue-workers 10` caps concurrency to match other competitors.
- `--tick 10` (10ms) for fast queue polling, `--poll-interval 1` for fast app sync.
- The Express app needs `express.json()` middleware or Inngest SDK returns 500 "Missing body".
- `--no-discovery` is set because auto-discovery is unreliable in Docker networks.

### Restate
- Extremely fast (4ms median) because it runs workflow steps **in-process** in the app's Node.js event loop. No queue hop, no network round-trip between steps. This is a fundamental architectural difference from Windmill/Temporal/Inngest.
- Uses `send` + `attach` pattern: POST `/benchmark/{id}/run/send` then GET `/restate/workflow/benchmark/{id}/attach`.
- The app must be registered with the Restate admin API: POST `http://admin:9070/deployments` with `{"uri": "http://app:9080"}`.
- Ports remapped to 8085/9075 to avoid conflict with port 8080 (often in use).

### Kestra
- Kestra 1.3.x has **mandatory basic auth** that cannot be disabled via config. The `/api/v1/flows` endpoint always returns 401.
- Workaround: deploy the flow via **direct SQL insertion** into Kestra's Postgres `flows` table. Key format is `main_{namespace}_{id}_{revision}`. The `value` JSONB must include `tenantId: "main"`, `source`, `deleted: false`, etc. (see existing tutorial flows for format).
- Trigger via the **webhook endpoint** (`/api/v1/executions/webhook/{namespace}/{flowId}/{key}`) which is public.
- Poll execution status via SQL: `SELECT value->>'state' FROM executions WHERE key = '{executionId}'`.
- Uses `io.kestra.plugin.core.debug.Return` tasks (not shell tasks) to avoid subprocess overhead.

### Prefect
- Very slow (~9.7s/workflow) because `.serve()` mode spawns a new subprocess per flow run. Prefect is designed for data pipeline tasks, not lightweight orchestration.
- The worker container runs `python flow.py` which calls `.serve(name="benchmark-deployment")` — this both registers the deployment and runs a worker loop.
- Deployment ID must be fetched via `GET /api/deployments/name/{flow_name}/{deployment_name}` before triggering.
- No auth required on self-hosted Prefect OSS server.

### Airflow
- Slow (~3.8s/workflow) due to scheduler overhead and PythonOperator subprocess execution.
- DAG file is volume-mounted to `./dags/`. Scheduler detects it after `DAG_DIR_LIST_INTERVAL` (set to 5s).
- DAGs are paused by default — must PATCH `/api/v1/dags/{dag_id}` with `{"is_paused": false}`.
- All API calls require Basic Auth: `Authorization: Basic YWRtaW46YWRtaW4=` (admin:admin).
- `airflow-init` container runs DB migration + user creation, then exits. Webserver and scheduler depend on it via `service_completed_successfully`.
- Webserver port remapped to 8090 to avoid conflict with 8080.

### Docker / Infrastructure
- All custom app/worker images must be **pre-built** before running (`docker build -t {name} -f Dockerfile ...`). Docker Compose's `build:` directive times out in some environments.
- Compose files use pre-built image references (e.g., `image: temporal-benchmark-worker`) not `build:` blocks.
- Docker Compose `--wait` flag was removed from `composeUp()` because it caused timeout failures. Each adapter handles its own readiness checking via `waitForHealth()` or custom polling.
- Competitors run sequentially with full `docker compose down -v` between runs to avoid resource contention and port conflicts.
- **Disk space**: Running all 7 competitors needs ~15GB for Docker images. Kestra (3.3GB) and Airflow (1.5GB) are the largest. Use `docker system prune -af` between runs if tight on space.

### Graphs
- `competitor_graphs.ts` generates SVG bar charts using D3 + JSDOM (same stack as existing `benchmarks/graph.ts`).
- Reads from `results/competitor_comparison_benchmark.json` (flat summary format).
- Config in `competitor_graphs_config.json` defines 5 charts: cold start, median latency, P95, throughput, step overhead.
- D3 callback params need explicit `any` types to pass Deno type checking.

## CI

`.github/workflows/benchmark-competitors.yml`:
- Runs weekly (Monday 6AM UTC) + manual dispatch
- `ubicloud-standard-8` runner
- Pre-pulls all Docker images in parallel
- Sequential competitor runs
- Results committed to `benchmarks` branch

## Potential Improvements

1. **Dagster adapter** — Another popular Python DAG-based orchestrator (competitor to Airflow/Prefect)
2. **Hatchet adapter** — Rising Postgres-backed task orchestrator (YC W24)
3. **Multi-worker scaling test** — Run with 1, 4, 8, 16 workers to show scaling curves
4. **Windmill `task()` benchmark** — Add a separate test using `task()` (child jobs) for users who need per-step isolation
5. **Payload size test** — Steps that pass non-trivial data (1KB, 100KB, 1MB) to measure serialization overhead
6. **Error/retry test** — Steps that fail and retry to measure recovery overhead
7. **Long-running workflow test** — 100+ steps to measure checkpoint overhead at scale
8. **Graph improvements** — Add error bars, include version numbers in chart labels
9. **Prefect worker pool mode** — Test with `Process` work pool instead of `.serve()` for potentially better throughput
10. **Airflow CeleryExecutor** — Test with Celery + Redis for parallel task execution instead of LocalExecutor
85 changes: 85 additions & 0 deletions benchmarks/competitors/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,85 @@
# Competitor Benchmarks

Performance benchmarks comparing Windmill's workflow-as-code against Temporal, Inngest, Restate, and Kestra.

## What's Measured

All platforms run the **same logical workflow**: 3 sequential steps, each returning a trivial integer. This isolates orchestration overhead from actual compute.

| Metric | Description |
|--------|-------------|
| **Cold start** | First execution latency after fresh deploy |
| **Single latency** | Median + P95 of individual end-to-end execution times |
| **Throughput** | Concurrent workflow completions per second |
| **Step overhead** | Per-step orchestration overhead (median_latency / 3) |

## Quick Start

```bash
# Install Deno
curl -fsSL https://deno.land/install.sh | sh

# Run all competitors (requires Docker)
deno run -A competitor_suite.ts

# Run specific competitors
deno run -A competitor_suite.ts --competitors windmill,temporal

# Custom parameters
deno run -A competitor_suite.ts \
--competitors windmill,inngest,restate \
--latency-samples 100 \
--throughput-batch 200 \
--output-dir ./results

# Generate graphs from results
deno run -A competitor_graphs.ts \
-c competitor_graphs_config.json \
--results-dir ./results
```

## Prerequisites

- [Deno](https://deno.land/) v2+
- [Docker](https://www.docker.com/) with Docker Compose v2
- [Node.js](https://nodejs.org/) 20+ (for Temporal client subprocess)
- ~8GB RAM (competitors run sequentially, one at a time)

## Workflow Equivalence

Each platform implements the same 3-step sequential workflow:

| Platform | Implementation |
|----------|---------------|
| **Windmill** | `task()` + `workflow()` from `windmill-client` |
| **Temporal** | `proxyActivities()` with 3 sequential activity calls |
| **Inngest** | `inngest.createFunction()` with 3 `step.run()` calls |
| **Restate** | `restate.workflow()` with 3 `ctx.run()` calls |
| **Kestra** | YAML flow with 3 `io.kestra.plugin.core.debug.Return` tasks |

## Architecture

```
competitor_suite.ts CLI entrypoint
└── competitor_harness.ts Orchestrator (setup → deploy → test → teardown)
├── lib/docker.ts Docker Compose lifecycle
├── lib/timing.ts Latency/throughput measurement
├── lib/results.ts JSON output + statistics
└── {competitor}/
├── adapter.ts CompetitorAdapter implementation
├── docker-compose.yml Container stack
└── workflow.* Workflow definition
```

Each competitor gets its own Docker Compose stack. Competitors run sequentially with full teardown between runs to avoid resource contention.

## Output

Results are written to `./results/` as JSON:
- `{competitor}_competitor_benchmark.json` — Per-competitor detailed results
- `competitor_comparison_benchmark.json` — Flat summary for graphing
- `competitor_*.svg` — Bar chart visualizations

## CI

The GitHub Actions workflow (`.github/workflows/benchmark-competitors.yml`) runs weekly on Monday. Results are committed to the `benchmarks` branch.
Loading
Loading