Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
238 changes: 236 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,8 +18,184 @@ Running locally (startup may be slow for the first time since it needs to pull a
In case you change code and want to run the new version you should execute:
- `./deploy.sh rebuild`

## Security & Authentication

**⚠️ IMPORTANT: Authentication is now required for all code execution requests!**

All code execution requests require an API key for authentication. There are three ways to provide your API key:

**HTTP Header (Recommended)**:
```bash
curl -X POST http://localhost:8080/lang/python \
-H "X-API-Key: dev-key-12345" \
-H "Content-Type: text/plain" \
-d "print('Hello World')"
```

**Query Parameter**:
```bash
curl -X POST "http://localhost:8080/lang/python?api_key=dev-key-12345" \
-H "Content-Type: text/plain" \
-d "print('Hello World')"
```

### Default API Keys

For development and testing, the following API keys are available:
- `dev-key-12345` - Development key
- `prod-key-67890` - Production key
- `test-key-abcde` - Testing key

**Note**: In production, replace these with secure API keys stored in environment variables or a secrets manager.

### Rate Limiting

- **Default Limit**: 100 requests per hour per API key
- **Configuration**: Set `RATE_LIMIT_MAX_REQUESTS` environment variable to change the limit
- Rate limit information is returned in response headers:
- `X-RateLimit-Remaining`: Number of requests remaining in current window
- `X-RateLimit-Retry-After`: Seconds to wait before retrying (when rate limited)

### Input Validation

All code submissions are validated for:
- **Maximum code size**: 100 KB (bytes) or 50,000 characters
- **Language support**: Only supported languages are accepted
- **Security patterns**: Dangerous patterns (e.g., `rm -rf`, `wget`, `curl`) are blocked
- **Empty code**: Non-empty code is required

## Async Job Execution API

**NEW**: The system now supports asynchronous job execution, allowing you to submit code for execution and retrieve results later.

### Submit a Job (Async)

```bash
curl -X POST http://localhost:8080/jobs \
-H "X-API-Key: dev-key-12345" \
-H "Content-Type: application/json" \
-d '{"code": "print(\"Hello World\")", "language": "python"}'
```

**Response**:
```json
{
"job_id": "550e8400-e29b-41d4-a716-446655440000",
"status": "queued"
}
```

### Get Job Status

```bash
curl -X GET http://localhost:8080/jobs/550e8400-e29b-41d4-a716-446655440000 \
-H "X-API-Key: dev-key-12345"
```

**Response**:
```json
{
"job_id": "550e8400-e29b-41d4-a716-446655440000",
"language": "python",
"status": "completed",
"output": "Hello World\n",
"error": null,
"created_at": "2025-01-15T10:30:00Z",
"started_at": "2025-01-15T10:30:01Z",
"completed_at": "2025-01-15T10:30:02Z",
"execution_duration_ms": 1234
}
```

**Job Statuses**:
- `queued` - Job is waiting to be executed
- `running` - Job is currently executing
- `completed` - Job completed successfully
- `failed` - Job failed with an error
- `timedout` - Job exceeded execution time limit

### List All Jobs

```bash
curl -X GET "http://localhost:8080/jobs?limit=10&offset=0" \
-H "X-API-Key: dev-key-12345"
```

**Response**:
```json
{
"jobs": [
{
"job_id": "550e8400-e29b-41d4-a716-446655440000",
"language": "python",
"status": "completed",
"created_at": "2025-01-15T10:30:00Z",
"completed_at": "2025-01-15T10:30:02Z",
"execution_duration_ms": 1234
}
],
"pagination": {
"total": 1,
"limit": 10,
"offset": 0
}
}
```

### Job TTL

Completed jobs are automatically cleaned up after **1 hour** (configurable via `jobs.ttl` in `application.conf`).

## Per-Language Resource Limits

Each programming language has optimized resource limits for execution:

| Language | CPUs | Memory | Timeout |
|-----------|------|--------|---------|
| Java | 2 | 256 MB | 10s |
| Python | 1 | 50 MB | 5s |
| JavaScript| 1 | 50 MB | 5s |
| Ruby | 1 | 30 MB | 5s |
| Perl | 1 | 20 MB | 3s |
| PHP | 1 | 40 MB | 5s |

These limits can be customized in `application.conf` under the `resources` section.

## Monitoring & Health Checks

The system exposes several monitoring endpoints (no authentication required):

### Health Check
```bash
curl http://localhost:8080/health
```
Returns `200 OK` with "healthy" if the service is running.

### Readiness Check
```bash
curl http://localhost:8080/ready
```
Returns cluster readiness status and member count.

### Prometheus Metrics
```bash
curl http://localhost:8080/metrics
```
Exposes Prometheus-compatible metrics including:
- `braindrill_requests_total` - Total requests by language and status
- `braindrill_execution_duration_seconds` - Execution duration histogram
- `braindrill_active_executions` - Currently active executions
- `braindrill_auth_failures_total` - Authentication failure count
- `braindrill_rate_limit_hits_total` - Rate limit violations
- `braindrill_validation_errors_total` - Input validation errors
- `braindrill_worker_pool_size` - Worker pool size
- `braindrill_queue_depth` - Number of jobs waiting in queue (by language)
- `braindrill_queued_jobs` - Number of jobs in queued state (by language)
- `braindrill_jobs_submitted_total` - Total jobs submitted (by language)
- JVM metrics (memory, GC, threads, etc.)

Example:
- sending `POST` request at `localhost:8080/lang/python`
- sending `POST` request at `localhost:8080/lang/python` with API key
- attaching `python` code to request body

![My Image](assets/python_example.png)
Expand Down Expand Up @@ -65,7 +241,65 @@ Architecture Diagram:

![My Image](assets/diagram.png)

## Recent Improvements (Phase 1: Security & Monitoring)

### ✅ Security Features
- **API Key Authentication**: All code execution endpoints now require authentication
- **Rate Limiting**: 100 requests/hour per API key (configurable)
- **Input Validation**: Code size limits, language validation, and dangerous pattern detection
- **Security Hardening**: Removed insecure `seccomp=unconfined` from Docker containers

### ✅ Monitoring & Observability
- **Prometheus Metrics**: Comprehensive metrics for requests, executions, errors, and system health
- **Health Checks**: `/health` and `/ready` endpoints for Kubernetes/load balancer integration
- **JVM Metrics**: Built-in monitoring of memory, GC, and thread pools
- **Request Tracking**: Duration histograms, success/failure rates, and active execution counts

### ✅ Configuration
- Rate limit configuration via `RATE_LIMIT_MAX_REQUESTS` environment variable
- Centralized security configuration in `application.conf`
- API keys configurable for different environments (dev/prod/test)

## Recent Improvements (Phase 2: Async Execution & Resource Management)

### ✅ Async Job Execution
- **Job Queue System**: Submit jobs and retrieve results later via REST API
- **Job Manager Actor**: Centralized job state management with automatic cleanup
- **Job Lifecycle Tracking**: Queued → Running → Completed/Failed states
- **Job History**: List and query past executions with pagination
- **JSON API**: RESTful endpoints for job submission, status retrieval, and listing

### ✅ Advanced Resource Management
- **Per-Language Resource Profiles**: Optimized CPU, memory, and timeout limits for each language
- **Configurable Limits**: Java gets 256MB/10s, Python gets 50MB/5s, etc.
- **Resource Configuration**: Centralized resource management via `ResourceConfig`
- **Dynamic Resource Allocation**: Workers automatically use language-specific limits

### ✅ Enhanced Metrics
- **Job Queue Metrics**: Track queued jobs, queue depth, and job submission rates
- **Queue Depth Gauges**: Monitor per-language queue sizes
- **Job State Tracking**: Metrics for jobs in each state (queued/running/completed)

### ✅ Configuration
- Job TTL configuration via `jobs.ttl` in `application.conf`
- Per-language resource profiles in `ResourceConfig`
- Backward compatibility with synchronous `/lang/<language>` endpoint

## Architecture Improvements

The updated architecture now includes:
1. **Authentication Layer**: API key validation before request processing
2. **Rate Limiter Actor**: Token bucket-based rate limiting per API key
3. **Input Validator**: Multi-stage validation (size, language, security patterns)
4. **Metrics Collection**: Real-time Prometheus metrics export
5. **Health Endpoints**: Kubernetes-ready health and readiness probes
6. **Job Manager**: Async job execution with state tracking and TTL-based cleanup
7. **Resource Manager**: Per-language resource profiles with configurable limits
8. **Dual Execution Modes**: Both synchronous and asynchronous execution supported

TODO:
- add support for C, Go, Rust and others - ❌
- use other `pekko` libraries to make cluster bootstrapping and management flexible and configurable - ❌
- wrap the cluster in k8s and enable autoscaling - ❌
- wrap the cluster in k8s and enable autoscaling - 🔄 (foundation in place)
- implement async job execution with job queue system - ✅ (completed in Phase 2)
- add multi-file project support and dependency management - ❌
6 changes: 5 additions & 1 deletion build.sbt
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@ ThisBuild / scalaVersion := "3.4.1"
val PekkoVersion = "1.0.2"
val PekkoHttpVersion = "1.0.1"
val PekkoManagementVersion = "1.0.0"
val PrometheusVersion = "0.16.0"

assembly / assemblyMergeStrategy := {
case PathList("META-INF", "versions", "9", "module-info.class") => MergeStrategy.discard
Expand All @@ -25,7 +26,10 @@ libraryDependencies ++= Seq(
),
"org.apache.pekko" %% "pekko-cluster-typed" % PekkoVersion,
"org.apache.pekko" %% "pekko-serialization-jackson" % PekkoVersion,
"ch.qos.logback" % "logback-classic" % "1.5.6"
"ch.qos.logback" % "logback-classic" % "1.5.6",
"io.prometheus" % "simpleclient" % PrometheusVersion,
"io.prometheus" % "simpleclient_hotspot" % PrometheusVersion,
"io.prometheus" % "simpleclient_common" % PrometheusVersion
)

libraryDependencies ++= Seq(
Expand Down
6 changes: 0 additions & 6 deletions docker-compose.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -29,8 +29,6 @@ services:
stdin_open: true
ports:
- '17350:17350'
security_opt:
- 'seccomp=unconfined'
volumes:
- /var/run/docker.sock:/var/run/docker.sock
- engine:/data
Expand All @@ -50,8 +48,6 @@ services:
stdin_open: true
ports:
- '17351:17351'
security_opt:
- 'seccomp=unconfined'
volumes:
- /var/run/docker.sock:/var/run/docker.sock
- engine:/data
Expand All @@ -71,8 +67,6 @@ services:
stdin_open: true
ports:
- '17352:17352'
security_opt:
- 'seccomp=unconfined'
volumes:
- /var/run/docker.sock:/var/run/docker.sock
- engine:/data
Expand Down
10 changes: 10 additions & 0 deletions src/main/resources/application.conf
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,16 @@ http {
host = "0.0.0.0"
}

security {
rate-limit {
max-requests = 100 # Maximum requests per hour per API key
max-requests = ${?RATE_LIMIT_MAX_REQUESTS}
}
}

jobs {
ttl = 1h # Time-to-live for completed jobs before cleanup
}

clustering {
ip = "127.0.0.1"
Expand Down
Loading