diff --git a/CLAUDE.md b/CLAUDE.md index 5fd65f5b..ed3e7c9f 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -69,8 +69,9 @@ helm/recotem/ # Helm chart (ServiceAccount, PDB, NetworkPolicy, HPA) envs/ # Environment files (dev.env, production.env) nginx.conf # Proxy config: SPA + /api/ + /ws/ + /admin/ + /inference/ + /static/ docs/ - guides/ # Feature guides (inference-api, api-keys, retraining, ab-testing) - deployment/ # Deployment guides (docker-compose, kubernetes, aws, gcp, env vars) + guide/ # User-facing guides (getting started, tuning, training, etc.) + specification/ # Developer-facing specs (architecture, data model, API, security) + deployment/ # Deployment and operations documentation ``` ## Development Setup diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index a520ef49..267c60f1 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -169,8 +169,9 @@ recotem/ nginx.conf # SPA + API + WS + Admin + Inference proxy helm/recotem/ # Helm chart for Kubernetes deployment docs/ - guides/ # Feature guides (inference API, API keys, etc.) - deployment/ # Deployment documentation + guide/ # User-facing guides (getting started, tuning, training, etc.) + specification/ # Developer-facing specs (architecture, data model, API, security) + deployment/ # Deployment and operations documentation ``` ## API Development Guide diff --git a/ONE_PAGER.md b/ONE_PAGER.md deleted file mode 100644 index 012deecf..00000000 --- a/ONE_PAGER.md +++ /dev/null @@ -1,169 +0,0 @@ -# Recotem One Pager - -## 1. Purpose - -Recotem is a Docker-first web application for building, tuning, training, and previewing recommender models through a UI. - -## 2. Product Scope - -Implemented user-facing capabilities: -- Create projects (user/item/time column definitions) -- Upload training data and item metadata files -- Create split and evaluation configurations -- Run parameter tuning jobs (Celery + Optuna + irspack) -- Auto-train model after tuning (optional) -- Train models manually from model configurations -- Preview recommendations from trained models -- Browse job logs and model/data artifacts - -Target usage: -- Multiple users use the same deployment -- Atlaskit-level component compliance is not required -- Same-domain operation is expected -- Separate frontend/backend instances may be introduced later - -## 3. High-Level Architecture - -Current deployment composition: -- `proxy` (Nginx): serves SPA, proxies `/api`, `/admin`, `/ws` -- `backend` (Django + DRF + Channels + Daphne): API + WebSocket endpoints -- `worker` (Celery): tuning/training background tasks -- `db` (PostgreSQL): application data + Optuna studies -- `redis` (Redis): Celery broker + channel layer + cache backend - -Tech stack: -- Frontend: Vue 3 + TypeScript + Vite + PrimeVue + Pinia -- Backend: Django 5.1, DRF, dj-rest-auth, simplejwt, celery, channels -- ML/Tuning: irspack + Optuna - -## 4. Core Data Model - -Main entities: -- `Project` -- `TrainingData` -- `ItemMetaData` -- `SplitConfig` -- `EvaluationConfig` -- `ModelConfiguration` -- `ParameterTuningJob` -- `TrainedModel` -- `TaskLog` - -Storage: -- DB metadata in PostgreSQL -- Uploaded datasets and trained model files in Django storage (`MEDIA_ROOT` or S3 if configured) - -## 5. Main Flow - -Standard workflow: -1. Create a project (`user_column`, `item_column`, optional `time_column`) -2. Upload training data file -3. Create split and evaluation configs -4. Create parameter tuning job -5. Background tasks run tuning; best config is persisted -6. Model can be trained (automatically or manually) -7. Recommendation preview endpoints are called from UI - -## 6. API and Realtime - -API exposure: -- Versioned: `/api/v1/...` -- Backward-compatible unversioned routes also exist via backend routing - -Realtime: -- WebSocket endpoints exist for job status/log channels (`/ws/job/{id}/...`) -- Current backend tasks persist logs to DB; websocket push wiring is incomplete (DB polling path exists in UI) - -## 7. Authentication and Multi-User Status - -Auth: -- dj-rest-auth + JWT endpoints are used for login and user fetch -- Access/refresh tokens are stored in browser localStorage - -Important multi-user note (current state): -- Data ownership boundaries are not modeled strongly yet (`Project` has no owner field) -- Most querysets are global and filtered by request parameters, not by user ownership -- This is a key gap for secure multi-user production usage - -## 8. Deployment and Operations - -### Same-domain deployment (current best-fit) - -Current Nginx setup is aligned to same-domain path-based routing: -- `/` -> frontend SPA -- `/api/` -> backend API -- `/ws/` -> backend websocket - -### Separate instances (future-compatible with changes) - -If frontend/backend are split across instances while keeping same domain: -- Prefer ingress/reverse-proxy path routing -- Add explicit backend base URL handling in frontend env configuration -- Add explicit CORS/CSRF trusted origin settings in backend - -## 9. Environment Variables - -Currently used core variables: -- `DATABASE_URL` -- `CELERY_BROKER_URL` -- `CACHE_REDIS_URL` -- `DEBUG` -- `SECRET_KEY` -- `DEFAULT_ADMIN_PASSWORD` -- `ACCESS_TOKEN_LIFETIME` -- `RECOTEM_STORAGE_TYPE` (+ optional S3 vars) - -Current status: -- `.env.example` exists -- `dev.env` / `production.env` templates exist -- Production template still contains weak placeholder-like defaults and must be hardened before public deployment - -## 10. CI/CD and Quality Gates - -GitHub Actions currently include: -- `pre-commit` (ruff + basic hooks) -- Test workflow (Playwright + pytest + coverage upload) -- Release workflow (multi-arch container build/push + Trivy scan + release artifact) -- CodeQL workflow -- Dependabot updates (pip/npm/actions/docker) - -Current quality status: -- Frontend unit tests: scripts exist but no test files committed -- Frontend E2E tests: scripts/workflow exist but no test files committed -- Frontend lint script currently requires ESLint flat config migration -- Backend tests exist for data upload and tuning flows - -## 11. Known Gaps (As-Is) - -Highest-impact items: -- Frontend/Backend API mismatch in model recommendation call path -- Incomplete data isolation for true multi-user security -- Missing/disabled frontend test assets despite CI expectations -- Security hardening required for production defaults (`SECRET_KEY`, hosts, credentials) -- Duplicate utility logic in backend service/util layers -- Documentation still contains partial legacy instructions in README - -## 12. Immediate Priorities - -P0: -- Enforce tenant/user ownership filtering in backend models/querysets/permissions -- Fix API contract mismatch in recommendation endpoint usage -- Restore runnable frontend test suites or adjust CI to match reality -- Harden production security defaults and env handling - -P1: -- Add configurable frontend API base URL for split-instance deployments -- Complete websocket event push pipeline from Celery/backend to channels -- Remove duplicated backend utility paths and consolidate service boundaries - -## 13. Reference Files - -Primary implementation references: -- `compose.yaml` -- `compose-dev.yaml` -- `backend/recotem/recotem/settings.py` -- `backend/recotem/recotem/api/tasks.py` -- `backend/recotem/recotem/api/views/` -- `frontend/src/api/client.ts` -- `frontend/src/pages/` -- `.github/workflows/` diff --git a/README.md b/README.md index a5e905c2..da53966e 100644 --- a/README.md +++ b/README.md @@ -136,19 +136,46 @@ cd frontend && npm run type-check # vue-tsc ## Documentation +### User Guide + +| Topic | Link | +|-------|------| +| Getting Started | [docs/guide/getting-started.md](docs/guide/getting-started.md) | +| Projects | [docs/guide/projects.md](docs/guide/projects.md) | +| Data Management | [docs/guide/data-management.md](docs/guide/data-management.md) | +| Hyperparameter Tuning | [docs/guide/tuning.md](docs/guide/tuning.md) | +| Model Training | [docs/guide/training.md](docs/guide/training.md) | +| API Keys | [docs/guide/api-keys.md](docs/guide/api-keys.md) | +| Inference API | [docs/guide/inference.md](docs/guide/inference.md) | +| Deployment Slots | [docs/guide/deployment-slots.md](docs/guide/deployment-slots.md) | +| A/B Testing | [docs/guide/ab-testing.md](docs/guide/ab-testing.md) | +| Scheduled Retraining | [docs/guide/retraining.md](docs/guide/retraining.md) | +| User Management | [docs/guide/user-management.md](docs/guide/user-management.md) | + +### Specification + +| Topic | Link | +|-------|------| +| Architecture | [docs/specification/architecture.md](docs/specification/architecture.md) | +| Data Model | [docs/specification/data-model.md](docs/specification/data-model.md) | +| API Reference | [docs/specification/api-reference.md](docs/specification/api-reference.md) | +| WebSocket Protocol | [docs/specification/websocket-protocol.md](docs/specification/websocket-protocol.md) | +| Security Design | [docs/specification/security-design.md](docs/specification/security-design.md) | +| Inference Service | [docs/specification/inference-service.md](docs/specification/inference-service.md) | +| Task System | [docs/specification/task-system.md](docs/specification/task-system.md) | + +### Deployment + | Topic | Link | |-------|------| -| Inference API | [docs/guides/inference-api.md](docs/guides/inference-api.md) | -| API Key Authentication | [docs/guides/api-keys.md](docs/guides/api-keys.md) | -| Scheduled Retraining | [docs/guides/retraining.md](docs/guides/retraining.md) | -| A/B Testing | [docs/guides/ab-testing.md](docs/guides/ab-testing.md) | -| Standalone Inference | [docs/guides/standalone-inference.md](docs/guides/standalone-inference.md) | -| Docker Compose Deployment | [docs/deployment/docker-compose.md](docs/deployment/docker-compose.md) | -| Kubernetes Deployment | [docs/deployment/kubernetes.md](docs/deployment/kubernetes.md) | -| AWS Deployment | [docs/deployment/aws.md](docs/deployment/aws.md) | -| GCP Deployment | [docs/deployment/gcp.md](docs/deployment/gcp.md) | +| Docker Compose | [docs/deployment/docker-compose.md](docs/deployment/docker-compose.md) | +| Kubernetes | [docs/deployment/kubernetes.md](docs/deployment/kubernetes.md) | +| AWS | [docs/deployment/aws.md](docs/deployment/aws.md) | +| GCP | [docs/deployment/gcp.md](docs/deployment/gcp.md) | | Environment Variables | [docs/deployment/environment-variables.md](docs/deployment/environment-variables.md) | +| Standalone Inference | [docs/deployment/standalone-inference.md](docs/deployment/standalone-inference.md) | | Separate Frontend | [docs/deployment/separate-frontend.md](docs/deployment/separate-frontend.md) | +| Management Commands | [docs/deployment/management-commands.md](docs/deployment/management-commands.md) | | Contributing | [CONTRIBUTING.md](CONTRIBUTING.md) | ## Links diff --git a/docs/deployment/docker-compose.md b/docs/deployment/docker-compose.md index 7884e413..03a3b987 100644 --- a/docs/deployment/docker-compose.md +++ b/docs/deployment/docker-compose.md @@ -108,7 +108,7 @@ Pre-load models on startup to avoid cold-start latency: INFERENCE_PRELOAD_MODEL_IDS=1,2,3 ``` -See [Standalone Inference Guide](../guides/standalone-inference.md) for the full workflow. +See [Standalone Inference Guide](standalone-inference.md) for the full workflow. ## Logs diff --git a/docs/deployment/management-commands.md b/docs/deployment/management-commands.md new file mode 100644 index 00000000..c8b672c1 --- /dev/null +++ b/docs/deployment/management-commands.md @@ -0,0 +1,320 @@ +# Management Commands + +Recotem includes several Django management commands for administration, deployment, and maintenance tasks. All commands are run via `python manage.py ` (or `uv run python manage.py ` in a local development setup). + +When running inside Docker, prefix with `docker compose exec backend`: + +```bash +docker compose exec backend python manage.py +``` + +--- + +## create_superuser + +Create the initial admin account. This command runs automatically during container startup. If any users already exist in the database, it does nothing. + +### Usage + +```bash +python manage.py create_superuser +``` + +### Behavior + +- If **no users** exist in the database, creates a superuser with username `admin`. +- If the `DEFAULT_ADMIN_PASSWORD` environment variable is set, that value is used as the password. +- If `DEFAULT_ADMIN_PASSWORD` is not set, a random 12-character password is generated and printed to stdout. +- If **any users** already exist, the command exits immediately without creating anything. + +### Arguments + +This command takes no arguments. Configuration is via environment variable only. + +### Environment Variables + +| Variable | Required | Description | +|----------|----------|-------------| +| `DEFAULT_ADMIN_PASSWORD` | No | Password for the admin user. If omitted, a random password is generated and displayed. | + +### When to Use + +- **Initial deployment**: The command is included in the Docker entrypoint so the first admin account is created automatically. +- **Manual setup**: Run it manually when setting up a development environment without Docker. + +### Examples + +```bash +# In Docker (automatic — included in entrypoint) +docker compose up backend + +# Local development with a specific password +DEFAULT_ADMIN_PASSWORD=mysecretpassword uv run python manage.py create_superuser + +# Local development with a random password (printed to stdout) +uv run python manage.py create_superuser +``` + +--- + +## create_api_key + +Create an API key for a project from the command line. The raw key is printed to stdout (this is the only time the full key is visible). + +### Usage + +```bash +python manage.py create_api_key \ + --project-id \ + --name \ + [--scopes ] \ + [--expires-in-days ] \ + [--owner ] +``` + +### Arguments + +| Argument | Required | Default | Description | +|----------|----------|---------|-------------| +| `--project-id` | Yes | -- | ID of the project the key belongs to. | +| `--name` | Yes | -- | A descriptive name for the key (must be unique within the project). | +| `--scopes` | No | `predict` | Comma-separated list of scopes: `read`, `write`, `predict`. | +| `--expires-in-days` | No | No expiry | Number of days until the key expires. | +| `--owner` | No | `admin` | Username of the key owner. Must match the project owner. | + +### Validations + +- The specified project ID must exist. +- The specified owner username must exist. +- The owner must match the project's owner (if the project has an owner set). +- The key name must not already exist for the same project. +- All scopes must be valid (`read`, `write`, or `predict`). + +### When to Use + +- **CI/CD pipelines**: Create API keys non-interactively as part of deployment scripts. +- **Docker entrypoints**: Provision inference keys during initial setup. +- **Scripting**: Generate keys for automated integrations without using the web UI. + +### Examples + +```bash +# Create a predict-only key for project 1 +python manage.py create_api_key --project-id 1 --name "Production Inference" + +# Create a key with multiple scopes and 90-day expiry +python manage.py create_api_key \ + --project-id 1 \ + --name "CI Pipeline" \ + --scopes "read,write,predict" \ + --expires-in-days 90 + +# Create a key owned by a specific user +python manage.py create_api_key \ + --project-id 2 \ + --name "Partner Integration" \ + --scopes "predict" \ + --owner "alice" + +# Capture the key in a shell variable +API_KEY=$(python manage.py create_api_key --project-id 1 --name "Automated" 2>/dev/null) +echo "Key: $API_KEY" +``` + +--- + +## create_test_users + +Create or update test user accounts. Primarily used for E2E testing and development environments. + +### Usage + +```bash +python manage.py create_test_users \ + --user \ + [--user ...] +``` + +### Arguments + +| Argument | Required | Description | +|----------|----------|-------------| +| `--user` | Yes (repeatable) | User credential pair in the format `username:password`. Can be specified multiple times to create several users. | + +### Behavior + +- If the user already exists, their password is updated to the specified value. +- If the user does not exist, a new (non-superuser) account is created. +- Prints whether each user was created or updated. + +### When to Use + +- **E2E test setup**: Create deterministic test accounts before running Playwright tests. +- **Development environments**: Quickly set up multiple users for manual testing. + +### Examples + +```bash +# Create a single test user +python manage.py create_test_users --user testuser:testpassword + +# Create multiple test users +python manage.py create_test_users \ + --user alice:password123 \ + --user bob:password456 + +# In Docker (e.g., as part of test setup) +docker compose exec backend python manage.py create_test_users \ + --user e2e_user:e2e_password +``` + +--- + +## resign_models + +Sign all existing unsigned trained model files with HMAC-SHA256. This command is needed when migrating from an older Recotem version that did not sign model files, or when the `SECRET_KEY` has been rotated. + +### Usage + +```bash +python manage.py resign_models [--dry-run] +``` + +### Arguments + +| Argument | Required | Description | +|----------|----------|-------------| +| `--dry-run` | No | Show which models would be signed without actually modifying any files. | + +### Behavior + +1. Scans all `TrainedModel` records that have an associated file. +2. For each file, checks whether it already has a valid HMAC-SHA256 signature. +3. If unsigned, reads the file, prepends an HMAC signature, and writes it back. +4. Prints a summary showing how many models were already signed, newly signed, and any errors. + +### Output Summary + +The command prints counts for: +- **Already signed** -- files that already have a valid signature (skipped). +- **Newly signed** -- files that were unsigned and have now been signed. +- **Errors** -- files that could not be read or written (e.g., missing from disk). + +### When to Use + +- **After upgrading Recotem** to a version that introduced model signing. Run this once to sign all legacy model files. +- **After rotating `SECRET_KEY`**. Old signatures become invalid with a new key, so re-sign all models. +- **Before enforcing signature verification**. After running this command, set `PICKLE_ALLOW_LEGACY_UNSIGNED=false` in your environment to reject any unsigned model files. + +### Examples + +```bash +# Preview which models need signing (no changes made) +python manage.py resign_models --dry-run + +# Sign all unsigned models +python manage.py resign_models + +# Full migration sequence +python manage.py resign_models +# Verify no errors in output, then enforce signing: +# Set PICKLE_ALLOW_LEGACY_UNSIGNED=false in your environment + +# In Docker +docker compose exec backend python manage.py resign_models --dry-run +docker compose exec backend python manage.py resign_models +``` + +--- + +## wait_db + +Wait for the PostgreSQL database to become available. Retries the connection with a 2-second delay between attempts. + +### Usage + +```bash +python manage.py wait_db +``` + +### Arguments + +This command takes no arguments. + +### Behavior + +- Attempts to connect to the default database up to **30 times** (60 seconds total). +- Waits 2 seconds between each attempt. +- Prints progress messages showing the current attempt number. +- Exits with code 0 on success, or code 1 if the database is still unavailable after all retries. + +### When to Use + +- **Docker entrypoints**: Run before `migrate` or `create_superuser` to ensure the database is accepting connections. This is critical because the `backend` container may start before `db` is ready. +- **Kubernetes init containers**: Use as a readiness check in init containers before the main application starts. + +### Examples + +```bash +# In a Docker entrypoint script (typical usage) +python manage.py wait_db && python manage.py migrate && python manage.py create_superuser + +# In Docker Compose (already included in the backend entrypoint) +docker compose up backend + +# In a Kubernetes init container +command: ["python", "manage.py", "wait_db"] +``` + +--- + +## assign_owners + +Assign an owner to Projects, SplitConfigs, and EvaluationConfigs that currently have no owner. This is a data migration tool for transitioning from single-user to multi-user mode. + +### Usage + +```bash +python manage.py assign_owners --user [--dry-run] +``` + +### Arguments + +| Argument | Required | Description | +|----------|----------|-------------| +| `--user` | Yes | Username to assign as the owner/created_by for all unowned records. | +| `--dry-run` | No | Show what would be changed without making any modifications. | + +### Behavior + +Scans three model types for records with no owner: + +| Model | Field Updated | +|-------|--------------| +| `Project` | `owner` | +| `SplitConfig` | `created_by` | +| `EvaluationConfig` | `created_by` | + +For each model, all records where the ownership field is `NULL` are updated to the specified user. The command prints a count of affected records for each model. + +### When to Use + +- **After upgrading to multi-user support**: If you have existing data from a single-user deployment, run this command to assign all legacy records to a user. Without an owner, these records may not be visible through the API's ownership-filtered views. +- **Data migration**: When consolidating records under a specific user account. + +### Examples + +```bash +# Preview which records would be updated +python manage.py assign_owners --user admin --dry-run + +# Assign all unowned records to the admin user +python manage.py assign_owners --user admin + +# Assign to a specific user +python manage.py assign_owners --user alice + +# In Docker +docker compose exec backend python manage.py assign_owners --user admin --dry-run +docker compose exec backend python manage.py assign_owners --user admin +``` diff --git a/docs/guides/standalone-inference.md b/docs/deployment/standalone-inference.md similarity index 98% rename from docs/guides/standalone-inference.md rename to docs/deployment/standalone-inference.md index d1318338..099e4138 100644 --- a/docs/guides/standalone-inference.md +++ b/docs/deployment/standalone-inference.md @@ -23,7 +23,7 @@ Start the full stack and train your models: docker compose up -d # Upload data, tune hyperparameters, and train models via the UI -# or use the REST API (see docs/guides/inference-api.md) +# or use the REST API (see docs/guide/inference.md) ``` ## Step 2: Create an API Key diff --git a/docs/guides/ab-testing.md b/docs/guide/ab-testing.md similarity index 84% rename from docs/guides/ab-testing.md rename to docs/guide/ab-testing.md index 52873c1c..c087d2ac 100644 --- a/docs/guides/ab-testing.md +++ b/docs/guide/ab-testing.md @@ -4,6 +4,18 @@ Recotem supports A/B testing of recommendation models through deployment slots w ## Concepts +### What is A/B Testing? + +A/B testing is a method for comparing two options to figure out which one works better. Imagine you have two recommendation models and you want to know which one leads to more clicks from your users. Instead of guessing, you split your traffic so that some users get recommendations from Model A and others get recommendations from Model B. After collecting enough data, you use statistics to determine which model actually performed better. + +**Why use A/B testing with recommendations?** + +- **Make data-driven decisions** -- instead of assuming a new model is better, prove it with real user behavior. +- **Reduce risk** -- roll out a new model gradually rather than switching all traffic at once. +- **Measure impact** -- quantify exactly how much a new model improves (or hurts) key metrics like click-through rate or purchase rate. + +In Recotem, A/B testing is built into the deployment system. You assign different models to deployment slots, split traffic between them, and Recotem tracks how each model performs. + ### Deployment Slots A deployment slot assigns a trained model to a project with a traffic weight. When the inference API receives a project-level prediction request, it selects a slot based on weights. diff --git a/docs/guide/api-keys.md b/docs/guide/api-keys.md new file mode 100644 index 00000000..99e47d19 --- /dev/null +++ b/docs/guide/api-keys.md @@ -0,0 +1,117 @@ +# API Key Authentication + +API keys let your applications and scripts talk to Recotem without requiring a user to log in. Think of an API key as a password for machines -- you give it to your app so it can fetch recommendations, upload data, or read project information on your behalf. + +## When to Use API Keys + +API keys are the right choice when you need to: + +- **Integrate recommendations into your application** -- your web or mobile app calls Recotem's inference API to show personalized recommendations to users. +- **Run automated scripts** -- batch jobs, data pipelines, or CI/CD workflows that need to interact with Recotem without human intervention. +- **Connect third-party services** -- external tools such as analytics platforms, marketing automation systems, or custom dashboards that pull data from Recotem. + +If you only need to manage projects and models through the web UI, you do not need an API key -- your normal user login is sufficient. + +## Overview + +- Keys are prefixed with `rctm_` so you can easily recognize them in configuration files and logs +- Each key belongs to a specific project -- it cannot access other projects +- Permissions are controlled via scopes (`read`, `write`, `predict`), so you can limit what a key is allowed to do +- Keys are hashed before storage for security -- the full key is shown only once when you create it, so copy it right away +- Keys can have optional expiration dates to automatically stop working after a certain time + +## Creating an API Key + +### Via UI + +1. Navigate to your project +2. Go to **API Keys** in the sidebar +3. Click **Create API Key** +4. Enter a name and select scopes +5. Copy the displayed key immediately — it will not be shown again + +### Via API + +```bash +curl -X POST http://localhost:8000/api/v1/api_keys/ \ + -H "Authorization: Bearer " \ + -H "Content-Type: application/json" \ + -d '{ + "name": "Production Service", + "project": 1, + "scopes": ["predict"] + }' +``` + +**Response:** + +```json +{ + "id": 1, + "name": "Production Service", + "project": 1, + "key_prefix": "rctm_abc1", + "scopes": ["predict"], + "is_active": true, + "expires_at": null, + "last_used_at": null, + "key": "rctm_abc1defg2hijklmn3opqrstu4vwxyz..." +} +``` + +The `key` field is only included in the creation response. Copy and store it in a secure location (such as a secrets manager or environment variable) -- you will not be able to see it again. + +## Using an API Key + +Pass the key in the `X-API-Key` header: + +```bash +curl -H "X-API-Key: rctm_your_key_here" \ + http://localhost:8000/inference/predict/1 \ + -d '{"user_id": "42", "cutoff": 10}' +``` + +API keys work with both the management API (`/api/v1/`) and the inference API (`/inference/`). + +## Scopes + +| Scope | Grants access to | +|-------|-----------------| +| `read` | Read project data, models, configurations | +| `write` | Create/update training data, configurations, models | +| `predict` | Call inference endpoints, record conversion events | + +A key can have multiple scopes. For most production integrations where you only need to serve recommendations, use `["predict"]`. Add `read` or `write` only if the key also needs to manage project data. + +## Managing Keys + +### List Keys + +```bash +curl -H "Authorization: Bearer " \ + "http://localhost:8000/api/v1/api_keys/?project=1" +``` + +### Revoke a Key + +Revoking deactivates a key immediately so it stops working, but the key record is kept for audit purposes: + +```bash +curl -X POST http://localhost:8000/api/v1/api_keys/1/revoke/ \ + -H "Authorization: Bearer " +``` + +### Delete a Key + +```bash +curl -X DELETE http://localhost:8000/api/v1/api_keys/1/ \ + -H "Authorization: Bearer " +``` + +## Security Best Practices + +- **Keys are hashed before storage** using PBKDF2-SHA256 -- even if the database is compromised, the raw keys cannot be recovered. +- The first 8 characters (the prefix, such as `rctm_abc1`) are stored in plaintext so Recotem can quickly look up which key is being used. +- Full keys are never stored and cannot be recovered. If you lose a key, revoke it and create a new one. +- The `last_used_at` field is updated on each successful use, so you can identify unused keys. +- Set `expires_at` for time-limited access -- this is especially useful for keys shared with external partners or temporary integrations. diff --git a/docs/guide/data-management.md b/docs/guide/data-management.md new file mode 100644 index 00000000..44c0c716 --- /dev/null +++ b/docs/guide/data-management.md @@ -0,0 +1,286 @@ +# Data Management + +This guide explains how to prepare, upload, and manage the data that powers your recommendation models. + +## What You Need to Know + +Recotem works with interaction data -- records of users engaging with items. This data is the foundation for training recommendation models. You can also upload item metadata to enrich your recommendations with descriptive information about items. + +## Prerequisites + +- A project has been created (see [Projects](projects.md)) +- Your data is in CSV or TSV format +- You know which columns match your project's user, item, and time column definitions + +## Training Data + +### What Is Training Data? + +Training data is a file where each row represents one interaction between a user and an item. For example: + +- A user watched a movie +- A customer purchased a product +- A reader clicked on an article + +The system uses these interactions to learn patterns and generate recommendations. + +### CSV Format Requirements + +Your CSV file must include at least the columns defined in your project: + +- The **user column** (e.g., `user_id`) -- required +- The **item column** (e.g., `movie_id`) -- required +- The **time column** (e.g., `timestamp`) -- required only if defined in the project + +Additional columns (such as ratings or categories) are allowed and will be preserved, but only the user, item, and time columns are used for model training. + +**Example -- Minimal format (no timestamps):** + +```csv +user_id,movie_id +1,101 +1,203 +2,101 +2,305 +3,203 +``` + +**Example -- With ratings and timestamps:** + +```csv +user_id,movie_id,rating,timestamp +1,101,5,2024-01-15 +1,203,4,2024-01-16 +2,101,3,2024-01-15 +2,305,5,2024-01-17 +3,203,4,2024-01-18 +``` + +**Example -- E-commerce interactions:** + +```csv +customer_id,product_id,action,date +C001,P100,purchase,2024-03-01 +C001,P200,view,2024-03-02 +C002,P100,view,2024-03-01 +C002,P300,purchase,2024-03-03 +``` + +### Supported File Formats + +| Format | Extension | Delimiter | +|--------|-----------|-----------| +| CSV | `.csv` | Comma (`,`) | +| TSV | `.tsv` | Tab | + +The system detects the format automatically based on the file extension. + +### Uploading Training Data + +#### Via the UI + +1. Navigate to your project +2. Click **Training Data** in the sidebar +3. Click **Upload** +4. Select your CSV or TSV file +5. The system validates the file immediately + + + + + +After a successful upload, you will see the file listed with its name, size, and upload date. + + + +#### Via the API + +```bash +curl -X POST http://localhost:8000/api/v1/training_data/ \ + -H "Authorization: Bearer $TOKEN" \ + -F "project=1" \ + -F "file=@/path/to/interactions.csv" +``` + +**Response:** + +```json +{ + "id": 1, + "project": 1, + "file": "/media/training_data/interactions.csv", + "ins_datetime": "2025-01-15T10:35:00Z", + "basename": "interactions.csv", + "filesize": 524288 +} +``` + +### Previewing Data + +You can preview the first rows of an uploaded file to verify it was parsed correctly. + +**Via the API:** + +```bash +curl http://localhost:8000/api/v1/training_data/1/preview/?n_rows=10 \ + -H "Authorization: Bearer $TOKEN" +``` + +**Response:** + +```json +{ + "columns": ["user_id", "movie_id", "rating", "timestamp"], + "rows": [ + [1, 101, 5, "2024-01-15"], + [1, 203, 4, "2024-01-16"] + ], + "total_rows": 2 +} +``` + +### Data Validation + +When you upload a file, Recotem checks the following: + +| Check | Error Message | +|-------|--------------| +| User column exists | `Column "user_id" not found in the upload file.` | +| Item column exists | `Column "movie_id" not found in the upload file.` | +| Time column exists (if configured) | `Column "timestamp" not found in the upload file.` | +| Time column is parseable as dates (if configured) | `Could not interpret "timestamp" as datetime.` | +| File is not empty | `file is required.` | + +If validation fails, the upload is rejected and the error message tells you exactly what went wrong. + +### Tips for Preparing Training Data + +- **More data is better** -- Recommendation models improve with more interactions +- **Unique users and items** -- Ensure consistent identifiers (do not mix `user_1` and `User_1`) +- **Remove duplicates** -- If a user interacted with the same item multiple times, decide whether to keep all records or just the latest +- **Column names must match exactly** -- The column names in your CSV must match the project's column definitions (case-sensitive) + +## Item Metadata + +### What Is Item Metadata? + +Item metadata provides descriptive information about items, such as titles, categories, or prices. While not required for training, it enriches the sample recommendation view in the UI by showing you what items are actually being recommended. + +### Format + +The metadata CSV must include the item column defined in your project. All other columns are treated as descriptive attributes. + +**Example:** + +```csv +movie_id,title,genre,year +101,The Matrix,Sci-Fi,1999 +203,Inception,Sci-Fi,2010 +305,The Godfather,Drama,1972 +420,Pulp Fiction,Crime,1994 +``` + +### Uploading Item Metadata + +#### Via the UI + +1. Navigate to your project +2. Click **Item Metadata** in the sidebar +3. Click **Upload** +4. Select your CSV file + + + +#### Via the API + +```bash +curl -X POST http://localhost:8000/api/v1/item_meta_data/ \ + -H "Authorization: Bearer $TOKEN" \ + -F "project=1" \ + -F "file=@/path/to/movies.csv" +``` + +**Response:** + +```json +{ + "id": 1, + "project": 1, + "file": "/media/item_meta_data/movies.csv", + "valid_columns_list_json": ["title", "genre", "year"], + "ins_datetime": "2025-01-15T10:40:00Z", + "basename": "movies.csv", + "filesize": 12345 +} +``` + +The `valid_columns_list_json` field lists which columns from the metadata file can be displayed in the UI (columns that cannot be serialized to JSON are excluded automatically). + +### Metadata Validation + +The system checks that the item column (e.g., `movie_id`) exists in the metadata file. If it is missing, the upload is rejected. + +## Managing Uploaded Files + +### Listing Files + +**Training data:** + +```bash +# All training data for a project +curl "http://localhost:8000/api/v1/training_data/?project=1" \ + -H "Authorization: Bearer $TOKEN" + +# A specific file +curl http://localhost:8000/api/v1/training_data/1/ \ + -H "Authorization: Bearer $TOKEN" +``` + +**Item metadata:** + +```bash +curl "http://localhost:8000/api/v1/item_meta_data/?project=1" \ + -H "Authorization: Bearer $TOKEN" +``` + +### Downloading Files + +You can download a previously uploaded file: + +```bash +curl http://localhost:8000/api/v1/training_data/1/download/ \ + -H "Authorization: Bearer $TOKEN" \ + -o training_data.csv +``` + +### Deleting Files + +Deleting a training data file removes the file from storage but keeps any models that were trained on it (those models remain usable). + +```bash +curl -X DELETE http://localhost:8000/api/v1/training_data/1/ \ + -H "Authorization: Bearer $TOKEN" +``` + +## API Reference + +### Training Data + +| Method | Endpoint | Description | +|--------|----------|-------------| +| `GET` | `/api/v1/training_data/?project={id}` | List training data for a project | +| `POST` | `/api/v1/training_data/` | Upload new training data (multipart form) | +| `GET` | `/api/v1/training_data/{id}/` | Get file details | +| `GET` | `/api/v1/training_data/{id}/preview/` | Preview first N rows | +| `GET` | `/api/v1/training_data/{id}/download/` | Download the file | +| `DELETE` | `/api/v1/training_data/{id}/` | Delete the file | + +### Item Metadata + +| Method | Endpoint | Description | +|--------|----------|-------------| +| `GET` | `/api/v1/item_meta_data/?project={id}` | List metadata files for a project | +| `POST` | `/api/v1/item_meta_data/` | Upload new metadata (multipart form) | +| `GET` | `/api/v1/item_meta_data/{id}/` | Get file details | +| `GET` | `/api/v1/item_meta_data/{id}/download/` | Download the file | +| `DELETE` | `/api/v1/item_meta_data/{id}/` | Delete the file | diff --git a/docs/guide/deployment-slots.md b/docs/guide/deployment-slots.md new file mode 100644 index 00000000..6759e8c1 --- /dev/null +++ b/docs/guide/deployment-slots.md @@ -0,0 +1,258 @@ +# Deployment Slots + +This guide explains how to deploy trained models for serving real-time recommendations in production. + +## What Are Deployment Slots? + +A deployment slot connects a trained model to a project for serving predictions through the inference API. Think of it as assigning a model to "go live" for a project. + +Each project can have multiple deployment slots, each pointing to a different model. When the inference API receives a project-level prediction request, it selects one of the active slots based on their weights. This enables smooth traffic distribution and A/B testing. + +## Why Use Deployment Slots? + +- **Controlled rollouts** -- Gradually shift traffic from an old model to a new one +- **A/B testing** -- Compare two models side by side with real users (see [A/B Testing](ab-testing.md)) +- **Easy rollback** -- If a new model underperforms, deactivate its slot and revert to the previous one +- **Zero-downtime updates** -- Swap models without restarting any services + +## Prerequisites + +- A trained model (see [Training](training.md)) +- An API key with `predict` scope for calling the inference API (see [API Keys](api-keys.md)) + +## Creating a Deployment Slot + +### Via the UI + +1. Navigate to your project +2. Click **Deployment Slots** in the sidebar +3. Click **Create Slot** +4. Fill in the details: + + + +| Field | Required | Description | +|-------|----------|-------------| +| **Name** | Yes | A descriptive label (e.g., "Production", "Candidate Model", "Variant A") | +| **Trained Model** | Yes | The model to serve. Must belong to the same project. | +| **Weight** | Yes | Traffic weight (0 to 100). Controls what fraction of requests this slot handles relative to other active slots. | +| **Active** | Yes | Whether this slot is currently serving traffic. Defaults to active. | + +5. Click **Save** + +### Via the API + +```bash +curl -X POST http://localhost:8000/api/v1/deployment_slot/ \ + -H "Authorization: Bearer $TOKEN" \ + -H "Content-Type: application/json" \ + -d '{ + "project": 1, + "name": "Production", + "trained_model": 3, + "weight": 100, + "is_active": true + }' +``` + +**Response:** + +```json +{ + "id": 1, + "project": 1, + "name": "Production", + "trained_model": 3, + "weight": 100.0, + "is_active": true, + "ins_datetime": "2025-01-15T14:00:00Z", + "updated_at": "2025-01-15T14:00:00Z" +} +``` + +## How Traffic Distribution Works + +When you call the project-level inference endpoint (`POST /inference/predict/project/{project_id}`), the service selects a slot based on the weights of all **active** slots for that project. + +**Example: Single slot** + +If you have one active slot with weight 100, all requests go to that slot's model. + +``` +Slot "Production" (weight: 100) --> 100% of traffic +``` + +**Example: Two slots for A/B testing** + +If you have two active slots with equal weights, traffic is split evenly: + +``` +Slot "Control" (weight: 50) --> 50% of traffic +Slot "Variant A" (weight: 50) --> 50% of traffic +``` + +**Example: Gradual rollout** + +Start with most traffic on the existing model, then gradually increase the new model's share: + +``` +Slot "Current Model" (weight: 90) --> 90% of traffic +Slot "New Model" (weight: 10) --> 10% of traffic +``` + +The weight values do not need to add up to 100. The system calculates proportions based on the relative weights. For instance, slots with weights 30 and 70 produce the same split as slots with weights 3 and 7. + +## Calling the Inference API with Slots + +Once you have active deployment slots, use the project-level inference endpoint: + +```bash +curl -X POST http://localhost:8000/inference/predict/project/1 \ + -H "X-API-Key: rctm_your_key_here" \ + -H "Content-Type: application/json" \ + -d '{ + "user_id": "42", + "cutoff": 10 + }' +``` + +**Response:** + +```json +{ + "items": [ + {"item_id": "305", "score": 0.95}, + {"item_id": "420", "score": 0.87} + ], + "model_id": 3, + "slot_id": 1, + "slot_name": "Production", + "request_id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890" +} +``` + +The response includes `slot_id` and `slot_name` so your application can track which model served each request. The `request_id` is useful for recording conversion events in A/B tests. + +If no active deployment slots exist for the project, the API returns a 404 error. + +## Managing Deployment Slots + +### Listing Slots + +```bash +# All slots for a project +curl "http://localhost:8000/api/v1/deployment_slot/?project=1" \ + -H "Authorization: Bearer $TOKEN" + +# Only active slots +curl "http://localhost:8000/api/v1/deployment_slot/?project=1&is_active=true" \ + -H "Authorization: Bearer $TOKEN" +``` + +### Updating a Slot + +You can change a slot's model, weight, or active status: + +```bash +# Change the model a slot points to +curl -X PATCH http://localhost:8000/api/v1/deployment_slot/1/ \ + -H "Authorization: Bearer $TOKEN" \ + -H "Content-Type: application/json" \ + -d '{"trained_model": 7}' + +# Adjust traffic weight +curl -X PATCH http://localhost:8000/api/v1/deployment_slot/1/ \ + -H "Authorization: Bearer $TOKEN" \ + -H "Content-Type: application/json" \ + -d '{"weight": 75}' +``` + +### Activating and Deactivating Slots + +Deactivating a slot removes it from traffic routing without deleting it. This is useful for pausing a model or performing maintenance. + +```bash +# Deactivate a slot +curl -X PATCH http://localhost:8000/api/v1/deployment_slot/2/ \ + -H "Authorization: Bearer $TOKEN" \ + -H "Content-Type: application/json" \ + -d '{"is_active": false}' + +# Reactivate a slot +curl -X PATCH http://localhost:8000/api/v1/deployment_slot/2/ \ + -H "Authorization: Bearer $TOKEN" \ + -H "Content-Type: application/json" \ + -d '{"is_active": true}' +``` + +### Deleting a Slot + +```bash +curl -X DELETE http://localhost:8000/api/v1/deployment_slot/2/ \ + -H "Authorization: Bearer $TOKEN" +``` + +## Common Patterns + +### Single Production Model + +The simplest setup: one active slot serving all traffic. + +```bash +curl -X POST http://localhost:8000/api/v1/deployment_slot/ \ + -H "Authorization: Bearer $TOKEN" \ + -H "Content-Type: application/json" \ + -d '{ + "project": 1, + "name": "Production", + "trained_model": 3, + "weight": 100 + }' +``` + +### Model Swap + +To replace the production model with a new one: + +```bash +# Update the existing slot to point to the new model +curl -X PATCH http://localhost:8000/api/v1/deployment_slot/1/ \ + -H "Authorization: Bearer $TOKEN" \ + -H "Content-Type: application/json" \ + -d '{"trained_model": 7}' +``` + +### Canary Deployment + +Send a small percentage of traffic to a new model to verify it works correctly: + +```bash +# Existing production slot (already at weight 100) +# Create a canary slot with low weight +curl -X POST http://localhost:8000/api/v1/deployment_slot/ \ + -H "Authorization: Bearer $TOKEN" \ + -H "Content-Type: application/json" \ + -d '{ + "project": 1, + "name": "Canary", + "trained_model": 7, + "weight": 5 + }' +``` + +This sends about 5% of traffic to the new model (`5 / (100 + 5) = ~4.8%`). + +### A/B Testing + +For formal A/B testing with statistical analysis, use deployment slots in combination with the A/B Testing feature. See the [A/B Testing guide](ab-testing.md) for details. + +## API Reference + +| Method | Endpoint | Description | +|--------|----------|-------------| +| `GET` | `/api/v1/deployment_slot/?project={id}` | List slots for a project | +| `POST` | `/api/v1/deployment_slot/` | Create a new slot | +| `GET` | `/api/v1/deployment_slot/{id}/` | Get slot details | +| `PATCH` | `/api/v1/deployment_slot/{id}/` | Update a slot (model, weight, active status) | +| `DELETE` | `/api/v1/deployment_slot/{id}/` | Delete a slot | +| `POST` | `/inference/predict/project/{project_id}` | Get recommendations via slot routing | diff --git a/docs/guide/getting-started.md b/docs/guide/getting-started.md new file mode 100644 index 00000000..d0752732 --- /dev/null +++ b/docs/guide/getting-started.md @@ -0,0 +1,351 @@ +# Getting Started with Recotem + +This guide walks you through the complete workflow: from your first login to getting real-time recommendations from a trained model. + +## What You Will Learn + +By the end of this guide, you will have: + +- Created a recommendation project +- Uploaded user-item interaction data +- Tuned and trained a recommendation model +- Retrieved personalized recommendations via the API + +## Prerequisites + +- Recotem is running and accessible in your browser (see [Docker Compose Deployment](../deployment/docker-compose.md) or [Kubernetes Deployment](../deployment/kubernetes.md)) +- A modern web browser (Chrome, Firefox, Safari, or Edge) +- A CSV file with user-item interaction data (or use the example below) + +## Step 1: Log In + +Open your browser and navigate to your Recotem instance (by default, `http://localhost:8000`). + +You will see the login page. Enter the admin credentials that were configured during deployment: + +- **Username**: `admin` (default) +- **Password**: The value of the `DEFAULT_ADMIN_PASSWORD` environment variable set during deployment + + + +After logging in, you will be taken to the Dashboard, which shows an overview of your projects. + + + +## Step 2: Create a Project + +A **project** is the top-level container for a recommendation task. It defines which columns in your data represent users, items, and (optionally) timestamps. + +1. Click **Create Project** on the Dashboard +2. Fill in the project details: + +| Field | Description | Example | +|-------|-------------|---------| +| **Name** | A descriptive name for your project | `Movie Recommendations` | +| **User Column** | The column name in your CSV that identifies users | `user_id` | +| **Item Column** | The column name in your CSV that identifies items | `movie_id` | +| **Time Column** | (Optional) The column name for timestamps. Leave blank if your data has no timestamps | `timestamp` | + + + +3. Click **Save** + +You can also create a project via the API: + +```bash +# First, obtain a JWT token +TOKEN=$(curl -s -X POST http://localhost:8000/api/v1/auth/login/ \ + -H "Content-Type: application/json" \ + -d '{"username": "admin", "password": "your_password"}' \ + | jq -r '.access') + +# Create the project +curl -X POST http://localhost:8000/api/v1/project/ \ + -H "Authorization: Bearer $TOKEN" \ + -H "Content-Type: application/json" \ + -d '{ + "name": "Movie Recommendations", + "user_column": "user_id", + "item_column": "movie_id", + "time_column": "timestamp" + }' +``` + +## Step 3: Upload Training Data + +Training data is a CSV file where each row represents an interaction between a user and an item. At minimum, your CSV must contain the user and item columns you defined in the project. + +**Example CSV format:** + +```csv +user_id,movie_id,rating,timestamp +1,101,5,2024-01-15 +1,203,4,2024-01-16 +2,101,3,2024-01-15 +2,305,5,2024-01-17 +3,203,4,2024-01-18 +3,305,2,2024-01-19 +``` + +To upload: + +1. Navigate to your project +2. Click **Training Data** in the sidebar +3. Click **Upload** +4. Select your CSV file +5. The system validates that the required columns exist in your file + + + +Recotem validates the file immediately after upload. If the required columns are missing, you will see an error message explaining which column was not found. + +**Via API:** + +```bash +curl -X POST http://localhost:8000/api/v1/training_data/ \ + -H "Authorization: Bearer $TOKEN" \ + -F "project=1" \ + -F "file=@/path/to/your/interactions.csv" +``` + +## Step 4: Create a Split Config + +A **split config** tells Recotem how to divide your data into training and test sets for evaluation. This is important because it lets the system measure how well the model performs on data it has not seen. + +1. Navigate to **Split Config** in the sidebar +2. Click **Create** +3. Configure the split: + +| Field | Description | Default | +|-------|-------------|---------| +| **Name** | A label for this config (optional) | | +| **Scheme** | How to split the data | `Random` | +| **Heldout Ratio** | Fraction of each user's interactions reserved for testing (0.0 to 1.0) | `0.1` | +| **Test User Ratio** | Fraction of users included in evaluation (0.0 to 1.0) | `1.0` | +| **Random Seed** | Seed for reproducibility | `42` | + + + +**Available split schemes:** + +- **Random** -- Randomly holds out a fraction of interactions per user +- **Time Global** -- Uses the most recent interactions globally as the test set +- **Time User** -- Uses the most recent interactions per user as the test set + +**Via API:** + +```bash +curl -X POST http://localhost:8000/api/v1/split_config/ \ + -H "Authorization: Bearer $TOKEN" \ + -H "Content-Type: application/json" \ + -d '{ + "name": "Default Split", + "scheme": "RG", + "heldout_ratio": 0.1, + "test_user_ratio": 1.0, + "random_seed": 42 + }' +``` + +## Step 5: Create an Evaluation Config + +An **evaluation config** defines which metric to optimize and how many items to consider when measuring model quality. + +1. Navigate to **Evaluation Config** in the sidebar +2. Click **Create** +3. Configure the evaluation: + +| Field | Description | Default | +|-------|-------------|---------| +| **Name** | A label for this config (optional) | | +| **Cutoff** | Number of top items to consider when computing metrics | `20` | +| **Target Metric** | The metric to optimize during tuning | `ndcg` | + + + +**Available metrics:** + +| Metric | Full Name | What It Measures | +|--------|-----------|-----------------| +| `ndcg` | Normalized Discounted Cumulative Gain | How well the recommended items are ranked, giving more weight to items at the top of the list | +| `map` | Mean Average Precision | Average precision across all relevant items | +| `recall` | Recall | Fraction of relevant items that appear in the recommendations | +| `hit` | Hit Rate | Whether at least one relevant item appears in the recommendations | + +**Via API:** + +```bash +curl -X POST http://localhost:8000/api/v1/evaluation_config/ \ + -H "Authorization: Bearer $TOKEN" \ + -H "Content-Type: application/json" \ + -d '{ + "name": "Default Evaluation", + "cutoff": 20, + "target_metric": "ndcg" + }' +``` + +## Step 6: Run Parameter Tuning + +**Parameter tuning** is an automated process that searches for the best algorithm and settings for your data. Recotem uses Optuna to efficiently explore different combinations. + +1. Navigate to **Tuning Jobs** in the sidebar +2. Click **Create Tuning Job** +3. Select: + - The **training data** you uploaded + - The **split config** you created + - The **evaluation config** you created +4. Optionally adjust: + - **Number of trials** (default: 40) -- how many different configurations to try + - **Train after tuning** (default: enabled) -- automatically train a model with the best configuration found + + + +5. Click **Start** + +The tuning job runs in the background. You can monitor its progress on the tuning job detail page. The UI updates in real time via WebSocket. + + + +When the job completes, it saves the best model configuration automatically. If "Train after tuning" was enabled, a trained model is also created. + +**Via API:** + +```bash +curl -X POST http://localhost:8000/api/v1/parameter_tuning_job/ \ + -H "Authorization: Bearer $TOKEN" \ + -H "Content-Type: application/json" \ + -d '{ + "data": 1, + "split": 1, + "evaluation": 1, + "n_trials": 40, + "train_after_tuning": true + }' +``` + +The response includes the job ID. You can check the status: + +```bash +curl http://localhost:8000/api/v1/parameter_tuning_job/1/ \ + -H "Authorization: Bearer $TOKEN" +``` + +Job statuses: `PENDING`, `RUNNING`, `COMPLETED`, `FAILED`. + +## Step 7: Train a Model + +If you did not enable "Train after tuning" in the previous step, or if you want to train additional models from a specific configuration, you can do so manually. + +1. Navigate to **Models** in the sidebar +2. Click **Train Model** +3. Select: + - The **model configuration** (the best config from tuning, or any other configuration) + - The **training data** to train on + + + +4. Click **Start Training** + +Training runs in the background. Once complete, the model file is saved and signed with HMAC-SHA256 for security. + +**Via API:** + +```bash +curl -X POST http://localhost:8000/api/v1/trained_model/ \ + -H "Authorization: Bearer $TOKEN" \ + -H "Content-Type: application/json" \ + -d '{ + "configuration": 1, + "data_loc": 1 + }' +``` + +## Step 8: Create an API Key + +To call the inference API, you need an API key with the `predict` scope. + +1. Navigate to **API Keys** in the sidebar +2. Click **Create API Key** +3. Enter a name and select the `predict` scope +4. Click **Create** +5. **Copy the key immediately** -- it will not be shown again + + + +**Via API:** + +```bash +curl -X POST http://localhost:8000/api/v1/api_keys/ \ + -H "Authorization: Bearer $TOKEN" \ + -H "Content-Type: application/json" \ + -d '{ + "name": "My First Key", + "project": 1, + "scopes": ["predict"] + }' +``` + +The response includes a `key` field with the full API key (prefixed with `rctm_`). Store it securely. + +For more details, see the [API Keys guide](api-keys.md). + +## Step 9: Get Recommendations + +Now you can call the inference API to get personalized recommendations for any user in your training data. + +**Single user recommendations:** + +```bash +curl -X POST http://localhost:8000/inference/predict/1 \ + -H "X-API-Key: rctm_your_key_here" \ + -H "Content-Type: application/json" \ + -d '{ + "user_id": "1", + "cutoff": 10 + }' +``` + +**Response:** + +```json +{ + "items": [ + {"item_id": "305", "score": 0.95}, + {"item_id": "420", "score": 0.87}, + {"item_id": "112", "score": 0.82} + ], + "model_id": 1, + "request_id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890" +} +``` + +Each item in the list is a recommended item for the user, sorted by score (highest first). The `cutoff` parameter controls how many items to return. + +**Batch recommendations (multiple users at once):** + +```bash +curl -X POST http://localhost:8000/inference/predict/1/batch \ + -H "X-API-Key: rctm_your_key_here" \ + -H "Content-Type: application/json" \ + -d '{ + "user_ids": ["1", "2", "3"], + "cutoff": 10 + }' +``` + +For the complete inference API reference, see the [Inference API guide](inference.md). + +## What's Next + +Now that you have a working recommendation pipeline, explore these topics: + +- **[Projects](projects.md)** -- Learn more about managing projects +- **[Data Management](data-management.md)** -- Preparing and uploading different data formats +- **[Tuning](tuning.md)** -- Advanced tuning options and available algorithms +- **[Training](training.md)** -- Model training details and comparison +- **[Deployment Slots](deployment-slots.md)** -- Deploy models for production serving with traffic splitting +- **[A/B Testing](ab-testing.md)** -- Compare models with statistical analysis +- **[Scheduled Retraining](retraining.md)** -- Keep models fresh with automatic retraining +- **[API Keys](api-keys.md)** -- Manage API access and scopes +- **[User Management](user-management.md)** -- Add users and manage permissions diff --git a/docs/guides/inference-api.md b/docs/guide/inference.md similarity index 63% rename from docs/guides/inference-api.md rename to docs/guide/inference.md index 743a88c1..a31c241e 100644 --- a/docs/guides/inference-api.md +++ b/docs/guide/inference.md @@ -1,6 +1,17 @@ # Inference API -The inference service is a standalone FastAPI application that serves real-time recommendations. It runs independently from the Django backend and connects directly to PostgreSQL (read-only) and Redis. +The inference API is how your application gets recommendations from Recotem. After you have trained a recommendation model, the inference API is the endpoint your app calls to ask "what items should I recommend to this user?" and get an answer back in milliseconds. + +## Where Inference Fits in the Workflow + +The inference API is the final step in the recommendation pipeline: + +1. **Upload data** -- you provide user interaction data (e.g., clicks, purchases) to Recotem. +2. **Tune and train** -- Recotem finds the best algorithm settings and trains a model. +3. **Deploy** -- you assign the trained model to a deployment slot so it can serve predictions. +4. **Get recommendations (you are here)** -- your application calls the inference API with a user ID and gets back a ranked list of recommended items. + +The inference service is a standalone FastAPI application that serves real-time recommendations. It runs independently from the Django backend and connects directly to PostgreSQL (read-only) and Redis. This separation means the inference service can be scaled independently to handle high prediction traffic without affecting the management interface. ## Base URL @@ -18,13 +29,13 @@ All prediction endpoints require an API key with `predict` scope. Pass it via th curl -H "X-API-Key: rctm_your_key_here" ... ``` -See [API Keys](api-keys.md) for how to create and manage keys. +You need to create an API key before you can call the inference API. See the [API Keys guide](api-keys.md) for how to create and manage keys. ## Endpoints ### POST /inference/predict/{model_id} -Get top-K recommendations for a single user from a specific model. +Get top-K recommendations for a single user from a specific model. This is the simplest endpoint -- use it when you know exactly which trained model you want to query. **Request:** @@ -64,7 +75,7 @@ Get top-K recommendations for a single user from a specific model. ### POST /inference/predict/{model_id}/batch -Get recommendations for multiple users in a single request (max 100 users). +Get recommendations for multiple users in a single request (max 100 users). This is more efficient than making individual calls when you need to generate recommendations for a batch of users at once, such as for email campaigns or pre-computed recommendation feeds. **Request:** @@ -103,7 +114,7 @@ Users not found in the model return an empty `items` list (no error). ### POST /inference/predict/project/{project_id} -Get recommendations using the project's deployment slots. The inference service selects a model based on deployment slot weights, enabling A/B testing. +Get recommendations using the project's deployment slots. Instead of specifying a model directly, you point to a project and Recotem automatically selects which model to use based on your deployment slot weights. This is the recommended endpoint for production use, as it enables A/B testing and seamless model updates without changing your application code. See the [A/B Testing guide](ab-testing.md) for details on setting up experiments. **Request:** @@ -161,13 +172,15 @@ List currently loaded models (no authentication required). ## Model Hot-Swap -When a new model is trained via the backend, the training service publishes a `model_trained` event to Redis Pub/Sub (channel `recotem:model_events` on db 3). Each inference service replica independently receives the event and loads the new model in a background thread. +When you train a new model, you do not need to restart the inference service. Recotem automatically pushes model updates to the inference service in the background. + +Here is how it works: the backend publishes a `model_trained` event to Redis Pub/Sub (channel `recotem:model_events` on db 3). Each inference service replica independently receives the event and loads the new model in a background thread. This means: - No restart needed when models are updated - All replicas update independently -- Model loading happens in the background without blocking requests -- Old models remain available until replaced in the LRU cache +- Model loading happens in the background without blocking ongoing requests +- Old models remain available until they are evicted from the LRU cache ## Configuration diff --git a/docs/guide/projects.md b/docs/guide/projects.md new file mode 100644 index 00000000..eaea9fbf --- /dev/null +++ b/docs/guide/projects.md @@ -0,0 +1,195 @@ +# Projects + +A project is the foundation of everything you do in Recotem. It defines a single recommendation task and acts as a container for all related data, models, and configurations. + +## What Is a Project? + +Think of a project as a self-contained recommendation workspace. For example, you might create separate projects for: + +- "Movie Recommendations" for a streaming platform +- "Product Suggestions" for an e-commerce store +- "Article Recommendations" for a news site + +Each project holds its own training data, tuning jobs, trained models, deployment slots, and API keys. + +## Why Projects Matter + +Projects serve two important purposes: + +1. **Column mapping** -- They define how Recotem reads your data by specifying which columns contain user identifiers, item identifiers, and timestamps. +2. **Isolation** -- Everything within a project is self-contained. Models trained in one project cannot accidentally be used in another. + +## Creating a Project + +### Prerequisites + +- You are logged in to Recotem +- You know the column names in your CSV data + +### Via the UI + +1. From the Dashboard, click **Create Project** +2. Fill in the following fields: + + + +| Field | Required | Description | +|-------|----------|-------------| +| **Name** | Yes | A descriptive name for this recommendation task. Must be unique among your projects. | +| **User Column** | Yes | The name of the column in your CSV files that identifies individual users (e.g., `user_id`, `customer_id`, `uid`). | +| **Item Column** | Yes | The name of the column in your CSV files that identifies individual items (e.g., `item_id`, `product_id`, `movie_id`). | +| **Time Column** | No | The name of the column for timestamps (e.g., `timestamp`, `date`, `created_at`). If provided, enables time-based data splitting. | + +3. Click **Save** + +### Via the API + +```bash +curl -X POST http://localhost:8000/api/v1/project/ \ + -H "Authorization: Bearer $TOKEN" \ + -H "Content-Type: application/json" \ + -d '{ + "name": "Movie Recommendations", + "user_column": "user_id", + "item_column": "movie_id", + "time_column": "timestamp" + }' +``` + +**Response:** + +```json +{ + "id": 1, + "name": "Movie Recommendations", + "owner": 1, + "user_column": "user_id", + "item_column": "movie_id", + "time_column": "timestamp", + "ins_datetime": "2025-01-15T10:30:00Z", + "updated_at": "2025-01-15T10:30:00Z" +} +``` + +## Understanding Column Definitions + +### User Column + +This column identifies the people (or entities) who receive recommendations. Every row in your training data must have a value in this column. + +**Examples:** +- An e-commerce site might use `customer_id` +- A streaming service might use `user_id` +- A content platform might use `reader_id` + +### Item Column + +This column identifies the things being recommended. Every row in your training data must have a value in this column. + +**Examples:** +- An e-commerce site might use `product_id` or `sku` +- A streaming service might use `movie_id` or `show_id` +- A content platform might use `article_id` + +### Time Column (Optional) + +If your data includes timestamps, specifying a time column enables time-based data splitting strategies. This can lead to more realistic evaluation because the system tests on future interactions rather than randomly held-out ones. + +If you do not have timestamps in your data, leave this field blank. Random splitting will be used instead. + +## Viewing Project Details + +### Project Summary + +The project summary gives you a quick overview of the current state of your project. + +**Via the UI:** + +Navigate to the project and view the summary panel on the project page. + + + +**Via the API:** + +```bash +curl http://localhost:8000/api/v1/project_summary/1/ \ + -H "Authorization: Bearer $TOKEN" +``` + +**Response:** + +```json +{ + "n_data": 3, + "n_complete_jobs": 2, + "n_models": 5, + "ins_datetime": "2025-01-15T10:30:00Z" +} +``` + +| Field | Description | +|-------|-------------| +| `n_data` | Number of training data files uploaded | +| `n_complete_jobs` | Number of completed tuning jobs | +| `n_models` | Number of trained models | +| `ins_datetime` | When the project was created | + +### Listing Projects + +**Via the API:** + +```bash +# List all your projects +curl http://localhost:8000/api/v1/project/ \ + -H "Authorization: Bearer $TOKEN" + +# Filter by name +curl "http://localhost:8000/api/v1/project/?name=Movie%20Recommendations" \ + -H "Authorization: Bearer $TOKEN" +``` + +## Updating a Project + +You can update a project's name or column definitions at any time. However, changing column definitions after uploading data may cause validation errors for future uploads if the new column names do not match your CSV files. + +**Via the API:** + +```bash +curl -X PATCH http://localhost:8000/api/v1/project/1/ \ + -H "Authorization: Bearer $TOKEN" \ + -H "Content-Type: application/json" \ + -d '{"name": "Updated Movie Recommendations"}' +``` + +## Deleting a Project + +Deleting a project permanently removes it and all associated data, models, configurations, and API keys. + +**Via the API:** + +```bash +curl -X DELETE http://localhost:8000/api/v1/project/1/ \ + -H "Authorization: Bearer $TOKEN" +``` + +This action cannot be undone. Make sure you no longer need any of the project's data or models before deleting. + +## Project Ownership + +Each project belongs to the user who created it. This means: + +- You can only see and manage projects you own +- Admin users (staff) can see all projects +- Legacy projects created before multi-user support was added are visible to all authenticated users +- Project names must be unique per user -- different users can have projects with the same name + +## API Reference + +| Method | Endpoint | Description | +|--------|----------|-------------| +| `GET` | `/api/v1/project/` | List your projects | +| `POST` | `/api/v1/project/` | Create a new project | +| `GET` | `/api/v1/project/{id}/` | Get project details | +| `PATCH` | `/api/v1/project/{id}/` | Update a project | +| `DELETE` | `/api/v1/project/{id}/` | Delete a project | +| `GET` | `/api/v1/project_summary/{id}/` | Get project summary statistics | diff --git a/docs/guides/retraining.md b/docs/guide/retraining.md similarity index 65% rename from docs/guides/retraining.md rename to docs/guide/retraining.md index 3f6181c7..506913d7 100644 --- a/docs/guides/retraining.md +++ b/docs/guide/retraining.md @@ -2,6 +2,16 @@ Recotem supports automatic periodic retraining of recommendation models using cron-based schedules powered by django-celery-beat. +## When to Set Up Retraining + +Scheduled retraining is useful when: + +- **Your product gets new user interaction data regularly** -- as users click, purchase, and browse, the underlying data changes. Models trained on old data gradually become less accurate. Automatic retraining keeps your recommendations fresh. +- **You want models to stay up-to-date without manual intervention** -- instead of remembering to retrain models yourself, set a schedule and let Recotem handle it automatically. +- **You want to retrain on a regular cadence** -- for example, retrain every night so that today's user activity is reflected in tomorrow's recommendations, or retrain weekly if your data changes more slowly. + +If your training data rarely changes or you prefer to retrain manually after each data upload, you may not need scheduled retraining. + ## Overview Each project can have one retraining schedule. When enabled, Celery Beat triggers a retraining task at the specified interval. The task can either: @@ -46,14 +56,30 @@ curl -X POST http://localhost:8000/api/v1/retraining_schedule/1/trigger/ \ ## Cron Expression Format -Standard 5-field cron syntax: `minute hour day_of_month month day_of_week` +Cron expressions tell Recotem when to run retraining. They use 5 fields separated by spaces: + +``` +minute hour day_of_month month day_of_week + 0 2 * * 0 +``` + +- **minute** (0-59) -- which minute of the hour +- **hour** (0-23) -- which hour of the day (24-hour format) +- **day_of_month** (1-31) -- which day of the month +- **month** (1-12) -- which month +- **day_of_week** (0-6) -- which day of the week (0 = Sunday, 1 = Monday, ..., 6 = Saturday) + +Use `*` to mean "every" and `*/N` to mean "every N units." + +Here are common schedules you can copy directly: -| Expression | Schedule | +| Expression | What it means | |-----------|---------| -| `0 2 * * 0` | Every Sunday at 2:00 AM | -| `0 3 * * *` | Every day at 3:00 AM | -| `0 */6 * * *` | Every 6 hours | -| `0 2 1 * *` | First day of each month at 2:00 AM | +| `0 2 * * 0` | Every Sunday at 2:00 AM -- good for weekly retraining | +| `0 3 * * *` | Every day at 3:00 AM -- good for daily retraining | +| `0 */6 * * *` | Every 6 hours (at 0:00, 6:00, 12:00, 18:00) -- for frequently changing data | +| `0 2 1 * *` | First day of each month at 2:00 AM -- for monthly retraining | +| `30 1 * * 1-5` | Weekdays at 1:30 AM -- skip weekends | ## Retraining Logic diff --git a/docs/guide/training.md b/docs/guide/training.md new file mode 100644 index 00000000..d87e663c --- /dev/null +++ b/docs/guide/training.md @@ -0,0 +1,281 @@ +# Training Models + +This guide explains how to train recommendation models from configurations, compare their performance, and prepare them for serving. + +## What Is Model Training? + +Model training is the process of building a recommendation model from a configuration and a dataset. The configuration specifies which algorithm to use and what settings to apply. The training data provides the user-item interactions that the model learns from. + +Once training is complete, the resulting model file can generate personalized recommendations for any user in the data. + +## Why Train Models? + +- **Serve recommendations** -- A trained model is required before you can call the inference API +- **Compare approaches** -- Train multiple models with different configurations and compare their quality +- **Keep models fresh** -- Retrain periodically as new interaction data becomes available + +## Prerequisites + +- A project with training data uploaded (see [Data Management](data-management.md)) +- A model configuration, either from a completed tuning job (see [Tuning](tuning.md)) or created manually + +## Auto-Train vs Manual Train + +There are two ways to train a model: + +### Auto-Train (After Tuning) + +When you create a tuning job with `train_after_tuning` set to `true` (the default), the system automatically trains a model using the best configuration found during tuning. This is the simplest approach -- the tuning job produces a ready-to-use model without any extra steps. + +### Manual Train + +You can also train a model explicitly by specifying a configuration and a training dataset. This is useful when you want to: + +- Train with a different dataset than the one used for tuning +- Retrain a model with updated data +- Train from a manually created configuration + +## Training a Model + +### Via the UI + +1. Navigate to **Models** in the sidebar +2. Click **Train Model** +3. Select: + - The **model configuration** to use (browse configurations from tuning results or manually created ones) + - The **training data** to train on +4. Click **Start Training** + + + +Training runs in the background. The model detail page shows the current status. + + + +### Via the API + +```bash +curl -X POST http://localhost:8000/api/v1/trained_model/ \ + -H "Authorization: Bearer $TOKEN" \ + -H "Content-Type: application/json" \ + -d '{ + "configuration": 5, + "data_loc": 1 + }' +``` + +| Field | Required | Description | +|-------|----------|-------------| +| `configuration` | Yes | ID of the model configuration to use | +| `data_loc` | Yes | ID of the training data to train on | + +**Response:** + +```json +{ + "id": 3, + "configuration": 5, + "data_loc": 1, + "file": null, + "irspack_version": null, + "ins_datetime": "2025-01-15T12:00:00Z", + "basename": null, + "filesize": null, + "task_links": [] +} +``` + +The `file`, `irspack_version`, `basename`, and `filesize` fields are populated after training completes. + +**Note:** The configuration and training data must belong to the same project. The API returns an error if they belong to different projects. + +## Monitoring Training + +Training runs as a background task. You can check progress: + +### Via the UI + +The model detail page updates in real time via WebSocket, showing the current training status. + + + +### Via the API + +```bash +curl http://localhost:8000/api/v1/trained_model/3/ \ + -H "Authorization: Bearer $TOKEN" +``` + +When training is complete, the response includes the model file details: + +```json +{ + "id": 3, + "configuration": 5, + "data_loc": 1, + "file": "/media/trained_models/model_3.pkl", + "irspack_version": "0.4.0", + "ins_datetime": "2025-01-15T12:00:00Z", + "basename": "model_3.pkl", + "filesize": 2456789, + "task_links": [ + { + "task": { + "task_id": "abc123", + "status": "SUCCESS" + } + } + ] +} +``` + +You can also view detailed training logs: + +```bash +curl "http://localhost:8000/api/v1/task_log/?model_id=3" \ + -H "Authorization: Bearer $TOKEN" +``` + +## Model File Security + +Every trained model file is signed with HMAC-SHA256 using the application's secret key. This ensures that: + +- Model files have not been tampered with +- Only models created by your Recotem instance can be loaded +- Corrupted files are detected before they are used for predictions + +This signing happens automatically -- you do not need to do anything special. + +## Comparing Models + +When you have multiple trained models, you can compare them by looking at the tuning scores and the configurations used. + +### Listing Models + +**Via the UI:** + +The Models page shows all trained models for your project with their configuration details. + + + +**Via the API:** + +```bash +# List all models for a project +curl "http://localhost:8000/api/v1/trained_model/?data_loc__project=1" \ + -H "Authorization: Bearer $TOKEN" +``` + +### Comparing Recommendation Quality + +You can test a model by getting sample recommendations: + +```bash +# Get sample recommendations from a model (randomly selects a user) +curl http://localhost:8000/api/v1/trained_model/3/sample_recommendation_raw/ \ + -H "Authorization: Bearer $TOKEN" +``` + +**Response:** + +```json +{ + "user_id": "42", + "user_profile": ["101", "203", "305"], + "recommendations": [ + {"item_id": "420", "score": 0.95}, + {"item_id": "112", "score": 0.87}, + {"item_id": "550", "score": 0.82} + ] +} +``` + +This shows: +- `user_id` -- The randomly selected user +- `user_profile` -- Items the user has previously interacted with +- `recommendations` -- The model's top recommendations for this user + +If you have uploaded item metadata, you can get enriched sample recommendations: + +```bash +curl http://localhost:8000/api/v1/trained_model/3/sample_recommendation_metadata/1/ \ + -H "Authorization: Bearer $TOKEN" +``` + +This replaces raw item IDs with metadata (such as titles and categories), making it easier to visually assess recommendation quality. + +### Getting Recommendations for a Specific User + +```bash +curl "http://localhost:8000/api/v1/trained_model/3/recommendation/?user_id=42&cutoff=10" \ + -H "Authorization: Bearer $TOKEN" +``` + +**Response:** + +```json +[ + {"item_id": "420", "score": 0.95}, + {"item_id": "112", "score": 0.87}, + {"item_id": "550", "score": 0.82} +] +``` + +### New User Recommendations + +You can also get recommendations for users who are not in the training data by providing a list of items they have interacted with: + +```bash +curl -X POST http://localhost:8000/api/v1/trained_model/3/recommend_using_profile_interaction/ \ + -H "Authorization: Bearer $TOKEN" \ + -H "Content-Type: application/json" \ + -d '{ + "item_ids": ["101", "203"], + "cutoff": 10 + }' +``` + +**Response:** + +```json +{ + "recommendations": [ + {"item_id": "305", "score": 0.91}, + {"item_id": "420", "score": 0.85} + ] +} +``` + +## Deleting Models + +Models can be deleted when they are no longer needed: + +```bash +curl -X DELETE http://localhost:8000/api/v1/trained_model/3/ \ + -H "Authorization: Bearer $TOKEN" +``` + +Be careful not to delete models that are assigned to active deployment slots. + +## Next Steps + +Once you have a trained model you are satisfied with: + +- **[Deployment Slots](deployment-slots.md)** -- Assign the model to a deployment slot for production serving +- **[Inference API](inference.md)** -- Call the inference API to get recommendations +- **[A/B Testing](ab-testing.md)** -- Compare models using real user interactions +- **[Scheduled Retraining](retraining.md)** -- Set up automatic retraining to keep models fresh + +## API Reference + +| Method | Endpoint | Description | +|--------|----------|-------------| +| `GET` | `/api/v1/trained_model/?data_loc__project={id}` | List models for a project | +| `POST` | `/api/v1/trained_model/` | Train a new model | +| `GET` | `/api/v1/trained_model/{id}/` | Get model details | +| `DELETE` | `/api/v1/trained_model/{id}/` | Delete a model | +| `GET` | `/api/v1/trained_model/{id}/download/` | Download the model file | +| `GET` | `/api/v1/trained_model/{id}/sample_recommendation_raw/` | Get sample recommendations | +| `GET` | `/api/v1/trained_model/{id}/sample_recommendation_metadata/{meta_id}/` | Get sample recommendations with metadata | +| `GET` | `/api/v1/trained_model/{id}/recommendation/?user_id={uid}&cutoff={n}` | Get recommendations for a specific user | +| `POST` | `/api/v1/trained_model/{id}/recommend_using_profile_interaction/` | Get recommendations for a new user profile | diff --git a/docs/guide/tuning.md b/docs/guide/tuning.md new file mode 100644 index 00000000..be91e6b9 --- /dev/null +++ b/docs/guide/tuning.md @@ -0,0 +1,310 @@ +# Hyperparameter Tuning + +This guide explains how Recotem automatically finds the best recommendation algorithm and settings for your data. + +## What Is Hyperparameter Tuning? + +Recommendation algorithms have settings (called hyperparameters) that control how they learn from data. For example, one setting might control how many hidden factors to use, while another controls the learning rate. + +Finding the right combination of algorithm and settings can be tedious if done manually. Recotem automates this process using a technique called hyperparameter optimization, powered by Optuna. It systematically tries different algorithms and settings, evaluates each one, and keeps track of the best-performing configuration. + +## Why Tune? + +- **Better recommendations** -- The right settings can dramatically improve recommendation quality +- **No ML expertise needed** -- You do not need to understand the algorithms in detail; the system finds good settings automatically +- **Reproducible results** -- Tuning jobs record every configuration tried, so results can be reproduced + +## Prerequisites + +Before running a tuning job, you need: + +1. A **project** with training data uploaded (see [Getting Started](getting-started.md)) +2. A **split config** that defines how to divide data for evaluation +3. An **evaluation config** that defines which metric to optimize + +## Step 1: Create a Split Config + +A split config tells Recotem how to split your data into a training set and a test set. The training set is used to build the model, and the test set is used to measure how well the model predicts interactions it has not seen. + +### Split Schemes + +| Scheme | Code | How It Works | Best For | +|--------|------|-------------|----------| +| **Random** | `RG` | Randomly selects a fraction of each user's interactions for testing | General-purpose; works with any data | +| **Time Global** | `TG` | Uses the most recent interactions (by timestamp) across all users for testing | Data with timestamps where you want to simulate predicting future behavior | +| **Time User** | `TU` | Uses the most recent interactions per user for testing | Data with timestamps where each user has enough history | + +### Key Settings + +| Setting | Description | Default | +|---------|-------------|---------| +| **Heldout Ratio** | Fraction of interactions held out for testing (0.0 to 1.0). A value of 0.1 means 10% of data is used for testing. | `0.1` | +| **Test User Ratio** | Fraction of users included in the evaluation (0.0 to 1.0). A value of 1.0 means all users are evaluated. | `1.0` | +| **Random Seed** | A number that ensures the same split is produced every time. Use the same seed for reproducible comparisons. | `42` | + +### Creating via the UI + +1. Navigate to **Split Config** in the sidebar +2. Click **Create** +3. Choose a scheme and adjust settings as needed +4. Click **Save** + + + +### Creating via the API + +```bash +curl -X POST http://localhost:8000/api/v1/split_config/ \ + -H "Authorization: Bearer $TOKEN" \ + -H "Content-Type: application/json" \ + -d '{ + "name": "80/20 Random Split", + "scheme": "RG", + "heldout_ratio": 0.2, + "test_user_ratio": 1.0, + "random_seed": 42 + }' +``` + +## Step 2: Create an Evaluation Config + +An evaluation config defines how model quality is measured. + +### Metrics + +| Metric | Description | When to Use | +|--------|-------------|-------------| +| **NDCG** (Normalized Discounted Cumulative Gain) | Measures ranking quality, giving more credit to relevant items ranked higher in the list | Default choice; good for most use cases | +| **MAP** (Mean Average Precision) | Average precision at each position where a relevant item appears | When you care about precision at every position | +| **Recall** | Fraction of all relevant items that appear in the top-K recommendations | When you want to maximize coverage of relevant items | +| **Hit** (Hit Rate) | Whether at least one relevant item appears in the top-K | When any relevant recommendation is a success | + +### Cutoff + +The cutoff determines how many top items to evaluate. For example, a cutoff of 20 means the system checks whether relevant items appear in the top 20 recommendations. Choose a cutoff that matches how many items you plan to show to users. + +### Creating via the UI + +1. Navigate to **Evaluation Config** in the sidebar +2. Click **Create** +3. Select a metric and set the cutoff +4. Click **Save** + + + +### Creating via the API + +```bash +curl -X POST http://localhost:8000/api/v1/evaluation_config/ \ + -H "Authorization: Bearer $TOKEN" \ + -H "Content-Type: application/json" \ + -d '{ + "name": "NDCG@20", + "cutoff": 20, + "target_metric": "ndcg" + }' +``` + +## Step 3: Run a Tuning Job + +### Creating via the UI + +1. Navigate to **Tuning Jobs** in the sidebar +2. Click **Create Tuning Job** +3. Select your training data, split config, and evaluation config +4. Adjust tuning settings if desired +5. Click **Start** + + + +### Creating via the API + +```bash +curl -X POST http://localhost:8000/api/v1/parameter_tuning_job/ \ + -H "Authorization: Bearer $TOKEN" \ + -H "Content-Type: application/json" \ + -d '{ + "data": 1, + "split": 1, + "evaluation": 1, + "n_trials": 40, + "train_after_tuning": true + }' +``` + +### Tuning Job Settings + +| Setting | Description | Default | +|---------|-------------|---------| +| **Number of Trials** (`n_trials`) | How many different configurations to try. More trials may find better settings but take longer. | `40` | +| **Train After Tuning** (`train_after_tuning`) | Automatically train a model using the best configuration found. | `true` | +| **Memory Budget** (`memory_budget`) | Maximum memory (in MB) available for the tuning process. | `8000` | +| **Timeout Overall** (`timeout_overall`) | Maximum time (in seconds) for the entire tuning job. Leave empty for no limit. | None | +| **Timeout Single Step** (`timeout_singlestep`) | Maximum time (in seconds) for a single trial. | None | +| **Random Seed** (`random_seed`) | Seed for reproducibility. | None | +| **Tried Algorithms** (`tried_algorithms_json`) | A list of specific algorithm names to try. If empty, the system tries a default set. | None (uses defaults) | + +## Available Algorithms + +Recotem uses the irspack library, which includes several recommendation algorithms. By default, the tuning process tries multiple algorithms and picks the best one. Common algorithms include: + +| Algorithm | Description | +|-----------|-------------| +| **IALSRecommender** | Implicit Alternating Least Squares -- a widely used collaborative filtering method that works well with implicit feedback data (views, clicks, purchases) | +| **CosineKNNRecommender** | K-Nearest Neighbors with cosine similarity -- recommends items similar to what a user has interacted with | +| **TopPopRecommender** | Recommends the most popular items -- serves as a simple baseline | +| **AsymmetricCosineKNNRecommender** | A variant of KNN that considers directional similarity between items | +| **TverskyIndexKNNRecommender** | KNN using Tversky similarity, which allows asymmetric comparison | +| **DenseSLIMRecommender** | A linear model approach that learns item-to-item weights | +| **P3alphaRecommender** | A graph-based method that uses random walks on the user-item graph | +| **RP3betaRecommender** | An extension of P3alpha that accounts for item popularity | + +You do not need to choose an algorithm manually. The tuning process explores multiple algorithms and their hyperparameters automatically. + +To restrict tuning to specific algorithms, pass them in the `tried_algorithms_json` field: + +```bash +curl -X POST http://localhost:8000/api/v1/parameter_tuning_job/ \ + -H "Authorization: Bearer $TOKEN" \ + -H "Content-Type: application/json" \ + -d '{ + "data": 1, + "split": 1, + "evaluation": 1, + "n_trials": 40, + "tried_algorithms_json": ["IALSRecommender", "CosineKNNRecommender"] + }' +``` + +## Monitoring Progress + +### Via the UI + +The tuning job detail page shows real-time progress. As each trial completes, you can see: + +- The algorithm and parameters tried +- The score achieved +- Whether it is the best configuration so far + +Updates are delivered in real time via WebSocket, so you do not need to refresh the page. + + + +### Via the API + +Check the job status: + +```bash +curl http://localhost:8000/api/v1/parameter_tuning_job/1/ \ + -H "Authorization: Bearer $TOKEN" +``` + +**Job statuses:** + +| Status | Meaning | +|--------|---------| +| `PENDING` | Job is queued and waiting to start | +| `RUNNING` | Tuning is in progress | +| `COMPLETED` | Tuning finished successfully | +| `FAILED` | An error occurred during tuning | + +You can also view the task logs for detailed output: + +```bash +curl "http://localhost:8000/api/v1/task_log/?tuning_job_id=1" \ + -H "Authorization: Bearer $TOKEN" +``` + +## Understanding Results + +When a tuning job completes, the results include: + +- **Best Configuration** (`best_config`) -- The ID of the model configuration that achieved the highest score. This configuration is saved automatically and can be used for training. +- **Best Score** (`best_score`) -- The value of the target metric achieved by the best configuration. +- **Tuned Model** (`tuned_model`) -- If "Train after tuning" was enabled, this is the ID of the model trained with the best configuration. + +```bash +curl http://localhost:8000/api/v1/parameter_tuning_job/1/ \ + -H "Authorization: Bearer $TOKEN" +``` + +**Example response (completed job):** + +```json +{ + "id": 1, + "status": "COMPLETED", + "best_score": 0.342, + "best_config": 5, + "tuned_model": 3, + "n_trials": 40, + "data": 1, + "split": 1, + "evaluation": 1, + "train_after_tuning": true, + "ins_datetime": "2025-01-15T11:00:00Z" +} +``` + +You can then view the best configuration's details: + +```bash +curl http://localhost:8000/api/v1/model_configuration/5/ \ + -H "Authorization: Bearer $TOKEN" +``` + +```json +{ + "id": 5, + "name": "IALSRecommender-best", + "project": 1, + "recommender_class_name": "IALSRecommender", + "parameters_json": { + "n_components": 64, + "alpha": 1.0, + "reg": 0.01 + }, + "ins_datetime": "2025-01-15T11:30:00Z" +} +``` + +## API Reference + +### Split Config + +| Method | Endpoint | Description | +|--------|----------|-------------| +| `GET` | `/api/v1/split_config/` | List split configs | +| `POST` | `/api/v1/split_config/` | Create a split config | +| `GET` | `/api/v1/split_config/{id}/` | Get details | +| `PATCH` | `/api/v1/split_config/{id}/` | Update | +| `DELETE` | `/api/v1/split_config/{id}/` | Delete | + +### Evaluation Config + +| Method | Endpoint | Description | +|--------|----------|-------------| +| `GET` | `/api/v1/evaluation_config/` | List evaluation configs | +| `POST` | `/api/v1/evaluation_config/` | Create an evaluation config | +| `GET` | `/api/v1/evaluation_config/{id}/` | Get details | +| `PATCH` | `/api/v1/evaluation_config/{id}/` | Update | +| `DELETE` | `/api/v1/evaluation_config/{id}/` | Delete | + +### Tuning Jobs + +| Method | Endpoint | Description | +|--------|----------|-------------| +| `GET` | `/api/v1/parameter_tuning_job/` | List tuning jobs | +| `POST` | `/api/v1/parameter_tuning_job/` | Create and start a tuning job | +| `GET` | `/api/v1/parameter_tuning_job/{id}/` | Get job details and results | +| `DELETE` | `/api/v1/parameter_tuning_job/{id}/` | Delete a tuning job | + +### Model Configurations + +| Method | Endpoint | Description | +|--------|----------|-------------| +| `GET` | `/api/v1/model_configuration/?project={id}` | List configurations for a project | +| `POST` | `/api/v1/model_configuration/` | Create a configuration manually | +| `GET` | `/api/v1/model_configuration/{id}/` | Get configuration details | +| `PATCH` | `/api/v1/model_configuration/{id}/` | Update a configuration | +| `DELETE` | `/api/v1/model_configuration/{id}/` | Delete a configuration | diff --git a/docs/guide/user-management.md b/docs/guide/user-management.md new file mode 100644 index 00000000..8e882cf3 --- /dev/null +++ b/docs/guide/user-management.md @@ -0,0 +1,265 @@ +# User Management + +This guide covers how to manage users in Recotem, including creating accounts, assigning roles, and changing passwords. + +## What You Need to Know + +Recotem supports multiple users with role-based access. Each user has their own projects, data, and models. Admin users can manage other users and see all resources across the system. + +## User Roles + +Recotem has two user roles: + +| Role | Description | Capabilities | +|------|-------------|-------------| +| **Regular user** | A standard account for everyday use | Create and manage their own projects, data, models, and API keys | +| **Admin** (staff) | An administrator with elevated privileges | Everything a regular user can do, plus: manage other users, see all projects, and access the Django admin panel | + +## Data Ownership + +Resources in Recotem are owned by the user who created them: + +- **Projects** belong to their creator. Each user can only see and manage their own projects. +- **Training data, models, and configurations** are visible based on the project they belong to. +- **Admin users** can see all resources across all users. +- **Legacy resources** created before multi-user support was added (with no owner) are visible to all authenticated users. + +## Creating Users + +Only admin users can create new user accounts. + +### Via the UI + +1. Navigate to the **User Management** page (available in the admin sidebar) +2. Click **Create User** +3. Fill in the user details: + + + +| Field | Required | Description | +|-------|----------|-------------| +| **Username** | Yes | A unique username for login | +| **Email** | No | The user's email address | +| **Password** | Yes | Must meet Django's password validation rules (minimum 8 characters, not too common, not entirely numeric) | +| **Admin (Staff)** | No | Whether this user has admin privileges. Defaults to regular user. | + +4. Click **Create** + +### Via the API + +```bash +curl -X POST http://localhost:8000/api/v1/users/ \ + -H "Authorization: Bearer $TOKEN" \ + -H "Content-Type: application/json" \ + -d '{ + "username": "alice", + "email": "alice@example.com", + "password": "secure_password_123", + "is_staff": false + }' +``` + +**Response:** + +```json +{ + "id": 2, + "username": "alice", + "email": "alice@example.com", + "is_staff": false, + "is_active": true, + "date_joined": "2025-01-15T15:00:00Z", + "last_login": null +} +``` + +The password is never included in the response. + +## Listing Users + +Admin users can view all user accounts: + +### Via the API + +```bash +curl http://localhost:8000/api/v1/users/ \ + -H "Authorization: Bearer $TOKEN" +``` + +**Response:** + +```json +[ + { + "id": 1, + "username": "admin", + "email": "admin@example.com", + "is_staff": true, + "is_active": true, + "date_joined": "2025-01-01T00:00:00Z", + "last_login": "2025-01-15T10:00:00Z" + }, + { + "id": 2, + "username": "alice", + "email": "alice@example.com", + "is_staff": false, + "is_active": true, + "date_joined": "2025-01-15T15:00:00Z", + "last_login": null + } +] +``` + +## Updating User Details + +Admin users can update another user's email, staff status, or active status: + +```bash +curl -X PATCH http://localhost:8000/api/v1/users/2/ \ + -H "Authorization: Bearer $TOKEN" \ + -H "Content-Type: application/json" \ + -d '{ + "email": "alice.new@example.com", + "is_staff": true + }' +``` + +Note that the username cannot be changed after creation. + +## Changing Passwords + +### Self-Service Password Change + +Any logged-in user can change their own password without needing admin help. This is available to all users regardless of role. + +**Via the UI:** + +1. Click on your username or profile area +2. Select **Change Password** +3. Enter your current password and your new password +4. Click **Save** + + + +**Via the API:** + +```bash +curl -X POST http://localhost:8000/api/v1/users/change_password/ \ + -H "Authorization: Bearer $TOKEN" \ + -H "Content-Type: application/json" \ + -d '{ + "old_password": "current_password", + "new_password": "new_secure_password_456" + }' +``` + +**Response:** + +```json +{ + "detail": "Password changed successfully." +} +``` + +The new password must meet Django's password validation requirements: +- At least 8 characters long +- Not too similar to your username or email +- Not a commonly used password +- Not entirely numeric + +### Admin Password Reset + +Admin users can reset another user's password when the user has forgotten it or needs to be locked out and given new credentials. + +**Via the API:** + +```bash +curl -X POST http://localhost:8000/api/v1/users/2/reset_password/ \ + -H "Authorization: Bearer $TOKEN" \ + -H "Content-Type: application/json" \ + -d '{ + "new_password": "temporary_password_789" + }' +``` + +**Response:** + +```json +{ + "detail": "Password has been reset." +} +``` + +After resetting, communicate the new password to the user through a secure channel. They should change it immediately after logging in. + +## Deactivating and Activating Users + +Instead of deleting user accounts, Recotem uses a soft-delete approach. Deactivating a user prevents them from logging in while preserving all their data. + +### Deactivating a User + +```bash +curl -X POST http://localhost:8000/api/v1/users/2/deactivate/ \ + -H "Authorization: Bearer $TOKEN" +``` + +**Response:** + +```json +{ + "id": 2, + "username": "alice", + "email": "alice@example.com", + "is_staff": false, + "is_active": false, + "date_joined": "2025-01-15T15:00:00Z", + "last_login": "2025-01-15T16:00:00Z" +} +``` + +A deactivated user: +- Cannot log in +- Cannot use API keys +- Their projects and data remain intact +- Can be reactivated at any time + +**Note:** You cannot deactivate your own account. This prevents accidentally locking yourself out. + +### Reactivating a User + +```bash +curl -X POST http://localhost:8000/api/v1/users/2/activate/ \ + -H "Authorization: Bearer $TOKEN" +``` + +The user can immediately log in again after reactivation. + +## Initial Admin Account + +When Recotem is deployed for the first time, an admin account is created automatically using the `DEFAULT_ADMIN_PASSWORD` environment variable. The default username is `admin`. + +For production deployments, make sure to: + +1. Set a strong `DEFAULT_ADMIN_PASSWORD` before first deployment +2. Change the admin password after first login +3. Create individual user accounts for team members instead of sharing the admin account + +## Security Considerations + +- **API keys cannot manage users** -- User management endpoints reject API key authentication entirely. Only JWT-authenticated admin users can manage accounts. +- **Password validation** -- All passwords are validated against Django's built-in password validators (minimum length, complexity, common password check). +- **Session security** -- After changing a password, existing JWT tokens remain valid until they expire. For immediate revocation, the user should log out of all sessions. + +## API Reference + +| Method | Endpoint | Description | Who Can Use | +|--------|----------|-------------|-------------| +| `GET` | `/api/v1/users/` | List all users | Admin only | +| `POST` | `/api/v1/users/` | Create a new user | Admin only | +| `GET` | `/api/v1/users/{id}/` | Get user details | Admin only | +| `PATCH` | `/api/v1/users/{id}/` | Update user (email, staff status) | Admin only | +| `POST` | `/api/v1/users/{id}/deactivate/` | Deactivate a user | Admin only | +| `POST` | `/api/v1/users/{id}/activate/` | Reactivate a user | Admin only | +| `POST` | `/api/v1/users/{id}/reset_password/` | Reset another user's password | Admin only | +| `POST` | `/api/v1/users/change_password/` | Change your own password | Any logged-in user | diff --git a/docs/guides/api-keys.md b/docs/guides/api-keys.md deleted file mode 100644 index deac3e01..00000000 --- a/docs/guides/api-keys.md +++ /dev/null @@ -1,107 +0,0 @@ -# API Key Authentication - -API keys provide programmatic access to Recotem resources. Keys are scoped to a project and support granular permissions. - -## Overview - -- Keys are prefixed with `rctm_` for easy identification -- Each key is tied to a specific project -- Permissions are controlled via scopes: `read`, `write`, `predict` -- Keys are hashed before storage (the full key is shown only once at creation) -- Keys can have optional expiration dates - -## Creating an API Key - -### Via UI - -1. Navigate to your project -2. Go to **API Keys** in the sidebar -3. Click **Create API Key** -4. Enter a name and select scopes -5. Copy the displayed key immediately — it will not be shown again - -### Via API - -```bash -curl -X POST http://localhost:8000/api/v1/api_keys/ \ - -H "Authorization: Bearer " \ - -H "Content-Type: application/json" \ - -d '{ - "name": "Production Service", - "project": 1, - "scopes": ["predict"] - }' -``` - -**Response:** - -```json -{ - "id": 1, - "name": "Production Service", - "project": 1, - "key_prefix": "rctm_abc1", - "scopes": ["predict"], - "is_active": true, - "expires_at": null, - "last_used_at": null, - "key": "rctm_abc1defg2hijklmn3opqrstu4vwxyz..." -} -``` - -The `key` field is only included in the creation response. Store it securely. - -## Using an API Key - -Pass the key in the `X-API-Key` header: - -```bash -curl -H "X-API-Key: rctm_your_key_here" \ - http://localhost:8000/inference/predict/1 \ - -d '{"user_id": "42", "cutoff": 10}' -``` - -API keys work with both the management API (`/api/v1/`) and the inference API (`/inference/`). - -## Scopes - -| Scope | Grants access to | -|-------|-----------------| -| `read` | Read project data, models, configurations | -| `write` | Create/update training data, configurations, models | -| `predict` | Call inference endpoints, record conversion events | - -A key can have multiple scopes. For inference-only integrations, use `["predict"]`. - -## Managing Keys - -### List Keys - -```bash -curl -H "Authorization: Bearer " \ - "http://localhost:8000/api/v1/api_keys/?project=1" -``` - -### Revoke a Key - -Revoking deactivates a key without deleting it: - -```bash -curl -X POST http://localhost:8000/api/v1/api_keys/1/revoke/ \ - -H "Authorization: Bearer " -``` - -### Delete a Key - -```bash -curl -X DELETE http://localhost:8000/api/v1/api_keys/1/ \ - -H "Authorization: Bearer " -``` - -## Security - -- Keys are stored as hashed values using Django's `make_password` (PBKDF2-SHA256) -- The first 8 characters (prefix) are stored in plaintext for database lookup -- Full keys are never stored and cannot be recovered -- The `last_used_at` field is updated on each successful authentication -- Set `expires_at` for time-limited access diff --git a/docs/specification/api-reference.md b/docs/specification/api-reference.md new file mode 100644 index 00000000..1d888910 --- /dev/null +++ b/docs/specification/api-reference.md @@ -0,0 +1,434 @@ +# API Reference Specification + +## Overview + +Recotem exposes two APIs: a **Management API** (Django REST Framework) for managing projects, data, models, and configuration; and an **Inference API** (FastAPI) for real-time recommendation serving. Both APIs support authentication via API keys (`X-API-Key` header) and the Management API additionally supports JWT tokens. + +## Base Paths + +| API | Base Path | Backward Compat | Framework | +|---|---|---|---| +| Management API | `/api/v1/` | `/api/` (deprecated) | Django REST Framework | +| Inference API | `/inference/` | -- | FastAPI | +| Django Admin | `/admin/` | -- | Django Admin | +| OpenAPI Schema | `/api/v1/schema/` | -- | drf-spectacular | + +## Authentication + +### JWT Authentication + +- **Obtain tokens**: `POST /api/v1/auth/login/` with `{"username": "...", "password": "..."}` +- **Response**: `{"access": "...", "refresh": "...", "user": {...}}` +- **Usage**: `Authorization: Bearer ` +- **Access token lifetime**: Configurable via `ACCESS_TOKEN_LIFETIME` env var (default 300 seconds) +- **Refresh token lifetime**: 1 day + +### API Key Authentication + +- **Header**: `X-API-Key: rctm_` +- **Scopes**: `read` (GET/HEAD/OPTIONS), `write` (POST/PUT/PATCH/DELETE), `predict` (inference) +- **Project-scoped**: Each API key is bound to a specific project + +### Session Authentication + +- **Cookie-based**: Django session auth (used by Django Admin) + +### Authentication Priority + +DRF evaluates authentication classes in order: +1. `ApiKeyAuthentication` (X-API-Key header) +2. `JWTAuthentication` (Authorization: Bearer) +3. `SessionAuthentication` (session cookie) + +## Management API Endpoints + +All management endpoints require authentication. API key users need appropriate scopes (`read` for safe methods, `write` for unsafe methods). Endpoints use `PageNumberPagination` with a default page size of 20. + +### Projects + +| Method | Endpoint | Description | Auth | +|---|---|---|---| +| `GET` | `/api/v1/project/` | List projects owned by or shared with the user | JWT, API Key (read) | +| `POST` | `/api/v1/project/` | Create a new project | JWT, API Key (write) | +| `GET` | `/api/v1/project/{id}/` | Retrieve project details | JWT, API Key (read) | +| `PUT` | `/api/v1/project/{id}/` | Update a project | JWT, API Key (write) | +| `PATCH` | `/api/v1/project/{id}/` | Partial update a project | JWT, API Key (write) | +| `DELETE` | `/api/v1/project/{id}/` | Delete a project | JWT, API Key (write) | + +**Filter fields**: Filterable via DjangoFilterBackend. + +### Project Summary + +| Method | Endpoint | Description | Auth | +|---|---|---|---| +| `GET` | `/api/v1/project_summary/{id}/` | Get aggregated project statistics | JWT | + +### Training Data + +| Method | Endpoint | Description | Auth | +|---|---|---|---| +| `GET` | `/api/v1/training_data/` | List training data files | JWT, API Key (read) | +| `POST` | `/api/v1/training_data/` | Upload training data CSV | JWT, API Key (write) | +| `GET` | `/api/v1/training_data/{id}/` | Retrieve training data details | JWT, API Key (read) | +| `DELETE` | `/api/v1/training_data/{id}/` | Delete training data | JWT, API Key (write) | + +**Filter fields**: `project` + +### Item Metadata + +| Method | Endpoint | Description | Auth | +|---|---|---|---| +| `GET` | `/api/v1/item_meta_data/` | List item metadata files | JWT, API Key (read) | +| `POST` | `/api/v1/item_meta_data/` | Upload item metadata | JWT, API Key (write) | +| `GET` | `/api/v1/item_meta_data/{id}/` | Retrieve item metadata details | JWT, API Key (read) | +| `DELETE` | `/api/v1/item_meta_data/{id}/` | Delete item metadata | JWT, API Key (write) | + +**Filter fields**: `project` + +### Split Configuration + +| Method | Endpoint | Description | Auth | +|---|---|---|---| +| `GET` | `/api/v1/split_config/` | List split configurations | JWT, API Key (read) | +| `POST` | `/api/v1/split_config/` | Create split configuration | JWT, API Key (write) | +| `GET` | `/api/v1/split_config/{id}/` | Retrieve split configuration | JWT, API Key (read) | +| `PUT` | `/api/v1/split_config/{id}/` | Update split configuration | JWT, API Key (write) | +| `PATCH` | `/api/v1/split_config/{id}/` | Partial update | JWT, API Key (write) | +| `DELETE` | `/api/v1/split_config/{id}/` | Delete split configuration | JWT, API Key (write) | + +### Evaluation Configuration + +| Method | Endpoint | Description | Auth | +|---|---|---|---| +| `GET` | `/api/v1/evaluation_config/` | List evaluation configurations | JWT, API Key (read) | +| `POST` | `/api/v1/evaluation_config/` | Create evaluation configuration | JWT, API Key (write) | +| `GET` | `/api/v1/evaluation_config/{id}/` | Retrieve evaluation configuration | JWT, API Key (read) | +| `PUT` | `/api/v1/evaluation_config/{id}/` | Update evaluation configuration | JWT, API Key (write) | +| `PATCH` | `/api/v1/evaluation_config/{id}/` | Partial update | JWT, API Key (write) | +| `DELETE` | `/api/v1/evaluation_config/{id}/` | Delete evaluation configuration | JWT, API Key (write) | + +### Parameter Tuning Jobs + +| Method | Endpoint | Description | Auth | +|---|---|---|---| +| `GET` | `/api/v1/parameter_tuning_job/` | List tuning jobs | JWT, API Key (read) | +| `POST` | `/api/v1/parameter_tuning_job/` | Create and start a tuning job | JWT, API Key (write) | +| `GET` | `/api/v1/parameter_tuning_job/{id}/` | Retrieve tuning job details | JWT, API Key (read) | +| `DELETE` | `/api/v1/parameter_tuning_job/{id}/` | Delete tuning job | JWT, API Key (write) | + +**Filter fields**: `data`, `status` + +### Model Configuration + +| Method | Endpoint | Description | Auth | +|---|---|---|---| +| `GET` | `/api/v1/model_configuration/` | List model configurations | JWT, API Key (read) | +| `POST` | `/api/v1/model_configuration/` | Create model configuration | JWT, API Key (write) | +| `GET` | `/api/v1/model_configuration/{id}/` | Retrieve model configuration | JWT, API Key (read) | +| `PUT` | `/api/v1/model_configuration/{id}/` | Update model configuration | JWT, API Key (write) | +| `PATCH` | `/api/v1/model_configuration/{id}/` | Partial update | JWT, API Key (write) | +| `DELETE` | `/api/v1/model_configuration/{id}/` | Delete model configuration | JWT, API Key (write) | + +**Filter fields**: `project` + +### Trained Models + +| Method | Endpoint | Description | Auth | +|---|---|---|---| +| `GET` | `/api/v1/trained_model/` | List trained models | JWT, API Key (read) | +| `POST` | `/api/v1/trained_model/` | Create and train a model | JWT, API Key (write) | +| `GET` | `/api/v1/trained_model/{id}/` | Retrieve trained model details | JWT, API Key (read) | +| `DELETE` | `/api/v1/trained_model/{id}/` | Delete trained model | JWT, API Key (write) | + +**Filter fields**: `configuration`, `data_loc` + +### Task Logs + +| Method | Endpoint | Description | Auth | +|---|---|---|---| +| `GET` | `/api/v1/task_log/` | List task log entries | JWT | +| `GET` | `/api/v1/task_log/{id}/` | Retrieve task log entry | JWT | + +**Filter fields**: `task` + +### API Keys + +| Method | Endpoint | Description | Auth | +|---|---|---|---| +| `GET` | `/api/v1/api_keys/` | List API keys (keys are masked) | JWT | +| `POST` | `/api/v1/api_keys/` | Create a new API key (full key returned once) | JWT | +| `GET` | `/api/v1/api_keys/{id}/` | Retrieve API key details | JWT | +| `PUT` | `/api/v1/api_keys/{id}/` | Update API key metadata | JWT | +| `PATCH` | `/api/v1/api_keys/{id}/` | Partial update API key | JWT | +| `DELETE` | `/api/v1/api_keys/{id}/` | Revoke/delete API key | JWT | + +**Note**: API key management endpoints deny access to API-key-authenticated requests (`DenyApiKeyAccess` permission). Only JWT/session users can manage API keys. + +### Retraining Schedule + +| Method | Endpoint | Description | Auth | +|---|---|---|---| +| `GET` | `/api/v1/retraining_schedule/` | List retraining schedules | JWT, API Key (read) | +| `POST` | `/api/v1/retraining_schedule/` | Create retraining schedule | JWT, API Key (write) | +| `GET` | `/api/v1/retraining_schedule/{id}/` | Retrieve schedule details | JWT, API Key (read) | +| `PUT` | `/api/v1/retraining_schedule/{id}/` | Update schedule | JWT, API Key (write) | +| `PATCH` | `/api/v1/retraining_schedule/{id}/` | Partial update schedule | JWT, API Key (write) | +| `DELETE` | `/api/v1/retraining_schedule/{id}/` | Delete schedule | JWT, API Key (write) | + +### Retraining Runs + +| Method | Endpoint | Description | Auth | +|---|---|---|---| +| `GET` | `/api/v1/retraining_run/` | List retraining run records | JWT, API Key (read) | +| `GET` | `/api/v1/retraining_run/{id}/` | Retrieve run details | JWT, API Key (read) | + +### Deployment Slots + +| Method | Endpoint | Description | Auth | +|---|---|---|---| +| `GET` | `/api/v1/deployment_slot/` | List deployment slots | JWT, API Key (read) | +| `POST` | `/api/v1/deployment_slot/` | Create deployment slot | JWT, API Key (write) | +| `GET` | `/api/v1/deployment_slot/{id}/` | Retrieve slot details | JWT, API Key (read) | +| `PUT` | `/api/v1/deployment_slot/{id}/` | Update slot | JWT, API Key (write) | +| `PATCH` | `/api/v1/deployment_slot/{id}/` | Partial update slot | JWT, API Key (write) | +| `DELETE` | `/api/v1/deployment_slot/{id}/` | Delete slot | JWT, API Key (write) | + +**Filter fields**: `project` + +### A/B Tests + +| Method | Endpoint | Description | Auth | +|---|---|---|---| +| `GET` | `/api/v1/ab_test/` | List A/B tests | JWT, API Key (read) | +| `POST` | `/api/v1/ab_test/` | Create A/B test | JWT, API Key (write) | +| `GET` | `/api/v1/ab_test/{id}/` | Retrieve A/B test details | JWT, API Key (read) | +| `PUT` | `/api/v1/ab_test/{id}/` | Update A/B test | JWT, API Key (write) | +| `PATCH` | `/api/v1/ab_test/{id}/` | Partial update | JWT, API Key (write) | +| `DELETE` | `/api/v1/ab_test/{id}/` | Delete A/B test | JWT, API Key (write) | + +**Custom actions**: + +| Method | Endpoint | Description | +|---|---|---| +| `POST` | `/api/v1/ab_test/{id}/start/` | Start test (DRAFT -> RUNNING) | +| `POST` | `/api/v1/ab_test/{id}/stop/` | Stop test (RUNNING -> COMPLETED) | +| `GET` | `/api/v1/ab_test/{id}/results/` | Get statistical results | +| `POST` | `/api/v1/ab_test/{id}/promote_winner/` | Promote winner slot (body: `{"slot_id": N}`) | + +**Filter fields**: `project`, `status` + +### Conversion Events + +| Method | Endpoint | Description | Auth | +|---|---|---|---| +| `GET` | `/api/v1/conversion_event/` | List conversion events | JWT, API Key (read) | +| `POST` | `/api/v1/conversion_event/` | Record conversion event | JWT, API Key (write) | +| `GET` | `/api/v1/conversion_event/{id}/` | Retrieve event details | JWT, API Key (read) | + +### Users + +| Method | Endpoint | Description | Auth | +|---|---|---|---| +| `GET` | `/api/v1/users/` | List users (admin) or self | JWT | +| `GET` | `/api/v1/users/{id}/` | Retrieve user details | JWT | +| `POST` | `/api/v1/users/{id}/change_password/` | Change own password | JWT | + +**Note**: User management denies API key access (`DenyApiKeyAccess`). + +### Utility Endpoints + +| Method | Endpoint | Auth | Description | +|---|---|---|---| +| `GET` | `/api/v1/ping/` | None | Health check, returns `{"status": "ok"}` | + +### Authentication Endpoints + +| Method | Endpoint | Description | +|---|---|---| +| `POST` | `/api/v1/auth/login/` | Obtain JWT access and refresh tokens | +| `POST` | `/api/v1/auth/logout/` | Invalidate tokens | +| `POST` | `/api/v1/auth/token/refresh/` | Refresh access token | +| `GET` | `/api/v1/auth/user/` | Get current user details | + +**Rate limiting**: Login endpoint is rate-limited to 5 requests/minute (`LoginRateThrottle`). + +### OpenAPI Schema + +| Method | Endpoint | Auth | Description | +|---|---|---|---| +| `GET` | `/api/v1/schema/` | None | OpenAPI 3.0 schema (YAML/JSON) | +| `GET` | `/api/v1/schema/swagger-ui/` | None | Swagger UI | +| `GET` | `/api/v1/schema/redoc/` | None | ReDoc documentation | + +## Inference API Endpoints + +All inference endpoints require API key authentication with `predict` scope. + +### Single User Prediction + +``` +POST /inference/predict/{model_id} +``` + +**Headers**: `X-API-Key: rctm_...`, `Content-Type: application/json` + +**Request body**: +```json +{ + "user_id": "42", + "cutoff": 10 +} +``` + +**Response** (`200 OK`): +```json +{ + "items": [ + {"item_id": "101", "score": 0.95}, + {"item_id": "203", "score": 0.87} + ], + "model_id": 5, + "request_id": "a1b2c3d4-..." +} +``` + +**Constraints**: `cutoff` must be between 1 and 1000. + +### Batch Prediction + +``` +POST /inference/predict/{model_id}/batch +``` + +**Request body**: +```json +{ + "user_ids": ["42", "43", "44"], + "cutoff": 10 +} +``` + +**Response** (`200 OK`): +```json +{ + "results": [ + {"items": [...], "model_id": 5, "request_id": "..."}, + {"items": [...], "model_id": 5, "request_id": "..."}, + {"items": [...], "model_id": 5, "request_id": "..."} + ] +} +``` + +**Constraints**: Maximum 100 users per batch. Unknown users return empty item lists. + +### Project-Level Prediction (A/B Routing) + +``` +POST /inference/predict/project/{project_id} +``` + +**Request body**: +```json +{ + "user_id": "42", + "cutoff": 10 +} +``` + +**Response** (`200 OK`): +```json +{ + "items": [...], + "model_id": 5, + "slot_id": 3, + "slot_name": "production-v2", + "request_id": "a1b2c3d4-..." +} +``` + +**Behavior**: Selects an active deployment slot using weighted random selection based on slot weights. The `slot_id` and `slot_name` in the response identify which slot (and therefore which model) served the recommendation. + +### Health Check + +``` +GET /inference/health +``` + +**Response** (`200 OK`, no auth required): +```json +{ + "status": "healthy", + "loaded_models": 3 +} +``` + +### List Loaded Models + +``` +GET /inference/models +``` + +**Response** (`200 OK`, no auth required): +```json +{ + "models": [1, 5, 12], + "count": 3 +} +``` + +## Rate Limiting + +### nginx Layer + +| Zone | Rate | Burst | Applies To | +|---|---|---|---| +| `api` | 30 req/s | 20 | `/api/` endpoints | +| `auth` | 5 req/min | 3 | `/api/auth/login/` | +| `recommendation` | 30 req/min | 10 | Model recommendation endpoints | + +### DRF Layer + +| Scope | Default Rate | Description | +|---|---|---| +| `anon` | 20/min | Anonymous requests | +| `user` | 100/min | Authenticated user requests | +| `login` | 5/min | Login attempts | +| `recommendation` | 30/min | Recommendation endpoint | + +### Inference Layer (slowapi) + +- Default: `100/minute` per API key (configurable via `INFERENCE_RATE_LIMIT`) +- Rate limit key: API key prefix (first 8 chars of random part), falls back to IP address + +## Error Responses + +Standard DRF error format: + +```json +{ + "detail": "Error message here." +} +``` + +For validation errors: + +```json +{ + "field_name": ["Error message."] +} +``` + +## Pagination + +All list endpoints use `PageNumberPagination`: + +```json +{ + "count": 42, + "next": "http://localhost:8000/api/v1/project/?page=2", + "previous": null, + "results": [...] +} +``` + +Default page size: 20. diff --git a/docs/specification/architecture.md b/docs/specification/architecture.md new file mode 100644 index 00000000..49fe86b3 --- /dev/null +++ b/docs/specification/architecture.md @@ -0,0 +1,292 @@ +# Architecture Specification + +## Overview + +Recotem is a Docker-first web application for building, tuning, training, deploying, and monitoring recommender models through a UI and REST API. The system follows a service-oriented architecture with 7 Docker services communicating over a single bridge network. + +## System Architecture Diagram + +``` + ┌────────────────────┐ + │ Clients │ + │ (Browser / API) │ + └─────────┬──────────┘ + │ X-API-Key or JWT + ┌────────▼─────────┐ + │ nginx (proxy) │ :8000 + │ + Vue 3 SPA │ + └──┬────┬────┬─────┘ + │ │ │ + /api/ /ws/ │ │ │ /inference/ + /admin/ │ │ │ + ┌──────────▼┐ ┌▼────▼──────────┐ + │ Backend │ │ Inference │ + │ Django 5 │ │ FastAPI │ + │ (daphne) │ │ :8081 │ + │ :8080 │ │ │ + └──┬──┬──┬──┘ └──┬────┬────────┘ + │ │ │ │ │ read-only + ┌────▼┐│ │ │ ┌─▼──────────┐ + │Redis││ │ │ │ PostgreSQL │ + │ ││ │ │ │ :5432 │ + │db0-3│├──┘ │ └──────▲──────┘ + └──┬──┘│ │ │ + │ ┌─▼───────┐ │ │ + │ │ Celery │ │ │ + │ │ Worker ├──┼─────────┘ + │ └─────────┘ │ + │ │ + ┌──▼───────────────┘ + │ Celery Beat + │ (scheduler) + └────────────────── + + Redis databases: + db0 = Celery broker db2 = Django cache + db1 = Channels (WS) db3 = Model event Pub/Sub +``` + +## Service Inventory + +| Service | Image / Technology | Internal Port | Purpose | +|---|---|---|---| +| `db` | PostgreSQL 17.2 (Alpine) | 5432 | Persistent data store; Optuna study storage | +| `redis` | Redis 7.4 (Alpine) | 6379 | Broker, channel layer, cache, model events | +| `backend` | Django 5.1 + Daphne (ASGI) | 8080 | REST API, WebSocket, Django Admin | +| `worker` | Celery (same image as backend) | -- | Background tasks (tuning, training) | +| `beat` | Celery Beat (same image as backend) | -- | Scheduled retraining cron | +| `inference` | FastAPI + Uvicorn | 8081 | Real-time recommendation serving | +| `proxy` | nginx + Vue 3 SPA | 8000 (exposed) | Reverse proxy, static SPA hosting | + +## Service Details + +### 1. PostgreSQL (`db`) + +- **Image**: `postgres:17.2-alpine` +- **Volume**: `db-data` mounted at `/var/lib/postgresql/data/pgdata` +- **Health check**: `pg_isready -U recotem_user -d recotem` every 5 seconds +- **Responsibilities**: + - Application data (projects, models, training data metadata, API keys) + - Optuna study storage (hyperparameter tuning trials) + - Celery task results (`django-celery-results`) + - Celery Beat schedule storage (`django-celery-beat`) + +### 2. Redis (`redis`) + +- **Image**: `redis:7.4-alpine` +- **Memory limit**: 256 MB with `allkeys-lru` eviction policy +- **Optional auth**: `REDIS_PASSWORD` environment variable +- **Health check**: `redis-cli ping` every 5 seconds +- **Database allocation**: + + | DB | Purpose | Consumer | + |---|---|---| + | db0 | Celery broker | Worker, Beat, Backend | + | db1 | Django Channels layer | Backend (WebSocket) | + | db2 | Django cache | Backend | + | db3 | Model event Pub/Sub | Backend (publisher), Inference (subscriber) | + +### 3. Backend (`backend`) + +- **Framework**: Django 5.1 + Django REST Framework + Django Channels +- **ASGI server**: Daphne (serves both HTTP and WebSocket) +- **Internal port**: 8080 +- **User**: `appuser:1000` (non-root) +- **Volumes**: + - `data-location:/data` -- trained model files and uploaded datasets + - `static-files:/app/dist/static` -- Django Admin static assets +- **Health check**: `curl -f http://localhost:8080/api/ping/` every 10 seconds +- **Memory**: 512 MB reserved, 2 GB limit +- **Key dependencies**: + - `dj-rest-auth` + `djangorestframework-simplejwt` for JWT authentication + - `django-channels` + `channels-redis` for WebSocket + - `django-celery-results` for task result persistence + - `drf-spectacular` for OpenAPI schema generation + - `django-environ` for settings management + - `irspack 0.4.0` + `optuna` for ML operations + +### 4. Celery Worker (`worker`) + +- **Image**: Same Docker image as backend +- **Command**: `celery -A recotem worker --loglevel=INFO` +- **Health check**: `celery inspect ping` every 60 seconds +- **Memory**: 1 GB reserved, 4 GB limit (model training is memory-intensive) +- **Volumes**: `data-location:/data` -- reads training data, writes model files +- **Responsibilities**: + - Parameter tuning (parallel Optuna trials via `group`) + - Model training (irspack recommender training) + - Scheduled retraining execution + - WebSocket status/log push via Django Channels layer +- **Auto-retry**: `ConnectionError` and `OSError` with exponential backoff, max 3 retries +- **Time limits**: Controlled via `CELERY_TASK_TIME_LIMIT` (default 3600s hard) and `CELERY_TASK_SOFT_TIME_LIMIT` (default 3480s soft) + +### 5. Celery Beat (`beat`) + +- **Image**: Same Docker image as backend +- **Command**: `celery -A recotem beat --loglevel=INFO --scheduler django_celery_beat.schedulers:DatabaseScheduler` +- **Memory**: 128 MB reserved, 512 MB limit +- **Depends on**: `backend` (healthy) and `redis` (healthy) +- **Responsibilities**: + - Reads cron schedules from `RetrainingSchedule` model via database scheduler + - Dispatches `task_scheduled_retrain` tasks to the worker + +### 6. Inference Service (`inference`) + +- **Framework**: FastAPI +- **Internal port**: 8081 +- **Health check**: `curl -f http://localhost:8081/health` every 10 seconds +- **Memory**: 512 MB reserved, 4 GB limit +- **Volume**: `data-location:/data:ro` (read-only access to model files) +- **Design**: + - Separate Python process from Django, using SQLAlchemy for read-only database access + - Thread-safe LRU model cache (configurable max size via `INFERENCE_MAX_LOADED_MODELS`) + - Redis Pub/Sub listener for hot-swap model updates on channel `recotem:model_events` + - Rate limiting per API key via `slowapi` + - Model file integrity verification via HMAC-SHA256 (shared `SECRET_KEY`) + - API key authentication compatible with Django's PBKDF2-SHA256 hashing (via `passlib`) + +### 7. Proxy (`proxy`) + +- **Build**: Multi-stage -- builds Vue 3 SPA then copies into nginx image +- **Exposed port**: 8000 (mapped to host) +- **User**: `nginx` (non-root) +- **Memory**: 64 MB reserved, 256 MB limit +- **Routing rules**: + + | Path | Destination | Rate Limit | + |---|---|---| + | `/` | Vue 3 SPA (static files) | -- | + | `/api/` | `backend:8080/api/` | 30 req/s burst 20 | + | `/api/auth/login/` | `backend:8080/api/auth/login/` | 5 req/min burst 3 | + | `/ws/` | `backend:8080/ws/` (WebSocket upgrade) | -- | + | `/admin/` | `backend:8080/admin/` | -- | + | `/inference/` | `inference:8081/` | 30 req/s burst 50 | + | `/static/` | `/app/dist/static/` (file system) | -- | + +- **Security headers**: X-Frame-Options (DENY), CSP, HSTS, X-Content-Type-Options, Referrer-Policy, Permissions-Policy +- **Compression**: gzip for text, CSS, JS, JSON, SVG + +## Technology Stack + +### Backend + +| Component | Technology | Version | +|---|---|---| +| Language | Python | 3.12 | +| Web framework | Django | 5.1 | +| REST API | Django REST Framework | -- | +| ASGI server | Daphne | -- | +| WebSocket | Django Channels + channels-redis | -- | +| Authentication | dj-rest-auth + simplejwt | -- | +| Task queue | Celery | -- | +| Task scheduler | django-celery-beat | -- | +| ML / Tuning | irspack 0.4.0 + Optuna | -- | +| Package manager | uv | -- | +| Linting | Ruff | -- | + +### Frontend + +| Component | Technology | Version | +|---|---|---| +| Framework | Vue 3 (Composition API) | 3.5 | +| Build tool | Vite | 6 | +| UI components | PrimeVue | 4 | +| CSS | Tailwind CSS | 4 | +| State management | Pinia | -- | +| Data fetching | TanStack Query | -- | +| Language | TypeScript (strict mode) | -- | +| Package manager | npm | -- | + +### Inference + +| Component | Technology | +|---|---| +| Framework | FastAPI | +| ORM | SQLAlchemy (read-only) | +| Rate limiting | slowapi | +| Password compat | passlib (Django PBKDF2-SHA256) | + +## Networking + +All services communicate over a single Docker bridge network (`backend-net`). No ports are exposed except `proxy:8000` which is mapped to the host. + +``` +┌─────────────────────────── backend-net ───────────────────────────┐ +│ │ +│ proxy:8000 ──► backend:8080 │ +│ ──► inference:8081 │ +│ │ +│ backend:8080 ──► db:5432 │ +│ ──► redis:6379 │ +│ │ +│ worker ──► db:5432 │ +│ ──► redis:6379 │ +│ │ +│ beat ──► redis:6379 │ +│ │ +│ inference:8081 ──► db:5432 (read-only) │ +│ ──► redis:6379/db3 (Pub/Sub subscriber) │ +│ │ +└───────────────────────────────────────────────────────────────────┘ + │ + port 8000 exposed + │ + ┌────▼────┐ + │ Host │ + └─────────┘ +``` + +## Volumes + +| Volume | Mounted To | Services | Purpose | +|---|---|---|---| +| `db-data` | `/var/lib/postgresql/data/pgdata` | db | Persistent database storage | +| `data-location` | `/data` | backend, worker, beat, inference (ro) | Uploaded datasets and trained model files | +| `static-files` | `/app/dist/static` | backend (rw), proxy (ro) | Django Admin static assets (`collectstatic`) | + +## Deployment Variants + +### Full Production Stack (7 services) + +Defined in `compose.yaml`. All services, full capabilities. + +### Development (2 services) + +Defined in `compose-dev.yaml`. Only PostgreSQL and Redis. Backend, worker, beat, and frontend run locally. + +### Inference-Only (3 services) + +Defined in `compose-inference.yaml`. Stripped-down deployment with only `db`, `inference`, and `proxy` (using `nginx-inference.conf`). For read-only recommendation serving without the management UI. + +## Data Flow + +``` +1. User creates Project ──► backend ──► PostgreSQL +2. User uploads TrainingData CSV ──► backend ──► /data volume + PostgreSQL +3. User creates tuning job ──► backend ──► Celery (via Redis db0) +4. Workers run Optuna trials ──► worker ──► PostgreSQL (Optuna storage) + ──► Redis db1 (WebSocket push) +5. Best config saved ──► worker ──► PostgreSQL +6. Model trained ──► worker ──► /data volume (signed model file) + ──► Redis db3 (model_trained event) +7. Inference picks up event ──► inference ◄── Redis db3 (Pub/Sub) + ──► /data volume (load model) +8. Client calls inference API ──► proxy ──► inference ──► in-memory model +9. Scheduled retrain (optional) ──► beat ──► Celery ──► worker (repeat 4-7) +``` + +## Design Decisions + +1. **Separate inference service**: Decouples the recommendation serving path from the Django application. The inference service uses SQLAlchemy for read-only database access and has no Django dependency, enabling independent scaling and deployment. + +2. **HMAC-signed model files**: All trained model files are signed with HMAC-SHA256 using the application's `SECRET_KEY`. This prevents loading tampered model files. The signing core module (`pickle_signing_core.py`) is Django-independent and shared with the inference service. + +3. **Redis database separation**: Four Redis databases isolate different concerns (broker, channels, cache, model events) to prevent key collisions and allow independent monitoring and eviction policies. + +4. **Single Docker image for backend/worker/beat**: All three services share the same Docker image (`backend/Dockerfile`), differing only in their entrypoint command. This simplifies builds and ensures code consistency. + +5. **WebSocket via query-string JWT**: Browsers cannot send custom headers on WebSocket upgrade requests. JWT tokens are passed as `?token=` query parameters and validated by `JwtAuthMiddleware`. + +6. **Daphne as ASGI server**: Daphne serves both HTTP and WebSocket protocols, eliminating the need for separate servers. The backend listens on port 8080 internally; nginx proxies to it on port 8000. + +7. **Model hot-swap via Pub/Sub**: When a model is trained, the backend publishes a `model_trained` event to Redis db3. The inference service's background listener picks up the event and loads the new model into its LRU cache, enabling zero-downtime model updates. diff --git a/docs/specification/data-model.md b/docs/specification/data-model.md new file mode 100644 index 00000000..583a4442 --- /dev/null +++ b/docs/specification/data-model.md @@ -0,0 +1,376 @@ +# Data Model Specification + +## Overview + +Recotem's data model is implemented as Django models in `backend/recotem/recotem/api/models/`. All domain models inherit from `ModelWithInsDatetime`, which provides automatic timestamping and reverse-chronological ordering. File-based models extend `BaseFileModel` for file storage and size tracking. + +## Entity Relationship Diagram + +``` + ┌───────────┐ + │ User │ + │ (Django) │ + └─────┬─────┘ + │ + ┌────────────────┬┼────────────────┐ + │ owner ││ created_by │ owner + ▼ ▼│ ▼ + ┌───────────┐ ┌──────┴──────┐ ┌───────────┐ + │ Project │ │ SplitConfig │ │ ApiKey │ + │ │ └──────┬──────┘ │ │ + └─────┬──────┘ │ └───────────┘ + │ │ │ + ┌────────────┼────────────┐ │ FK to Project + │ │ │ │ + ▼ ▼ ▼ │ ┌──────────────────┐ + ┌──────────┐ ┌──────────┐ ┌──────────┐ │ EvaluationConfig │ + │ Training │ │ ItemMeta │ │ Model │ │ │ + │ Data │ │ Data │ │ Config │ └────────┬─────────┘ + │ (file) │ │ (file) │ │ │ │ + └────┬─────┘ └──────────┘ └────┬─────┘ │ + │ │ │ + │ ┌────────────────────┼──────────────────┘ + │ │ │ + ▼ ▼ ▼ + ┌─────────────────┐ ┌─────────────┐ + │ParameterTuning │ │ Trained │ + │ Job │───►│ Model │ + │ │ │ (file) │ + └────────┬────────┘ └──────┬──────┘ + │ │ + ┌─────┼─────┐ │ + ▼ ▼ ▼ + ┌──────┐ ┌──────┐ ┌──────────────┐ + │Task& │ │Task& │ │ Deployment │ + │Param │ │Model │ │ Slot │ + │Link │ │Link │ └──────┬───────┘ + └──────┘ └──────┘ │ + ┌────┼────┐ + ▼ ▼ + ┌──────────┐ ┌──────────────┐ + │ ABTest │ │ Conversion │ + │ │ │ Event │ + └──────────┘ └──────────────┘ + + ┌──────────────────┐ ┌──────────────────┐ + │ Retraining │────►│ Retraining │ + │ Schedule │ │ Run │ + └──────────────────┘ └──────────────────┘ + │ + FK to Project +``` + +## Base Classes + +### ModelWithInsDatetime + +All domain models inherit from this abstract base class. + +| Field | Type | Description | +|---|---|---| +| `ins_datetime` | `DateTimeField(auto_now_add=True)` | Creation timestamp | +| `updated_at` | `DateTimeField(auto_now=True)` | Last modification timestamp | + +- **Meta**: `abstract = True`, `ordering = ["-id"]` (newest first) + +### BaseFileModel + +Inherited by models that store uploaded files (`TrainingData`, `ItemMetaData`, `TrainedModel`). + +| Field | Type | Description | +|---|---|---| +| `file` | `FileField` | Stored file reference | +| `filesize` | `IntegerField` | File size in bytes (populated on save signal) | + +## Model Definitions + +### Project + +The top-level organizational entity. Defines the column mapping for user/item interaction data. + +| Field | Type | Constraints | Description | +|---|---|---|---| +| `name` | `CharField(max_length=256)` | Unique per owner | Human-readable project name | +| `owner` | `ForeignKey(User)` | `null=True`, CASCADE | Project owner; NULL for legacy unowned data | +| `user_column` | `CharField(max_length=256)` | Required | Name of user ID column in training data | +| `item_column` | `CharField(max_length=256)` | Required | Name of item ID column in training data | +| `time_column` | `CharField(max_length=256)` | `null=True` | Optional timestamp column name | + +**Constraints**: `UniqueConstraint(fields=["owner", "name"], name="unique_project_name_per_owner")` + +**Design note**: `owner` is nullable for backward compatibility with data created before multi-user support. Unowned projects (`owner=NULL`) are visible to all authenticated users via `OwnedResourceMixin`. + +### TrainingData + +Uploaded CSV file containing user-item interaction records. + +| Field | Type | Constraints | Description | +|---|---|---|---| +| `project` | `ForeignKey(Project)` | CASCADE, indexed | Parent project | +| `file` | Inherited from `BaseFileModel` | -- | Uploaded CSV/TSV/Parquet file | +| `filesize` | Inherited from `BaseFileModel` | -- | File size (populated by `post_save` signal) | + +**Validation**: `validate_return_df()` verifies that the file contains the columns defined in the project (`user_column`, `item_column`, optional `time_column`). + +### ItemMetaData + +Optional item metadata file for feature-enriched recommendations. + +| Field | Type | Constraints | Description | +|---|---|---|---| +| `project` | `ForeignKey(Project)` | CASCADE, indexed | Parent project | +| `valid_columns_list_json` | `JSONField` | `null=True` | List of valid feature columns | +| `file` | Inherited from `BaseFileModel` | -- | Uploaded metadata file | + +### SplitConfig + +Configuration for train/validation data splitting. + +| Field | Type | Default | Description | +|---|---|---|---| +| `name` | `CharField(max_length=256)` | `null=True` | Optional display name | +| `created_by` | `ForeignKey(User)` | `null=True`, SET_NULL | Creator (NULL for legacy) | +| `scheme` | `CharField(choices)` | `"RG"` (Random) | Split strategy: RG/TG/TU | +| `heldout_ratio` | `FloatField` | `0.1` | Fraction of interactions held out [0.0, 1.0] | +| `n_heldout` | `IntegerField` | `null=True` | Absolute number of heldout items | +| `test_user_ratio` | `FloatField` | `1.0` | Fraction of users used for testing [0.0, 1.0] | +| `n_test_users` | `IntegerField` | `null=True` | Absolute number of test users | +| `random_seed` | `IntegerField` | `42` | Random seed for reproducibility | + +**Split schemes**: +- `RG` (Random): Random interaction holdout +- `TG` (Time Global): Global time-based split +- `TU` (Time User): Per-user time-based split + +### EvaluationConfig + +Configuration for model evaluation metrics. + +| Field | Type | Default | Description | +|---|---|---|---| +| `name` | `CharField(max_length=256)` | `null=True` | Optional display name | +| `cutoff` | `IntegerField` | `20` | Top-K cutoff for evaluation | +| `created_by` | `ForeignKey(User)` | `null=True`, SET_NULL | Creator (NULL for legacy) | +| `target_metric` | `CharField(choices)` | `"ndcg"` | Metric to optimize | + +**Target metrics**: `ndcg`, `map`, `recall`, `hit` + +### ModelConfiguration + +Recommender algorithm configuration with hyperparameters. + +| Field | Type | Constraints | Description | +|---|---|---|---| +| `name` | `CharField(max_length=256)` | `null=True` | Display name; unique per project | +| `project` | `ForeignKey(Project)` | CASCADE, indexed | Parent project | +| `recommender_class_name` | `CharField(max_length=128)` | Validated Python identifier | irspack recommender class name | +| `parameters_json` | `JSONField` | default `{}` | Hyperparameter key-value pairs | + +**Constraints**: `UniqueConstraint(fields=["project", "name"], name="unique_model_config_name_per_project")` + +**Validation**: `recommender_class_name` must match `^[A-Za-z_][A-Za-z0-9_]*$`. + +### TrainedModel + +A trained recommendation model stored as an HMAC-signed serialized file. + +| Field | Type | Constraints | Description | +|---|---|---|---| +| `configuration` | `ForeignKey(ModelConfiguration)` | CASCADE, indexed | Algorithm configuration used | +| `data_loc` | `ForeignKey(TrainingData)` | CASCADE, indexed | Training data used | +| `irspack_version` | `CharField(max_length=16)` | `null=True` | irspack version at training time | +| `file` | Inherited from `BaseFileModel` | -- | HMAC-SHA256 signed serialized file | + +**File format**: `HMAC_SIGNATURE (32 bytes) + SERIALIZED_PAYLOAD`. The payload contains a dict with keys `id_mapped_recommender`, `irspack_version`, `recotem_trained_model_id`. + +### ParameterTuningJob + +Orchestrates hyperparameter search using Optuna. + +| Field | Type | Constraints | Description | +|---|---|---|---| +| `data` | `ForeignKey(TrainingData)` | CASCADE, indexed | Training data to tune on | +| `split` | `ForeignKey(SplitConfig)` | CASCADE | Data split configuration | +| `evaluation` | `ForeignKey(EvaluationConfig)` | CASCADE | Evaluation configuration | +| `status` | `CharField(choices)` | default `"PENDING"`, indexed | PENDING/RUNNING/COMPLETED/FAILED | +| `n_tasks_parallel` | `IntegerField` | default `1` | Number of parallel Celery workers | +| `n_trials` | `IntegerField` | default `40` | Total Optuna trials | +| `memory_budget` | `IntegerField` | default `8000` | Memory budget (MB) | +| `timeout_overall` | `IntegerField` | `null=True` | Overall timeout (seconds) | +| `timeout_singlestep` | `IntegerField` | `null=True` | Per-trial timeout (seconds) | +| `random_seed` | `IntegerField` | `null=True` | Random seed | +| `tried_algorithms_json` | `JSONField` | `null=True` | List of algorithm names to try | +| `irspack_version` | `CharField(max_length=16)` | `null=True` | irspack version used | +| `train_after_tuning` | `BooleanField` | default `True` | Auto-train best config | +| `tuned_model` | `OneToOneField(TrainedModel)` | `null=True`, SET_NULL | Resulting trained model | +| `best_config` | `OneToOneField(ModelConfiguration)` | `null=True`, SET_NULL | Best configuration found | +| `best_score` | `FloatField` | `null=True` | Best evaluation score achieved | + +**Methods**: `study_name()` returns `"job-{id}-{ins_datetime}"` for Optuna study identification. + +### TaskAndParameterJobLink + +Links Celery `TaskResult` entries to `ParameterTuningJob`. + +| Field | Type | Constraints | Description | +|---|---|---|---| +| `job` | `ForeignKey(ParameterTuningJob)` | CASCADE, related `task_links` | Parent tuning job | +| `task` | `OneToOneField(TaskResult)` | CASCADE, related `tuning_job_link` | Celery task result | + +### TaskAndTrainedModelLink + +Links Celery `TaskResult` entries to `TrainedModel`. + +| Field | Type | Constraints | Description | +|---|---|---|---| +| `model` | `ForeignKey(TrainedModel)` | CASCADE, related `task_links` | Parent trained model | +| `task` | `OneToOneField(TaskResult)` | CASCADE, related `model_link` | Celery task result | + +### ApiKey + +API key for programmatic access to project resources. + +| Field | Type | Constraints | Description | +|---|---|---|---| +| `project` | `ForeignKey(Project)` | CASCADE, related `api_keys` | Scoped to this project | +| `owner` | `ForeignKey(User)` | CASCADE, related `api_keys` | Key creator/owner | +| `name` | `CharField(max_length=256)` | Unique per project | Human-readable key name | +| `key_prefix` | `CharField(max_length=16)` | indexed | First 8 chars of random part for lookup | +| `hashed_key` | `CharField(max_length=256)` | -- | PBKDF2-SHA256 hash of full key | +| `scopes` | `JSONField` | default `[]` | Permission scopes: `read`, `write`, `predict` | +| `is_active` | `BooleanField` | default `True` | Whether the key is active | +| `expires_at` | `DateTimeField` | `null=True` | Optional expiration timestamp | +| `last_used_at` | `DateTimeField` | `null=True` | Last usage timestamp | + +**Constraints**: `UniqueConstraint(fields=["project", "name"], name="unique_api_key_name_per_project")` + +**Key format**: `rctm_` (prefix `rctm_`, 48-character random part) + +### TaskLog + +Log entries associated with Celery task execution. + +| Field | Type | Constraints | Description | +|---|---|---|---| +| `task` | `ForeignKey(TaskResult)` | CASCADE | Celery task result | +| `contents` | `TextField` | blank allowed | Log message content | + +### RetrainingSchedule + +Defines periodic model retraining configuration. + +| Field | Type | Default | Description | +|---|---|---|---| +| `project` | `OneToOneField(Project)` | CASCADE, related `retraining_schedule` | One schedule per project | +| `is_enabled` | `BooleanField` | `False` | Whether schedule is active | +| `cron_expression` | `CharField(max_length=100)` | `"0 2 * * 0"` | Cron schedule | +| `training_data` | `ForeignKey(TrainingData)` | `null=True`, SET_NULL | Specific data; or latest if NULL | +| `model_configuration` | `ForeignKey(ModelConfiguration)` | `null=True`, SET_NULL | Config for train-only mode | +| `retune` | `BooleanField` | `False` | Whether to re-run tuning | +| `split_config` | `ForeignKey(SplitConfig)` | `null=True`, SET_NULL | Required if retune=True | +| `evaluation_config` | `ForeignKey(EvaluationConfig)` | `null=True`, SET_NULL | Required if retune=True | +| `max_retries` | `IntegerField` | `3` | Max retry attempts | +| `notify_on_failure` | `BooleanField` | `True` | Send failure notifications | +| `last_run_at` | `DateTimeField` | `null=True` | Timestamp of last execution | +| `last_run_status` | `CharField(choices)` | `null=True` | SUCCESS/FAILED/SKIPPED | +| `next_run_at` | `DateTimeField` | `null=True` | Next scheduled execution | +| `auto_deploy` | `BooleanField` | `False` | Auto-deploy trained model to slot | + +### RetrainingRun + +Record of a single retraining execution. + +| Field | Type | Default | Description | +|---|---|---|---| +| `schedule` | `ForeignKey(RetrainingSchedule)` | CASCADE, related `runs` | Parent schedule | +| `status` | `CharField(choices)` | `"PENDING"` | PENDING/RUNNING/COMPLETED/FAILED/SKIPPED | +| `trained_model` | `ForeignKey(TrainedModel)` | `null=True`, SET_NULL | Resulting trained model | +| `tuning_job` | `ForeignKey(ParameterTuningJob)` | `null=True`, SET_NULL | Associated tuning job (if retune) | +| `error_message` | `TextField` | `""` | Error details on failure | +| `completed_at` | `DateTimeField` | `null=True` | Completion timestamp | +| `data_rows_at_trigger` | `IntegerField` | `null=True` | Data size at trigger time | + +### DeploymentSlot + +A slot that maps a trained model to serving with a traffic weight for A/B testing. + +| Field | Type | Default | Description | +|---|---|---|---| +| `project` | `ForeignKey(Project)` | CASCADE, related `deployment_slots` | Parent project | +| `name` | `CharField(max_length=256)` | -- | Slot display name | +| `trained_model` | `ForeignKey(TrainedModel)` | CASCADE | Model served by this slot | +| `weight` | `FloatField` | `100` | Traffic weight [0.0, 100.0] | +| `is_active` | `BooleanField` | `True` | Whether slot is active | + +### ABTest + +A/B test comparing two deployment slots. + +| Field | Type | Default | Description | +|---|---|---|---| +| `project` | `ForeignKey(Project)` | CASCADE, related `ab_tests` | Parent project | +| `name` | `CharField(max_length=256)` | -- | Test name | +| `status` | `CharField(choices)` | `"DRAFT"` | DRAFT/RUNNING/COMPLETED/CANCELLED | +| `control_slot` | `ForeignKey(DeploymentSlot)` | CASCADE, related `control_tests` | Control (baseline) slot | +| `variant_slot` | `ForeignKey(DeploymentSlot)` | CASCADE, related `variant_tests` | Variant (challenger) slot | +| `target_metric_name` | `CharField(max_length=50)` | `"ctr"` | Metric: ctr/purchase_rate/conversion_rate | +| `min_sample_size` | `IntegerField` | `1000` | Minimum impressions before analysis | +| `confidence_level` | `FloatField` | `0.95` | Statistical confidence [0.5, 0.99] | +| `started_at` | `DateTimeField` | `null=True` | Test start timestamp | +| `ended_at` | `DateTimeField` | `null=True` | Test end timestamp | +| `winner_slot` | `ForeignKey(DeploymentSlot)` | `null=True`, SET_NULL, related `won_tests` | Promoted winner slot | + +### ConversionEvent + +Tracking event for A/B test analysis. Note: This model inherits directly from `models.Model` (not `ModelWithInsDatetime`). + +| Field | Type | Constraints | Description | +|---|---|---|---| +| `project` | `ForeignKey(Project)` | CASCADE | Parent project | +| `deployment_slot` | `ForeignKey(DeploymentSlot)` | CASCADE | Slot that served the recommendation | +| `user_id` | `CharField(max_length=256)` | -- | User identifier | +| `item_id` | `CharField(max_length=256)` | default `""` | Item identifier | +| `event_type` | `CharField(choices)` | -- | impression/click/purchase | +| `recommendation_request_id` | `UUIDField` | `null=True` | Links to inference request ID | +| `timestamp` | `DateTimeField(auto_now_add=True)` | -- | Event timestamp | +| `metadata_json` | `JSONField` | default `{}` | Arbitrary metadata | + +**Indexes**: Composite index on `(project, deployment_slot, event_type, timestamp)` for efficient A/B test result queries. + +## Signals + +1. **`create_auth_token`**: `post_save` on `User` -- creates a DRF `Token` for each new user (legacy auth support). +2. **`save_file_size`**: `post_save` on `TrainingData` -- populates `filesize` after file upload. + +## Key Relationships Summary + +``` +User ──1:N──► Project ──1:N──► TrainingData + ──1:N──► ItemMetaData + ──1:N──► ModelConfiguration + ──1:N──► DeploymentSlot + ──1:N──► ABTest + ──1:N──► ApiKey + ──1:1──► RetrainingSchedule + +TrainingData ──1:N──► ParameterTuningJob + ──1:N──► TrainedModel + +ModelConfiguration ──1:N──► TrainedModel + +ParameterTuningJob ──1:1──► ModelConfiguration (best_config) + ──1:1──► TrainedModel (tuned_model) + ──1:N──► TaskAndParameterJobLink + +TrainedModel ──1:N──► DeploymentSlot + ──1:N──► TaskAndTrainedModelLink + +DeploymentSlot ──1:N──► ABTest (control or variant) + ──1:N──► ConversionEvent + +RetrainingSchedule ──1:N──► RetrainingRun +``` + +## Database Tables + +All Django model tables use the prefix `api_` (from the app label `recotem.api`). Table names follow Django convention: `api_project`, `api_trainingdata`, `api_trainedmodel`, etc. Additionally, `django_celery_results_taskresult` and `django_celery_beat_*` tables are managed by their respective packages. diff --git a/docs/specification/inference-service.md b/docs/specification/inference-service.md new file mode 100644 index 00000000..826e3dbe --- /dev/null +++ b/docs/specification/inference-service.md @@ -0,0 +1,378 @@ +# Inference Service Specification + +## Overview + +The inference service is a standalone FastAPI application that serves real-time recommendation predictions. It operates independently from the Django backend, connecting to the same PostgreSQL database (read-only) and Redis instance (Pub/Sub on db3). This separation enables independent scaling and deployment of the serving layer. + +## Architecture + +``` ++-------------------------------------------------------------+ +| Inference Service | +| (FastAPI, port 8081) | +| | +| +-------------+ +-------------+ +----------------------+ | +| | Routes | | Auth | | Rate Limiter | | +| | - predict | | - API key | | - slowapi | | +| | - project | | - scope | | - per key/IP | | +| | - health | | check | | | | +| +------+------+ +------+------+ +----------------------+ | +| | | | +| +------v------------------------------+ | +| | Model Loader | | +| | +------------------------------+ | | +| | | LRU Cache (OrderedDict) | | | +| | | Thread-safe (Lock) | | | +| | | Max: INFERENCE_MAX_ | | | +| | | LOADED_MODELS | | | +| | +------------------------------+ | | +| +------+------------------------------+ | +| | | +| +------v----------+ +-----------------------+ | +| | HMAC Verifier | | Hot-Swap Listener | | +| | (signing.py) | | (Redis Pub/Sub) | | +| | | | Channel: | | +| | SECRET_KEY | | recotem:model_events | | +| +-----------------+ +-----------+-----------+ | +| | | ++-----------------------------------+--------------------------+ + | + +---------------------+---------------+ + | | | + +------v------+ +-------v------+ +------v------+ + | /data (ro) | | Redis db3 | | PostgreSQL | + | Model files | | Pub/Sub | | (read-only) | + +-------------+ +--------------+ +-------------+ +``` + +## Configuration + +All settings are managed via Pydantic `BaseSettings` (loaded from environment variables): + +| Setting | Default | Description | +|---|---|---| +| `DATABASE_URL` | `postgresql://recotem_user:recotem_pass@localhost:5432/recotem` | Read-only PostgreSQL connection | +| `MODEL_EVENTS_REDIS_URL` | `redis://localhost:6379/3` | Redis Pub/Sub URL (db3) | +| `SECRET_KEY` | `VeryBadSecret@ChangeThis` | Must match Django's SECRET_KEY for HMAC verification | +| `INFERENCE_PORT` | `8081` | Service listen port | +| `INFERENCE_MAX_LOADED_MODELS` | `10` | Maximum models in LRU cache | +| `INFERENCE_RATE_LIMIT` | `100/minute` | Rate limit per API key | +| `INFERENCE_PRELOAD_MODEL_IDS` | `""` | Comma-separated model IDs to load at startup | +| `MEDIA_ROOT` | `/data` | Root path for model file storage | +| `RECOTEM_STORAGE_TYPE` | `""` | Storage type (empty for local filesystem) | +| `PICKLE_ALLOW_LEGACY_UNSIGNED` | `True` | Accept unsigned legacy model files | + +## Database Access + +The inference service uses SQLAlchemy with read-only access to Django's PostgreSQL database. SQLAlchemy models mirror the Django schema: + +| SQLAlchemy Model | Django Table | Purpose | +|---|---|---| +| `TrainedModel` | `api_trainedmodel` | Model file paths and metadata | +| `Project` | `api_project` | Project definitions | +| `ApiKey` | `api_apikey` | API key verification | +| `ModelConfiguration` | `api_modelconfiguration` | Configuration metadata | +| `DeploymentSlot` | `api_deploymentslot` | Slot routing for A/B tests | +| `TrainingData` | `api_trainingdata` | Training data project linkage | + +Sessions are created per-request via the `get_db` FastAPI dependency and closed after request completion. + +## LRU Model Cache + +### Design + +The `ModelCache` class implements a thread-safe LRU (Least Recently Used) cache for loaded recommendation models: + +```python +class ModelCache: + _cache: OrderedDict[int, IDMappedRecommender] + _lock: threading.Lock + _max_size: int +``` + +### Operations + +| Operation | Description | Thread Safety | +|---|---|---| +| `get(model_id)` | Retrieve model, move to end (most recent) | Lock-protected | +| `put(model_id, model)` | Insert/update model; evict LRU if at capacity | Lock-protected | +| `remove(model_id)` | Remove model from cache | Lock-protected | +| `loaded_models()` | List all cached model IDs | Lock-protected | +| `size()` | Return number of cached models | Lock-protected | + +### Eviction Policy + +When inserting a new model into a full cache, the least recently used model (front of `OrderedDict`) is evicted: + +``` +Cache state (max_size=3): [A, B, C] + ^ ^ + LRU MRU + +get(B) --> Cache: [A, C, B] (B moved to end) +put(D) --> Cache: [C, B, D] (A evicted) +``` + +### Model Loading Flow + +``` +get_or_load_model(model_id, file_path) + | + +-- Check cache: model_cache.get(model_id) + | +-- Cache hit: return cached model + | + +-- Load from disk: load_model_from_file(file_path) + | +-- 1. Resolve path: MEDIA_ROOT / file_path + | +-- 2. Read raw bytes + | +-- 3. Verify HMAC: verify_and_extract(SECRET_KEY, raw_data) + | +-- 4. Deserialize: load model from verified payload + | +-- 5. Extract: data["id_mapped_recommender"] + | + +-- Store in cache: model_cache.put(model_id, model) + +-- Return model +``` + +### Custom Deserializer + +A custom deserializer handles models serialized with different module paths. It redirects `IDMappedRecommender` class resolution to the local inference module's `id_mapper_compat` module, ensuring compatibility regardless of the original module path. + +## Hot-Swap via Redis Pub/Sub + +### Event Flow + +``` +Celery Worker Redis db3 Inference Service + | | | + | train_and_save_model() | | + | ----------------------> | | + | | | + | PUBLISH | | + | recotem:model_events | | + | {"event":"model_trained",| | + | "model_id": 42, | | + | "project_id": 1} | | + | ---------------------> | | + | | SUBSCRIBE | + | | recotem:model_events | + | | <-----------------------| + | | | + | | MESSAGE | + | | ----------------------> | + | | | + | | _handle_model_event()| + | | +--------------------+ + | | | 1. Parse JSON | + | | | 2. Query DB | + | | | 3. Load model | + | | | 4. Update cache | + | | +--------------------+ +``` + +### Event Format + +Published by `training_service._publish_model_event()`: + +```json +{ + "event": "model_trained", + "model_id": 42, + "project_id": 1 +} +``` + +### Listener Thread + +The Pub/Sub listener runs as a daemon thread started during FastAPI lifespan: + +```python +def start_listener() -> threading.Thread: + thread = threading.Thread(target=_listen, daemon=True, name="model-event-listener") + thread.start() + return thread +``` + +- **Auto-reconnect**: On `redis.ConnectionError`, waits 5 seconds and reconnects +- **Error isolation**: Unexpected exceptions are logged and the listener continues +- **Channel**: `recotem:model_events` + +### Event Handling + +When a `model_trained` event is received: +1. Parse JSON payload +2. Query `TrainedModel` from the database via SQLAlchemy +3. If the model exists and has a file, call `get_or_load_model()` +4. The model is loaded into the LRU cache, replacing any stale version +5. Log success or failure + +## Slot Routing (A/B Testing) + +The project-level prediction endpoint implements weighted random routing across deployment slots: + +```python +def _select_slot_by_weight(slots: list[DeploymentSlot]) -> DeploymentSlot: + weights = [s.weight for s in slots] + return random.choices(slots, weights=weights, k=1)[0] +``` + +### Routing Flow + +``` +POST /inference/predict/project/{project_id} + | + +-- 1. Verify API key has predict scope and project access + | + +-- 2. Query active deployment slots for project + | SELECT * FROM api_deploymentslot + | WHERE project_id = ? AND is_active = true + | + +-- 3. Weighted random selection + | Example: Slot A (weight=70), Slot B (weight=30) + | -> 70% of requests route to Slot A + | + +-- 4. Load model from selected slot + | get_or_load_model(slot.trained_model_id, model.file) + | + +-- 5. Generate recommendations + | + +-- 6. Return response with slot_id and slot_name + (enables client-side event attribution) +``` + +The response includes `slot_id` and `slot_name` so clients can record which slot served each recommendation for A/B test analysis via `ConversionEvent`. + +## Rate Limiting + +### Implementation + +Rate limiting uses `slowapi` (a Starlette-compatible wrapper around `limits`): + +```python +limiter = Limiter(key_func=get_api_key_or_ip) +``` + +### Rate Limit Key Resolution + +```python +def get_api_key_or_ip(request: Request) -> str: + api_key_header = request.headers.get("x-api-key", "") + if api_key_header.startswith("rctm_") and len(api_key_header) > 13: + return api_key_header[5:13] # Use key prefix as rate limit key + return get_remote_address(request) +``` + +- **API key requests**: Rate limited per API key prefix (first 8 chars of random part) +- **Unauthenticated requests**: Rate limited per IP address +- **Default limit**: `100/minute` (configurable via `INFERENCE_RATE_LIMIT`) + +### Rate limit exceeded response + +``` +HTTP 429 Too Many Requests +{"error": "Rate limit exceeded: 100 per 1 minute"} +``` + +## Pre-Loading Models + +Models can be pre-loaded into the cache at service startup via `INFERENCE_PRELOAD_MODEL_IDS`: + +```bash +INFERENCE_PRELOAD_MODEL_IDS=1,5,12 +``` + +During the FastAPI lifespan startup: +1. Parse comma-separated model IDs +2. For each ID, query the database for the model record +3. If found, call `get_or_load_model()` to load into cache +4. Log success or failure for each model + +This eliminates cold-start latency for frequently-used models. + +## API Key Authentication + +The inference service implements its own API key verification (independent of Django): + +``` +Request Header: X-API-Key: rctm_aBcDeFgH... + | + +-- 1. Check prefix: "rctm_" + +-- 2. Extract first 8 chars of random part as prefix + +-- 3. Query: SELECT * FROM api_apikey WHERE key_prefix = ? AND is_active = true + +-- 4. Check expiration + +-- 5. Verify hash: django_pbkdf2_sha256.verify(full_key, hashed_key) + +-- 6. Check scope: "predict" in api_key.scopes +``` + +Uses `passlib.hash.django_pbkdf2_sha256` for Django-compatible PBKDF2-SHA256 hash verification. + +## Health Check + +The `/health` endpoint returns service status and cache metrics: + +```json +{ + "status": "healthy", + "loaded_models": 3 +} +``` + +The `/models` endpoint lists all currently cached model IDs: + +```json +{ + "models": [1, 5, 12], + "count": 3 +} +``` + +These endpoints do not require authentication and are used by Docker health checks and monitoring systems. + +## Scaling Strategy + +### Horizontal Scaling + +The inference service is stateless (model cache is local to each instance). Multiple instances can run behind a load balancer: + +``` + +-- Inference Instance 1 (LRU cache) +Load Balancer ---+-- Inference Instance 2 (LRU cache) + +-- Inference Instance 3 (LRU cache) +``` + +- Each instance maintains its own LRU cache +- All instances subscribe to the same Redis Pub/Sub channel +- Model loading is idempotent (safe to load concurrently) +- Database connections are read-only (no write contention) + +### Memory Considerations + +- **Per-model memory**: Depends on the recommendation algorithm and data size +- **Cache limit**: Controlled by `INFERENCE_MAX_LOADED_MODELS` (default 10) +- **Docker memory**: 512 MB reserved, 4 GB limit +- **Eviction**: LRU eviction ensures memory stays bounded + +### Connection Pooling + +- SQLAlchemy engine uses default connection pooling for database access +- Redis Pub/Sub maintains a single persistent connection per instance +- nginx upstream uses `keepalive 8` connections to the inference service + +## Docker Configuration + +```yaml +inference: + depends_on: + db: { condition: service_healthy } + redis: { condition: service_healthy } + volumes: + - data-location:/data:ro # Read-only model files + healthcheck: + test: ["CMD-SHELL", "curl -f http://localhost:8081/health || exit 1"] + interval: 10s + start_period: 20s + deploy: + resources: + limits: { memory: 4G } + reservations: { memory: 512M } +``` + +The `/data` volume is mounted read-only (`ro`) since the inference service only reads model files; writing is done by the Celery worker. diff --git a/docs/specification/security-design.md b/docs/specification/security-design.md new file mode 100644 index 00000000..8cfb21cd --- /dev/null +++ b/docs/specification/security-design.md @@ -0,0 +1,396 @@ +# Security Design Specification + +## Overview + +Recotem implements defense-in-depth security across multiple layers: authentication and authorization at the API level, cryptographic integrity for serialized model files, multi-tenancy isolation via ownership filtering, and rate limiting at both the reverse proxy and application layers. + +## Authentication + +### Authentication Stack + +``` ++-------------------------------------------------------+ +| nginx (proxy) | +| - Rate limiting (api, auth, recommendation zones) | +| - Security headers (CSP, HSTS, X-Frame-Options) | +| - WebSocket log sanitization (query string excluded) | ++---------------------------+---------------------------+ + | + +-------------v-------------+ + | DRF Authentication | + | | + | 1. ApiKeyAuthentication | + | 2. JWTAuthentication | + | 3. SessionAuthentication | + +-------------+-------------+ + | + +-------------v-------------+ + | DRF Permissions | + | | + | - IsAuthenticated | + | - RequireManagementScope | + | - DenyApiKeyAccess | + +---------------------------+ +``` + +### JWT Authentication + +- **Library**: `djangorestframework-simplejwt` + `dj-rest-auth` +- **Token type**: Access token (short-lived) + Refresh token (1 day) +- **Access token lifetime**: Configurable via `ACCESS_TOKEN_LIFETIME` (default 300 seconds) +- **Storage**: Tokens are returned in response body (not cookies) +- **Login endpoint**: `POST /api/v1/auth/login/` +- **Refresh endpoint**: `POST /api/v1/auth/token/refresh/` + +``` +POST /api/v1/auth/login/ +Content-Type: application/json + +{"username": "admin", "password": "secret"} + +Response: +{"access": "eyJ...", "refresh": "eyJ...", "user": {...}} +``` + +Usage: `Authorization: Bearer eyJ...` + +### API Key Authentication + +API keys provide programmatic access scoped to a specific project. + +#### Key Format + +``` +rctm_ + +------------------------------+ + ^ + | + First 8 chars = key_prefix (stored for lookup) +``` + +- **Prefix**: `rctm_` (fixed, identifies Recotem API keys) +- **Random part**: 48 characters of `secrets.token_urlsafe()` +- **Lookup key**: First 8 characters of the random part (stored as `key_prefix`) +- **Storage**: Full key is hashed with Django's PBKDF2-SHA256 (`make_password()`) + +#### Key Generation + +```python +def generate_api_key() -> tuple[str, str, str]: + random_part = secrets.token_urlsafe(48) + full_key = f"rctm_{random_part}" + prefix = random_part[:8] + hashed_key = make_password(full_key) + return full_key, prefix, hashed_key +``` + +The full key is returned to the user exactly once at creation time. Only the prefix and hash are stored. + +#### Authentication Flow + +``` +Client Request + | + | X-API-Key: rctm_aBcDeFgHiJkLmNoPqRsTuVwX... + | + v +ApiKeyAuthentication.authenticate() + | + +-- 1. Extract header: HTTP_X_API_KEY + +-- 2. Check prefix: starts with "rctm_"? + +-- 3. Extract random part, take first 8 chars as prefix + +-- 4. DB lookup: ApiKey.objects.get(key_prefix=prefix, is_active=True) + +-- 5. Check expiration: expires_at < now? + +-- 6. Verify hash: check_password(full_key, hashed_key) + +-- 7. Update last_used_at (fire-and-forget) + +-- 8. Attach key to request: request.api_key = key_obj + +-- 9. Return (key_obj.owner, key_obj) +``` + +#### Scopes + +API keys have JSON-array scopes that control access: + +| Scope | Grants Access To | +|---|---| +| `read` | GET, HEAD, OPTIONS on management endpoints | +| `write` | POST, PUT, PATCH, DELETE on management endpoints | +| `predict` | Inference API prediction endpoints | + +Scope enforcement: +- **Management API**: `RequireManagementScope` permission class checks `read`/`write` +- **Inference API**: `require_scope("predict")` FastAPI dependency +- **User management**: `DenyApiKeyAccess` unconditionally blocks API key access + +#### Inference Service Compatibility + +The inference service (FastAPI) verifies API keys independently using SQLAlchemy for database access. It uses `passlib.hash.django_pbkdf2_sha256` to verify keys against Django's PBKDF2-SHA256 hash format, ensuring compatibility without a Django dependency. + +### WebSocket Authentication + +See [WebSocket Protocol Specification](websocket-protocol.md) for details. JWT tokens are passed as `?token=` query parameters since browsers cannot send custom headers on WebSocket upgrade requests. + +## Model File Integrity (HMAC-SHA256 Signing) + +### Threat Model + +Trained recommendation models are serialized for persistence. Serialized files can potentially execute code on deserialization if tampered with. An attacker who can write to the model storage volume could inject malicious serialized files. + +### Signing Architecture + +``` +Training (Celery Worker) Serving (Inference / Backend) ++------------------------+ +------------------------+ +| | | | +| 1. Train model | | 1. Read file from | +| 2. Serialize model | | storage | +| 3. sign_bytes() | | 2. verify_and_extract | +| HMAC = SHA256( | | Verify HMAC | +| SECRET_KEY, | | Extract payload | +| payload) | | 3. Deserialize model | +| 4. Write: HMAC + | | | +| payload to file | | | +| | | | ++------------------------+ +------------------------+ +``` + +### File Format + +``` +Offset Length Content +0x00 32 HMAC-SHA256 signature +0x20 variable Serialized payload (model data) +``` + +Total file size = 32 + len(payload) bytes. + +### Signing (signing_core module) + +```python +def sign_bytes(key: bytes, payload: bytes) -> bytes: + signature = hmac.new(key, payload, hashlib.sha256).digest() + return signature + payload +``` + +- **Key**: `SECRET_KEY.encode("utf-8")` (Django's SECRET_KEY) +- **Algorithm**: HMAC-SHA256 (32-byte digest) +- **Called by**: `training_service.train_and_save_model()` after serialization + +### Verification (signing_core module) + +The verification function performs these steps: + +1. If data is 32 bytes or fewer: treat as legacy (if allowed) or reject +2. Split data: `signature = data[:32]`, `payload = data[32:]` +3. Compute expected HMAC: `HMAC-SHA256(key, payload)` +4. If `hmac.compare_digest(signature, expected)`: return payload (verified) +5. If `data[0] == 0x80` and `allow_legacy`: return data as-is (unsigned legacy file, warning logged) +6. Otherwise: raise `ValueError` (tampering detected) + +Uses `hmac.compare_digest()` for constant-time comparison to prevent timing attacks. + +### Legacy Unsigned File Handling + +Files created before HMAC signing was introduced do not have signatures. These are detected by the `0x80` byte at position 0, which is a protocol marker for the serialization format. + +| Scenario | `allow_legacy=True` | `allow_legacy=False` | +|---|---|---| +| Valid HMAC signature | Return payload | Return payload | +| No HMAC, starts with `0x80` | Return data (warning logged) | Raise `ValueError` | +| Invalid HMAC, no `0x80` marker | Raise `ValueError` | Raise `ValueError` | +| Data <= 32 bytes | Return data (warning logged) | Raise `ValueError` | + +Controlled by `PICKLE_ALLOW_LEGACY_UNSIGNED` setting (default `True`). After running `manage.py resign_models`, set to `False` to reject all unsigned files. + +### Shared Signing Module + +The signing core module has no Django dependencies. It is used by: +1. **Backend** (`services/signing.py`): Wraps core with Django settings for `SECRET_KEY` and `PICKLE_ALLOW_LEGACY_UNSIGNED` +2. **Inference service** (`signing.py`): Independent implementation using Pydantic settings + +Both services must share the same `SECRET_KEY` for signature verification. + +## Multi-Tenancy + +### Ownership Model + +``` ++--------------------------------------------------------+ +| User (Django) | +| | +| owns --> Project --> TrainingData, ModelConfig, etc. | +| | +| created_by --> SplitConfig, EvaluationConfig | +| | +| owns --> ApiKey (scoped to Project) | ++---------------------------------------------------------+ +``` + +### OwnedResourceMixin + +Applied to ViewSets for models with an ownership chain through `Project.owner`: + +```python +class OwnedResourceMixin: + owner_lookup: str = "owner" + + def get_owner_filter(self): + user = self.request.user + if user.is_staff: + return Q() # Staff sees everything + q = Q(**{owner_lookup: user}) | Q(**{f"{owner_lookup}__isnull": True}) + # API key project scope + api_key = getattr(self.request, "api_key", None) + if api_key is not None: + q &= Q(**{project_lookup: api_key.project_id}) + return q +``` + +**Behavior**: +- **Regular users**: See own resources + legacy unowned resources (`owner=NULL`) +- **Staff users**: See all resources +- **API key users**: Further filtered to the API key's project scope + +### CreatedByResourceMixin + +Applied to ViewSets for `SplitConfig` and `EvaluationConfig` which use `created_by` instead of project ownership: + +```python +class CreatedByResourceMixin: + created_by_lookup: str = "created_by" + + def get_owner_filter(self): + user = self.request.user + if user.is_staff: + return Q() + return Q(**{created_by_lookup: user}) | Q(**{f"{created_by_lookup}__isnull": True}) +``` + +### ViewSet Owner Lookup Configuration + +| ViewSet | Mixin | `owner_lookup` | +|---|---|---| +| `ProjectViewSet` | `OwnedResourceMixin` | `"owner"` | +| `TrainingDataViewset` | `OwnedResourceMixin` | `"project__owner"` | +| `ModelConfigurationViewset` | `OwnedResourceMixin` | `"project__owner"` | +| `TrainedModelViewset` | `OwnedResourceMixin` | `"configuration__project__owner"` | +| `ParameterTuningJobViewSet` | `OwnedResourceMixin` | `"data__project__owner"` | +| `ABTestViewSet` | `OwnedResourceMixin` | `"project__owner"` | +| `DeploymentSlotViewSet` | `OwnedResourceMixin` | `"project__owner"` | +| `ApiKeyViewSet` | `OwnedResourceMixin` | `"owner"` | +| `SplitConfigViewSet` | `CreatedByResourceMixin` | `"created_by"` | +| `EvaluationConfigViewSet` | `CreatedByResourceMixin` | `"created_by"` | + +## Rate Limiting + +### Three-Layer Rate Limiting + +``` +Layer 1: nginx (connection-level) + | api: 30 req/s per IP + | auth: 5 req/min per IP + | recommendation: 30 req/min per IP + | + v +Layer 2: DRF Throttling (application-level) + | anon: 20/min + | user: 100/min + | login: 5/min + | recommendation: 30/min + | + v +Layer 3: slowapi (inference-level) + Per API key: 100/min (configurable) +``` + +### Login Brute-Force Protection + +Login is rate-limited at three levels: +1. nginx `auth` zone: 5 requests/minute per IP with burst of 3 +2. DRF `LoginRateThrottle`: 5/min scope (`AnonRateThrottle` subclass) +3. Django password validators (minimum length, common password check, numeric check) + +## Security Headers + +### nginx Security Headers + +``` +X-Frame-Options: DENY +X-Content-Type-Options: nosniff +Referrer-Policy: strict-origin-when-cross-origin +X-XSS-Protection: 0 +Content-Security-Policy: default-src 'self'; script-src 'self' 'unsafe-inline'; ... +Strict-Transport-Security: max-age=31536000; includeSubDomains +X-Permitted-Cross-Domain-Policies: none +Permissions-Policy: camera=(), microphone=(), geolocation=(), payment=() +X-Request-ID: +``` + +### Django Security Settings (Production) + +When `DEBUG=False`: +- `SECURE_HSTS_SECONDS`: 31536000 (1 year) +- `SECURE_HSTS_INCLUDE_SUBDOMAINS`: True +- `SECURE_HSTS_PRELOAD`: True +- `SECURE_SSL_REDIRECT`: True +- `SESSION_COOKIE_SECURE`: True +- `CSRF_COOKIE_SECURE`: True +- `SESSION_COOKIE_HTTPONLY`: True + +### Production Safety Checks + +Django settings include runtime assertions that prevent deployment with insecure defaults: +- `SECRET_KEY` must not be the default value +- `SECRET_KEY` must be at least 50 characters +- `ALLOWED_HOSTS` must not contain `"*"` or be empty + +## CORS and CSRF Configuration + +### CORS + +- **Same-origin deployments** (via nginx proxy): CORS is not needed +- **Cross-origin deployments**: Set `CORS_ALLOWED_ORIGINS` environment variable +- **Development**: Allows `localhost:5173` and `localhost:8000` +- **Credentials**: `CORS_ALLOW_CREDENTIALS = True` + +### CSRF + +- `CSRF_TRUSTED_ORIGINS`: Auto-derived from `ALLOWED_HOSTS` when not explicitly set +- Can be overridden via `CSRF_TRUSTED_ORIGINS` environment variable +- Required for Django Admin on non-HTTPS origins + +## Logging and Sensitive Data Protection + +### Sensitive Data Filter + +A custom logging filter (`_SensitiveDataFilter`) masks sensitive patterns in log output: + +| Pattern | Masked As | +|---|---| +| `://user:password@host` | `://user:***@host` | +| `AWS_SECRET_ACCESS_KEY=value` | `AWS_SECRET_ACCESS_KEY=***` | +| `AWS_SESSION_TOKEN=value` | `AWS_SESSION_TOKEN=***` | + +### WebSocket Log Sanitization + +nginx uses a custom log format (`ws_sanitized`) for WebSocket requests that excludes query strings, preventing JWT tokens from appearing in access logs: + +``` +log_format ws_sanitized '$remote_addr ... "$request_method $uri $server_protocol" ...'; +``` + +## Summary of Security Controls + +| Concern | Control | +|---|---| +| Authentication (API) | JWT access tokens + API keys with scoped permissions | +| Authentication (WebSocket) | JWT via query parameter | +| Authorization | OwnedResourceMixin / CreatedByResourceMixin for data isolation | +| Model file integrity | HMAC-SHA256 signing with SECRET_KEY | +| API key storage | PBKDF2-SHA256 hashing (Django `make_password`) | +| Rate limiting | nginx zones + DRF throttling + slowapi | +| Login protection | 3-layer rate limiting + Django password validators | +| Transport security | HSTS, SSL redirect, secure cookies (production) | +| Content security | CSP, X-Frame-Options DENY, X-Content-Type-Options nosniff | +| Log hygiene | Sensitive data filter + WebSocket log sanitization | +| Deployment safety | Runtime checks for SECRET_KEY and ALLOWED_HOSTS | diff --git a/docs/specification/task-system.md b/docs/specification/task-system.md new file mode 100644 index 00000000..d8ab5eb2 --- /dev/null +++ b/docs/specification/task-system.md @@ -0,0 +1,461 @@ +# Task System Specification + +## Overview + +Recotem's background task system is built on Celery with a Redis broker (db0). Tasks handle computationally intensive operations: hyperparameter tuning via Optuna, model training via irspack, and scheduled retraining. Task results are persisted to PostgreSQL via `django-celery-results`, and real-time progress is pushed to connected WebSocket clients via Django Channels. + +## Architecture + +``` + Django Channels + (Redis db1) + ^ + | group_send() + | ++----------+ +----------+ +------+------+ +| Celery | | Redis | | Celery | +| Beat |--->| Broker |--->| Worker | +| (cron) | | (db0) | | | ++----------+ +----------+ +------+------+ + | + +------------+------------+ + | | | + +-----v----+ +----v-----+ +----v---------+ + | Optuna | | irspack | | Redis db3 | + | Storage | | Training | | (model event | + | (Postgres)| | | | Pub/Sub) | + +----------+ +----------+ +--------------+ +``` + +## Task Registry + +All tasks are defined in `backend/recotem/recotem/api/tasks.py` and registered with the Celery app. + +| Task | Signature | Purpose | +|---|---|---| +| `run_search` | `run_search(parameter_tuning_job_id, index)` | Single Optuna study worker | +| `task_create_best_config` | `task_create_best_config(parameter_tuning_job_id)` | Save best config after tuning | +| `task_create_best_config_train_rec` | `task_create_best_config_train_rec(parameter_tuning_job_id)` | Save config + auto-train model | +| `task_train_recommender` | `task_train_recommender(model_id)` | Train model from existing config | +| `task_scheduled_retrain` | `task_scheduled_retrain(schedule_id)` | Execute scheduled retraining | + +## Common Task Configuration + +All tasks share the following configuration: + +```python +@app.task( + bind=True, + time_limit=settings.CELERY_TASK_TIME_LIMIT, # default 3600s + soft_time_limit=settings.CELERY_TASK_SOFT_TIME_LIMIT, # default 3480s + autoretry_for=(ConnectionError, OSError), + retry_backoff=True, + max_retries=3, +) +``` + +| Setting | Default | Description | +|---|---|---| +| `CELERY_TASK_TIME_LIMIT` | 3600 (1 hour) | Hard kill timeout | +| `CELERY_TASK_SOFT_TIME_LIMIT` | 3480 (58 min) | Raises `SoftTimeLimitExceeded` | +| Auto-retry | `ConnectionError`, `OSError` | Network failures with exponential backoff | +| Max retries | 3 | Maximum retry attempts | + +## Task 1: run_search + +### Purpose + +Executes a subset of Optuna hyperparameter search trials. Multiple `run_search` tasks run in parallel (one per `n_tasks_parallel`) sharing the same Optuna study via PostgreSQL-backed storage. + +### Flow + +``` +run_search(parameter_tuning_job_id, index) + | + +-- 1. Create TaskResult and TaskAndParameterJobLink + +-- 2. Load job data: TrainingData, SplitConfig, EvaluationConfig + +-- 3. Resolve recommender algorithms to search + +-- 4. Atomically set job status: PENDING -> RUNNING (first worker wins) + +-- 5. Send WebSocket: status=running, log="Start job N / worker M" + +-- 6. Prepare dataset: + | split_dataframe_partial_user_holdout() -> train/val split + | Build sparse matrix + Evaluator + +-- 7. Create/load Optuna study (shared storage) + +-- 8. Run study.optimize(): + | For each trial: + | - suggest_categorical("recommender_class_name", [...]) + | - Get default parameters from recommender class + | - Train recommender on X_train + | - Evaluate on X_test + | - Return negative score (Optuna minimizes) + | - Callback: log trial results via WebSocket + bulk TaskLog + +-- 9. Flush remaining buffered TaskLog entries +``` + +### Trial Distribution + +Trials are distributed across parallel workers: + +```python +n_trials_per_worker = job.n_trials // job.n_tasks_parallel +# Extra trials assigned to early workers +if index < (job.n_trials % job.n_tasks_parallel): + n_trials_per_worker += 1 +``` + +Example: 40 trials across 3 workers = [14, 13, 13] + +### Algorithm Resolution + +The `_get_search_recommender_classes()` function resolves algorithm names: + +1. If `tried_algorithms_json` is `None`, use defaults: `IALSRecommender`, `CosineKNNRecommender`, `TopPopRecommender` +2. Otherwise, resolve each name (handling `*Optimizer` -> `*Recommender` suffix mapping) +3. If no names resolve, fall back to defaults + +### Log Buffering + +Trial log messages are buffered in memory and bulk-inserted every 10 trials for performance: + +```python +log_buffer: list[TaskLog] = [] +BULK_FLUSH_SIZE = 10 + +def callback(study, trial): + log_buffer.append(TaskLog(task=task_result, contents=message)) + if len(log_buffer) >= BULK_FLUSH_SIZE: + TaskLog.objects.bulk_create(log_buffer) + log_buffer.clear() + _send_ws_log(job_id, message) # Real-time WebSocket push +``` + +A `finally` block ensures remaining buffer entries are flushed even on failure. + +## Task 2: task_create_best_config + +### Purpose + +After all `run_search` workers complete, this task reads the best trial from the shared Optuna study and creates a `ModelConfiguration` record. + +### Flow + +``` +task_create_best_config(parameter_tuning_job_id) + | + +-- 1. Create TaskResult + +-- 2. create_best_config_fun(): + | +-- Lock job row (select_for_update) + | +-- Load best trial from Optuna storage + | +-- Extract parameters (strip prefixes) + | +-- Resolve recommender class name + | +-- Create ModelConfiguration record + | +-- Update job: best_config, best_score, irspack_version + | +-- If best_score == 0.0: set FAILED, raise error + | +-- Set job status: COMPLETED + | +-- Log and send WebSocket: completed with score + +-- 3. Return config_id +``` + +### Parameter Extraction + +Optuna stores parameters with algorithm-specific prefixes. The extraction logic: + +1. Collect `trial.params` and `trial.user_attrs` (if valid param names) +2. Strip prefixes: `re.sub(r"^([^\.]*\.)", "", key)` (removes `ClassName.` prefix) +3. Extract `recommender_class_name` from user attrs or params +4. Remove internal keys (`optimizer_name`, `recommender_class_name`) + +### Atomic Update + +The best config creation uses `transaction.atomic()` with `select_for_update()` on the job row to prevent race conditions if multiple config-save tasks execute. + +## Task 3: task_create_best_config_train_rec + +### Purpose + +Combines best config extraction and model training in one task. Used when `train_after_tuning=True`. + +### Flow + +``` +task_create_best_config_train_rec(parameter_tuning_job_id) + | + +-- 1. create_best_config_fun() -> config_id + +-- 2. Create TrainedModel record + +-- 3. train_recommender_func(task_result, model.id, job_id) + | +-- Train model using training_service.train_and_save_model() + | +-- Link task result to model + | +-- Update job.tuned_model on success + +-- 4. On success: _finalize_retraining_run(job, model) + | +-- Update RetrainingRun status -> COMPLETED + | +-- If auto_deploy: create/update DeploymentSlot + +-- 5. On failure: _fail_retraining_run_for_job(job_id) + | +-- Update RetrainingRun status -> FAILED +``` + +## Task 4: task_train_recommender + +### Purpose + +Trains a single model from an existing `ModelConfiguration`. Used for manual training (not part of a tuning flow). + +### Flow + +``` +task_train_recommender(model_id) + | + +-- 1. Create TaskResult + +-- 2. train_recommender_func(task_result, model_id) + | +-- Load TrainedModel + | +-- Create TaskAndTrainedModelLink + | +-- training_service.train_and_save_model(model) + | +-- Load training data CSV + | +-- Build sparse matrix + | +-- Train irspack recommender + | +-- Serialize model with IDMappedRecommender + | +-- Sign with HMAC-SHA256 + | +-- Save to storage + | +-- Publish model_trained event via Redis Pub/Sub + +-- 3. On SoftTimeLimitExceeded: log timeout, create TaskLog + +-- 4. On other error: log failure, create TaskLog +``` + +## Task 5: task_scheduled_retrain + +### Purpose + +Executed by Celery Beat on a cron schedule. Runs either a retune+train cycle or a train-only cycle depending on schedule configuration. + +### Flow + +``` +task_scheduled_retrain(schedule_id) + | + +-- 1. Load RetrainingSchedule with related objects + +-- 2. Check: is_enabled? If not, skip + +-- 3. Determine training data: + | If schedule.training_data is set, use it + | Otherwise, use the latest TrainingData for the project + +-- 4. Create RetrainingRun record (status=RUNNING) + +-- 5. Branch: + | + | [retune=True + split_config + evaluation_config] + | +-- Create ParameterTuningJob + | +-- start_tuning_job(job) -> async chord + | +-- Link run.tuning_job = job + | +-- Set schedule status = RUNNING + | +-- Return (tuning completion tracked asynchronously) + | + | [retune=False + model_configuration] + | +-- Create TrainedModel + | +-- train_and_save_model(model) -> synchronous + | +-- Link run.trained_model = model + | +-- Set run status = COMPLETED + | +-- If auto_deploy: create/update DeploymentSlot + | + | [Neither configured] + | +-- Set run status = SKIPPED + | + +-- 6. Update schedule: last_run_at, last_run_status +``` + +## Orchestration: start_tuning_job + +### Purpose + +Coordinates the parallel tuning workflow using Celery's `chain` and `group` primitives. + +### Task Graph + +``` + +------- run_search(job_id, 0) --------+ + | | +start_tuning_job() -> +------- run_search(job_id, 1) --------+ -> task_create_best_config + | (parallel group) | or + +------- run_search(job_id, N-1) ------+ task_create_best_config_train_rec + (chain continuation) +``` + +### Implementation + +```python +def start_tuning_job(job: ParameterTuningJob) -> None: + job.status = ParameterTuningJob.Status.PENDING + job.save(update_fields=["status"]) + + # Create Optuna study + optuna.create_study(storage=optuna_storage, study_name=study_name, ...) + + if job.train_after_tuning: + chain( + group(run_search.si(job.id, i) for i in range(n_parallel)), + task_create_best_config_train_rec.si(job.id), + ).delay() + else: + chain( + group(run_search.si(job.id, i) for i in range(n_parallel)), + task_create_best_config.si(job.id), + ).delay() +``` + +### Celery Primitives + +- **`group(...)`**: Runs all `run_search` tasks in parallel across available workers +- **`chain(group, task)`**: After all group tasks complete, runs the config/train task +- **`.si()`**: Immutable signature -- prevents result passing between tasks (each task loads its own data) +- **`.delay()`**: Enqueues the entire chain to the broker + +### Error Handling in Orchestration + +If `start_tuning_job()` fails to enqueue tasks: +1. The exception is caught +2. Job status is set to FAILED +3. The error is re-raised + +## WebSocket Integration + +### Status Updates + +Tasks push status changes to WebSocket clients via Django Channels: + +```python +def _send_ws_status(job_id: int, status: str, data: dict = None) -> None: + channel_layer = get_channel_layer() + async_to_sync(channel_layer.group_send)( + f"job_{job_id}_status", + {"type": "job_status_update", "status": status, "data": data or {}}, + ) +``` + +### Log Messages + +Tasks push log entries to WebSocket clients: + +```python +def _send_ws_log(job_id: int, message: str) -> None: + channel_layer = get_channel_layer() + async_to_sync(channel_layer.group_send)( + f"job_{job_id}_logs", + {"type": "task_log_message", "message": message}, + ) +``` + +### Failure Resilience + +WebSocket push failures (Redis connection errors) are caught and logged as warnings. Tasks continue execution because: +1. Log messages are also persisted to `TaskLog` records in the database +2. Job status is persisted to the `ParameterTuningJob` model +3. WebSocket consumers support late-join buffering (clients reconnecting see the current state) + +## Auto-Deploy After Training + +When a `RetrainingSchedule` has `auto_deploy=True`, successfully trained models are automatically deployed: + +```python +def _auto_deploy_model(schedule, model): + slot_name = f"auto-deploy-{schedule.project.name}" + DeploymentSlot.objects.update_or_create( + project=schedule.project, + name=slot_name, + defaults={ + "trained_model": model, + "weight": 100, + "is_active": True, + }, + ) +``` + +- Creates a new deployment slot named `auto-deploy-` +- Or updates an existing slot with the same name +- Sets weight to 100 (full traffic) +- Sets the slot as active + +## Retraining Run Lifecycle + +### State Machine + +``` + +-- RUNNING --+ + | | + PENDING -+ +-- COMPLETED + | + +-- FAILED + | + +-- SKIPPED +``` + +### Tracking for Async Retraining + +When `retune=True`, the retraining flow is asynchronous: +1. `task_scheduled_retrain` creates the `RetrainingRun` and `ParameterTuningJob` +2. The tuning chord runs asynchronously +3. On completion, `_finalize_retraining_run()` updates the run status +4. On failure, `_fail_retraining_run_for_job()` marks the run as failed + +This is handled in `task_create_best_config_train_rec` which calls the finalization functions after the tuning+training pipeline completes or fails. + +## Optuna Storage + +Optuna studies are stored in PostgreSQL (shared with the application database): + +```python +@lru_cache(maxsize=1) +def get_optuna_storage() -> RDBStorage: + db_url = settings.DATABASE_URL + # Convert to psycopg3 dialect for SQLAlchemy + if db_url.startswith("postgresql://"): + db_url = db_url.replace("postgresql://", "postgresql+psycopg://", 1) + return RDBStorage(db_url, engine_kwargs={"pool_size": 5, "max_overflow": 10}) +``` + +- Connection pooling: 5 connections + 10 overflow +- Cached singleton (one storage instance per process) +- Parallel workers share the study via database-level synchronization + +## Celery Beat Configuration + +Celery Beat is configured with `django-celery-beat`'s `DatabaseScheduler`: + +``` +celery -A recotem beat --loglevel=INFO --scheduler django_celery_beat.schedulers:DatabaseScheduler +``` + +This reads periodic task schedules from the Django database (`django_celery_beat_*` tables), which are managed via the `RetrainingSchedule` model and the Django Admin interface. + +## Result Storage + +Task results are stored in PostgreSQL via `django-celery-results`: + +```python +CELERY_RESULT_BACKEND = "django-db" +CELERY_RESULT_EXPIRES = 604800 # 7 days +``` + +Results are linked to domain objects via: +- `TaskAndParameterJobLink`: Links `TaskResult` to `ParameterTuningJob` +- `TaskAndTrainedModelLink`: Links `TaskResult` to `TrainedModel` +- `TaskLog`: Free-text log entries linked to `TaskResult` + +## Error Recovery + +### Task-Level Recovery + +| Error Type | Behavior | +|---|---| +| `ConnectionError` / `OSError` | Auto-retry with exponential backoff (max 3) | +| `SoftTimeLimitExceeded` | Log timeout, set job FAILED, create TaskLog, re-raise | +| Other exceptions | Log error, set job FAILED, create TaskLog, re-raise | + +### Retraining Run Recovery + +When a tuning job linked to a retraining run fails: +1. `_fail_retraining_run_for_job()` is called +2. Finds the `RetrainingRun` linked to the `ParameterTuningJob` +3. Sets run status to FAILED with error message +4. Updates schedule's `last_run_status` to FAILED + +### Data Consistency + +- Best config creation uses `transaction.atomic()` + `select_for_update()` to prevent race conditions +- Job status transitions use atomic field updates: `ParameterTuningJob.objects.filter(...).update(status=...)` +- Task log buffering uses `finally` blocks to ensure flushing even on errors diff --git a/docs/specification/websocket-protocol.md b/docs/specification/websocket-protocol.md new file mode 100644 index 00000000..51b3282c --- /dev/null +++ b/docs/specification/websocket-protocol.md @@ -0,0 +1,337 @@ +# WebSocket Protocol Specification + +## Overview + +Recotem uses WebSocket connections to deliver real-time updates from background Celery tasks to connected clients. The WebSocket layer is built on Django Channels with a Redis channel layer (db1). Two consumer types are provided: `JobStatusConsumer` for job status updates and `TaskLogConsumer` for streaming task log messages. + +## Connection Endpoints + +| Endpoint | Consumer | Description | +|---|---|---| +| `ws://host:8000/ws/job/{job_id}/status/` | `JobStatusConsumer` | Real-time job status updates | +| `ws://host:8000/ws/job/{job_id}/logs/` | `TaskLogConsumer` | Streaming task log messages | + +Routing is defined in `backend/recotem/recotem/api/routing.py`: +```python +websocket_urlpatterns = [ + re_path(r"^ws/job/(?P\d+)/status/$", JobStatusConsumer.as_asgi()), + re_path(r"^ws/job/(?P\d+)/logs/$", TaskLogConsumer.as_asgi()), +] +``` + +## Authentication + +### JWT via Query Parameter + +Browsers cannot send custom HTTP headers on WebSocket upgrade requests. Instead, JWT access tokens are passed as a `?token=` query parameter. + +``` +ws://host:8000/ws/job/42/status/?token=eyJhbGciOiJIUzI1NiIs... +``` + +### Authentication Flow + +``` +Client nginx Daphne (ASGI) + | | | + | GET /ws/job/42/status/ | | + | ?token= | | + | Upgrade: websocket | | + |------------------------------>| | + | | proxy_pass (ws upgrade) | + | |--------------------------->| + | | | + | | JwtAuthMiddleware | + | | +-----------------+ + | | | Parse ?token | + | | | Validate JWT | + | | | Load User | + | | | Set scope[user] | + | | +-----------------+ + | | | + | | Consumer.connect()| + | | +-----------------+ + | | | check_auth() | + | | | has_job_access()| + | | | Accept or Close | + | | +-----------------+ + | | | + |<--------------------------------------------- 101 / 4401 | +``` + +### JwtAuthMiddleware + +Defined in `backend/recotem/recotem/api/middleware.py`. Intercepts every WebSocket connection before it reaches the consumer: + +1. Extracts `token` from query string parameters +2. Validates the JWT access token via `rest_framework_simplejwt.tokens.AccessToken` +3. Loads the corresponding Django user +4. Sets `scope["user"]` for downstream consumers +5. Falls back to `AnonymousUser` if no token or validation fails + +### AuthenticatedConsumerMixin + +Both consumers use `AuthenticatedConsumerMixin` which provides: + +- **`check_auth()`**: Rejects unauthenticated users with close code `4401` +- **`has_job_access(job_id)`**: Verifies the user owns the job's project (or the project is unowned/legacy). Returns `False` and closes with code `4403` if access is denied. + +```python +# Access check query: +ParameterTuningJob.objects.filter(id=job_id).filter( + Q(data__project__owner_id=user.id) | Q(data__project__owner__isnull=True) +).exists() +``` + +### Close Codes + +| Code | Meaning | +|---|---| +| `4401` | Unauthenticated -- no valid JWT token | +| `4403` | Forbidden -- user does not have access to this job | + +## Heartbeat Mechanism + +Both consumers use `HeartbeatMixin` to keep connections alive. + +### Configuration + +- **Interval**: 60 seconds (`HEARTBEAT_INTERVAL`) +- **Direction**: Server sends `ping`, client responds with `pong` + +### Protocol + +``` +Server Client + | | + | {"type": "ping"} | + |----------------------------------->| + | | + | {"type": "pong"} | + |<-----------------------------------| + | | + | ... 60 seconds ... | + | | + | {"type": "ping"} | + |----------------------------------->| +``` + +### Implementation + +- The heartbeat loop runs as an `asyncio` task, started on connection and cancelled on disconnect. +- Client `pong` messages are consumed by the `receive()` method and silently ignored (no processing). +- The heartbeat keeps the WebSocket connection open through proxies and load balancers that may have idle timeouts. +- nginx is configured with `proxy_read_timeout 300s` for WebSocket connections. + +## Consumer 1: JobStatusConsumer + +### Purpose + +Delivers real-time status updates for `ParameterTuningJob` instances. Used by the frontend to show job progress without polling. + +### Channel Group + +Group name: `job_{job_id}_status` + +### Connection Sequence + +1. Validate JWT authentication +2. Check job access permissions +3. Join channel group `job_{job_id}_status` +4. Accept WebSocket connection +5. Start heartbeat +6. Send current job status snapshot (late-join support) + +### Late-Join Buffer + +On connection, the consumer queries the current `ParameterTuningJob` status from the database and sends it as an initial message. This ensures clients that connect after a job has started (or completed) receive the current state: + +```json +{ + "type": "status_update", + "status": "running", + "data": { + "best_score": 0.85, + "buffered": true + }, + "seq": 0 +} +``` + +The `buffered: true` flag indicates this is historical data, not a live event. + +### Message Format: status_update + +Sent by the consumer to clients when a status change occurs. + +```json +{ + "type": "status_update", + "status": "", + "data": { ... }, + "seq": 0 +} +``` + +| Field | Type | Description | +|---|---|---| +| `type` | string | Always `"status_update"` | +| `status` | string | Job status: `"pending"`, `"running"`, `"completed"`, `"error"` | +| `data` | object | Status-specific payload | +| `seq` | integer | Monotonically increasing sequence number per connection | + +**Status-specific data payloads**: + +| Status | Data Fields | +|---|---| +| `running` | `{}` (empty) | +| `completed` | `{"best_score": }` | +| `error` | `{"error": ""}` | + +### Internal Channel Event: job_status_update + +Celery tasks send status updates to the channel group via: + +```python +channel_layer.group_send( + f"job_{job_id}_status", + {"type": "job_status_update", "status": "running", "data": {}} +) +``` + +The consumer's `job_status_update()` handler transforms this into the client-facing `status_update` message format with sequence numbering. + +## Consumer 2: TaskLogConsumer + +### Purpose + +Streams task log messages for a tuning job. Used by the frontend to display a live log feed during job execution. + +### Channel Group + +Group name: `job_{job_id}_logs` + +### Connection Sequence + +1. Validate JWT authentication +2. Check job access permissions +3. Join channel group `job_{job_id}_logs` +4. Accept WebSocket connection +5. Start heartbeat +6. Send existing log entries (late-join support, up to 500 entries) + +### Late-Join Buffer + +On connection, the consumer queries existing `TaskLog` entries linked to the job (via `task__tuning_job_link__job_id`) and sends them in chronological order: + +```json +{ + "type": "log", + "message": "Started the parameter tuning job 42", + "timestamp": "2025-01-15T10:30:00.123456+00:00", + "buffered": true, + "seq": 0 +} +``` + +- Maximum `500` entries are sent (`LATE_JOIN_BUFFER_LIMIT`) +- Entries are ordered by `ins_datetime` ascending +- The `buffered: true` flag distinguishes historical entries from live messages + +### Message Format: log + +Sent by the consumer to clients for each log message. + +```json +{ + "type": "log", + "message": "", + "timestamp": "", + "seq": 0 +} +``` + +| Field | Type | Description | +|---|---|---| +| `type` | string | Always `"log"` | +| `message` | string | Log message content | +| `timestamp` | string | ISO 8601 timestamp (may be empty for live messages) | +| `seq` | integer | Monotonically increasing sequence number per connection | + +### Internal Channel Event: task_log_message + +Celery tasks push log messages via: + +```python +channel_layer.group_send( + f"job_{job_id}_logs", + {"type": "task_log_message", "message": "Trial 3 with IALSRecommender..."} +) +``` + +## Sequence Numbering + +Both consumers maintain a per-connection sequence counter (`_seq`). Every outbound message includes a `seq` field that starts at 0 and increments by 1 for each message sent. This allows clients to: + +- Detect missed messages +- Maintain message ordering +- Distinguish the initial buffer from live updates + +## Complete Message Flow Diagram + +``` + Redis (db1) + Channel Layer +Client (Browser) Consumer Group Celery Worker + | | | | + | WS Connect | | | + |----------------->| | | + | | group_add() | | + | |------------------->| | + | status_update | | | + | (buffered) | | | + |<-----------------| | | + | | | | + | | | group_send() | + | | |<-----------------| + | | job_status_update | | + | |<-------------------| | + | status_update | | | + |<-----------------| | | + | | | | + | {"type":"ping"} | | | + |<-----------------| | | + | {"type":"pong"} | | | + |----------------->| | | + | | | | + | WS Disconnect | | | + |----------------->| | | + | | group_discard() | | + | |------------------->| | +``` + +## nginx WebSocket Configuration + +WebSocket connections are proxied through nginx with the following configuration: + +```nginx +location /ws/ { + access_log /dev/stdout ws_sanitized; + proxy_pass http://backend; + proxy_http_version 1.1; + proxy_set_header Upgrade $http_upgrade; + proxy_set_header Connection "upgrade"; + proxy_read_timeout 300s; +} +``` + +**Key points**: +- The `ws_sanitized` log format excludes query strings to prevent JWT token leakage in access logs. +- `proxy_read_timeout 300s` allows long-lived connections (5 minutes before nginx drops idle connections; heartbeat at 60s keeps them alive). +- The `Upgrade` and `Connection` headers enable the HTTP-to-WebSocket protocol switch. + +## Error Handling + +- If `_get_existing_logs()` fails during late-join buffer delivery, the error is logged but the connection remains open. Live messages will still be delivered. +- If `group_send()` fails in Celery tasks (e.g., Redis connection error), the exception is caught and logged as a warning. The task continues execution; log entries are also persisted to the database as `TaskLog` records. +- The heartbeat loop gracefully handles `CancelledError` on disconnect.