From 4957986afb5d457bcefa0663c5a553b35a45eb02 Mon Sep 17 00:00:00 2001 From: Carsten Scharfenberg Date: Wed, 1 Apr 2026 16:35:44 +0000 Subject: [PATCH 01/27] Enhance README and architectural documentation for clarity and completeness --- README.md | 67 ++++++++++++- docs/ARCHITECTURAL_DESIGN.md | 160 ++++++++++++++++---------------- middleware/sql_to_arc/README.md | 156 +++++++++++++++++-------------- 3 files changed, 232 insertions(+), 151 deletions(-) diff --git a/README.md b/README.md index 4da290a4..be56bad2 100644 --- a/README.md +++ b/README.md @@ -1,3 +1,66 @@ -# m4.2_advanced_middleware_api +# FAIRagro SQL-to-ARC Middleware -The API component of the advanced middleware that accepts ARCs in RO-Create format and pushes them to the datahub +This repository contains the **SQL-to-ARC Converter**, a core component of the FAIRagro advanced middleware architecture. It enables Research Data Infrastructure (RDI) providers to transform their relational metadata into standardized Annotated Research Context (ARC) objects and transmit them to the central FAIRagro Middleware API. + +## 📁 Repository Structure + +| Folder | Description | +| :--- | :--- | +| [middleware/](middleware/) | Source code of the converter component. | +| [docs/](docs/) | Architectural design, database view specifications, and API documentation. | +| [dev_environment/](dev_environment/) | Docker-based local development setup (Postgres, Mock API). | +| [scripts/](scripts/) | Tooling for quality checks, environment setup, and Git LFS. | +| [docker/](docker/) | Dockerfiles and container structure tests. | + +## 🚀 Getting Started (Development) + +This project uses [uv](https://github.com/astral-sh/uv) for dependency management and workspace orchestration. + +### 1. Prerequisites + +- **Python 3.12+** +- **Docker & Docker Compose** +- **Git LFS** (installed via `./scripts/setup-git-lfs.sh`) + +### 2. Environment Setup + +Clone the repository and install all workspace dependencies: + +```bash +uv sync --all-packages +``` + +### 3. Start Local Development Environment + +The `dev_environment` folder provides a full stack including a PostgreSQL database pre-filled with edaphobase data: + +```bash +cd dev_environment +./start.sh --build +``` + +This will start the database and run a test iteration of the converter. + +## 🔧 Component Documentation + +Detailed information on how to use, configure, and deploy the specific components can be found in their respective subdirectories: + +- **[SQL-to-ARC Converter README](middleware/sql_to_arc/README.md)**: Configuration (YAML/Env), CLI options, and production deployment. +- **[Architectural Design](docs/ARCHITECTURAL_DESIGN.md)**: Deep dive into the concurrency model, memory management, and data flow. +- **[Database View Spec](docs/sql_to_arc_database_views.md)**: The SQL views required for the RDI provider database. + +## đŸ§Ș Quality Standards + +We maintain high code quality through automated checks: + +```bash +# Run all quality checks (Ruff, Mypy, Pylint, Bandit) +./scripts/quality-check.sh + +# Run unit and integration tests +uv run pytest middleware/sql_to_arc/tests/ +``` + +--- +**Maintained by:** FAIRagro Middleware Team +**License:** [LICENSE](LICENSE) diff --git a/docs/ARCHITECTURAL_DESIGN.md b/docs/ARCHITECTURAL_DESIGN.md index 4ce0700b..1b08433c 100644 --- a/docs/ARCHITECTURAL_DESIGN.md +++ b/docs/ARCHITECTURAL_DESIGN.md @@ -1,142 +1,142 @@ -# Architektur-Dokumentation: SQL-to-ARC Middleware +# Architectural Documentation: SQL-to-ARC Middleware -## 1. Übersicht +## 1. Overview -Die SQL-to-ARC Middleware ist fĂŒr die Konvertierung von Metadaten aus einer relationalen SQL-Datenbank in das **ARC (Annotated Research Context)** Format verantwortlich. Die Architektur ist auf **hohen Durchsatz**, **speichereffiziente Verarbeitung** und **StabilitĂ€t** bei großen Datenmengen ausgelegt. +The SQL-to-ARC Middleware is responsible for converting metadata from a relational SQL database into the **ARC (Annotated Research Context)** format. The architecture is designed for **high throughput**, **memory-efficient processing**, and **stability** when handling large volumes of data. -## 2. Kernkomponenten +## 2. Core Components -Die Middleware besteht aus drei Hauptschichten: +The middleware consists of three main layers: -1. **Async IO Loop (Controller):** Orchestriert den Datenfluss, verwaltet Datenbank-Streams und API-Uploads. -2. **Process Pool Executor (Worker):** Parallelisiert die CPU-lastige ARC-Berechnung in separaten Betriebssystem-Prozessen. -3. **Streaming Generator (Data Layer):** Liest Daten in Chunks aus der Datenbank, um den RAM-Verbrauch konstant zu halten. +1. **Async IO Loop (Controller):** Orchestrates the data flow, manages database streams, and handles API uploads. +2. **Process Pool Executor (Worker):** Parallelizes CPU-intensive ARC calculations in separate operating system processes. +3. **Streaming Generator (Data Layer):** Reads data in chunks from the database to keep RAM consumption constant. --- -## 3. Detaillierte Architekturkonzepte +## 3. Detailed Architectural Concepts -### 3.1 Parallelisierung & CPU-Offloading +### 3.1 Parallelization & CPU Offloading -Da die Generierung von ARCs (via `arctrl`) rechenintensiv ist und Python durch das Global Interpreter Lock (GIL) limitiert wird, nutzt die Middleware einen `ProcessPoolExecutor`. +Since generating ARCs (via `arctrl`) is computationally intensive and Python is limited by the Global Interpreter Lock (GIL), the middleware utilizes a `ProcessPoolExecutor`. -- **Vorteil:** Jede ARC-Berechnung lĂ€uft auf einem eigenen CPU-Kern. -- **Implementierung:** `loop.run_in_executor(executor, build_arc_for_investigation, ...)` -- **Multiprocessing Support:** Der Aufruf von `multiprocessing.freeze_support()` stellt sicher, dass die Middleware auch in "eingefrorenen" Umgebungen (wie PyInstaller-Binaries unter Windows) korrekt neue Prozesse starten kann. Unter Linux ist dies primĂ€r eine Best-Practice fĂŒr die Cross-Platform KompatibilitĂ€t. +* **Advantage:** Each ARC calculation runs on its own CPU core. +* **Implementation:** `loop.run_in_executor(executor, build_single_arc_task, ...)` +* **Multiprocessing Support:** Calling `multiprocessing.freeze_support()` ensures the middleware correctly starts new processes even in "frozen" environments (such as PyInstaller binaries on Windows). On Linux, this is primarily a best practice for cross-platform compatibility. -### 3.2 Concurrency & Flow Control (Die Semaphore) +### 3.2 Concurrency & Flow Control (The Semaphore) -ZusĂ€tzlich zum Prozess-Pool wird eine `asyncio.Semaphore` verwendet. Dies adressiert zwei kritische Probleme, die ein reiner Prozess-Pool nicht lösen kann: +In addition to the process pool, an `asyncio.Semaphore` is used. This addresses two critical issues that a pure process pool cannot solve: -1. **Memory Protection:** Ohne Semaphore wĂŒrde Python fĂŒr alle (z. B. 10.000) DatensĂ€tze gleichzeitig asynchrone Tasks starten und Daten aus der DB im RAM halten. Die Semaphore limitiert die Anzahl der *gleichzeitig aktiven* Workflows. -2. **Network/IO Backpressure:** Die Semaphore begrenzt auch die Anzahl der gleichzeitigen HTTP-Verbindungen zur API, um Timeouts und Rate-Limiting zu vermeiden. +1. **Memory Protection:** Without a semaphore, Python would start asynchronous tasks for all (e.g., 10,000) datasets simultaneously and keep the database data in RAM. The semaphore limits the number of *concurrently active* workflows. +2. **Network/IO Backpressure:** The semaphore also limits the number of simultaneous HTTP connections to the API to avoid timeouts and rate-limiting. -**Diskussionspunkt:** *Warum nicht einfach die GrĂ¶ĂŸe des Prozess-Pools limitieren?* -Der Prozess-Pool limitiert nur die CPU-Auslastung. Die Semaphore limitiert den **gesamten Lebenszyklus** (Datenvorbereitung -> Build -> Upload). Sie verhindert, dass der Speicher mit "wartenden" Daten ĂŒberlĂ€uft, bevor diese ĂŒberhaupt an den Pool ĂŒbergeben werden. +**Discussion Point:** *Why not simply limit the size of the process pool?* +The process pool only limits CPU usage. The semaphore limits the **entire lifecycle** (data preparation -> build -> upload). It prevents the memory from overflowing with "waiting" data before it is even handed over to the pool. -### 3.3 Speichereffizientes Daten-Streaming +### 3.3 Memory-Efficient Data Streaming -Die Middleware implementiert einen **Lazy-Loading** Ansatz fĂŒr Datenbank-EntitĂ€ten: +The middleware implements a **lazy-loading** approach for database entities: -- **Chunking:** Über den Generator `stream_investigation_datasets` werden Untersuchungen mit `fetchmany(batch_size)` geladen. -- **Relationales Batching:** FĂŒr jeden Chunk (z. B. 100 Untersuchungen) werden die zugehörigen Studies und Assays in einem einzigen Bulk-Query (`WHERE investigation_id = ANY(...)`) nachgeladen. -- **Effekt:** Wir vermeiden das "N+1 Query" Problem (extrem langsam) und gleichzeitig den "Full Table Load" (extrem speicherhungrig). +* **Chunking:** Using the `stream_investigations` generator (in the `Database` class), investigations are loaded in batches. +* **Relational Batching:** Associated studies, assays, contacts, and publications are fetched in bulk for each batch using specialized queries (e.g., `WHERE investigation_ref = ANY(...)`). +* **Effect:** We avoid the "N+1 Query" problem (extremely slow) while also avoiding a "Full Table Load" (extremely memory-intensive). --- -## 4. Speicher-Management & Performance-Optimierung +## 4. Memory Management & Performance Optimization -Bei der Verarbeitung von tausenden Investigations (ARC-Containern) kann der RAM-Verbrauch kritisch werden. Die Middleware implementiert hierfĂŒr drei Strategien: +When processing thousands of investigations (ARC containers), RAM consumption can become critical. The middleware implements three strategies for this: -### 4.1 Backlog Flow Control (Produzenten-Pause) +### 4.1 Backlog Flow Control (Producer Pause) -Der asynchrone Datenbank-Stream produziert Daten schneller, als der Prozess-Pool sie konvertieren kann. +The asynchronous database stream produces data faster than the process pool can convert it. -- **Problem:** Tausende `asyncio.Tasks` wĂŒrden gleichzeitig im RAM auf ihre AusfĂŒhrung warten, inklusive aller zugehörigen Datenbank-Zeilen. -- **Lösung:** Eine Drosselung im Haupt-Loop: `if len(running_tasks) >= max_concurrent_tasks: await asyncio.wait(...)`. Der Stream pausiert, bis wieder KapazitĂ€t frei ist. Dies limitiert die Anzahl der DatensĂ€tze, die sich gleichzeitig im Speicher befinden. +* **Problem:** Thousands of `asyncio.Tasks` would wait in RAM simultaneously for execution, including all associated database rows. +* **Solution:** Throttling in the main loop managed by the semaphore and task set management. The stream pauses until capacity becomes available. This limits the number of datasets residing in memory at once. ### 4.2 Worker-Side Serialization & GC -Die ARC-Objekte der `arctrl` Bibliothek sind komplex und beanspruchen sowohl Python- als auch .NET-Speicher. +ARC objects in the `arctrl` library are complex and consume both Python and .NET-bridge memory. -- **Strategie:** Die Konvertierung zum JSON-String erfolgt direkt im Worker-Prozess. -- **Memory Cleanup:** Nach der Serialisierung werden die ARC-Objekte im Worker explizit gelöscht (`del`) und der Garbage Collector (`gc.collect()`) aufgerufen, bevor der Prozess das Ergebnis an den Hauptprozess zurĂŒckgibt. Dies verhindert das "Anschwellen" der Worker-Prozesse. +* **Strategy:** Conversion to a JSON-LD string is performed directly within the worker process. +* **Memory Cleanup:** After serialization, ARC objects in the worker are explicitly deleted (`del`) and the garbage collector (`gc.collect()`) is called before the process returns the result to the main process. This prevents worker processes from "swelling." -### 4.3 JSON vs. Objekt-Transfer +### 4.3 JSON vs. Object Transfer -Zwischen dem Hauptprozess und den Workern werden keine komplexen ARC-Objekte ĂŒbertragen, sondern lediglich primitive Python-Datentypen (Dicts) als Input und fertige JSON-Strings als Output. Dies minimiert den Overhead der Inter-Prozess-Kommunikation (IPC). +In the current implementation, large ARC objects are not transferred between the main process and workers. Instead, primitive Python types (like dicts) are used as input, and serialized JSON-LD strings are returned as output. This minimizes Inter-Process Communication (IPC) overhead. -### 4.4 Entkopplung von I/O und CPU (Workload Balancing) +### 4.4 Decoupling I/O and CPU (Workload Balancing) -Um die CPU-Auslastung zu maximieren, wird die Anzahl der gleichzeitig aktiven Tasks (`max_concurrent_tasks`) unabhĂ€ngig von der Anzahl der CPU-Worker (`max_concurrent_arc_builds`) gesteuert. +To maximize CPU utilization, the number of concurrently active tasks (`max_concurrent_tasks`) is controlled independently of the number of CPU workers (`max_concurrent_arc_builds`). -- **Prinzip:** WĂ€hrend ein Teil der Tasks auf die Netzwerk-Antwort der API wartet (I/O), können die CPU-Worker bereits den nĂ€chsten ARC-Build aus der Warteschlange verarbeiten. -- **Konfiguration:** StandardmĂ€ĂŸig ist die Task-KapazitĂ€t doppelt so groß wie die Anzahl der CPU-Worker (einstellbar via `max_concurrent_tasks`), um Latenzen zu ĂŒberbrĂŒcken, ohne den RAM zu ĂŒberlasten. +* **Principle:** While some tasks wait for the API's network response (I/O), CPU workers can already process the next ARC build from the queue. +* **Configuration:** By default, task capacity is four times larger than the number of CPU workers (configurable via `max_concurrent_tasks`) to bridge latencies without overstretching RAM. --- -## 5. Datenfluss (Step-by-Step) +## 5. Data Flow (Step-by-Step) -1. **Producer:** Der Hauptprozess startet den Streaming-Generator. -2. **Throttle:** Der Loop wartet an der `Semaphore` auf einen freien Slot. -3. **Data Fetch:** Eine Untersuchung wird aus der DB gelesen. -4. **Build (CPU):** Der Datensatz wird an den `ProcessPoolExecutor` geschickt. Der Haupt-Loop bleibt wĂ€hrenddessen frei fĂŒr andere Aufgaben. -5. **Upload (I/O):** Das Ergebnis (JSON) wird asynchron per HTTP an die Middleware-API gesendet. -6. **Release:** Die Semaphore wird freigegeben, der nĂ€chste Datensatz fließt nach. +1. **Producer:** The main process starts the streaming generator. +2. **Throttle:** The loop waits on the `Semaphore` for an available slot. +3. **Data Fetch:** Investigation data and related entities (Studies, Assays, etc.) are fetched from the database. +4. **Build (CPU):** The dataset is sent to the `ProcessPoolExecutor`. The main loop remains free for other tasks in the meantime. +5. **Upload (I/O):** The result (JSON) is sent asynchronously via HTTP to the Middleware API using `ApiClient`. +6. **Release:** The semaphore is released, and the next dataset flows in. --- -## 6. Fehlerbehandlung & Monitoring +## 6. Error Handling & Monitoring -- **Gezieltes Exception Handling:** Fehler beim Upload oder beim Build fĂŒhren nicht zum Abbruch des gesamten Laufs. -- **ProcessingStats:** Jeder Erfolg und Fehler wird mit ID erfasst und am Ende als JSON-LD Report ausgegeben. -- **Tracing:** Die gesamte Kette ist mit OpenTelemetry (Tracing) instrumentiert, um Performance-EngpĂ€sse im Prozess-Pool oder Netzwerk zu identifizieren. +* **Targeted Exception Handling:** Errors during upload or build do not cause the entire run to abort. +* **ProcessingStats:** Every success and failure is recorded by ID and output as a JSON-LD report at the end. +* **Tracing:** The entire chain is instrumented with **OpenTelemetry** (tracing) to identify performance bottlenecks in the process pool or network. +* **Pre-flight Schema Validation:** The middleware verifies that all required database views and columns exist before starting the process. --- -## 7. Zusammenfassung der Design-Entscheidungen +## 7. Summary of Design Decisions -| Problem | Lösung | Grund | +| Problem | Solution | Reason | | :--- | :--- | :--- | -| GIL / CPU-Limit | `ProcessPoolExecutor` | Echte ParallelitĂ€t auf mehreren Kernen. | -| Niedrige CPU Auslastung | I/O-CPU Entkopplung | `max_concurrent_tasks` erlaubt API-Uploads parallel zu neuen ARC-Builds. | -| Memory Overflow (Backlog) | Producer Throttling | Verhindert, dass zu viele DatensĂ€tze gleichzeitig im RAM "warten". | -| Memory Leak (Worker) | `gc.collect()` + JSON Return | Gibt Speicher im Worker sofort nach der Konvertierung frei. | -| Datenbank-Last | `fetchmany` + `ANY()` | Optimale Balance zwischen Abfrage-Anzahl und Speicherlast. | -| Skalierbarkeit | Single ARC Processing | FrĂŒherer Erfolg/Fehler-Feedback pro Untersuchung statt nur pro Batch. | +| GIL / CPU Limit | `ProcessPoolExecutor` | True parallelism across multiple cores. | +| Low CPU Utilization | I/O-CPU Decoupling | `max_concurrent_tasks` allows API uploads in parallel with new ARC builds. | +| Memory Overflow (Backlog) | Producer Throttling | Prevents too many datasets from "waiting" in RAM simultaneously. | +| Memory Leak (Worker) | `gc.collect()` + JSON Return | Frees memory in the worker immediately after conversion. | +| Database Load | Server-side Cursors + `ANY()` | Optimal balance between number of queries and memory load. | +| Scalability | Single ARC Processing | Earlier success/error feedback per investigation instead of per batch only. | --- ## 8. Performance Tuning Guide -Um die Middleware optimal an die vorhandene Hardware und die Datenstruktur der Datenbank anzupassen, können folgende Parameter in der Konfigurationsdatei (`config.yaml`) optimiert werden: +To optimally adapt the middleware to existing hardware and database structures, the following parameters in the configuration file (`config.yaml`) can be optimized: -### 8.1 CPU & Parallelisierung +### 8.1 CPU & Parallelization -- **`max_concurrent_arc_builds`**: Bestimmt die Anzahl der Worker-Prozesse im `ProcessPoolExecutor`. - - **Empfehlung**: Setzen Sie diesen Wert auf die Anzahl der verfĂŒgbaren CPU-Kerne minus 1 (um Reserven fĂŒr den Hauptprozess und das Betriebssystem zu lassen). - - **Effekt**: Höhere CPU-Last, aber schnellere Verarbeitung der ARC-Generierung. +* **`max_concurrent_arc_builds`**: Determines the number of worker processes in the `ProcessPoolExecutor`. + * **Recommendation**: Set this value to the number of available CPU cores minus 1 (to leave reserves for the main process and the operating system). + * **Effect**: Higher CPU load, but faster execution of ARC generation. -### 8.2 Durchsatz & I/O Balancing +### 8.2 Throughput & I/O Balancing -- **`max_concurrent_tasks`**: Limitiert die Anzahl der gleichzeitig aktiven asynchronen Workflows (Datenfetch + Build + Upload). - - **Faustformel**: `4 * max_concurrent_arc_builds`. - - **Warum?**: WĂ€hrend z.B. 4 Kerne ARCs berechnen, können die restlichen Tasks auf die Netzwerk-Antwort der API warten (I/O). Ein zu hoher Wert fĂŒhrt zu erhöhtem RAM-Verbrauch; ein zu niedriger Wert lĂ€sst die CPU leerlaufen ("Stop-and-Go"). - - **Tuning**: Wenn die CPU-Auslastung trotz Arbeit stark schwankt, erhöhen Sie diesen Wert leicht (z.B. auf `5 * builds`). +* **`max_concurrent_tasks`**: Limits the number of concurrently active asynchronous workflows (data fetch + build + upload). + * **Rule of Thumb**: `4 * max_concurrent_arc_builds`. + * **Why?**: While 4 cores are calculating ARCs, other tasks can wait for the API's network response (I/O). A value that is too high leads to increased RAM consumption; a value that is too low causes the CPU to run dry ("Stop-and-Go"). -### 8.3 Datenbank-Effizienz +### 8.3 Database Efficiency -- **`db_batch_size`**: Anzahl der Investigations, die pro Datenbank-Chunk geladen werden. - - **Standard**: 100. - - **Tuning**: Erhöhen Sie diesen Wert bei sehr vielen kleinen Investigations (wenige Studies/Assays), um die Anzahl der SQL-Roundtrips zu senken. Senken Sie ihn, wenn einzelne Investigations extrem groß sind, um den RAM-Verbrauch des Hauptprozesses zu limitieren. +* **`db_batch_size`**: Number of investigations loaded per database chunk. + * **Default**: 100. + * **Tuning**: Increase this value if you have many small investigations (few studies/assays) to reduce SQL roundtrips. Decrease it if individual investigations are extremely large to limit the RAM consumption of the main process. -### 8.4 StabilitĂ€t & Timeouts +### 8.4 Stability & Timeouts -- **`arc_generation_timeout_minutes`**: Maximalzeit fĂŒr einen einzelnen `build_arc_for_investigation` Aufruf im Worker. - - **Tuning**: Erhöhen Sie diesen Wert, falls Sie im Log "Timeout" Fehler bei sehr großen DatensĂ€tzen (z.B. Tausende Assays) sehen. +* **`arc_generation_timeout_minutes`**: Maximum time for a single `build_single_arc_task` call in the worker. + * **Tuning**: Increase this value if you see "Timeout" errors in the log for very large datasets (e.g., thousands of assays). -### 8.5 Zusammenfassung: Das optimale Setup finden +### 8.5 Summary: Finding the Optimal Setup -1. **CPU-Limit finden**: Erhöhen Sie `max_concurrent_arc_builds` bis die CPU-Kerne ausgelastet sind. -2. **I/O-Löcher fĂŒllen**: Erhöhen Sie `max_concurrent_tasks`, wenn die CPU-Last zwischen den Builds auf 0% sinkt (Anzeichen fĂŒr Warten auf API-Uploads). -3. **RAM-Check**: Überwachen Sie den Speicherverbrauch. Der RAM-Bedarf steigt linear mit `max_concurrent_tasks` und der GrĂ¶ĂŸe der Investigations im Batch. +1. **Find CPU Limit:** Increase `max_concurrent_arc_builds` until CPU cores are saturated. +2. **Fill I/O Gaps:** Increase `max_concurrent_tasks` if CPU load drops to 0% between builds (an indication of waiting for API uploads). +3. **RAM Check:** Monitor memory consumption. RAM requirements increase linearly with `max_concurrent_tasks` and the size of investigations in the batch. diff --git a/middleware/sql_to_arc/README.md b/middleware/sql_to_arc/README.md index 78fbd20c..d7abc99f 100644 --- a/middleware/sql_to_arc/README.md +++ b/middleware/sql_to_arc/README.md @@ -1,106 +1,124 @@ -# SQL to ARC Converter +# SQL-to-ARC Middleware Converter -The `sql_to_arc` package converts data from a PostgreSQL database schema into FAIR ARC containers using ARCtrl, and uploads them to the Advanced Middleware API. +The **SQL-to-ARC Converter** is a high-performance middleware component designed to bridge research data infrastructures (RDIs) and the FAIRagro metadata ecosystem. -## Features +## Overview -- Async Database access via `sqlalchemy` (asyncio with asyncpg, aiosqlite, etc.) -- SQL View-based mapping of data to ARCtrl models -- Batch upload to the Middleware API using `ApiClient` -- Pydantic-based configuration with generic Connection String support +SQL-to-ARC runs locally at the RDI provider's infrastructure. It connects to the provider's SQL database, extracts metadata from a [pre-defined set of database views](docs/sql_to_arc_database_views.md), and transforms this data into **Annotated Research Context (ARC)** objects using the [ARCtrl Library](https://github.com/nfdi4plants/ARCtrl). -## Requirements +Once an ARC is constructed and validated, the tool uses the [middleware api_client library](https://github.com/fairagro/m4.2_advanced_middleware_api/tree/main/middleware/api_client) to transmit the resulting RO-Crate JSON-LD payloads to the [FAIRagro Middleware API](https://github.com/fairagro/m4.2_advanced_middleware_api). -- Python 3.12+ -- PostgreSQL reachable from runtime -- The workspace packages `shared` and `api_client` available (uv workspace) +### Key Features -## Install (uv) +- **High Throughput:** Parallel processing using a CPU-bound process pool for ARC generation. +- **Memory Efficient:** Streaming database access via server-side cursors and batching. +- **Robust:** Pre-flight schema validation and comprehensive OpenTelemetry tracing. +- **Configurable:** Fully customizable via YAML, Environment Variables, or Docker Secrets. -This repo uses `uv` for dependency management. +--- -```bash -# from repository root -uv sync --all-packages -uv run python -m middleware.sql_to_arc.main -``` +## Configuration -If you prefer a virtual environment only for this package: +The tool is configured using a YAML file, which can be overridden by Environment Variables. -```bash -cd middleware/sql_to_arc -uv sync -uv run python -m middleware.sql_to_arc.main -``` +### YAML Configuration (`config.yaml`) -## Configuration +| Field | Type | Description | +| :--- | :--- | :--- | +| `connection_string` | `string` | Database URI (e.g., `postgresql+psycopg://user:pass@host:5432/db`). | +| `rdi` | `string` | Unique identifier for your RDI (e.g., `edaphobase`). | +| `rdi_url` | `string` | Public URL of your RDI (used for provenance metadata). | +| `api_client` | `object` | Configuration for the Middleware API connection (see below). | +| `max_concurrent_arc_builds` | `int` | Number of parallel worker processes (Default: `5`). | +| `max_concurrent_tasks` | `int` | Max concurrent IO+CPU tasks (Default: `4 * builds`). | +| `db_batch_size` | `int` | Investigations to fetch per DB chunk (Default: `100`). | +| `debug_limit` | `int` | (Optional) Limit processing to the first N investigations. | -Configuration is defined by `middleware.sql_to_arc.config.Config` and can be provided as dict, env, or file. The default example in `main.py`: - -```python -config = Config.from_data({ - "connection_string": "postgresql+asyncpg://user:pass@localhost:5432/edaphobase", - "rdi": "edaphobase", - "api_client": { - "api_url": "http://localhost:8000", - "client_cert_path": "/path/to/cert.pem", - "client_key_path": "/path/to/key.pem", - "verify_ssl": "false", - }, -}) -``` +#### `api_client` Configuration -Environment variables can be supported by extending `Config` (e.g., `pydantic` `BaseSettings`). +| Field | Type | Description | +| :--- | :--- | :--- | +| `api_url` | `string` | URL of the Middleware API. | +| `client_cert_path` | `string` | Path to the client certificate (PEM). | +| `client_key_path` | `string` | Path to the client private key (PEM). | +| `verify_ssl` | `bool` | Whether to verify the API's SSL certificate (Default: `true`). | -## Running +### Environment Variables -Run the converter locally (async): +All configuration fields can be set via environment variables using the prefix **`SQL_TO_ARC_`**. -```bash -uv run python -m middleware.sql_to_arc.main -``` +- **Examples:** + - `SQL_TO_ARC_CONNECTION_STRING="postgresql://..."` + - `SQL_TO_ARC_RDI="my-rdi"` + - `SQL_TO_ARC_DEBUG_LIMIT=10` + +### Secrets Handling -This will: +In containerized environments, sensitive values like the `connection_string` or `client_key` can be provided via **Docker Secrets**. The tool looks for files in `/run/secrets/` with names matching the lowercase environment variable (e.g., `/run/secrets/sql_to_arc_connection_string`). -- Open a DB connection -- Fetch Investigations, Studies, Assays -- Map them to ARCtrl objects -- Upload batches to the Middleware API +--- -## Docker +## Usage -A Dockerfile for building a standalone binary exists at `docker/Dockerfile.sql_to_arc`. +### 1. From Source (Development) -Build: +Requires [uv](https://github.com/astral-sh/uv) installed. ```bash -docker build -f docker/Dockerfile.sql_to_arc -t sql-to-arc . +# Install dependencies for all workspace members +uv sync --all-packages + +# Run the converter with a specific config file +uv run python -m middleware.sql_to_arc.main -c my_config.yaml ``` -Run: +### 2. Local Docker Image + +Build the image from the repository root: ```bash -docker run --rm -e DB_HOST=... -e DB_USER=... -e DB_PASSWORD=... -e API_URL=... sql-to-arc +docker build -f docker/Dockerfile.sql_to_arc -t sql-to-arc:local . ``` -Adjust env variables or mount configuration as needed. +Run with environment variables: -## Development +```bash +docker run --rm \ + -e SQL_TO_ARC_CONNECTION_STRING="postgresql://..." \ + -e SQL_TO_ARC_RDI="my-rdi" \ + -v $(pwd)/certs:/certs \ + -e SQL_TO_ARC_API_CLIENT__CLIENT_CERT_PATH="/certs/client.crt" \ + sql-to-arc:local +``` -- Tests are under `middleware/sql_to_arc/tests` -- Mapping logic in `middleware/sql_to_arc/src/middleware/sql_to_arc/mapper.py` -- Main entrypoint `middleware/sql_to_arc/src/middleware/sql_to_arc/main.py` +### 3. Official Docker Image -Lint / format / type-check: +Pull the latest official image from Docker Hub (once available): ```bash -uv run ruff check -uv run ruff format -uv run mypy -p middleware.sql_to_arc +docker pull fairagro/sql-to-arc:latest + +docker run --rm \ + --env-file .env \ + -v $(pwd)/config.yaml:/etc/sql_to_arc/config.yaml:ro \ + fairagro/sql-to-arc:latest -c /etc/sql_to_arc/config.yaml ``` -## Troubleshooting +--- + +## CLI Options + +| Option | Description | +| :--- | :--- | +| `-c`, `--config` | Path to the YAML configuration file (Default: `config.yaml`). | +| `-v`, `--version` | Show the version and exit. | +| `-h`, `--help` | Show help and exit. | + +--- + +## Documentation Links -- Connection errors: verify DB host/port/user/password -- API errors: ensure `api_client` settings and server availability -- Type errors: run `uv run mypy` and update models/config accordingly +- [Architectural Design](docs/ARCHITECTURAL_DESIGN.md) +- [Database View Specification](docs/sql_to_arc_database_views.md) +- [ARCtrl Documentation](https://nfdi4plants.org/ARCtrl/) +- [Middleware API Client](https://github.com/fairagro/m4.2_advanced_middleware_api/tree/main/middleware/api_client) From 7085b91dd366a473d7a48a7f0f09b5da4a76ab23 Mon Sep 17 00:00:00 2001 From: Carsten Scharfenberg Date: Wed, 1 Apr 2026 17:00:40 +0000 Subject: [PATCH 02/27] Update VS Code settings and enhance README for Dev Container usage --- .vscode/settings.json | 3 +-- README.md | 18 ++++++++---------- 2 files changed, 9 insertions(+), 12 deletions(-) diff --git a/.vscode/settings.json b/.vscode/settings.json index b6a1e9e4..ed97bf02 100644 --- a/.vscode/settings.json +++ b/.vscode/settings.json @@ -11,7 +11,6 @@ "python.testing.autoTestDiscoverOnSaveEnabled": true, "python.testing.pytestPath": "${workspaceFolder}/.venv/bin/pytest", "python.testing.cwd": "${workspaceFolder}", - "python-envs.pythonProjects": [], // Code formatting - use ruff for both linting and formatting "[python]": { @@ -49,5 +48,5 @@ ], "pylint.importStrategy": "fromEnvironment", - "python-envs.defaultEnvManager": "ms-python.python:venv" + "python-envs.defaultEnvManager": "ms-python.python:system" } diff --git a/README.md b/README.md index be56bad2..791ad112 100644 --- a/README.md +++ b/README.md @@ -14,11 +14,14 @@ This repository contains the **SQL-to-ARC Converter**, a core component of the F ## 🚀 Getting Started (Development) -This project uses [uv](https://github.com/astral-sh/uv) for dependency management and workspace orchestration. +The preferred method for working with this repository is using the **Dev Container** (VS Code). -### 1. Prerequisites +While it is possible to develop without the Dev Container (see prerequisites below), this approach is not tested and is therefore neither documented nor officially supported. + +### 1. Prerequisites (for manual setups only) - **Python 3.12+** +- **[uv](https://github.com/astral-sh/uv)** (Dependency Management & Workspace Orchestration) - **Docker & Docker Compose** - **Git LFS** (installed via `./scripts/setup-git-lfs.sh`) @@ -32,14 +35,9 @@ uv sync --all-packages ### 3. Start Local Development Environment -The `dev_environment` folder provides a full stack including a PostgreSQL database pre-filled with edaphobase data: - -```bash -cd dev_environment -./start.sh --build -``` +The `dev_environment` folder provides a full stack including a PostgreSQL database pre-filled with edaphobase data. -This will start the database and run a test iteration of the converter. +Please refer to the **[Development Environment README](dev_environment/README.md)** for detailed instructions on prerequisites (like secret management and mTLS keys), setup, and usage. ## 🔧 Component Documentation @@ -62,5 +60,5 @@ uv run pytest middleware/sql_to_arc/tests/ ``` --- -**Maintained by:** FAIRagro Middleware Team +**Maintained by:** FAIRagro Middleware Team **License:** [LICENSE](LICENSE) From 01c24d1303a4904cc8d1a30836d35c7357d550cb Mon Sep 17 00:00:00 2001 From: Carsten Scharfenberg Date: Wed, 1 Apr 2026 17:07:13 +0000 Subject: [PATCH 03/27] Enhance README with updated development instructions and example YAML configuration --- README.md | 2 +- middleware/sql_to_arc/README.md | 21 +++++++++++++++++++++ 2 files changed, 22 insertions(+), 1 deletion(-) diff --git a/README.md b/README.md index 791ad112..d159fb18 100644 --- a/README.md +++ b/README.md @@ -16,7 +16,7 @@ This repository contains the **SQL-to-ARC Converter**, a core component of the F The preferred method for working with this repository is using the **Dev Container** (VS Code). -While it is possible to develop without the Dev Container (see prerequisites below), this approach is not tested and is therefore neither documented nor officially supported. +While it is possible to develop without the Dev Container (see next steps below), this approach is not tested and is therefore neither documented nor officially supported. ### 1. Prerequisites (for manual setups only) diff --git a/middleware/sql_to_arc/README.md b/middleware/sql_to_arc/README.md index d7abc99f..1963eb4d 100644 --- a/middleware/sql_to_arc/README.md +++ b/middleware/sql_to_arc/README.md @@ -23,6 +23,22 @@ The tool is configured using a YAML file, which can be overridden by Environment ### YAML Configuration (`config.yaml`) +Example configuration file: + +```yaml +connection_string: "postgresql+psycopg://user:password@localhost:5432/edaphobase" +rdi: "edaphobase" +rdi_url: "https://portal.edaphobase.org" +max_concurrent_arc_builds: 4 +db_batch_size: 50 + +api_client: + api_url: "https://middleware.fairagro.net/api/v1" + client_cert_path: "/run/secrets/client.crt" + client_key_path: "/run/secrets/client.key" + verify_ssl: true +``` + | Field | Type | Description | | :--- | :--- | :--- | | `connection_string` | `string` | Database URI (e.g., `postgresql+psycopg://user:pass@host:5432/db`). | @@ -36,6 +52,11 @@ The tool is configured using a YAML file, which can be overridden by Environment #### `api_client` Configuration +The official **[FAIRagro Middleware API](https://middleware.fairagro.net/docs#/)** requires **mTLS (Mutual TLS)** for authentication. This means you must provide both a client certificate and a private key. + +To obtain a valid client certificate for your RDI, please contact the FAIRagro middleware team at: +`carsten (dot) scharfenberg (at) zalf (dot) de` + | Field | Type | Description | | :--- | :--- | :--- | | `api_url` | `string` | URL of the Middleware API. | From 33fc13aa0ed00b1fdfc70931e735459299724829 Mon Sep 17 00:00:00 2001 From: Carsten Scharfenberg Date: Wed, 1 Apr 2026 17:17:44 +0000 Subject: [PATCH 04/27] Update VS Code settings and enhance README with detailed configuration options --- .vscode/settings.json | 2 +- middleware/sql_to_arc/README.md | 53 +++++++++++++++++++++++---------- 2 files changed, 38 insertions(+), 17 deletions(-) diff --git a/.vscode/settings.json b/.vscode/settings.json index ed97bf02..38bccfc8 100644 --- a/.vscode/settings.json +++ b/.vscode/settings.json @@ -48,5 +48,5 @@ ], "pylint.importStrategy": "fromEnvironment", - "python-envs.defaultEnvManager": "ms-python.python:system" + "python-envs.defaultEnvManager": "ms-python.python:venv" } diff --git a/middleware/sql_to_arc/README.md b/middleware/sql_to_arc/README.md index 1963eb4d..e58ea85f 100644 --- a/middleware/sql_to_arc/README.md +++ b/middleware/sql_to_arc/README.md @@ -39,16 +39,21 @@ api_client: verify_ssl: true ``` -| Field | Type | Description | -| :--- | :--- | :--- | -| `connection_string` | `string` | Database URI (e.g., `postgresql+psycopg://user:pass@host:5432/db`). | -| `rdi` | `string` | Unique identifier for your RDI (e.g., `edaphobase`). | -| `rdi_url` | `string` | Public URL of your RDI (used for provenance metadata). | -| `api_client` | `object` | Configuration for the Middleware API connection (see below). | -| `max_concurrent_arc_builds` | `int` | Number of parallel worker processes (Default: `5`). | -| `max_concurrent_tasks` | `int` | Max concurrent IO+CPU tasks (Default: `4 * builds`). | -| `db_batch_size` | `int` | Investigations to fetch per DB chunk (Default: `100`). | -| `debug_limit` | `int` | (Optional) Limit processing to the first N investigations. | +| Field | Type | Description | Default | +| :--- | :--- | :--- | :--- | +| `connection_string` | `string` | Database URI (e.g., `postgresql+psycopg://user:pass@host:5432/db`). | **Required** | +| `rdi` | `string` | Unique identifier for your RDI (e.g., `edaphobase`). | **Required** | +| `rdi_url` | `string` | Public URL of your RDI (used for provenance metadata). | **Required** | +| `api_client` | `object` | Configuration for the Middleware API connection (see below). | **Required** | +| `log_level` | `string` | Console logging level (`DEBUG`, `INFO`, `WARNING`, `ERROR`, `CRITICAL`). | `INFO` | +| `otel` | `object` | OpenTelemetry configuration (see below). | `defaults` | +| `max_concurrent_arc_builds` | `int` | Number of parallel worker processes in the CPU pool. | `5` | +| `max_concurrent_tasks` | `int` | Max concurrent IO+CPU tasks. | `4 * builds` | +| `db_batch_size` | `int` | Investigations to fetch per DB chunk. | `100` | +| `max_studies` | `int` | Max studies per investigation (safety limit). | `5000` | +| `max_assays` | `int` | Max assays per investigation (safety limit). | `10000` | +| `arc_generation_timeout_minutes` | `int` | Timeout for a single ARC generation process. | `30` | +| `debug_limit` | `int` | (Optional) Limit processing to the first N investigations. | `None` | #### `api_client` Configuration @@ -57,12 +62,28 @@ The official **[FAIRagro Middleware API](https://middleware.fairagro.net/docs#/) To obtain a valid client certificate for your RDI, please contact the FAIRagro middleware team at: `carsten (dot) scharfenberg (at) zalf (dot) de` -| Field | Type | Description | -| :--- | :--- | :--- | -| `api_url` | `string` | URL of the Middleware API. | -| `client_cert_path` | `string` | Path to the client certificate (PEM). | -| `client_key_path` | `string` | Path to the client private key (PEM). | -| `verify_ssl` | `bool` | Whether to verify the API's SSL certificate (Default: `true`). | +| Field | Type | Description | Default | +| :--- | :--- | :--- | :--- | +| `api_url` | `string` | URL of the Middleware API. | **Required** | +| `client_cert_path` | `string` | Path to the client certificate (PEM). | `None` | +| `client_key_path` | `string` | Path to the client private key (PEM). | `None` | +| `ca_cert_path` | `string` | Path to a custom CA certificate for server verification. | `None` | +| `timeout` | `float` | Request timeout in seconds. | `30.0` | +| `verify_ssl` | `bool` | Whether to verify the API's SSL certificate. | `true` | +| `follow_redirects` | `bool` | Whether to follow HTTP redirects. | `true` | +| `max_concurrency` | `int` | Max concurrent HTTP requests. | `10` | +| `max_retries` | `int` | Max retries for transient HTTP errors. | `3` | +| `retry_backoff_factor` | `float` | Backoff factor for retries. | `2.0` | + +#### `otel` Configuration + +OpenTelemetry settings for distributed tracing and logging. + +| Field | Type | Description | Default | +| :--- | :--- | :--- | :--- | +| `endpoint` | `string` | OTel collector endpoint (e.g., `http://localhost:4318`). | `None` | +| `log_console_spans` | `bool` | Whether to print OTel spans to the console. | `false` | +| `log_level` | `string` | Logging level for OTLP log export. | `INFO` | ### Environment Variables From 6bb38254869ed797162541fa973a456e4f2ac5d0 Mon Sep 17 00:00:00 2001 From: Carsten Scharfenberg Date: Thu, 2 Apr 2026 13:22:21 +0000 Subject: [PATCH 05/27] Enhance development environment with demo setup and configuration updates - Added demo Docker Compose files for local testing with mock API - Updated README and AGENTS.md for clarity on demo usage - Introduced start-demo.sh and start-dev.sh scripts for easier environment setup - Modified configuration files for demo and development modes - Added demo SQL initialization script and mock API implementation - Updated .gitignore to exclude demo output directory --- .gitattributes | 1 + .gitignore | 2 + .vscode/copilot-instructions.md | 2 +- .vscode/launch.json | 2 +- AGENTS.md | 16 +- dev_environment/README.md | 53 ++++--- dev_environment/compose.demo.yaml | 60 ++++++++ .../{compose.yaml => compose.dev.yaml} | 22 +-- .../{debug_config.yaml => config.debug.yaml} | 0 dev_environment/config.demo.yaml | 8 + .../{config.yaml => config.dev.yaml} | 0 dev_environment/demo.sql | 108 +++++++++++++ dev_environment/demo_api_main.py | 143 ++++++++++++++++++ .../{FAIRagro.sql => edaphobase.sql} | 0 dev_environment/start-demo.sh | 49 ++++++ dev_environment/{start.sh => start-dev.sh} | 8 +- dev_environment/stop.sh | 24 --- 17 files changed, 431 insertions(+), 67 deletions(-) create mode 100644 dev_environment/compose.demo.yaml rename dev_environment/{compose.yaml => compose.dev.yaml} (72%) rename dev_environment/{debug_config.yaml => config.debug.yaml} (100%) create mode 100644 dev_environment/config.demo.yaml rename dev_environment/{config.yaml => config.dev.yaml} (100%) create mode 100644 dev_environment/demo.sql create mode 100644 dev_environment/demo_api_main.py rename dev_environment/{FAIRagro.sql => edaphobase.sql} (100%) create mode 100755 dev_environment/start-demo.sh rename dev_environment/{start.sh => start-dev.sh} (84%) delete mode 100755 dev_environment/stop.sh diff --git a/.gitattributes b/.gitattributes index 9b6967e8..62379026 100644 --- a/.gitattributes +++ b/.gitattributes @@ -30,4 +30,5 @@ Dockerfile text eol=lf *.dll binary *.so binary *.sql filter=lfs diff=lfs merge=lfs -text +dev_environment/demo.sql text !filter !diff !merge docs/create_empty_views.sql text !filter !diff !merge diff --git a/.gitignore b/.gitignore index 40933497..6d63a014 100644 --- a/.gitignore +++ b/.gitignore @@ -224,3 +224,5 @@ helmchart/**/server.conf # ggshield cache .cache_ggshield + +dev_environment/demo_output/ \ No newline at end of file diff --git a/.vscode/copilot-instructions.md b/.vscode/copilot-instructions.md index a9bf3a08..f431b04d 100644 --- a/.vscode/copilot-instructions.md +++ b/.vscode/copilot-instructions.md @@ -63,7 +63,7 @@ uv run mypy middleware/ source scripts/load-env.sh # Docker -cd dev_environment && ./start.sh --build +cd dev_environment && ./start-dev.sh --build ``` ## ⚠ Common Patterns diff --git a/.vscode/launch.json b/.vscode/launch.json index 236a314d..8e0f94cd 100644 --- a/.vscode/launch.json +++ b/.vscode/launch.json @@ -13,7 +13,7 @@ "envFile": "${workspaceFolder}/.env", "args": [ "-c", - "${workspaceFolder}/dev_environment/debug_config.yaml" + "${workspaceFolder}/dev_environment/config.debug.yaml" ] }, { diff --git a/AGENTS.md b/AGENTS.md index 70868640..70f93eea 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -38,9 +38,9 @@ scripts/ └── post-merge dev_environment/ -├── start.sh # Start Docker Compose (Postgres + Converter) -├── compose.yaml # Docker services definition -└── config.yaml # Development configuration for the converter +├── start-dev.sh # Start Docker Compose (Postgres + Converter) +├── compose.dev.yaml # Docker services definition +└── config.dev.yaml # Development configuration for the converter ``` ## 🔧 Important Commands @@ -61,9 +61,13 @@ uv sync --dev --all-packages ### Development Environment ```bash -# Start local database and run converter +# Start a full local demo (including mock API, no secrets/mTLS required) cd dev_environment -./start.sh --build +./start-demo.sh --build + +# Start local database and run converter (requires decryption via sops) +cd dev_environment +./start-dev.sh --build # View logs docker compose logs -f @@ -108,7 +112,7 @@ services: sql_to_arc: # The converter component (this repo) ``` -**Configuration**: `dev_environment/config.yaml` +**Configuration**: `dev_environment/config.dev.yaml` - Connects to `postgres` service on port 5432. - Uses `api_url` pointing to an external Middleware API if needed. diff --git a/dev_environment/README.md b/dev_environment/README.md index 2cd2c0e7..241975fb 100644 --- a/dev_environment/README.md +++ b/dev_environment/README.md @@ -30,19 +30,32 @@ The SQL-to-ARC converter that: - Waits for db-init to complete - Connects to PostgreSQL and Middleware API - Mounts encrypted secrets via sops -- Currently set to `sleep 3600` (modify compose.yaml to enable converter) +- Currently set to `sleep 3600` (modify compose.dev.yaml to enable converter) -### 4. middleware-api +### 4. middleware-api (Demo Only) -The FAIRagro Middleware API service that: +A simple mock service that simulates the Middleware API for local testing. -- Builds from `../docker/Dockerfile.api` -- Runs on port `8000` -- Provides REST API for ARC management -- No mTLS validation in dev mode (HTTP without client certs) -- Health check via `/live` endpoint +- No mTLS required (uses HTTP) +- Logs all incoming ARC uploads +- Available via `http://localhost:8000` -## Quick Start +## Quick Start (Demo Mode - Recommended for First Time) + +If you don't have the mTLS keys yet and just want to see the workflow in action with a local database and a mock API: + +```bash +./start-demo.sh --build +``` + +This starts: + +1. **postgres**: A local DB. +2. **db-init**: Fills the DB with sample data. +3. **middleware-api**: A local mock API. +4. **sql-to-arc**: The converter, pointing to the local mock. + +## Quick Start (Standard/External Mode) ### Prerequisites @@ -50,10 +63,10 @@ The FAIRagro Middleware API service that: - [sops](https://github.com/getsops/sops) for secret management - Age or PGP key configured for sops decryption -### Start Everything +### Start with Decryption ```bash -./start.sh +./start-dev.sh ``` This will: @@ -65,7 +78,7 @@ This will: With image rebuild: ```bash -./start.sh --build +./start-dev.sh --build ``` ### Start with External Middleware API @@ -126,9 +139,9 @@ sops client.key sops -d client.key ``` -The `start.sh` script uses `sops exec-file` to temporarily decrypt `client.key` during container startup. +The `start-dev.sh` script uses `sops exec-file` to temporarily decrypt `client.key` during container startup. -### config.yaml +### config.dev.yaml Application configuration for sql_to_arc: @@ -172,7 +185,7 @@ docker compose logs sql_to_arc Common issues: - Secrets not mounted → verify sops decryption works: `sops -d client.key` -- API unreachable → check `api_url` in config.yaml +- API unreachable → check `api_url` in config.dev.yaml - Database connection → verify db-init completed successfully ### Rebuild specific service @@ -182,7 +195,7 @@ docker compose build sql_to_arc docker compose up sql_to_arc ``` -## Manual Usage (without start.sh) +## Manual Usage (without start-dev.sh) If you don't want to use sops or the start script: @@ -201,15 +214,15 @@ sops exec-file client.key \ ## Development Workflow 1. Make changes to sql_to_arc code -2. Rebuild image: `./start.sh --build` +2. Rebuild image: `./start-dev.sh --build` 3. View logs: `docker compose logs -f sql_to_arc` 4. Iterate ## Files -- `compose.yaml` - Docker Compose service definitions -- `config.yaml` - Application configuration +- `compose.dev.yaml` - Docker Compose service definitions +- `config.dev.yaml` - Application configuration - `client.crt` - Client certificate (plain) - `client.key` - Client private key (encrypted with sops) -- `start.sh` - Startup script with sops integration +- `start-dev.sh` - Startup script with sops integration - `run.sh` - **DEPRECATED** - Old script (kept for reference) diff --git a/dev_environment/compose.demo.yaml b/dev_environment/compose.demo.yaml new file mode 100644 index 00000000..06a9055a --- /dev/null +++ b/dev_environment/compose.demo.yaml @@ -0,0 +1,60 @@ +services: + db-init: + image: postgres:15 + depends_on: + postgres: + condition: service_healthy + environment: + PGHOST: postgres + PGUSER: ${POSTGRES_USER} + PGPASSWORD: ${POSTGRES_PASSWORD} + PGDATABASE: postgres + volumes: + - ./demo.sql:/tmp/demo.sql:ro + entrypoint: /bin/bash + command: + - -c + - | + set -euo pipefail + echo "Recreating rdi database for demo..." + psql -c "DROP DATABASE IF EXISTS rdi;" + psql -c "CREATE DATABASE rdi;" + echo "Importing local demo.sql..." + PGDATABASE=rdi psql < /tmp/demo.sql + echo "Demo database initialization complete." + + middleware-api: + image: python:3.12-slim + container_name: middleware-api-demo + environment: + - LOG_LEVEL=INFO + - LOCAL_UID=${LOCAL_UID:-1000} + - LOCAL_GID=${LOCAL_GID:-1000} + ports: + - "8000:8000" + volumes: + - ./demo_output:/data/arcs + - ./demo_api_main.py:/app/main.py:ro + command: + - /bin/sh + - -c + - | + set -e + apt-get update + apt-get install -y --no-install-recommends curl + pip install fastapi uvicorn arctrl + uvicorn main:app --app-dir /app --host 0.0.0.0 --port 8000 + healthcheck: + test: ["CMD", "curl", "-f", "http://localhost:8000/live"] + interval: 5s + timeout: 5s + retries: 5 + + sql_to_arc: + depends_on: + middleware-api: + condition: service_healthy + volumes: + - ./config.demo.yaml:/etc/sql_to_arc/config.yaml:ro + environment: + SQL_TO_ARC_CONNECTION_STRING: "postgresql+psycopg://postgres:postgres@postgres:5432/rdi" diff --git a/dev_environment/compose.yaml b/dev_environment/compose.dev.yaml similarity index 72% rename from dev_environment/compose.yaml rename to dev_environment/compose.dev.yaml index 4a1e502f..4a02acf6 100644 --- a/dev_environment/compose.yaml +++ b/dev_environment/compose.dev.yaml @@ -28,7 +28,7 @@ services: PGPASSWORD: ${POSTGRES_PASSWORD} PGDATABASE: postgres volumes: - - ./FAIRagro.sql:/tmp/FAIRagro.sql:ro + - ./edaphobase.sql:/tmp/edaphobase.sql:ro entrypoint: /bin/bash command: - -c @@ -37,17 +37,17 @@ services: apt-get update && apt-get install -y --no-install-recommends wget ca-certificates - echo "Dropping and recreating database edaphobase..." - psql -c "DROP DATABASE IF EXISTS edaphobase;" - psql -c "CREATE DATABASE edaphobase;" + echo "Dropping and recreating database rdi..." + psql -c "DROP DATABASE IF EXISTS rdi;" + psql -c "CREATE DATABASE rdi;" - echo "Attempting to download Edaphobase dump..." - if wget -q -O /tmp/downloaded_FAIRagro.sql https://repo.edaphobase.org/rep/dumps/FAIRagro.sql; then + echo "Attempting to download latest Edaphobase dump..." + if wget -q -O /tmp/downloaded_edaphobase.sql https://repo.edaphobase.org/rep/dumps/FAIRagro.sql; then echo "Importing downloaded dump..." - PGDATABASE=edaphobase psql < /tmp/downloaded_FAIRagro.sql - elif [ -f /tmp/FAIRagro.sql ]; then - echo "Download failed. Importing local FAIRagro.sql fallback..." - PGDATABASE=edaphobase psql < /tmp/FAIRagro.sql + PGDATABASE=rdi psql < /tmp/downloaded_edaphobase.sql + elif [ -f /tmp/edaphobase.sql ]; then + echo "Download failed. Importing local edaphobase.sql fallback..." + PGDATABASE=rdi psql < /tmp/edaphobase.sql else echo "Error: Could not download dump and no local fallback found." exit 1 @@ -69,7 +69,7 @@ services: tmpfs: - /run/secrets:mode=1777 volumes: - - ./config.yaml:/etc/sql_to_arc/config.yaml:ro + - ./config.dev.yaml:/etc/sql_to_arc/config.yaml:ro - ./client.crt:/etc/sql_to_arc/client.crt:ro command: > sh -c "printf '%s' \"$$SQL_TO_ARC_CLIENT_KEY_DATA\" > /run/secrets/client.key && diff --git a/dev_environment/debug_config.yaml b/dev_environment/config.debug.yaml similarity index 100% rename from dev_environment/debug_config.yaml rename to dev_environment/config.debug.yaml diff --git a/dev_environment/config.demo.yaml b/dev_environment/config.demo.yaml new file mode 100644 index 00000000..5adac56b --- /dev/null +++ b/dev_environment/config.demo.yaml @@ -0,0 +1,8 @@ +connection_string: "postgresql+psycopg://postgres:postgres@postgres:5432/rdi" +rdi: "demo" +rdi_url: "https://example.org" + +api_client: + api_url: "http://middleware-api:8000/" + verify_ssl: false + diff --git a/dev_environment/config.yaml b/dev_environment/config.dev.yaml similarity index 100% rename from dev_environment/config.yaml rename to dev_environment/config.dev.yaml diff --git a/dev_environment/demo.sql b/dev_environment/demo.sql new file mode 100644 index 00000000..7f8c0ff9 --- /dev/null +++ b/dev_environment/demo.sql @@ -0,0 +1,108 @@ +-- Demo SQL for FAIRagro SQL-to-ARC +-- This is a minimal dataset with 10 investigations for demonstration purposes. + +CREATE TABLE IF NOT EXISTS "vInvestigation" ( + identifier TEXT PRIMARY KEY, + title TEXT NOT NULL, + description_text TEXT NOT NULL, + submission_date TIMESTAMP, + public_release_date TIMESTAMP +); + +CREATE TABLE IF NOT EXISTS "vContact" ( + last_name TEXT, + first_name TEXT, + mid_initials TEXT, + email TEXT, + phone TEXT, + fax TEXT, + postal_address TEXT, + affiliation TEXT, + roles TEXT, + investigation_ref TEXT, + target_type TEXT DEFAULT 'investigation', + target_ref TEXT +); + +CREATE TABLE IF NOT EXISTS "vStudy" ( + identifier TEXT PRIMARY KEY, + title TEXT, + description_text TEXT, + submission_date TIMESTAMP, + public_release_date TIMESTAMP, + investigation_ref TEXT +); + +CREATE TABLE IF NOT EXISTS "vAssay" ( + identifier TEXT PRIMARY KEY, + measurement_type_term TEXT, + technology_type_term TEXT, + technology_platform TEXT, + investigation_ref TEXT, + study_ref TEXT, + title TEXT, + description_text TEXT, + measurement_type_uri TEXT, + measurement_type_version TEXT, + technology_type_uri TEXT, + technology_type_version TEXT +); + +CREATE TABLE IF NOT EXISTS "vPublication" ( + investigation_ref TEXT, + target_type TEXT DEFAULT 'investigation', + pubmed_id TEXT, + doi TEXT, + authors TEXT, + title TEXT, + status_term TEXT, + status_uri TEXT, + status_version TEXT, + target_ref TEXT +); + +CREATE TABLE IF NOT EXISTS "vAnnotationTable" ( + table_name TEXT, + target_type TEXT, + target_ref TEXT, + investigation_ref TEXT, + column_type TEXT, + column_io_type TEXT, + column_value TEXT, + column_annotation_term TEXT, + column_annotation_uri TEXT, + column_annotation_version TEXT, + row_index INTEGER, + cell_value TEXT, + cell_annotation_term TEXT, + cell_annotation_uri TEXT, + cell_annotation_version TEXT +); + +-- Insert 10 demo investigations +INSERT INTO "vInvestigation" (identifier, title, description_text, submission_date, public_release_date) VALUES +('INV001', 'Soil Microbiome Study A', 'Analysis of soil microbiome in temperate forests.', '2023-01-15', '2023-06-01'), +('INV002', 'Wheat Yield Optimization', 'Field trials for drought-resistant wheat varieties.', '2023-02-10', '2023-07-15'), +('INV003', 'Peatland Carbon Sequestration', 'Long-term monitoring of carbon flux in northern peatlands.', '2023-03-05', '2023-08-20'), +('INV004', 'Invasive Species Impact', 'Study on the spread of Japanese Knotweed in riparian zones.', '2023-04-12', '2023-09-10'), +('INV005', 'Nitrate Leaching in Grasslands', 'Measuring nitrogen runoff after intensive fertilization.', '2023-05-20', '2023-10-05'), +('INV006', 'Urban Soil Contamination', 'Heavy metal analysis in community gardens.', '2023-06-15', '2023-11-15'), +('INV007', 'Forest Canopy Biodiversity', 'Arthropod diversity in old-growth beech forests.', '2023-07-01', '2023-12-01'), +('INV008', 'Maize Phenotyping Experiment', 'High-throughput imaging of maize growth stages.', '2023-08-10', '2024-01-15'), +('INV009', 'Groundwater Quality Assessment', 'Monitoring pesticides in agricultural watersheds.', '2023-09-05', '2024-02-10'), +('INV010', 'Alpine Meadow Phenology', 'Climate change impacts on flowering times in the Alps.', '2023-10-12', '2024-03-20'); + +-- Add some contacts +INSERT INTO "vContact" (first_name, last_name, email, affiliation, investigation_ref) VALUES +('Jane', 'Doe', 'jane.doe@example.org', 'University of Soil Science', 'INV001'), +('John', 'Smith', 'john.smith@example.org', 'Cereal Research Institute', 'INV002'); + +-- Add some studies +INSERT INTO "vStudy" (identifier, title, description_text, investigation_ref) VALUES +('STU001', 'Bacterial Profiling', '16S rRNA sequencing of soil samples.', 'INV001'), +('STU002', 'Drought Stress Assay', 'Greenhouse experiment with controlled watering.', 'INV002'); + +-- Add some assays +INSERT INTO "vAssay" (identifier, measurement_type_term, technology_type_term, investigation_ref, study_ref) VALUES +('ASS001', 'DNA Sequencing', 'Illumina MiSeq', 'INV001', '["STU001"]'), +('ASS002', 'Phenotyping', 'Infrared Imaging', 'INV002', '["STU002"]'); diff --git a/dev_environment/demo_api_main.py b/dev_environment/demo_api_main.py new file mode 100644 index 00000000..fe5d02c3 --- /dev/null +++ b/dev_environment/demo_api_main.py @@ -0,0 +1,143 @@ +""" +Demo API Mock for the FAIRagro SQL-to-ARC converter. + +This module provides a lightweight FastAPI server that simulates the Middleware API. +It receives ARC RO-Crate payloads, deserializes them using the arctrl library, +and writes the resulting ARC directory structure to the local file system. +""" + +import json +import os +import traceback +from datetime import UTC, datetime +from pathlib import Path + +from arctrl import ARC +from arctrl.py.fable_modules.fable_library.async_ import start_as_task +from fastapi import FastAPI, Request # type: ignore # pylint: disable=import-error + +app = FastAPI() + + +def _get_target_owner() -> tuple[int, int] | None: + uid_value = os.environ.get("LOCAL_UID") + gid_value = os.environ.get("LOCAL_GID") + if uid_value is None or gid_value is None: + return None + + try: + return int(uid_value), int(gid_value) + except ValueError: + print(f"Invalid LOCAL_UID/LOCAL_GID: {uid_value}/{gid_value}") + return None + + +def _chown_tree(path: Path) -> None: + owner = _get_target_owner() + if owner is None or not path.exists(): + return + + uid, gid = owner + + def apply_ownership(target: Path) -> None: + os.chown(target, uid, gid) + + apply_ownership(path) + if path.is_dir(): + for root, dirs, files in os.walk(path): + root_path = Path(root) + apply_ownership(root_path) + for name in dirs: + apply_ownership(root_path / name) + for name in files: + apply_ownership(root_path / name) + + +def _handle_error(arc_dir: Path, rdi: str, arc_id: str, exc: Exception) -> None: + """ + Log and store error information when ARC processing fails. + + Args: + arc_dir: The directory where the ARC was supposed to be saved. + rdi: The RDI identifier. + arc_id: The ARC identifier. + exc: The exception that occurred. + """ + tb = traceback.format_exc() + print(f"Error writing ARC for {rdi}/{arc_id}: {exc}") + arc_dir.mkdir(parents=True, exist_ok=True) + error_path = arc_dir / "error.txt" + with open(error_path, "w", encoding="utf-8") as handle: + handle.write(str(exc)) + handle.write("\n\n") + handle.write(tb) + _chown_tree(arc_dir) + + +@app.post("/v3/arcs") +async def upload_arc(request: Request): + """ + Handle the submission of an ARC RO-Crate. + + This endpoint receives the RO-Crate JSON-LD payload, validates it, + and uses the arctrl library to reconstruct the ARC directory structure. + The resulting files are saved to the local 'demo_output' volume. + + Args: + rdi: The identifier of the Research Data Infrastructure. + payload: The request body containing the 'arc' (RO-Crate JSON). + + Returns: + A dictionary matching the ArcResult schema expected by the ApiClient. + """ + rdi = request.query_params.get("rdi") + data = await request.json() + arc_payload = data.get("arc", data) + + if rdi is None: + rdi = data.get("rdi", "unknown") + + output_path = Path(f"/data/arcs/{rdi}") + output_path.mkdir(parents=True, exist_ok=True) + arc_id = arc_payload.get("identifier", f"arc_{os.urandom(4).hex()}") + now = datetime.now(UTC).isoformat() + arc_dir = output_path / arc_id + payload_path = arc_dir.with_suffix(".payload.json") + + with open(payload_path, "w", encoding="utf-8") as handle: + json.dump(arc_payload, handle, indent=2) + _chown_tree(payload_path) + + try: + arc_json = json.dumps(arc_payload) + arc = ARC.from_rocrate_json_string(arc_json) + await start_as_task(arc.WriteAsync(str(arc_dir))) + _chown_tree(arc_dir) + print(f"Saved ARC structure for {rdi} as {arc_id} using arctrl") + except (json.JSONDecodeError, OSError, RuntimeError) as exc: + _handle_error(arc_dir, rdi, arc_id, exc) + except Exception as exc: # noqa: BLE001 + _handle_error(arc_dir, rdi, arc_id, exc) + + return { + "arc_id": arc_id, + "status": "created", + "metadata": { + "rdi": rdi, + "arc_hash": "demo_hash", + "status": "ACTIVE", + "first_seen": now, + "last_seen": now, + }, + } + + +@app.get("/live") +def live(): + """ + Liveness probe for the demo API. + + Returns: + dict: A simple status indicator. + """ + return {"status": "ok"} diff --git a/dev_environment/FAIRagro.sql b/dev_environment/edaphobase.sql similarity index 100% rename from dev_environment/FAIRagro.sql rename to dev_environment/edaphobase.sql diff --git a/dev_environment/start-demo.sh b/dev_environment/start-demo.sh new file mode 100755 index 00000000..a608e588 --- /dev/null +++ b/dev_environment/start-demo.sh @@ -0,0 +1,49 @@ +#!/usr/bin/env bash +# +# Start a full local demo environment (SQL DB + Converter + Mock API) +# This setup DOES NOT require mTLS or secret decryption. +# +# Usage: +# ./start-demo.sh # Start services +# ./start-demo.sh --build # Build images and start +# + +set -euo pipefail + +script_dir="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" +cd "$script_dir" + +# Parse arguments +BUILD_FLAG="" +if [[ "${1:-}" == "--build" ]]; then + BUILD_FLAG="--build" +fi + +echo "==> Starting FULL LOCAL DEMO..." +echo " - PostgreSQL will be started" +echo " - Database will be initialized with LOCAL demo.sql (10 records)" +echo " - Local Middleware API Mock will be started (no mTLS required)" +echo " - SQL-to-ARC will connect to local mock" +echo " - ARCs will be deserialized and saved as ISA folders" +echo "" + +# Export environment variables for compose.dev.yaml (usually from secrets, but hardcoded here for demo) +export POSTGRES_USER=postgres +export POSTGRES_PASSWORD=postgres +export CONNECTION_STRING="postgresql+psycopg://postgres:postgres@postgres:5432/rdi" +export CLIENT_KEY="not-needed-for-demo" +export LOCAL_UID="$(id -u)" +export LOCAL_GID="$(id -g)" + +# Ensure output directory exists for the volume mount +mkdir -p "${script_dir}/demo_output" + +# Start services using the base compose file + the demo override +# This now overrides the db-init service to use demo.sql without downloads +docker compose -f compose.dev.yaml -f compose.demo.yaml up $BUILD_FLAG + +echo "" +echo "==> Demo finished! You can find the generated ARCs in: dev_environment/demo_output/" +echo " (Wait a moment for files to appear if they are being processed...)" +echo " - View logs: docker compose -f compose.dev.yaml -f compose.demo.yaml logs" +echo " - Clean up: docker compose -f compose.dev.yaml -f compose.demo.yaml down -v" diff --git a/dev_environment/start.sh b/dev_environment/start-dev.sh similarity index 84% rename from dev_environment/start.sh rename to dev_environment/start-dev.sh index 28248799..92995949 100755 --- a/dev_environment/start.sh +++ b/dev_environment/start-dev.sh @@ -21,7 +21,7 @@ fi echo "==> Starting SQL-to-ARC with EXTERNAL API..." echo " - Local PostgreSQL will be started" echo " - Database will be initialized with Edaphobase dump" -echo " - SQL-to-ARC will connect to the API configured in config.yaml" +echo " - SQL-to-ARC will connect to the API configured in config.dev.yaml" echo " - Using client certificates: client.crt, secrets.enc.yaml" echo "" @@ -33,9 +33,9 @@ fi # Use sops exec-env to pass the decrypted secrets as environment variables # without writing them to physical disk files. sops exec-env "${script_dir}/secrets.enc.yaml" \ - "docker compose -f compose.yaml up $BUILD_FLAG" + "docker compose -f compose.dev.yaml up $BUILD_FLAG" echo "" echo "==> Services finished!" -echo " - View logs: docker compose -f compose.yaml logs" -echo " - Clean up: docker compose -f compose.yaml down" +echo " - View logs: docker compose -f compose.dev.yaml logs" +echo " - Clean up: docker compose -f compose.dev.yaml down" diff --git a/dev_environment/stop.sh b/dev_environment/stop.sh deleted file mode 100755 index 27157974..00000000 --- a/dev_environment/stop.sh +++ /dev/null @@ -1,24 +0,0 @@ -#!/usr/bin/env bash -# -# Stop all services and optionally clean up volumes -# -# Usage: -# ./stop.sh # Stop services, keep data -# ./stop.sh --clean # Stop services and remove all data -# - -set -euo pipefail - -script_dir="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" -cd "$script_dir" - -if [[ "${1:-}" == "--clean" ]]; then - echo "==> Stopping services and removing all data..." - docker compose down -v - echo "==> All services stopped and data removed." -else - echo "==> Stopping services (data will be preserved)..." - docker compose down - echo "==> Services stopped. Data preserved in volumes." - echo " To remove data: ./stop.sh --clean" -fi From 5dfb151afb805aa185cdbb76c1b964810dcdcaba Mon Sep 17 00:00:00 2001 From: Carsten Scharfenberg Date: Thu, 2 Apr 2026 13:34:57 +0000 Subject: [PATCH 06/27] Enhance demo API with type hints, error handling, and directory ownership management; update dependencies in pyproject.toml and uv.lock --- dev_environment/demo_api_main.py | 12 ++--- dev_environment/demo_output/.gitkeep | 0 pyproject.toml | 2 + uv.lock | 67 ++++++++++++++++++++++++++++ 4 files changed, 76 insertions(+), 5 deletions(-) create mode 100644 dev_environment/demo_output/.gitkeep diff --git a/dev_environment/demo_api_main.py b/dev_environment/demo_api_main.py index fe5d02c3..a393032d 100644 --- a/dev_environment/demo_api_main.py +++ b/dev_environment/demo_api_main.py @@ -13,8 +13,8 @@ from pathlib import Path from arctrl import ARC -from arctrl.py.fable_modules.fable_library.async_ import start_as_task -from fastapi import FastAPI, Request # type: ignore # pylint: disable=import-error +from arctrl.py.fable_modules.fable_library.async_ import start_as_task # type: ignore[import-untyped] +from fastapi import FastAPI, Request app = FastAPI() @@ -75,7 +75,7 @@ def _handle_error(arc_dir: Path, rdi: str, arc_id: str, exc: Exception) -> None: @app.post("/v3/arcs") -async def upload_arc(request: Request): +async def upload_arc(request: Request) -> dict[str, str | dict[str, str]]: """ Handle the submission of an ARC RO-Crate. @@ -97,8 +97,10 @@ async def upload_arc(request: Request): if rdi is None: rdi = data.get("rdi", "unknown") - output_path = Path(f"/data/arcs/{rdi}") + output_path = Path("/data/arcs") output_path.mkdir(parents=True, exist_ok=True) + _chown_tree(output_path) # Ensure the root output dir belongs to the host user + arc_id = arc_payload.get("identifier", f"arc_{os.urandom(4).hex()}") now = datetime.now(UTC).isoformat() arc_dir = output_path / arc_id @@ -133,7 +135,7 @@ async def upload_arc(request: Request): @app.get("/live") -def live(): +def live() -> dict[str, str]: """ Liveness probe for the demo API. diff --git a/dev_environment/demo_output/.gitkeep b/dev_environment/demo_output/.gitkeep new file mode 100644 index 00000000..e69de29b diff --git a/pyproject.toml b/pyproject.toml index 2f9208fc..e5ce1b62 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -22,6 +22,8 @@ dev = [ "ruff>=0.14.8", "pre-commit>=4.0.0", "respx>=0.20.0", + "fastapi>=0.135.3", + "uvicorn>=0.42.0", ] [tool.uv.sources] diff --git a/uv.lock b/uv.lock index deabd038..98df0dc6 100644 --- a/uv.lock +++ b/uv.lock @@ -37,6 +37,15 @@ wheels = [ { url = "https://files.pythonhosted.org/packages/b0/80/4d1565bc16b53cd603c73dc4bc770e2e6418d957417e05031314760dc28c/aioodbc-0.5.0-py3-none-any.whl", hash = "sha256:bcaf16f007855fa4bf0ce6754b1f72c6c5a3d544188849577ddd55c5dc42985e", size = 19449, upload-time = "2023-10-28T21:37:28.51Z" }, ] +[[package]] +name = "annotated-doc" +version = "0.0.4" +source = { registry = "https://pypi.org/simple" } +sdist = { url = "https://files.pythonhosted.org/packages/57/ba/046ceea27344560984e26a590f90bc7f4a75b06701f653222458922b558c/annotated_doc-0.0.4.tar.gz", hash = "sha256:fbcda96e87e9c92ad167c2e53839e57503ecfda18804ea28102353485033faa4", size = 7288, upload-time = "2025-11-10T22:07:42.062Z" } +wheels = [ + { url = "https://files.pythonhosted.org/packages/1e/d3/26bf1008eb3d2daa8ef4cacc7f3bfdc11818d111f7e2d0201bc6e3b49d45/annotated_doc-0.0.4-py3-none-any.whl", hash = "sha256:571ac1dc6991c450b25a9c2d84a3705e2ae7a53467b5d111c24fa8baabbed320", size = 5303, upload-time = "2025-11-10T22:07:40.673Z" }, +] + [[package]] name = "annotated-types" version = "0.7.0" @@ -253,6 +262,18 @@ wheels = [ { url = "https://files.pythonhosted.org/packages/2a/68/687187c7e26cb24ccbd88e5069f5ef00eba804d36dde11d99aad0838ab45/charset_normalizer-3.4.6-py3-none-any.whl", hash = "sha256:947cf925bc916d90adba35a64c82aace04fa39b46b52d4630ece166655905a69", size = 61455, upload-time = "2026-03-15T18:53:23.833Z" }, ] +[[package]] +name = "click" +version = "8.3.1" +source = { registry = "https://pypi.org/simple" } +dependencies = [ + { name = "colorama", marker = "sys_platform == 'win32'" }, +] +sdist = { url = "https://files.pythonhosted.org/packages/3d/fa/656b739db8587d7b5dfa22e22ed02566950fbfbcdc20311993483657a5c0/click-8.3.1.tar.gz", hash = "sha256:12ff4785d337a1bb490bb7e9c2b1ee5da3112e94a8622f26a6c77f5d2fc6842a", size = 295065, upload-time = "2025-11-15T20:45:42.706Z" } +wheels = [ + { url = "https://files.pythonhosted.org/packages/98/78/01c019cdb5d6498122777c1a43056ebb3ebfeef2076d9d026bfe15583b2b/click-8.3.1-py3-none-any.whl", hash = "sha256:981153a64e25f12d547d3426c367a4857371575ee7ad18df2a6183ab0545b2a6", size = 108274, upload-time = "2025-11-15T20:45:41.139Z" }, +] + [[package]] name = "colorama" version = "0.4.6" @@ -426,6 +447,22 @@ wheels = [ { url = "https://files.pythonhosted.org/packages/c1/8b/5fe2cc11fee489817272089c4203e679c63b570a5aaeb18d852ae3cbba6a/et_xmlfile-2.0.0-py3-none-any.whl", hash = "sha256:7a91720bc756843502c3b7504c77b8fe44217c85c537d85037f0f536151b2caa", size = 18059, upload-time = "2024-10-25T17:25:39.051Z" }, ] +[[package]] +name = "fastapi" +version = "0.135.3" +source = { registry = "https://pypi.org/simple" } +dependencies = [ + { name = "annotated-doc" }, + { name = "pydantic" }, + { name = "starlette" }, + { name = "typing-extensions" }, + { name = "typing-inspection" }, +] +sdist = { url = "https://files.pythonhosted.org/packages/f7/e6/7adb4c5fa231e82c35b8f5741a9f2d055f520c29af5546fd70d3e8e1cd2e/fastapi-0.135.3.tar.gz", hash = "sha256:bd6d7caf1a2bdd8d676843cdcd2287729572a1ef524fc4d65c17ae002a1be654", size = 396524, upload-time = "2026-04-01T16:23:58.188Z" } +wheels = [ + { url = "https://files.pythonhosted.org/packages/84/a4/5caa2de7f917a04ada20018eccf60d6cc6145b0199d55ca3711b0fc08312/fastapi-0.135.3-py3-none-any.whl", hash = "sha256:9b0f590c813acd13d0ab43dd8494138eb58e484bfac405db1f3187cfc5810d98", size = 117734, upload-time = "2026-04-01T16:23:59.328Z" }, +] + [[package]] name = "filelock" version = "3.25.2" @@ -686,6 +723,7 @@ dependencies = [ [package.dev-dependencies] dev = [ { name = "bandit" }, + { name = "fastapi" }, { name = "httpx" }, { name = "mypy" }, { name = "pre-commit" }, @@ -697,6 +735,7 @@ dev = [ { name = "pytest-mock" }, { name = "respx" }, { name = "ruff" }, + { name = "uvicorn" }, ] [package.metadata] @@ -705,6 +744,7 @@ requires-dist = [{ name = "sql-to-arc", editable = "middleware/sql_to_arc" }] [package.metadata.requires-dev] dev = [ { name = "bandit", specifier = ">=1.7.0" }, + { name = "fastapi", specifier = ">=0.135.3" }, { name = "httpx", specifier = ">=0.28.1" }, { name = "mypy", specifier = ">=1.19.0" }, { name = "pre-commit", specifier = ">=4.0.0" }, @@ -716,6 +756,7 @@ dev = [ { name = "pytest-mock", specifier = ">=3.10.0" }, { name = "respx", specifier = ">=0.20.0" }, { name = "ruff", specifier = ">=0.14.8" }, + { name = "uvicorn", specifier = ">=0.42.0" }, ] [[package]] @@ -1574,6 +1615,19 @@ asyncio = [ { name = "greenlet" }, ] +[[package]] +name = "starlette" +version = "1.0.0" +source = { registry = "https://pypi.org/simple" } +dependencies = [ + { name = "anyio" }, + { name = "typing-extensions", marker = "python_full_version < '3.13'" }, +] +sdist = { url = "https://files.pythonhosted.org/packages/81/69/17425771797c36cded50b7fe44e850315d039f28b15901ab44839e70b593/starlette-1.0.0.tar.gz", hash = "sha256:6a4beaf1f81bb472fd19ea9b918b50dc3a77a6f2e190a12954b25e6ed5eea149", size = 2655289, upload-time = "2026-03-22T18:29:46.779Z" } +wheels = [ + { url = "https://files.pythonhosted.org/packages/0b/c9/584bc9651441b4ba60cc4d557d8a547b5aff901af35bda3a4ee30c819b82/starlette-1.0.0-py3-none-any.whl", hash = "sha256:d3ec55e0bb321692d275455ddfd3df75fff145d009685eb40dc91fc66b03d38b", size = 72651, upload-time = "2026-03-22T18:29:45.111Z" }, +] + [[package]] name = "stevedore" version = "5.7.0" @@ -1631,6 +1685,19 @@ wheels = [ { url = "https://files.pythonhosted.org/packages/39/08/aaaad47bc4e9dc8c725e68f9d04865dbcb2052843ff09c97b08904852d84/urllib3-2.6.3-py3-none-any.whl", hash = "sha256:bf272323e553dfb2e87d9bfd225ca7b0f467b919d7bbd355436d3fd37cb0acd4", size = 131584, upload-time = "2026-01-07T16:24:42.685Z" }, ] +[[package]] +name = "uvicorn" +version = "0.42.0" +source = { registry = "https://pypi.org/simple" } +dependencies = [ + { name = "click" }, + { name = "h11" }, +] +sdist = { url = "https://files.pythonhosted.org/packages/e3/ad/4a96c425be6fb67e0621e62d86c402b4a17ab2be7f7c055d9bd2f638b9e2/uvicorn-0.42.0.tar.gz", hash = "sha256:9b1f190ce15a2dd22e7758651d9b6d12df09a13d51ba5bf4fc33c383a48e1775", size = 85393, upload-time = "2026-03-16T06:19:50.077Z" } +wheels = [ + { url = "https://files.pythonhosted.org/packages/0a/89/f8827ccff89c1586027a105e5630ff6139a64da2515e24dafe860bd9ae4d/uvicorn-0.42.0-py3-none-any.whl", hash = "sha256:96c30f5c7abe6f74ae8900a70e92b85ad6613b745d4879eb9b16ccad15645359", size = 68830, upload-time = "2026-03-16T06:19:48.325Z" }, +] + [[package]] name = "virtualenv" version = "21.2.0" From 28bb7b841c6c68828b5f09acb760895514d6e8e9 Mon Sep 17 00:00:00 2001 From: Carsten Scharfenberg Date: Thu, 2 Apr 2026 13:49:05 +0000 Subject: [PATCH 07/27] Enhance demo environment by adding health checks for db-init service and updating start-demo.sh for improved container management and cleanup --- dev_environment/compose.demo.yaml | 8 ++++++++ dev_environment/start-demo.sh | 8 ++++++-- 2 files changed, 14 insertions(+), 2 deletions(-) diff --git a/dev_environment/compose.demo.yaml b/dev_environment/compose.demo.yaml index 06a9055a..48d2e323 100644 --- a/dev_environment/compose.demo.yaml +++ b/dev_environment/compose.demo.yaml @@ -22,6 +22,12 @@ services: echo "Importing local demo.sql..." PGDATABASE=rdi psql < /tmp/demo.sql echo "Demo database initialization complete." + exec sleep infinity + healthcheck: + test: ["CMD-SHELL", "psql -d rdi -c 'SELECT 1' -U postgres"] + interval: 5s + timeout: 5s + retries: 5 middleware-api: image: python:3.12-slim @@ -52,6 +58,8 @@ services: sql_to_arc: depends_on: + db-init: + condition: service_healthy middleware-api: condition: service_healthy volumes: diff --git a/dev_environment/start-demo.sh b/dev_environment/start-demo.sh index a608e588..c10c1160 100755 --- a/dev_environment/start-demo.sh +++ b/dev_environment/start-demo.sh @@ -40,10 +40,14 @@ mkdir -p "${script_dir}/demo_output" # Start services using the base compose file + the demo override # This now overrides the db-init service to use demo.sql without downloads -docker compose -f compose.dev.yaml -f compose.demo.yaml up $BUILD_FLAG +docker compose -f compose.dev.yaml -f compose.demo.yaml up $BUILD_FLAG --abort-on-container-exit --exit-code-from sql_to_arc echo "" -echo "==> Demo finished! You can find the generated ARCs in: dev_environment/demo_output/" +echo "==> Demo finished! Cleaning up containers..." +docker compose -f compose.dev.yaml -f compose.demo.yaml down -v + +echo "" +echo "==> You can find the generated ARCs in: dev_environment/demo_output/" echo " (Wait a moment for files to appear if they are being processed...)" echo " - View logs: docker compose -f compose.dev.yaml -f compose.demo.yaml logs" echo " - Clean up: docker compose -f compose.dev.yaml -f compose.demo.yaml down -v" From 542e249e8e728cae6e95a52b6e5f0d1392ede583 Mon Sep 17 00:00:00 2001 From: Carsten Scharfenberg Date: Thu, 2 Apr 2026 14:33:19 +0000 Subject: [PATCH 08/27] Enhance README and configuration files with quick start instructions, example configurations, and improved environment setup for better usability --- README.md | 11 +++ middleware/sql_to_arc/README.md | 24 ++--- middleware/sql_to_arc/config.example.yaml | 113 +++++++++++++++------- scripts/load-env.sh | 10 +- 4 files changed, 107 insertions(+), 51 deletions(-) diff --git a/README.md b/README.md index d159fb18..ea46541e 100644 --- a/README.md +++ b/README.md @@ -12,6 +12,17 @@ This repository contains the **SQL-to-ARC Converter**, a core component of the F | [scripts/](scripts/) | Tooling for quality checks, environment setup, and Git LFS. | | [docker/](docker/) | Dockerfiles and container structure tests. | +## 🌟 Quick Start (Full Local Demo) + +For the best **out-of-the-box experience**, you can run a complete local demonstration. This setup starts a PostgreSQL database with demo data, a local Mock Middleware API, and the SQL-to-ARC converter to process and save results locally: + +```bash +# Start the full demo stack (requires Docker) +./dev_environment/start-demo.sh --build +``` + +> **Note:** This demo does not require any secrets or mTLS keys. Generated ARCs will be saved to `dev_environment/demo_output/`. + ## 🚀 Getting Started (Development) The preferred method for working with this repository is using the **Dev Container** (VS Code). diff --git a/middleware/sql_to_arc/README.md b/middleware/sql_to_arc/README.md index e58ea85f..29209387 100644 --- a/middleware/sql_to_arc/README.md +++ b/middleware/sql_to_arc/README.md @@ -102,6 +102,8 @@ In containerized environments, sensitive values like the `connection_string` or ## Usage +The following examples assume you are in the root of the repository. + ### 1. From Source (Development) Requires [uv](https://github.com/astral-sh/uv) installed. @@ -110,8 +112,8 @@ Requires [uv](https://github.com/astral-sh/uv) installed. # Install dependencies for all workspace members uv sync --all-packages -# Run the converter with a specific config file -uv run python -m middleware.sql_to_arc.main -c my_config.yaml +# Run the converter using the example config directly +uv run python -m middleware.sql_to_arc.main -c middleware/sql_to_arc/config.example.yaml ``` ### 2. Local Docker Image @@ -119,17 +121,13 @@ uv run python -m middleware.sql_to_arc.main -c my_config.yaml Build the image from the repository root: ```bash +# Build the converter image docker build -f docker/Dockerfile.sql_to_arc -t sql-to-arc:local . -``` - -Run with environment variables: -```bash +# Run using the example config via a volume mount docker run --rm \ - -e SQL_TO_ARC_CONNECTION_STRING="postgresql://..." \ - -e SQL_TO_ARC_RDI="my-rdi" \ - -v $(pwd)/certs:/certs \ - -e SQL_TO_ARC_API_CLIENT__CLIENT_CERT_PATH="/certs/client.crt" \ + --env-file .env \ + -v $(pwd)/middleware/sql_to_arc/config.example.yaml:/etc/sql_to_arc/config.yaml:ro \ sql-to-arc:local ``` @@ -138,12 +136,10 @@ docker run --rm \ Pull the latest official image from Docker Hub (once available): ```bash -docker pull fairagro/sql-to-arc:latest - docker run --rm \ --env-file .env \ - -v $(pwd)/config.yaml:/etc/sql_to_arc/config.yaml:ro \ - fairagro/sql-to-arc:latest -c /etc/sql_to_arc/config.yaml + -v $(pwd)/middleware/sql_to_arc/config.example.yaml:/etc/sql_to_arc/config.yaml:ro \ + zalf/fairagro-advanced-middleware-sql_to_arc/sql-to-arc:latest ``` --- diff --git a/middleware/sql_to_arc/config.example.yaml b/middleware/sql_to_arc/config.example.yaml index 58301553..917d7211 100644 --- a/middleware/sql_to_arc/config.example.yaml +++ b/middleware/sql_to_arc/config.example.yaml @@ -1,49 +1,96 @@ # SQL to ARC Converter Configuration Example -# Copy this file to config.yaml and adjust the values +# This file contains ALL available configuration options with their default or example values. -# Database Connection Settings -# --------------------------- -# Name of the source database (e.g., PostgreSQL database name) -db_name: "edaphobase" +# ------------------------------------------------------------------------------ +# 1. CORE SETTINGS +# ------------------------------------------------------------------------------ -# Database user with read access -db_user: "reader" +# Full SQLAlchemy connection string for the source database. +# Supported: postgresql+psycopg +connection_string: "postgresql+psycopg://postgres:postgres@localhost:5432/rdi" -# Database password (will be handled securely) -db_password: "secure_password_here" +# Unique identifier for the Research Data Infrastructure (RDI). +rdi: "edaphobase" -# Database host address (hostname or IP) -db_host: "localhost" +# Public URL of the RDI portal (used for provenance metadata). +rdi_url: "https://edaphobase.org" -# Database port (default: 5432 for PostgreSQL) -db_port: 5432 +# Console logging level (DEBUG, INFO, WARNING, ERROR, CRITICAL). +log_level: "INFO" -# RDI Identifier -# ------------- -# Unique identifier for the Research Data Infrastructure (RDI) -# This is used to tag or namespace the converted ARCs -rdi: "edaphobase" +# ------------------------------------------------------------------------------ +# 2. PROCESSING & PERFORMANCE +# ------------------------------------------------------------------------------ + +# Number of parallel worker processes for ARC generation (CPU-bound). +# Recommended: Number of CPU cores available. +max_concurrent_arc_builds: 4 + +# Maximum concurrent tasks (IO + CPU). Defaults to 4 * max_concurrent_arc_builds. +max_concurrent_tasks: ~ + +# Number of investigations to fetch from the database per batch. +db_batch_size: 100 + +# Safety limit: Maximum number of studies per investigation. +max_studies: 5000 + +# Safety limit: Maximum number of assays per investigation. +max_assays: 10000 + +# Timeout in minutes for generating a single ARC. +arc_generation_timeout_minutes: 30 + +# (Optional) Limit processing to the first N investigations for debugging. +debug_limit: ~ -# API Client Configuration -# ----------------------- -# Settings for connecting to the Middleware API to upload ARCs -api_client: - # Base URL of the Middleware API - api_url: "http://localhost:8000" - # Path to the client certificate file (PEM format) for mTLS authentication - client_cert_path: "/path/to/client.crt" +# ------------------------------------------------------------------------------ +# 3. MIDDLEWARE API CLIENT (mTLS) +# ------------------------------------------------------------------------------ - # Path to the client private key file (PEM format) - client_key_path: "/path/to/client.key" +api_client: + # Base URL of the FAIRagro Middleware API. + api_url: "http://localhost:8000" - # Path to the CA certificate file (optional, for self-signed server certs) - # ca_cert_path: "/path/to/ca.crt" + # Mutual TLS (mTLS) Credentials + client_cert_path: "dev_environment/client.crt" + client_key_path: "dev_environment/client.key" + + # (Optional) Path to a custom CA certificate to verify the API server. + ca_cert_path: ~ - # Request timeout in seconds (default: 30.0) + # Request timeout in seconds. timeout: 30.0 - # Verify SSL certificates (default: true) - # Set to false only for testing with self-signed certs without CA + # Whether to verify the API server's SSL certificate. verify_ssl: true + + # Whether to follow HTTP redirects. + follow_redirects: true + + # Maximum concurrent HTTP requests to the API. + max_concurrency: 10 + + # Maximum retries for transient HTTP errors (5xx, timeouts). + max_retries: 3 + + # Exponential backoff factor for retries. + retry_backoff_factor: 2.0 + + +# ------------------------------------------------------------------------------ +# 4. OPENTELEMETRY TRACING +# ------------------------------------------------------------------------------ + +otel: + # OTel collector endpoint (e.g., http://localhost:4318). + # If null (~), tracing is disabled or uses default env vars. + endpoint: ~ + + # Whether to print OpenTelemetry spans to the console in a readable format. + log_console_spans: false + + # Logging level for OTLP log export. + log_level: "INFO" diff --git a/scripts/load-env.sh b/scripts/load-env.sh index 829823e8..ae0bf9ce 100755 --- a/scripts/load-env.sh +++ b/scripts/load-env.sh @@ -126,13 +126,15 @@ fi # Decrypt the encrypted file and write to .env if grep -q '"sops"' "$ENCRYPTED_FILE" 2>/dev/null; then # Decrypt encrypted file and write to .env - sops -d "$ENCRYPTED_FILE" > "$DECRYPTED_FILE" 2>/dev/null + # We remove the CLIENT_KEY for the .env file because it breaks Docker's --env-file parser. + # We use a perl regex to find the CLIENT_KEY="..." multiline block and delete it entirely. + sops -d "$ENCRYPTED_FILE" 2>/dev/null | perl -0777 -pe 's/CLIENT_KEY=".*?"\n?//gs' > "$DECRYPTED_FILE" if [ $? -eq 0 ]; then - echo "✅ Encrypted secrets decrypted to $DECRYPTED_FILE" + echo "✅ Encrypted secrets decrypted to $DECRYPTED_FILE (CLIENT_KEY omitted for Docker compatibility)" - # Also load for current shell + # Also load for current shell (the shell CAN handle the full file, so we re-decrypt for memory) set -a - source "$DECRYPTED_FILE" + source <(sops -d "$ENCRYPTED_FILE" 2>/dev/null) set +a else echo "❌ Error decrypting $ENCRYPTED_FILE" From a08081bdbf1734d42738b0ba47f757c6e1d6b4ab Mon Sep 17 00:00:00 2001 From: Carsten Scharfenberg Date: Thu, 2 Apr 2026 14:41:14 +0000 Subject: [PATCH 09/27] Fix documentation links in README for correct relative paths --- middleware/sql_to_arc/README.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/middleware/sql_to_arc/README.md b/middleware/sql_to_arc/README.md index 29209387..5c6f5045 100644 --- a/middleware/sql_to_arc/README.md +++ b/middleware/sql_to_arc/README.md @@ -156,7 +156,7 @@ docker run --rm \ ## Documentation Links -- [Architectural Design](docs/ARCHITECTURAL_DESIGN.md) -- [Database View Specification](docs/sql_to_arc_database_views.md) +- [Architectural Design](../docs/ARCHITECTURAL_DESIGN.md) +- [Database View Specification](../docs/sql_to_arc_database_views.md) - [ARCtrl Documentation](https://nfdi4plants.org/ARCtrl/) - [Middleware API Client](https://github.com/fairagro/m4.2_advanced_middleware_api/tree/main/middleware/api_client) From a53bafedf82a1b1ffff78c6df02ab24f65e6a9e9 Mon Sep 17 00:00:00 2001 From: Carsten Scharfenberg Date: Thu, 2 Apr 2026 14:42:57 +0000 Subject: [PATCH 10/27] Fix relative paths in README documentation links for consistency --- middleware/sql_to_arc/README.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/middleware/sql_to_arc/README.md b/middleware/sql_to_arc/README.md index 5c6f5045..33211e42 100644 --- a/middleware/sql_to_arc/README.md +++ b/middleware/sql_to_arc/README.md @@ -4,7 +4,7 @@ The **SQL-to-ARC Converter** is a high-performance middleware component designed ## Overview -SQL-to-ARC runs locally at the RDI provider's infrastructure. It connects to the provider's SQL database, extracts metadata from a [pre-defined set of database views](docs/sql_to_arc_database_views.md), and transforms this data into **Annotated Research Context (ARC)** objects using the [ARCtrl Library](https://github.com/nfdi4plants/ARCtrl). +SQL-to-ARC runs locally at the RDI provider's infrastructure. It connects to the provider's SQL database, extracts metadata from a [pre-defined set of database views](../../docs/sql_to_arc_database_views.md), and transforms this data into **Annotated Research Context (ARC)** objects using the [ARCtrl Library](https://github.com/nfdi4plants/ARCtrl). Once an ARC is constructed and validated, the tool uses the [middleware api_client library](https://github.com/fairagro/m4.2_advanced_middleware_api/tree/main/middleware/api_client) to transmit the resulting RO-Crate JSON-LD payloads to the [FAIRagro Middleware API](https://github.com/fairagro/m4.2_advanced_middleware_api). @@ -156,7 +156,7 @@ docker run --rm \ ## Documentation Links -- [Architectural Design](../docs/ARCHITECTURAL_DESIGN.md) -- [Database View Specification](../docs/sql_to_arc_database_views.md) +- [Architectural Design](../../docs/ARCHITECTURAL_DESIGN.md) +- [Database View Specification](../../docs/sql_to_arc_database_views.md) - [ARCtrl Documentation](https://nfdi4plants.org/ARCtrl/) - [Middleware API Client](https://github.com/fairagro/m4.2_advanced_middleware_api/tree/main/middleware/api_client) From 2999cb992e1f508912f97c8a8885c18d484312ba Mon Sep 17 00:00:00 2001 From: Carsten Scharfenberg Date: Thu, 2 Apr 2026 17:16:28 +0000 Subject: [PATCH 11/27] Enhance database and processor modules with improved comments and error handling; clarify function docstrings for better understanding and maintainability. --- .../src/middleware/sql_to_arc/database.py | 6 +- .../src/middleware/sql_to_arc/processor.py | 72 ++++++++++++++++--- 2 files changed, 69 insertions(+), 9 deletions(-) diff --git a/middleware/sql_to_arc/src/middleware/sql_to_arc/database.py b/middleware/sql_to_arc/src/middleware/sql_to_arc/database.py index e069118b..a40ee871 100644 --- a/middleware/sql_to_arc/src/middleware/sql_to_arc/database.py +++ b/middleware/sql_to_arc/src/middleware/sql_to_arc/database.py @@ -200,7 +200,7 @@ async def stream_investigations( try: async with self.engine.connect() as conn: # Use literal_column("*") to ensure SQLAlchemy generates 'SELECT *' - # instead of '"vInvestigation"."*"' + # instead of '"vInvestigation"."*"' which can cause issues with some dialects stmt: sqlalchemy.Select[Any] = ( select(sqlalchemy.literal_column("*")) .select_from(table(view_name)) @@ -209,11 +209,13 @@ async def stream_investigations( if limit: stmt = stmt.limit(limit) + # Execute stream to use server-side cursor (prevents loading all rows into RAM) result = await conn.stream(stmt) async for row in result.mappings(): # Count everything we find in the database stats.found_datasets += 1 + # Map raw DB row to Pydantic model with validation investigation = self._validate_and_map(row, InvestigationRow, "investigation") if investigation is None: # If validation fails, it's a found but failed dataset @@ -221,8 +223,10 @@ async def stream_investigations( stats.failed_ids.append(row.get("identifier", "unknown")) continue + # Yield validated model to the async loop in processor.py yield investigation except ProgrammingError as e: + # Handle missing view gracefully (e.g. during initial setup or empty DBs) if f'relation "{view_name.lower()}" does not exist' in str(e).lower(): logger.warning('Table or view "%s" does not exist. Treating as empty.', view_name) else: diff --git a/middleware/sql_to_arc/src/middleware/sql_to_arc/processor.py b/middleware/sql_to_arc/src/middleware/sql_to_arc/processor.py index 46bb859f..90d4f29e 100644 --- a/middleware/sql_to_arc/src/middleware/sql_to_arc/processor.py +++ b/middleware/sql_to_arc/src/middleware/sql_to_arc/processor.py @@ -68,7 +68,22 @@ async def _build_and_upload_single_arc( inv_info: str, semaphore: asyncio.Semaphore, ) -> None: - """Build a single ARC and upload it.""" + """ + Orchestrate the creation and transmission of a single ARC. + + This function handles the lifecycle of one research dataset: + 1. Acquires a semaphore slot (concurrency control). + 2. Gathers pre-fetched relational data into a bundle. + 3. Offloads the CPU-intensive ARC generation to a Process Pool. + 4. Uploads the resulting JSON to the Middleware API. + + Args: + ctx: Context containing executors and pre-fetched data. + investigation: The metadata row for the investigation. + stats: Global stats object to update on success/failure. + inv_info: Logging prefix for traceability. + semaphore: Semaphore to limit concurrent processing. + """ inv_id = str(investigation.identifier) # Acquire semaphore to limit concurrency async with semaphore: @@ -106,8 +121,7 @@ async def _build_and_upload_single_arc( stats.failed_ids.append(inv_id) return - json_size_kb = len(arc_json.encode("utf-8")) / 1024 - logger.info("%s: ARC JSON created: size=%.2fKB", inv_info, json_size_kb) + logger.info("%s: ARC JSON created: size=%.2fKB", inv_info, len(arc_json.encode("utf-8")) / 1024) # Upload single ARC await _upload_and_update_stats(ctx, arc_json, inv_id, stats, inv_info) @@ -154,10 +168,18 @@ async def _fetch_and_group_related_data(db: Database, investigation_ids: list[st async def group_stream( gen: AsyncGenerator[Any, None], ) -> tuple[dict[str, list[Any]], int]: + """ + Help to consume an async generator and group items by their investigation reference. + + This organizes relational data (like Studies or Assays) into a lookup table where + the key is the 'investigation_ref'. This is essential for the 1:N mapping during + ARC construction. + """ m = defaultdict(list) count = 0 async for r in gen: - # All models and the annotation dict have investigation_ref + # All models and the annotation dict have investigation_ref. + # We handle both Pydantic models (obj.field) and raw dicts (obj[field]). inv_ref = r["investigation_ref"] if isinstance(r, dict) else r.investigation_ref m[str(inv_ref)].append(r) count += 1 @@ -198,7 +220,16 @@ def _spawn_investigation_task( res: WorkerResources, running_tasks: set[asyncio.Task[None]], ) -> None: - """Create worker context and spawn a processing task.""" + """ + Create worker context and spawn a processing task. + + Args: + investigation: The data row from DB representing the research dataset. + idx: The global sequence number of this investigation (used for logging context). + batch_data: The pre-fetched relational data (studies, assays, etc.) for the current batch. + res: Shared resources (API client, config, stats, executor, semaphore). + running_tasks: The set of currently active asyncio tasks for backpressure tracking. + """ ctx = WorkerContext( client=res.client, rdi=res.config.rdi, @@ -213,9 +244,17 @@ def _spawn_investigation_task( arc_generation_timeout_minutes=res.config.arc_generation_timeout_minutes, ) + # Logging Context: 'inv_info' provides a human-readable prefix (e.g. "Investigation 42") + # that is passed down to sub-functions to keep log entries traceable back to their origin + # without repeating the logic of generating this string everywhere. inv_info = f"Investigation {idx}" + # MUTABLE ARGUMENTS: Note that 'res.stats' and 'running_tasks' are passed by reference. + # The 'process_investigation' task will update 'res.stats' directly on failure + # and 'running_tasks' will automatically discard this task via its done_callback. task = asyncio.create_task(process_investigation(ctx, investigation, res.stats, inv_info, res.semaphore)) running_tasks.add(task) + # Self-cleanup: When the task is finished (success or failure), it removes itself + # from the 'running_tasks' set to free up slots for the next batch. task.add_done_callback(running_tasks.discard) @@ -226,6 +265,8 @@ async def process_investigations( ) -> ProcessingStats: """Fetch investigations from DB and process them concurrently with flow control.""" stats = ProcessingStats() + # 1. Flow Control: Use a semaphore to limit the number of active tasks. + # This prevents the application from reading too much data into RAM at once. semaphore = asyncio.Semaphore(config.max_concurrent_tasks) logger.info( @@ -235,6 +276,8 @@ async def process_investigations( config.db_batch_size, ) + # 2. Parallelization: Setup a Process Pool to handle CPU-intensive ARC generation. + # We use "spawn" as the start method for better isolation and cross-platform compatibility. with ( concurrent.futures.ProcessPoolExecutor( max_workers=config.max_concurrent_arc_builds, @@ -244,15 +287,21 @@ async def process_investigations( ): running_tasks: set[asyncio.Task[None]] = set() inv_idx = 0 + # Initialize the streaming generator for investigations investigation_gen = db.stream_investigations(stats=stats, limit=config.debug_limit) while True: + # 3. Memory-Efficient Fetching: Get a chunk of investigations from the generator. + # We fetch in batches to balance between DB roundtrips and RAM usage. + # 'anext' is used to manually advance the async generator. batch = [] try: for _ in range(config.db_batch_size): try: + # Fetch next investigation from the server-side cursor batch.append(await anext(investigation_gen)) except StopAsyncIteration: + # End of stream reached, triggered by 'anext' break except (RuntimeError, OSError, ConnectionError) as e: logger.error("Database or connection error while fetching investigations: %s", e, exc_info=True) @@ -264,15 +313,18 @@ async def process_investigations( if not batch: break + # 4. Backpressure: If we reached the task limit, wait for at least one to finish. + # This is crucial for throttling the 'Producer' (the DB generator). if len(running_tasks) >= config.max_concurrent_tasks: await asyncio.wait(running_tasks, return_when=asyncio.FIRST_COMPLETED) - # 3. Relational Batching: Fetch all related data for this batch at once + # 5. Relational Batching: Fetch ALL related data (Studies, Assays, etc.) + # for the entire batch in ONE go to avoid N+1 query performance hits. batch_data = await _fetch_and_group_related_data(db, [str(inv.identifier) for inv in batch]) stats.total_studies += batch_data.study_count stats.total_assays += batch_data.assay_count - # 4. Prepare resources for spawning tasks + # 6. Resource Injection: Bundle dependencies to pass them to worker tasks. res = WorkerResources( client=client, config=config, @@ -281,7 +333,10 @@ async def process_investigations( semaphore=semaphore, ) - # 5. Spawn tasks for each investigation in the batch + # 7. Task Execution: Spawn an asynchronous task for each investigation. + # Each task will later use the Semaphore to enter the Process Pool. + # IN-PLACE UPDATES: These tasks will directly update 'stats' and + # manage their own lifecycle within 'running_tasks'. for investigation in batch: inv_idx += 1 _spawn_investigation_task( @@ -292,6 +347,7 @@ async def process_investigations( running_tasks, ) + # 8. Cleanup: Wait for all remaining background tasks (uploads, builds) to finish. if running_tasks: logger.info("Waiting for %d remaining tasks to complete...", len(running_tasks)) await asyncio.gather(*running_tasks) From 72abfa17f8dc645a8d28f96b783f93158f9619a5 Mon Sep 17 00:00:00 2001 From: Carsten Scharfenberg Date: Wed, 8 Apr 2026 12:45:11 +0000 Subject: [PATCH 12/27] Enhance development environment by adding new extensions for improved functionality and updating architecture rules documentation for clarity and structure --- .devcontainer/antigravity/devcontainer.json | 4 +- .devcontainer/vscode/devcontainer.json | 4 +- .vscode/extensions.json | 4 +- AGENTS.md | 17 ++- docs/ARCHITECTURE_RULES.md | 135 ++++++++++++++++++++ docs/workspace.dsl | 89 +++++++++++++ 6 files changed, 249 insertions(+), 4 deletions(-) create mode 100644 docs/ARCHITECTURE_RULES.md create mode 100644 docs/workspace.dsl diff --git a/.devcontainer/antigravity/devcontainer.json b/.devcontainer/antigravity/devcontainer.json index ce1bf3e4..23ebb372 100644 --- a/.devcontainer/antigravity/devcontainer.json +++ b/.devcontainer/antigravity/devcontainer.json @@ -101,7 +101,9 @@ "github.copilot-chat", "charliermarsh.ruff", "tim-koehler.helm-intellisense", - "vadzimnestsiarenka.helm-template-preview-and-more" + "vadzimnestsiarenka.helm-template-preview-and-more", + "jebbs.plantuml", + "systemticks.c4-dsl-extension" ] } }, diff --git a/.devcontainer/vscode/devcontainer.json b/.devcontainer/vscode/devcontainer.json index 61b28b7a..4a00e674 100644 --- a/.devcontainer/vscode/devcontainer.json +++ b/.devcontainer/vscode/devcontainer.json @@ -107,7 +107,9 @@ "github.copilot-chat", "charliermarsh.ruff", "tim-koehler.helm-intellisense", - "vadzimnestsiarenka.helm-template-preview-and-more" + "vadzimnestsiarenka.helm-template-preview-and-more", + "jebbs.plantuml", + "systemticks.c4-dsl-extension" ] } }, diff --git a/.vscode/extensions.json b/.vscode/extensions.json index 93272e30..b845a5ba 100644 --- a/.vscode/extensions.json +++ b/.vscode/extensions.json @@ -21,6 +21,8 @@ "charliermarsh.ruff", "tim-koehler.helm-intellisense", "vadzimnestsiarenka.helm-template-preview-and-more", - "ms-vscode-remote.remote-containers" + "ms-vscode-remote.remote-containers", + "jebbs.plantuml", + "systemticks.c4-dsl-extension" ], } diff --git a/AGENTS.md b/AGENTS.md index 70f93eea..f0d928c4 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -76,7 +76,22 @@ docker compose logs -f docker compose down ``` -## 📝 Key Implementation Details +## ïżœ Architecture Rules + +Before generating or modifying code, read **[docs/ARCHITECTURE_RULES.md](docs/ARCHITECTURE_RULES.md)**. + +It defines binding constraints that MUST be followed: + +- **Module Dependency Graph**: Which module may import from which (no circular imports). +- **Extension Points**: How to add new DB entities, mapper functions, or config values. +- **Concurrency Rules**: IPC contract for worker processes, Semaphore scope. +- **Error Handling**: Per-investigation failure isolation, stats update pattern. +- **Config**: NEVER use `os.environ` directly — always extend `Config` in `config.py`. +- **Database Access**: All SQL goes through `Database`; always use server-side cursors and bulk fetches. + +--- + +## ïżœđŸ“ Key Implementation Details ### External Dependencies diff --git a/docs/ARCHITECTURE_RULES.md b/docs/ARCHITECTURE_RULES.md new file mode 100644 index 00000000..6fb8025f --- /dev/null +++ b/docs/ARCHITECTURE_RULES.md @@ -0,0 +1,135 @@ +# Architecture Rules: SQL-to-ARC Middleware + +This document defines **binding rules** for the `middleware/sql_to_arc` package. +These constraints exist to preserve correctness, prevent circular imports, and enforce design patterns. +An AI assistant or developer modifying this codebase MUST follow these rules. + +--- + +## 1. Module Dependency Graph + +`sql_to_arc` is a single, self-contained component. There is no complex inter-package dependency policy to enforce within it. + +The one cross-package rule is: + +> `middleware.shared` and `middleware.api_client` are **read-only dependencies** of `sql_to_arc`. +> They must NEVER import from `middleware.sql_to_arc`. This is naturally enforced since both packages live in a separate repository. + +If intra-package layering rules become necessary in the future (e.g., forbidden imports between specific modules), document them here. + +--- + +## 2. Extension Points + +### 2.1 Adding a New Database Entity + +When adding a new entity type (e.g. `SampleRow`), ALL of the following steps are mandatory: + +1. **Define the model** in `models.py` by subclassing `BaseRow`: + + ```python + class SampleRow(BaseRow): + __view_name__: ClassVar[str] = "vSample" + identifier: str = spec_field(required=True) + ... + ``` + + - Use `spec_field()` (not `Field()` directly) for all ARC-spec-relevant fields. + - Set `__view_name__` to the exact database view name. + +2. **Add a streaming method** to `Database` in `database.py`: + + ```python + async def stream_samples(self, investigation_ids: list[str]) -> AsyncGenerator[SampleRow, None]: + async for r in self._stream_by_investigation(SampleRow, investigation_ids, "sample"): + yield r + ``` + +3. **Register for schema validation** in `Database.validate_schema()`: + + ```python + models = [..., SampleRow] + ``` + +4. **Add a mapper function** in `mapper.py`. + +5. **Link into the data bundle** `ArcBuildData` in `context.py` and populate it in `_fetch_and_group_related_data()` in `processor.py` using `group_stream()`. + +6. **Call the mapper** inside `builder.py` in `build_single_arc_task()`. + +### 2.2 Adding New Mapper Functions + +- Mapper functions live exclusively in `mapper.py`. +- They accept a single `*Row` Pydantic model as input and return an `arctrl` type. +- They MUST NOT perform I/O, logging, or access the database. + +### 2.3 Adding New Configuration Values + +- All configuration values MUST be added as typed, annotated fields in the `Config` class in `config.py` or in other config classes that are referenced by `Config`. +- MUST use `Annotated[..., Field(description="...")]` with a meaningful description. +- Provide a sensible default whenever possible. +- **NEVER** access `os.environ` directly in any module. The `Config` object is the single source of truth for all settings. +- **NEVER** introduce new environment variables outside of `Config`. + +--- + +## 3. Concurrency & IPC Rules + +### 3.1 Process Pool Entry Point + +- `build_single_arc_task()` in `builder.py` is the **only function** executed inside worker processes. +- It MUST be a plain, top-level function (not a method or lambda) because it is pickled for IPC. +- Its argument MUST be the frozen dataclass `ArcBuildData` (picklable, no locks, no sockets). +- Its return value MUST be a `str` (JSON-LD string) or `None`. Returning complex objects (e.g., `ARC`) is forbidden — they are not reliably picklable across process boundaries and waste IPC bandwidth. + +### 3.2 Memory Management in Workers + +- After serializing the ARC to JSON, `del arc` MUST be called, followed by `gc.collect()`. +- This prevents worker processes from accumulating memory across repeated calls. + +### 3.3 Semaphore Usage + +- The `asyncio.Semaphore` (from `config.max_concurrent_tasks`) limits the **full lifecycle** of each investigation: data bundling → CPU build → API upload. +- It is acquired inside `_build_and_upload_single_arc()`. +- NEVER acquire the semaphore in a different scope (e.g. before spawning a task). +- Do NOT use `asyncio.Semaphore` as a substitute for the process pool limit. Both controls serve different purposes: the semaphore manages memory/IO, the `ProcessPoolExecutor` manages CPU. + +--- + +## 4. Error Handling Rules + +- A failure for one investigation MUST NOT abort the entire run. +- Catch expected errors at the point closest to the failure (`_upload_and_update_stats`, `_build_and_upload_single_arc`). +- On failure: increment `stats.failed_datasets` and append the identifier to `stats.failed_ids`. +- Re-raise unexpected errors (i.e., programming errors) so they are visible immediately. +- NEVER use bare `except Exception` as the final catch — only use it in `process_investigations`'s batch loop where it is immediately re-raised after logging. + +--- + +## 5. Configuration & Secrets + +- Configuration is loaded once in `main.py` via `ConfigWrapper` from `middleware.shared`. +- The resulting `Config` object is passed explicitly to functions that need it (dependency injection). +- Secrets (e.g. `connection_string`, API keys) use `pydantic.SecretStr`. Never log them with `str()` directly; use `.get_secret_value()` only at the point of use (e.g., engine creation). + +--- + +## 6. Logging Conventions + +- Every module defines: `logger = logging.getLogger(__name__)`. +- Do NOT use `print()` for any diagnostic output. +- Log messages that occur inside concurrent tasks MUST include a traceability prefix. Use the pattern `"%s: message", inv_info` (see `inv_info` in `processor.py`) so parallel log lines are distinguishable. +- Log levels: + - `DEBUG`: internal state, loop iterations. + - `INFO`: successful milestones (fetch, build, upload). + - `WARNING`: recoverable issues (missing optional column, assay without study). + - `ERROR`: per-item failures (build failed, upload failed). Do not use for fatal errors. + +--- + +## 7. Database Access Rules + +- All database reads go through the `Database` class in `database.py`. No other module is allowed to instantiate `AsyncEngine` or execute SQL directly. +- Use `stream_results=True` on all large queries to enable server-side cursors and avoid loading full tables into RAM. +- Use `literal_column("*")` with `select()` rather than ORM field mappings to generate clean `SELECT *` SQL. +- Related data (studies, assays, etc.) is ALWAYS fetched in bulk per batch using `WHERE investigation_ref IN (...)`. Never fetch related data row-by-row in a loop. diff --git a/docs/workspace.dsl b/docs/workspace.dsl new file mode 100644 index 00000000..24e8e3d5 --- /dev/null +++ b/docs/workspace.dsl @@ -0,0 +1,89 @@ +workspace "SQL-to-ARC Middleware" "Middleware component to convert SQL views into Annotated Research Context (ARC) objects." { + + model { + user = person "RDI Data Manager" "Responsible for managing and providing metadata from a Research Data Infrastructure." + + group "FAIRagro Ecosystem" { + fairAgroApi = softwareSystem "FAIRagro Middleware API" "Receives RO-Crate JSON-LD payloads." "External" + } + + sqlToArc = softwareSystem "SQL-to-ARC Converter" "Central component that maps relational data to ARC objects and sends them to the API." { + database = container "RDI SQL Database" "PostgreSQL database serving standardized metadata views." "PostgreSQL" "Database" + + converter = container "Converter Service" "Core logic for database extraction, mapping, and API transmission." "Python" { + main = component "Main Entry Point" "CLI interface and orchestrator." "Python" + + group "Async IO Loop (Controller)" { + orchestrator = component "Workflow Orchestrator" "Coordinates the data flow, manages concurrent tasks via Semaphores." "Python/Asyncio" + stats = component "Processing Stats" "Collects success/failure metrics and generates final reports." "Python" + } + + group "Process Pool Executor (Worker)" { + mapper = component "ARC Mapper" "Transforms relational rows into ARC structures using arctrl. Runs in separate OS processes to bypass GIL." "Python/arctrl" + serializer = component "JSON-LD Serializer" "Converts ARC objects to JSON strings directly in the worker process." "Python" + } + + group "Streaming Generator (Data Layer)" { + db_client = component "Database Client" "Implements lazy-loading and relational batching via SQLAlchemy streaming cursors." "Python/SQLAlchemy" + } + + api_client = component "API Client" "Handles mTLS secured async HTTP uploads to the Middleware API." "Python/httpx" + } + + demo_api = container "Mock API" "Simulates the FAIRagro API for local testing and CI." "FastAPI" "Development" + } + + # Relationships + user -> main "Configures and starts" + + main -> orchestrator "Orchestrates through" + orchestrator -> db_client "Streams investigations from" + orchestrator -> mapper "Submits tasks to Process Pool" + orchestrator -> api_client "Enqueues uploads to" + orchestrator -> stats "Updates metrics in" + + mapper -> serializer "Serializes to JSON-LD via" + db_client -> database "Queries views (vInvestigation, vStudy, etc.)" + api_client -> fairAgroApi "Sends RO-Crate JSON-LD (mTLS)" + api_client -> demo_api "Sends data during local demo" + } + + views { + systemContext sqlToArc "SystemContext" { + include * + autoLayout + } + + container sqlToArc "Containers" { + include * + autoLayout + } + + component converter "Components" { + include * + autoLayout + } + + styles { + element "Software System" { + background #1168bd + color #ffffff + } + element "Container" { + background #438dd5 + color #ffffff + } + element "Database" { + shape Cylinder + } + element "External" { + background #999999 + color #ffffff + } + element "Group" { + color #666666 + border Dotted + } + } + } +} From f6e86c20667131322320acf3bd344619eab9c55d Mon Sep 17 00:00:00 2001 From: Carsten Scharfenberg <138563220+Zalfsten@users.noreply.github.com> Date: Wed, 8 Apr 2026 14:54:24 +0200 Subject: [PATCH 13/27] Potential fix for pull request finding 'CodeQL / Uncontrolled data used in path expression' Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com> --- dev_environment/demo_api_main.py | 17 ++++++++++++++++- 1 file changed, 16 insertions(+), 1 deletion(-) diff --git a/dev_environment/demo_api_main.py b/dev_environment/demo_api_main.py index a393032d..58527cef 100644 --- a/dev_environment/demo_api_main.py +++ b/dev_environment/demo_api_main.py @@ -101,7 +101,22 @@ async def upload_arc(request: Request) -> dict[str, str | dict[str, str]]: output_path.mkdir(parents=True, exist_ok=True) _chown_tree(output_path) # Ensure the root output dir belongs to the host user - arc_id = arc_payload.get("identifier", f"arc_{os.urandom(4).hex()}") + # Derive a safe ARC identifier from the payload. The ARC identifier is used + # as a directory name below, so ensure it cannot escape the output_path. + raw_arc_id = arc_payload.get("identifier") + if isinstance(raw_arc_id, str) and raw_arc_id.strip(): + candidate_id = raw_arc_id.strip() + # Prevent absolute paths and directory components; keep only the final name. + if Path(candidate_id).is_absolute(): + candidate_id = f"arc_{os.urandom(4).hex()}" + safe_name = Path(candidate_id).name + if not safe_name: + arc_id = f"arc_{os.urandom(4).hex()}" + else: + arc_id = safe_name + else: + arc_id = f"arc_{os.urandom(4).hex()}" + now = datetime.now(UTC).isoformat() arc_dir = output_path / arc_id payload_path = arc_dir.with_suffix(".payload.json") From cf2305fd45a7db23c6dbb622c31fed18bb688117 Mon Sep 17 00:00:00 2001 From: Carsten Scharfenberg <138563220+Zalfsten@users.noreply.github.com> Date: Wed, 8 Apr 2026 14:56:39 +0200 Subject: [PATCH 14/27] Potential fix for pull request finding 'CodeQL / Uncontrolled data used in path expression' Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com> --- dev_environment/demo_api_main.py | 34 +++++++++++++++++++++++--------- 1 file changed, 25 insertions(+), 9 deletions(-) diff --git a/dev_environment/demo_api_main.py b/dev_environment/demo_api_main.py index 58527cef..9a7937d5 100644 --- a/dev_environment/demo_api_main.py +++ b/dev_environment/demo_api_main.py @@ -102,20 +102,36 @@ async def upload_arc(request: Request) -> dict[str, str | dict[str, str]]: _chown_tree(output_path) # Ensure the root output dir belongs to the host user # Derive a safe ARC identifier from the payload. The ARC identifier is used - # as a directory name below, so ensure it cannot escape the output_path. + # as a directory name below, so ensure it cannot escape the output_path and + # does not contain any path traversal or directory separators. raw_arc_id = arc_payload.get("identifier") + + def _generate_random_arc_id() -> str: + return f"arc_{os.urandom(4).hex()}" + if isinstance(raw_arc_id, str) and raw_arc_id.strip(): candidate_id = raw_arc_id.strip() - # Prevent absolute paths and directory components; keep only the final name. - if Path(candidate_id).is_absolute(): - candidate_id = f"arc_{os.urandom(4).hex()}" - safe_name = Path(candidate_id).name - if not safe_name: - arc_id = f"arc_{os.urandom(4).hex()}" + # Reduce to a single path component and normalize it. + safe_name = os.path.normpath(Path(candidate_id).name) + # Reject empty names, current/parent directory markers, or anything that + # would reintroduce directory components on this platform. + if not safe_name or safe_name in {".", ".."} or "/" in safe_name or "\\" in safe_name: + arc_id = _generate_random_arc_id() else: - arc_id = safe_name + # Build the full path and ensure it stays within the output_path root. + full_path = (output_path / safe_name).resolve() + try: + common_root = os.path.commonpath([str(output_path.resolve()), str(full_path)]) + except ValueError: + # On error (e.g., different drives), fall back to a random ID. + arc_id = _generate_random_arc_id() + else: + if common_root != str(output_path.resolve()): + arc_id = _generate_random_arc_id() + else: + arc_id = safe_name else: - arc_id = f"arc_{os.urandom(4).hex()}" + arc_id = _generate_random_arc_id() now = datetime.now(UTC).isoformat() arc_dir = output_path / arc_id From 48a934e593c91f55bed95e2ab638f38864691127 Mon Sep 17 00:00:00 2001 From: Carsten Scharfenberg <138563220+Zalfsten@users.noreply.github.com> Date: Wed, 8 Apr 2026 14:56:59 +0200 Subject: [PATCH 15/27] Potential fix for pull request finding 'CodeQL / Uncontrolled data used in path expression' Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com> --- dev_environment/demo_api_main.py | 20 ++++++++++++++++++++ 1 file changed, 20 insertions(+) diff --git a/dev_environment/demo_api_main.py b/dev_environment/demo_api_main.py index 9a7937d5..1ac981b3 100644 --- a/dev_environment/demo_api_main.py +++ b/dev_environment/demo_api_main.py @@ -135,6 +135,26 @@ def _generate_random_arc_id() -> str: now = datetime.now(UTC).isoformat() arc_dir = output_path / arc_id + + # Ensure the resolved target directory stays within the intended output root. + output_root_resolved = output_path.resolve() + arc_dir_resolved = arc_dir.resolve() + common_root = Path(os.path.commonpath([str(output_root_resolved), str(arc_dir_resolved)])) + if common_root != output_root_resolved: + # Reject paths that would escape the output root (for example via symlinks). + return { + "arc_id": arc_id, + "status": "error", + "metadata": { + "rdi": rdi, + "arc_hash": "demo_hash", + "status": "REJECTED", + "first_seen": now, + "last_seen": now, + }, + } + + arc_dir = arc_dir_resolved payload_path = arc_dir.with_suffix(".payload.json") with open(payload_path, "w", encoding="utf-8") as handle: From d76eb756f90eb0acc5e495cddb1bff1c24a930c4 Mon Sep 17 00:00:00 2001 From: Carsten Scharfenberg Date: Wed, 8 Apr 2026 13:00:29 +0000 Subject: [PATCH 16/27] Fix typo in Architecture Rules section header in AGENTS.md --- AGENTS.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/AGENTS.md b/AGENTS.md index f0d928c4..0d914376 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -76,7 +76,7 @@ docker compose logs -f docker compose down ``` -## ïżœ Architecture Rules +## Architecture Rules Before generating or modifying code, read **[docs/ARCHITECTURE_RULES.md](docs/ARCHITECTURE_RULES.md)**. From 5c84645abb3324652baafa52eda6f049da0c533e Mon Sep 17 00:00:00 2001 From: Carsten Scharfenberg Date: Wed, 8 Apr 2026 13:00:37 +0000 Subject: [PATCH 17/27] Refactor db-init service: remove infinite sleep and update healthcheck condition for sql_to_arc dependency --- dev_environment/compose.demo.yaml | 8 +------- 1 file changed, 1 insertion(+), 7 deletions(-) diff --git a/dev_environment/compose.demo.yaml b/dev_environment/compose.demo.yaml index 48d2e323..41f051c6 100644 --- a/dev_environment/compose.demo.yaml +++ b/dev_environment/compose.demo.yaml @@ -22,12 +22,6 @@ services: echo "Importing local demo.sql..." PGDATABASE=rdi psql < /tmp/demo.sql echo "Demo database initialization complete." - exec sleep infinity - healthcheck: - test: ["CMD-SHELL", "psql -d rdi -c 'SELECT 1' -U postgres"] - interval: 5s - timeout: 5s - retries: 5 middleware-api: image: python:3.12-slim @@ -59,7 +53,7 @@ services: sql_to_arc: depends_on: db-init: - condition: service_healthy + condition: service_completed_successfully middleware-api: condition: service_healthy volumes: From d36578366c518c4e501bded4e492b57b94be22ee Mon Sep 17 00:00:00 2001 From: Carsten Scharfenberg <138563220+Zalfsten@users.noreply.github.com> Date: Wed, 8 Apr 2026 15:01:39 +0200 Subject: [PATCH 18/27] Potential fix for pull request finding 'CodeQL / Uncontrolled data used in path expression' Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com> --- dev_environment/demo_api_main.py | 69 ++++++++++++++++++-------------- 1 file changed, 38 insertions(+), 31 deletions(-) diff --git a/dev_environment/demo_api_main.py b/dev_environment/demo_api_main.py index 1ac981b3..99cd698d 100644 --- a/dev_environment/demo_api_main.py +++ b/dev_environment/demo_api_main.py @@ -109,41 +109,50 @@ async def upload_arc(request: Request) -> dict[str, str | dict[str, str]]: def _generate_random_arc_id() -> str: return f"arc_{os.urandom(4).hex()}" - if isinstance(raw_arc_id, str) and raw_arc_id.strip(): - candidate_id = raw_arc_id.strip() - # Reduce to a single path component and normalize it. - safe_name = os.path.normpath(Path(candidate_id).name) - # Reject empty names, current/parent directory markers, or anything that - # would reintroduce directory components on this platform. - if not safe_name or safe_name in {".", ".."} or "/" in safe_name or "\\" in safe_name: - arc_id = _generate_random_arc_id() - else: - # Build the full path and ensure it stays within the output_path root. - full_path = (output_path / safe_name).resolve() - try: - common_root = os.path.commonpath([str(output_path.resolve()), str(full_path)]) - except ValueError: - # On error (e.g., different drives), fall back to a random ID. - arc_id = _generate_random_arc_id() + def _derive_safe_arc_id(base_dir: Path, raw_id: object) -> tuple[str, Path] | tuple[None, None]: + """ + Derive a safe ARC identifier and corresponding directory path that + is guaranteed to stay within the given base_dir. Returns (None, None) + if no safe identifier can be derived. + """ + base_resolved = base_dir.resolve() + + def _fallback() -> tuple[str, Path]: + rid = _generate_random_arc_id() + target = (base_resolved / rid).resolve() + return rid, target + + if isinstance(raw_id, str) and raw_id.strip(): + candidate_id = raw_id.strip() + # Reduce to a single path component and normalize it. + safe_name = os.path.normpath(Path(candidate_id).name) + # Reject empty names, current/parent directory markers, or anything that + # would reintroduce directory components on this platform. + if not safe_name or safe_name in {".", ".."} or "/" in safe_name or "\\" in safe_name: + arc_id, candidate_dir = _fallback() else: - if common_root != str(output_path.resolve()): - arc_id = _generate_random_arc_id() - else: - arc_id = safe_name - else: - arc_id = _generate_random_arc_id() + candidate_dir = (base_resolved / safe_name).resolve() + arc_id = safe_name + else: + arc_id, candidate_dir = _fallback() + + try: + common_root = os.path.commonpath([str(base_resolved), str(candidate_dir)]) + except ValueError: + return None, None + + if common_root != str(base_resolved): + return None, None + + return arc_id, candidate_dir now = datetime.now(UTC).isoformat() - arc_dir = output_path / arc_id - # Ensure the resolved target directory stays within the intended output root. - output_root_resolved = output_path.resolve() - arc_dir_resolved = arc_dir.resolve() - common_root = Path(os.path.commonpath([str(output_root_resolved), str(arc_dir_resolved)])) - if common_root != output_root_resolved: + arc_id, arc_dir = _derive_safe_arc_id(output_path, raw_arc_id) + if arc_id is None or arc_dir is None: # Reject paths that would escape the output root (for example via symlinks). return { - "arc_id": arc_id, + "arc_id": "invalid", "status": "error", "metadata": { "rdi": rdi, @@ -153,8 +162,6 @@ def _generate_random_arc_id() -> str: "last_seen": now, }, } - - arc_dir = arc_dir_resolved payload_path = arc_dir.with_suffix(".payload.json") with open(payload_path, "w", encoding="utf-8") as handle: From e864741b7b817d210322c07c937c7bb59b25973d Mon Sep 17 00:00:00 2001 From: Carsten Scharfenberg <138563220+Zalfsten@users.noreply.github.com> Date: Wed, 8 Apr 2026 15:05:01 +0200 Subject: [PATCH 19/27] Potential fix for pull request finding 'CodeQL / Uncontrolled data used in path expression' Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com> --- dev_environment/demo_api_main.py | 18 +++++++++++++++--- 1 file changed, 15 insertions(+), 3 deletions(-) diff --git a/dev_environment/demo_api_main.py b/dev_environment/demo_api_main.py index 99cd698d..3f344fb6 100644 --- a/dev_environment/demo_api_main.py +++ b/dev_environment/demo_api_main.py @@ -11,6 +11,7 @@ import traceback from datetime import UTC, datetime from pathlib import Path +import re from arctrl import ARC from arctrl.py.fable_modules.fable_library.async_ import start_as_task # type: ignore[import-untyped] @@ -122,13 +123,24 @@ def _fallback() -> tuple[str, Path]: target = (base_resolved / rid).resolve() return rid, target + # Allow only simple, short directory names consisting of safe characters. + # This ensures that user-controlled identifiers cannot introduce path + # traversal or unexpected filesystem semantics. + safe_name_pattern = re.compile(r"^[A-Za-z0-9_.-]{1,64}$") + if isinstance(raw_id, str) and raw_id.strip(): candidate_id = raw_id.strip() # Reduce to a single path component and normalize it. safe_name = os.path.normpath(Path(candidate_id).name) - # Reject empty names, current/parent directory markers, or anything that - # would reintroduce directory components on this platform. - if not safe_name or safe_name in {".", ".."} or "/" in safe_name or "\\" in safe_name: + # Reject empty names, current/parent directory markers, any embedded + # separators, or names that do not match the allowed pattern. + if ( + not safe_name + or safe_name in {".", ".."} + or "/" in safe_name + or "\\" in safe_name + or not safe_name_pattern.match(safe_name) + ): arc_id, candidate_dir = _fallback() else: candidate_dir = (base_resolved / safe_name).resolve() From 5e8c2af8efcb46faf128f1910025479170922e78 Mon Sep 17 00:00:00 2001 From: Carsten Scharfenberg <138563220+Zalfsten@users.noreply.github.com> Date: Wed, 8 Apr 2026 15:09:32 +0200 Subject: [PATCH 20/27] Potential fix for pull request finding 'CodeQL / Uncontrolled data used in path expression' Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com> --- dev_environment/demo_api_main.py | 33 ++++++++++++++++++++++++++++++++ 1 file changed, 33 insertions(+) diff --git a/dev_environment/demo_api_main.py b/dev_environment/demo_api_main.py index 3f344fb6..857b5d0b 100644 --- a/dev_environment/demo_api_main.py +++ b/dev_environment/demo_api_main.py @@ -174,6 +174,39 @@ def _fallback() -> tuple[str, Path]: "last_seen": now, }, } + + # Final safety check: ensure the resolved ARC directory is strictly within + # the resolved output root directory before performing any filesystem I/O. + try: + base_real = os.path.realpath(output_path) + arc_real = os.path.realpath(arc_dir) + common_root = os.path.commonpath([base_real, arc_real]) + except ValueError: + return { + "arc_id": "invalid", + "status": "error", + "metadata": { + "rdi": rdi, + "arc_hash": "demo_hash", + "status": "REJECTED", + "first_seen": now, + "last_seen": now, + }, + } + + if common_root != base_real: + return { + "arc_id": "invalid", + "status": "error", + "metadata": { + "rdi": rdi, + "arc_hash": "demo_hash", + "status": "REJECTED", + "first_seen": now, + "last_seen": now, + }, + } + payload_path = arc_dir.with_suffix(".payload.json") with open(payload_path, "w", encoding="utf-8") as handle: From 9510a6330cfd4648fb2548bf011e719def9c8deb Mon Sep 17 00:00:00 2001 From: Carsten Scharfenberg <138563220+Zalfsten@users.noreply.github.com> Date: Wed, 8 Apr 2026 15:13:03 +0200 Subject: [PATCH 21/27] Potential fix for pull request finding 'CodeQL / Uncontrolled data used in path expression' Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com> --- dev_environment/demo_api_main.py | 52 +++++++++++++++++++++++++++++--- 1 file changed, 48 insertions(+), 4 deletions(-) diff --git a/dev_environment/demo_api_main.py b/dev_environment/demo_api_main.py index 857b5d0b..f7b5fcf3 100644 --- a/dev_environment/demo_api_main.py +++ b/dev_environment/demo_api_main.py @@ -19,6 +19,9 @@ app = FastAPI() +# Root directory under which all ARC data and error logs are stored. +OUTPUT_ROOT = Path("/data/arcs") + def _get_target_owner() -> tuple[int, int] | None: uid_value = os.environ.get("LOCAL_UID") @@ -66,13 +69,54 @@ def _handle_error(arc_dir: Path, rdi: str, arc_id: str, exc: Exception) -> None: """ tb = traceback.format_exc() print(f"Error writing ARC for {rdi}/{arc_id}: {exc}") - arc_dir.mkdir(parents=True, exist_ok=True) - error_path = arc_dir / "error.txt" + + # Ensure that error logging always happens under the configured OUTPUT_ROOT, + # regardless of how arc_dir was derived. This avoids using any potentially + # untrusted path prefixes. + base_root = OUTPUT_ROOT.resolve() + + # Derive a simple, safe subdirectory name from the provided arc_dir/arc_id. + # Prefer the final path component of arc_dir; fall back to arc_id; and + # finally to a generic name if necessary. + candidate_name = arc_dir.name if arc_dir.name not in {"", ".", ".."} else arc_id + if not isinstance(candidate_name, str) or not candidate_name: + candidate_name = "unknown" + + # Reuse the same safe-name pattern as in upload_arc. + safe_name_pattern = re.compile(r"^[A-Za-z0-9_.-]{1,64}$") + candidate_name = candidate_name.strip() + if ( + not candidate_name + or candidate_name in {".", ".."} + or "/" in candidate_name + or "\\" in candidate_name + or not safe_name_pattern.match(candidate_name) + ): + safe_name = "unknown" + else: + safe_name = candidate_name + + # Build the final directory for error logging under the safe root. + safe_arc_dir = (base_root / safe_name).resolve() + try: + common_root = os.path.commonpath([str(base_root), str(safe_arc_dir)]) + except ValueError: + # Fall back to logging under a generic error directory if something goes wrong. + safe_arc_dir = (base_root / "errors").resolve() + common_root = os.path.commonpath([str(base_root), str(safe_arc_dir)]) + + if common_root != str(base_root): + # As an additional safeguard, if the computed directory is not under + # OUTPUT_ROOT, force it into a fixed "errors" directory. + safe_arc_dir = (base_root / "errors").resolve() + + safe_arc_dir.mkdir(parents=True, exist_ok=True) + error_path = safe_arc_dir / "error.txt" with open(error_path, "w", encoding="utf-8") as handle: handle.write(str(exc)) handle.write("\n\n") handle.write(tb) - _chown_tree(arc_dir) + _chown_tree(safe_arc_dir) @app.post("/v3/arcs") @@ -98,7 +142,7 @@ async def upload_arc(request: Request) -> dict[str, str | dict[str, str]]: if rdi is None: rdi = data.get("rdi", "unknown") - output_path = Path("/data/arcs") + output_path = OUTPUT_ROOT output_path.mkdir(parents=True, exist_ok=True) _chown_tree(output_path) # Ensure the root output dir belongs to the host user From 986dcfd5433e0b0f76a593256fddd96f25bedbe9 Mon Sep 17 00:00:00 2001 From: Carsten Scharfenberg <138563220+Zalfsten@users.noreply.github.com> Date: Wed, 8 Apr 2026 15:14:40 +0200 Subject: [PATCH 22/27] Potential fix for pull request finding 'CodeQL / Uncontrolled data used in path expression' Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com> --- dev_environment/demo_api_main.py | 13 ++++++++++++- 1 file changed, 12 insertions(+), 1 deletion(-) diff --git a/dev_environment/demo_api_main.py b/dev_environment/demo_api_main.py index f7b5fcf3..fffefc20 100644 --- a/dev_environment/demo_api_main.py +++ b/dev_environment/demo_api_main.py @@ -107,8 +107,19 @@ def _handle_error(arc_dir: Path, rdi: str, arc_id: str, exc: Exception) -> None: if common_root != str(base_root): # As an additional safeguard, if the computed directory is not under - # OUTPUT_ROOT, force it into a fixed "errors" directory. + # OUTPUT_ROOT, force it into a fixed "errors" directory and re-validate. safe_arc_dir = (base_root / "errors").resolve() + try: + common_root = os.path.commonpath([str(base_root), str(safe_arc_dir)]) + except ValueError: + # In the unlikely event that commonpath still fails, log directly under + # the resolved OUTPUT_ROOT itself. + safe_arc_dir = base_root + else: + if common_root != str(base_root): + # If the resolved "errors" directory is still not under OUTPUT_ROOT, + # fall back to using the OUTPUT_ROOT directory itself. + safe_arc_dir = base_root safe_arc_dir.mkdir(parents=True, exist_ok=True) error_path = safe_arc_dir / "error.txt" From a3222d1bbdcf1c4b56b9a9f7b999797ee911e2dd Mon Sep 17 00:00:00 2001 From: Carsten Scharfenberg <138563220+Zalfsten@users.noreply.github.com> Date: Wed, 8 Apr 2026 15:15:06 +0200 Subject: [PATCH 23/27] Potential fix for pull request finding 'CodeQL / Uncontrolled data used in path expression' Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com> --- dev_environment/demo_api_main.py | 57 +++++++------------------------- 1 file changed, 12 insertions(+), 45 deletions(-) diff --git a/dev_environment/demo_api_main.py b/dev_environment/demo_api_main.py index fffefc20..6460a2ef 100644 --- a/dev_environment/demo_api_main.py +++ b/dev_environment/demo_api_main.py @@ -70,55 +70,22 @@ def _handle_error(arc_dir: Path, rdi: str, arc_id: str, exc: Exception) -> None: tb = traceback.format_exc() print(f"Error writing ARC for {rdi}/{arc_id}: {exc}") - # Ensure that error logging always happens under the configured OUTPUT_ROOT, - # regardless of how arc_dir was derived. This avoids using any potentially - # untrusted path prefixes. - base_root = OUTPUT_ROOT.resolve() - - # Derive a simple, safe subdirectory name from the provided arc_dir/arc_id. - # Prefer the final path component of arc_dir; fall back to arc_id; and - # finally to a generic name if necessary. - candidate_name = arc_dir.name if arc_dir.name not in {"", ".", ".."} else arc_id - if not isinstance(candidate_name, str) or not candidate_name: - candidate_name = "unknown" - - # Reuse the same safe-name pattern as in upload_arc. - safe_name_pattern = re.compile(r"^[A-Za-z0-9_.-]{1,64}$") - candidate_name = candidate_name.strip() - if ( - not candidate_name - or candidate_name in {".", ".."} - or "/" in candidate_name - or "\\" in candidate_name - or not safe_name_pattern.match(candidate_name) - ): - safe_name = "unknown" - else: - safe_name = candidate_name - - # Build the final directory for error logging under the safe root. - safe_arc_dir = (base_root / safe_name).resolve() - try: - common_root = os.path.commonpath([str(base_root), str(safe_arc_dir)]) - except ValueError: - # Fall back to logging under a generic error directory if something goes wrong. - safe_arc_dir = (base_root / "errors").resolve() - common_root = os.path.commonpath([str(base_root), str(safe_arc_dir)]) - - if common_root != str(base_root): - # As an additional safeguard, if the computed directory is not under - # OUTPUT_ROOT, force it into a fixed "errors" directory and re-validate. - safe_arc_dir = (base_root / "errors").resolve() - try: - common_root = os.path.commonpath([str(base_root), str(safe_arc_dir)]) - except ValueError: + # Always log errors under a fixed subdirectory of OUTPUT_ROOT to avoid + # any dependence on user-controlled identifiers when constructing paths. + errors_root = (OUTPUT_ROOT / "errors").resolve() + errors_root.mkdir(parents=True, exist_ok=True) + + error_path = errors_root / "error.txt" # In the unlikely event that commonpath still fails, log directly under - # the resolved OUTPUT_ROOT itself. - safe_arc_dir = base_root + handle.write(f"RDI: {rdi}\n") + handle.write(f"ARC ID: {arc_id}\n") + handle.write(f"ARC directory: {arc_dir}\n") + handle.write(f"Exception: {exc}\n\n") else: - if common_root != str(base_root): # If the resolved "errors" directory is still not under OUTPUT_ROOT, + _chown_tree(errors_root) # fall back to using the OUTPUT_ROOT directory itself. + safe_arc_dir = base_root safe_arc_dir.mkdir(parents=True, exist_ok=True) From 2a7e5494d1b001bf6e941ff6ed6968a164e1b143 Mon Sep 17 00:00:00 2001 From: Carsten Scharfenberg Date: Wed, 8 Apr 2026 13:44:03 +0000 Subject: [PATCH 24/27] Refactor error handling in ARC processing to simplify logging --- dev_environment/demo_api_main.py | 213 +++++++++++-------------------- 1 file changed, 71 insertions(+), 142 deletions(-) diff --git a/dev_environment/demo_api_main.py b/dev_environment/demo_api_main.py index 6460a2ef..2f5ebd47 100644 --- a/dev_environment/demo_api_main.py +++ b/dev_environment/demo_api_main.py @@ -8,10 +8,10 @@ import json import os +import re import traceback from datetime import UTC, datetime from pathlib import Path -import re from arctrl import ARC from arctrl.py.fable_modules.fable_library.async_ import start_as_task # type: ignore[import-untyped] @@ -58,60 +58,79 @@ def apply_ownership(target: Path) -> None: def _handle_error(arc_dir: Path, rdi: str, arc_id: str, exc: Exception) -> None: + tb = traceback.format_exc() + print(f"Error writing ARC for {rdi}/{arc_id} (dir={arc_dir}): {exc}\n{tb}") + + +# Pre-compiled pattern for safe ARC directory names (no path traversal, predictable charset). +_SAFE_NAME_PATTERN = re.compile(r"^[A-Za-z0-9_.-]{1,64}$") + + +def _generate_random_arc_id() -> str: + return f"arc_{os.urandom(4).hex()}" + + +def _derive_safe_arc_id(base_dir: Path, raw_id: object) -> tuple[str, Path] | tuple[None, None]: """ - Log and store error information when ARC processing fails. + Derive a safe ARC identifier and corresponding directory path. - Args: - arc_dir: The directory where the ARC was supposed to be saved. - rdi: The RDI identifier. - arc_id: The ARC identifier. - exc: The exception that occurred. + Guarantees that the returned path stays within base_dir. Returns (None, None) + if no safe identifier can be derived. """ - tb = traceback.format_exc() - print(f"Error writing ARC for {rdi}/{arc_id}: {exc}") - - # Always log errors under a fixed subdirectory of OUTPUT_ROOT to avoid - # any dependence on user-controlled identifiers when constructing paths. - errors_root = (OUTPUT_ROOT / "errors").resolve() - errors_root.mkdir(parents=True, exist_ok=True) - - error_path = errors_root / "error.txt" - # In the unlikely event that commonpath still fails, log directly under - handle.write(f"RDI: {rdi}\n") - handle.write(f"ARC ID: {arc_id}\n") - handle.write(f"ARC directory: {arc_dir}\n") - handle.write(f"Exception: {exc}\n\n") + base_resolved = base_dir.resolve() + + def _fallback() -> tuple[str, Path]: + rid = _generate_random_arc_id() + return rid, (base_resolved / rid).resolve() + + if isinstance(raw_id, str) and raw_id.strip(): + safe_name = os.path.normpath(Path(raw_id.strip()).name) + if ( + not safe_name + or safe_name in {".", ".."} + or "/" in safe_name + or "\\" in safe_name + or not _SAFE_NAME_PATTERN.match(safe_name) + ): + arc_id, candidate_dir = _fallback() else: - # If the resolved "errors" directory is still not under OUTPUT_ROOT, - _chown_tree(errors_root) - # fall back to using the OUTPUT_ROOT directory itself. + candidate_dir = (base_resolved / safe_name).resolve() + arc_id = safe_name + else: + arc_id, candidate_dir = _fallback() + + try: + common_root = os.path.commonpath([str(base_resolved), str(candidate_dir)]) + except ValueError: + return None, None - safe_arc_dir = base_root + if common_root != str(base_resolved): + return None, None - safe_arc_dir.mkdir(parents=True, exist_ok=True) - error_path = safe_arc_dir / "error.txt" - with open(error_path, "w", encoding="utf-8") as handle: - handle.write(str(exc)) - handle.write("\n\n") - handle.write(tb) - _chown_tree(safe_arc_dir) + return arc_id, candidate_dir -@app.post("/v3/arcs") -async def upload_arc(request: Request) -> dict[str, str | dict[str, str]]: - """ - Handle the submission of an ARC RO-Crate. +def _rejected_response(rdi: str | None, now: str) -> dict[str, str | dict[str, str]]: + return { + "arc_id": "invalid", + "status": "error", + "metadata": { + "rdi": rdi or "unknown", + "arc_hash": "demo_hash", + "status": "REJECTED", + "first_seen": now, + "last_seen": now, + }, + } - This endpoint receives the RO-Crate JSON-LD payload, validates it, - and uses the arctrl library to reconstruct the ARC directory structure. - The resulting files are saved to the local 'demo_output' volume. - Args: - rdi: The identifier of the Research Data Infrastructure. - payload: The request body containing the 'arc' (RO-Crate JSON). +@app.post("/v3/arcs") +async def upload_arc(request: Request) -> dict[str, str | dict[str, str]]: + """Handle the submission of an ARC RO-Crate. - Returns: - A dictionary matching the ArcResult schema expected by the ApiClient. + Receives the RO-Crate JSON-LD payload, validates it, and uses the arctrl + library to reconstruct the ARC directory structure. Results are saved to + the local 'demo_output' volume. """ rdi = request.query_params.get("rdi") data = await request.json() @@ -122,115 +141,25 @@ async def upload_arc(request: Request) -> dict[str, str | dict[str, str]]: output_path = OUTPUT_ROOT output_path.mkdir(parents=True, exist_ok=True) - _chown_tree(output_path) # Ensure the root output dir belongs to the host user - - # Derive a safe ARC identifier from the payload. The ARC identifier is used - # as a directory name below, so ensure it cannot escape the output_path and - # does not contain any path traversal or directory separators. - raw_arc_id = arc_payload.get("identifier") - - def _generate_random_arc_id() -> str: - return f"arc_{os.urandom(4).hex()}" - - def _derive_safe_arc_id(base_dir: Path, raw_id: object) -> tuple[str, Path] | tuple[None, None]: - """ - Derive a safe ARC identifier and corresponding directory path that - is guaranteed to stay within the given base_dir. Returns (None, None) - if no safe identifier can be derived. - """ - base_resolved = base_dir.resolve() - - def _fallback() -> tuple[str, Path]: - rid = _generate_random_arc_id() - target = (base_resolved / rid).resolve() - return rid, target - - # Allow only simple, short directory names consisting of safe characters. - # This ensures that user-controlled identifiers cannot introduce path - # traversal or unexpected filesystem semantics. - safe_name_pattern = re.compile(r"^[A-Za-z0-9_.-]{1,64}$") - - if isinstance(raw_id, str) and raw_id.strip(): - candidate_id = raw_id.strip() - # Reduce to a single path component and normalize it. - safe_name = os.path.normpath(Path(candidate_id).name) - # Reject empty names, current/parent directory markers, any embedded - # separators, or names that do not match the allowed pattern. - if ( - not safe_name - or safe_name in {".", ".."} - or "/" in safe_name - or "\\" in safe_name - or not safe_name_pattern.match(safe_name) - ): - arc_id, candidate_dir = _fallback() - else: - candidate_dir = (base_resolved / safe_name).resolve() - arc_id = safe_name - else: - arc_id, candidate_dir = _fallback() - - try: - common_root = os.path.commonpath([str(base_resolved), str(candidate_dir)]) - except ValueError: - return None, None - - if common_root != str(base_resolved): - return None, None - - return arc_id, candidate_dir + _chown_tree(output_path) now = datetime.now(UTC).isoformat() - arc_id, arc_dir = _derive_safe_arc_id(output_path, raw_arc_id) + arc_id, arc_dir = _derive_safe_arc_id(output_path, arc_payload.get("identifier")) if arc_id is None or arc_dir is None: - # Reject paths that would escape the output root (for example via symlinks). - return { - "arc_id": "invalid", - "status": "error", - "metadata": { - "rdi": rdi, - "arc_hash": "demo_hash", - "status": "REJECTED", - "first_seen": now, - "last_seen": now, - }, - } - - # Final safety check: ensure the resolved ARC directory is strictly within - # the resolved output root directory before performing any filesystem I/O. + return _rejected_response(rdi, now) + + # Final safety check via realpath to catch symlink escapes. try: base_real = os.path.realpath(output_path) - arc_real = os.path.realpath(arc_dir) - common_root = os.path.commonpath([base_real, arc_real]) + common_root = os.path.commonpath([base_real, os.path.realpath(arc_dir)]) except ValueError: - return { - "arc_id": "invalid", - "status": "error", - "metadata": { - "rdi": rdi, - "arc_hash": "demo_hash", - "status": "REJECTED", - "first_seen": now, - "last_seen": now, - }, - } + return _rejected_response(rdi, now) if common_root != base_real: - return { - "arc_id": "invalid", - "status": "error", - "metadata": { - "rdi": rdi, - "arc_hash": "demo_hash", - "status": "REJECTED", - "first_seen": now, - "last_seen": now, - }, - } + return _rejected_response(rdi, now) payload_path = arc_dir.with_suffix(".payload.json") - with open(payload_path, "w", encoding="utf-8") as handle: json.dump(arc_payload, handle, indent=2) _chown_tree(payload_path) From e1df71c5ac2f3ecc2252ca1c41941877ebb803fb Mon Sep 17 00:00:00 2001 From: Carsten Scharfenberg Date: Wed, 8 Apr 2026 13:44:19 +0000 Subject: [PATCH 25/27] Refactor Docker Compose setup to streamline PostgreSQL initialization and simplify demo script --- dev_environment/compose.demo.yaml | 47 ++++++++++++++----------------- dev_environment/start-demo.sh | 8 +++--- 2 files changed, 25 insertions(+), 30 deletions(-) diff --git a/dev_environment/compose.demo.yaml b/dev_environment/compose.demo.yaml index 41f051c6..78f906d0 100644 --- a/dev_environment/compose.demo.yaml +++ b/dev_environment/compose.demo.yaml @@ -1,27 +1,20 @@ services: - db-init: + postgres: image: postgres:15 - depends_on: - postgres: - condition: service_healthy + restart: unless-stopped environment: - PGHOST: postgres - PGUSER: ${POSTGRES_USER} - PGPASSWORD: ${POSTGRES_PASSWORD} - PGDATABASE: postgres + POSTGRES_USER: postgres + POSTGRES_PASSWORD: postgres + POSTGRES_DB: rdi + ports: + - "5432:5432" volumes: - - ./demo.sql:/tmp/demo.sql:ro - entrypoint: /bin/bash - command: - - -c - - | - set -euo pipefail - echo "Recreating rdi database for demo..." - psql -c "DROP DATABASE IF EXISTS rdi;" - psql -c "CREATE DATABASE rdi;" - echo "Importing local demo.sql..." - PGDATABASE=rdi psql < /tmp/demo.sql - echo "Demo database initialization complete." + - ./demo.sql:/docker-entrypoint-initdb.d/01-demo.sql:ro + healthcheck: + test: ["CMD-SHELL", "pg_isready -U postgres -d rdi"] + interval: 5s + timeout: 5s + retries: 20 middleware-api: image: python:3.12-slim @@ -40,20 +33,22 @@ services: - -c - | set -e - apt-get update - apt-get install -y --no-install-recommends curl pip install fastapi uvicorn arctrl uvicorn main:app --app-dir /app --host 0.0.0.0 --port 8000 healthcheck: - test: ["CMD", "curl", "-f", "http://localhost:8000/live"] + test: ["CMD-SHELL", "python -c \"import urllib.request; urllib.request.urlopen('http://localhost:8000/live')\" 2>/dev/null || exit 1"] interval: 5s timeout: 5s - retries: 5 + retries: 10 sql_to_arc: + image: sql_to_arc:latest + build: + context: .. + dockerfile: docker/Dockerfile.sql_to_arc depends_on: - db-init: - condition: service_completed_successfully + postgres: + condition: service_healthy middleware-api: condition: service_healthy volumes: diff --git a/dev_environment/start-demo.sh b/dev_environment/start-demo.sh index c10c1160..d05cbe64 100755 --- a/dev_environment/start-demo.sh +++ b/dev_environment/start-demo.sh @@ -40,14 +40,14 @@ mkdir -p "${script_dir}/demo_output" # Start services using the base compose file + the demo override # This now overrides the db-init service to use demo.sql without downloads -docker compose -f compose.dev.yaml -f compose.demo.yaml up $BUILD_FLAG --abort-on-container-exit --exit-code-from sql_to_arc +docker compose -f compose.demo.yaml up $BUILD_FLAG --abort-on-container-exit --exit-code-from sql_to_arc echo "" echo "==> Demo finished! Cleaning up containers..." -docker compose -f compose.dev.yaml -f compose.demo.yaml down -v +docker compose -f compose.demo.yaml down -v echo "" echo "==> You can find the generated ARCs in: dev_environment/demo_output/" echo " (Wait a moment for files to appear if they are being processed...)" -echo " - View logs: docker compose -f compose.dev.yaml -f compose.demo.yaml logs" -echo " - Clean up: docker compose -f compose.dev.yaml -f compose.demo.yaml down -v" +echo " - View logs: docker compose -f compose.demo.yaml logs" +echo " - Clean up: docker compose -f compose.demo.yaml down -v" From 345a9bfd919a04c1c49ae41361b27970a4909a31 Mon Sep 17 00:00:00 2001 From: Carsten Scharfenberg Date: Wed, 8 Apr 2026 14:00:52 +0000 Subject: [PATCH 26/27] Refactor _derive_safe_arc_id to ensure valid ARC identifiers and prevent path traversal --- dev_environment/demo_api_main.py | 60 +++++++++++--------------------- 1 file changed, 20 insertions(+), 40 deletions(-) diff --git a/dev_environment/demo_api_main.py b/dev_environment/demo_api_main.py index 2f5ebd47..e754a45f 100644 --- a/dev_environment/demo_api_main.py +++ b/dev_environment/demo_api_main.py @@ -70,44 +70,36 @@ def _generate_random_arc_id() -> str: return f"arc_{os.urandom(4).hex()}" -def _derive_safe_arc_id(base_dir: Path, raw_id: object) -> tuple[str, Path] | tuple[None, None]: +def _derive_safe_arc_id(base_dir: Path, raw_id: object) -> tuple[str, Path]: """ Derive a safe ARC identifier and corresponding directory path. - Guarantees that the returned path stays within base_dir. Returns (None, None) - if no safe identifier can be derived. + Always returns a valid (arc_id, path) pair that is guaranteed to be + contained within base_dir. Falls back to a random ID when the provided + raw_id cannot be used safely. """ - base_resolved = base_dir.resolve() + # Resolve symlinks on the base directory once so all comparisons are stable. + base_real = Path(os.path.realpath(base_dir)) def _fallback() -> tuple[str, Path]: rid = _generate_random_arc_id() - return rid, (base_resolved / rid).resolve() - - if isinstance(raw_id, str) and raw_id.strip(): - safe_name = os.path.normpath(Path(raw_id.strip()).name) - if ( - not safe_name - or safe_name in {".", ".."} - or "/" in safe_name - or "\\" in safe_name - or not _SAFE_NAME_PATTERN.match(safe_name) - ): - arc_id, candidate_dir = _fallback() - else: - candidate_dir = (base_resolved / safe_name).resolve() - arc_id = safe_name - else: - arc_id, candidate_dir = _fallback() + return rid, base_real / rid - try: - common_root = os.path.commonpath([str(base_resolved), str(candidate_dir)]) - except ValueError: - return None, None + if not (isinstance(raw_id, str) and raw_id.strip()): + return _fallback() - if common_root != str(base_resolved): - return None, None + safe_name = os.path.normpath(Path(raw_id.strip()).name) + if not safe_name or safe_name in {".", ".."} or not _SAFE_NAME_PATTERN.match(safe_name): + return _fallback() - return arc_id, candidate_dir + # Normalize with realpath and verify containment *before* returning the path. + # This is the CodeQL-recommended pattern for preventing path traversal: + # construct → realpath → startswith-check. + candidate_real = Path(os.path.realpath(base_real / safe_name)) + if not str(candidate_real).startswith(str(base_real) + os.sep): + return _fallback() + + return safe_name, candidate_real def _rejected_response(rdi: str | None, now: str) -> dict[str, str | dict[str, str]]: @@ -146,18 +138,6 @@ async def upload_arc(request: Request) -> dict[str, str | dict[str, str]]: now = datetime.now(UTC).isoformat() arc_id, arc_dir = _derive_safe_arc_id(output_path, arc_payload.get("identifier")) - if arc_id is None or arc_dir is None: - return _rejected_response(rdi, now) - - # Final safety check via realpath to catch symlink escapes. - try: - base_real = os.path.realpath(output_path) - common_root = os.path.commonpath([base_real, os.path.realpath(arc_dir)]) - except ValueError: - return _rejected_response(rdi, now) - - if common_root != base_real: - return _rejected_response(rdi, now) payload_path = arc_dir.with_suffix(".payload.json") with open(payload_path, "w", encoding="utf-8") as handle: From 7e6becc554a26458d7e9fe7adca5251a4adc00dd Mon Sep 17 00:00:00 2001 From: Carsten Scharfenberg Date: Wed, 8 Apr 2026 14:01:04 +0000 Subject: [PATCH 27/27] Update PostgreSQL environment variables in Docker Compose to use dynamic values --- dev_environment/compose.demo.yaml | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/dev_environment/compose.demo.yaml b/dev_environment/compose.demo.yaml index 78f906d0..2d14eebf 100644 --- a/dev_environment/compose.demo.yaml +++ b/dev_environment/compose.demo.yaml @@ -3,8 +3,8 @@ services: image: postgres:15 restart: unless-stopped environment: - POSTGRES_USER: postgres - POSTGRES_PASSWORD: postgres + POSTGRES_USER: ${POSTGRES_USER} + POSTGRES_PASSWORD: ${POSTGRES_PASSWORD} POSTGRES_DB: rdi ports: - "5432:5432"