diff --git a/.agents/skills/arctrl/SKILL.md b/.agents/skills/arctrl/SKILL.md new file mode 100644 index 0000000..fdaef73 --- /dev/null +++ b/.agents/skills/arctrl/SKILL.md @@ -0,0 +1,322 @@ +--- +name: arctrl +description: > + Reference for using the arctrl Python library (v3.x) to build ARC (Annotated + Research Context) objects and serialize them to RO-Crate JSON-LD. Use when + working with ArcInvestigation, ArcStudy, ArcAssay, ArcTable, CompositeHeader, + CompositeCell, OntologyAnnotation, OntologySourceReference, Person, or + Publication objects, or when calling ToROCrateJsonString / WriteAsync. +compatibility: Python 3.12+, arctrl (Fable-transpiled F# library) +--- + +# ARCtrl — Usage Reference + +ARCtrl is a Fable-transpiled F# library — the Python surface is idiomatic +but some internals are Fable runtime types. + +--- + +## Package & Imports + +arctrl ships no type stubs and no `py.typed` marker. Mypy will report +`[import-untyped]` for every `arctrl.*` import unless you suppress it. + +**Preferred: project-level override in `pyproject.toml`** (no per-import +comments needed, covers all submodules): + +```toml +[[tool.mypy.overrides]] +module = ["arctrl", "arctrl.*"] +ignore_missing_imports = true +``` + +The `arctrl.*` glob is required because the Fable-transpiled internals are +exposed under `arctrl.py.*` subpackages (e.g. +`arctrl.py.Core.Table.composite_cell`), which are a different dotted path +from the bare `arctrl` package. + +**Alternative: per-import suppression** (only needed when the project-level +override is not in place): + +```python +from arctrl.py.fable_modules.fable_library.async_ import start_as_task # type: ignore[import-untyped] +from arctrl.py.Core.Table.composite_cell import Data # type: ignore[import-untyped] +``` + +--- + +```python +from arctrl import ( + ARC, + ArcAssay, + ArcInvestigation, + ArcStudy, + ArcTable, + CompositeCell, + CompositeHeader, + IOType, + OntologyAnnotation, + Person, + Publication, +) + +# Async write helper lives in the Fable internals: +from arctrl.py.fable_modules.fable_library.async_ import start_as_task # type: ignore[import-untyped] +``` + +--- + +## Core Objects + +### OntologyAnnotation + +```python +# Empty / unknown +oa = OntologyAnnotation() + +# With values — all parameters optional: +oa = OntologyAnnotation( + name="soil texture", # human-readable term + tan="http://purl.obolibrary.org/obo/ENVO_00002001", # TermAccessionNumber (URI) + tsr="ENVO", # TermSourceREF: short name of the ontology source +) +# tsr is a back-reference to an OntologySourceReference registered on the +# investigation (by its .Name). If no OntologySourceReference is registered, +# tsr can be left empty or omitted. +``` + +### OntologySourceReference + +Registered on `ArcInvestigation.OntologySourceReferences`. Describes an +ontology source and holds its version. + +```python +from arctrl import OntologySourceReference + +osr = OntologySourceReference( + name="ENVO", # short name — must match OntologyAnnotation.tsr + description="Environment Ontology", + file="http://purl.obolibrary.org/obo/envo.owl", + version="2024-01-01", # ontology version / access date +) +investigation.OntologySourceReferences.append(osr) +``` + +**Relationship:** `OntologyAnnotation.tsr` is a string key that references +`OntologySourceReference.name`. ARCtrl does not enforce referential integrity +at runtime, but the RO-Crate serialization will include both objects. + +### ArcInvestigation + +```python +inv = ArcInvestigation.create( + identifier="inv001", # required, must be non-empty + title="My Investigation", + description="...", + submission_date="2024-01-15", # ISO string or None + public_release_date="2025-01-01", +) +``` + +### ArcStudy + +```python +study = ArcStudy.create( + identifier="study001", + title="My Study", + description="...", + submission_date=None, + public_release_date=None, +) +``` + +### ArcAssay + +```python +assay = ArcAssay.create( + identifier="assay001", + measurement_type=OntologyAnnotation("soil metagenome", "http://...", ""), + technology_type=OntologyAnnotation("nucleotide sequencing", "http://...", ""), + technology_platform=OntologyAnnotation("Illumina", None, None), + # technology_platform=None is fine if unknown +) +``` + +### Person + +```python +person = Person( + last_name="Doe", + first_name="John", + mid_initials="A", + email="j.doe@example.com", + phone="+49 123 456789", + fax=None, + address="Somewhere", + affiliation="UFZ", + roles=[OntologyAnnotation("author", "http://...", "")], +) +``` + +### Publication + +```python +pub = Publication( + doi="10.1234/example", + pub_med_id="12345678", + authors="Doe J, Smith A", + title="Paper title", + status=OntologyAnnotation("published", "http://...", ""), +) +``` + +--- + +## Building an ARC + +```python +# 1. Wrap investigation +arc = ARC.from_arc_investigation(inv) + +# 2. Add studies (registers them in the investigation) +arc.AddRegisteredStudy(study) + +# 3. Add assays +arc.AddAssay(assay) + +# 4. Link assay → study +study.RegisterAssay(assay.Identifier) # pass the string identifier + +# 5. Attach contacts +arc.Contacts.append(person) # investigation-level +study.Contacts.append(person) # study-level +assay.Performers.append(person) # assay-level + +# 6. Attach publications +arc.Publications.append(pub) # investigation-level +study.Publications.append(pub) # study-level + +# 7. Serialize to RO-Crate JSON-LD string +json_str: str = arc.ToROCrateJsonString() +``` + +--- + +## ArcTable (Annotation Tables) + +```python +# Create table +table = ArcTable.init("my-table-name") + +# Build headers +header_input = CompositeHeader.input(IOType.of_string("source_name")) +header_output = CompositeHeader.output(IOType.of_string("sample_name")) +header_char = CompositeHeader.characteristic(OntologyAnnotation("pH", "", "")) +header_factor = CompositeHeader.factor(OntologyAnnotation("temperature", "", "")) +header_param = CompositeHeader.parameter(OntologyAnnotation("extraction", "", "")) +header_comp = CompositeHeader.component(OntologyAnnotation("reagent", "", "")) +header_cmt = CompositeHeader.comment("My comment label") +header_perf = CompositeHeader.performer # property, not callable +header_date = CompositeHeader.date # property, not callable +# Fallback for unknown/simple header names: +header_any = CompositeHeader.OfHeaderString("SomeColumnName") + +# IOType canonical strings recognised by IOType.of_string() (maps to named tags 0-3): +# "Source Name" / "Source" → tag 0 (Source) +# "Sample Name" / "Sample" → tag 1 (Sample) +# "Data" / "RawDataFile" / "Raw Data File" / "DerivedDataFile" / +# "Derived Data File" / "ImageFile" / "Image File" → tag 2 (Data) +# "Material" → tag 3 (Material) +# Any other string → tag 4 (FreeType — avoid for ISA compliance) + +# Build cells +cell_text = CompositeCell.free_text("some value") +cell_term = CompositeCell.term(OntologyAnnotation("sandy loam", "http://...", "")) +cell_unitized = CompositeCell.unitized("6.8", OntologyAnnotation("pH", "http://...", "")) +cell_empty = CompositeCell.free_text("") + +# Add column (header + matching cell list) +table.AddColumn(header_char, [cell_term, cell_term, cell_empty]) + +# Check whether a header expects a term cell +if header.IsTermColumn: + cell = CompositeCell.term(OntologyAnnotation(str(value), "", "")) +else: + cell = CompositeCell.free_text(str(value)) + +# Attach table to study or assay +study.AddTable(table) +assay.AddTable(table) +``` + +--- + +## Reading Back / Deserializing + +```python +# From RO-Crate JSON-LD string +arc = ARC.from_rocrate_json_string(json_str) + +# Async write to directory (creates ISA file structure on disk) +await start_as_task(arc.WriteAsync("/path/to/output/dir")) +``` + +--- + +## Identifiers + +- `assay.Identifier` — string property, read-only after creation +- `study.Identifier` +- `arc.Identifier` + +--- + +## Known Pitfalls + +**`start_as_task` is untyped** — always add `# type: ignore[import-untyped]` +on the import. + +**`CompositeHeader.performer` and `.date` are properties, not constructors** +— call them without `()`: + +```python +header = CompositeHeader.performer # CORRECT +header = CompositeHeader.performer() # TypeError +``` + +**`OntologyAnnotation()` without args is valid** — use for empty/unknown terms +instead of `None` to avoid null-ref errors in the F# layer. + +**ARC objects carry .NET interop state** — do not pickle or transfer across +multiprocessing boundaries. Serialize to JSON string first. + +**`ToROCrateJsonString()` + `gc.collect()`** — after serializing in a worker +process, explicitly `del arc` and call `gc.collect()` to release .NET bridge +memory promptly. + +**`ArcAssay.create(technology_platform=None)`** — `None` is safe. An empty +`OntologyAnnotation()` is also accepted. + +--- + +## RO-Crate JSON-LD Output Shape + +```json +{ + "@context": { "...": "..." }, + "@graph": [ + { "@id": "inv001", "@type": "Dataset", "identifier": "inv001" }, + { "@id": "study001", "@type": "Dataset" }, + { "@id": "assay001", "@type": "Dataset" }, + { "@id": "#Doe_John", "@type": "Person", "familyName": "Doe" } + ] +} +``` + +Test assertion pattern: + +```python +graph = json.loads(arc.ToROCrateJsonString()).get("@graph", []) +inv_node = next(item for item in graph if item.get("identifier") == "inv001") +person = next(item for item in graph if item.get("familyName") == "Doe") +``` diff --git a/.agents/skills/config-wrapper/SKILL.md b/.agents/skills/config-wrapper/SKILL.md new file mode 100644 index 0000000..a9005e6 --- /dev/null +++ b/.agents/skills/config-wrapper/SKILL.md @@ -0,0 +1,130 @@ +--- +name: config-wrapper +description: > + Reference for the ConfigWrapper / ConfigBase pattern from middleware.shared. + Use when adding config fields, reading config values, overriding via + environment variables or Docker secrets, or extending ConfigBase in a new + component. ConfigWrapper is the single source of truth for all configuration. +compatibility: Python 3.12+, pydantic v2, middleware.shared +--- + +# ConfigWrapper — Usage Reference + +`ConfigWrapper` (from `middleware.shared.config`) wraps a YAML file and adds +environment variable and Docker secret overrides. A component's `Config` class +extends `ConfigBase` and is populated via `Config.from_config_wrapper(wrapper)`. + +--- + +## Loading Configuration + +```python +from middleware.shared.config.config_wrapper import ConfigWrapper +from mycomponent.config import Config # extends ConfigBase + +wrapper = ConfigWrapper.from_yaml_file(path, prefix="MY_PREFIX") +config = Config.from_config_wrapper(wrapper) +``` + +--- + +## Override Resolution Order + +For every config field, the wrapper resolves values in this order: + +1. **Environment variable**: `{PREFIX}_{FIELD_PATH}` (uppercase, `_` as separator) +2. **Docker secret file**: `/run/secrets/{prefix}_{field_path}` (lowercase) +3. **YAML file value** +4. **Pydantic field default** + +Nested fields use `_` as path separator: +- `api_client.api_url` with prefix `MY_APP` → `MY_APP_API_CLIENT_API_URL` + +--- + +## Type Coercion (env / secret values are always strings) + +| String value | Parsed as | +|---|---| +| `"true"` / `"True"` / `"TRUE"` | `True` (bool) | +| `"false"` / `"False"` / `"FALSE"` | `False` (bool) | +| `"123"` | `123` (int) | +| `"3.14"` | `3.14` (float) | +| `""` (empty) | `None` | +| anything else | `str` | + +--- + +## Extending ConfigBase + +`ConfigBase` is an optional convenience base class from `middleware.shared` +that bundles config options shared across FAIRagro middleware components. You +can subclass it to inherit those fields, or use plain `pydantic.BaseModel` if +your component doesn't need them. + +```python +from typing import Annotated +from pydantic import Field, SecretStr +from middleware.shared.config.config_base import ConfigBase # optional + + +class Config(ConfigBase): # or BaseModel if ConfigBase fields aren't needed + # Required field (no default) + connection_string: Annotated[SecretStr, Field(description="DB connection URI")] + + # Optional field with default + batch_size: Annotated[ + int, + Field(description="Records to fetch per batch.", ge=1), + ] = 100 +``` + +--- + +## ConfigBase (optional convenience base) + +`ConfigBase` from `middleware.shared` is a FAIRagro-specific convenience class. +Use it when your component should share the standard logging and OpenTelemetry +fields; skip it for components that don't need them. + +Inherited fields: + +```python +log_level: LogLevel = "INFO" # "DEBUG" | "INFO" | "WARNING" | "ERROR" | "CRITICAL" +otel: OtelConfig # OpenTelemetry settings +``` + +`OtelConfig` fields: +- `endpoint: str | None` — OTLP collector URL +- `log_console_spans: bool` — print spans to stdout +- `log_level: LogLevel` — OTLP log export level + +--- + +## Secrets Handling + +- `SecretStr` fields: access the value as `.get_secret_value()` only at the + point of use (e.g., when creating a DB engine). Never pass them to `str()` + or log them directly. +- Docker secrets: mount files to `/run/secrets/`; the wrapper resolves them + automatically using the full key name (lowercase). + +--- + +## Testing + +In unit tests, instantiate `Config` directly without the wrapper: + +```python +config = Config( + connection_string=SecretStr("postgresql+asyncpg://user:pass@localhost/db"), + # ... other required fields +) +``` + +In integration tests, mock at the wrapper boundary: + +```python +mocker.patch("mycomponent.main.ConfigWrapper.from_yaml_file") +mocker.patch("mycomponent.main.Config.from_config_wrapper", return_value=mock_config) +``` diff --git a/.agents/skills/create-specifica-feature/SKILL.md b/.agents/skills/create-specifica-feature/SKILL.md new file mode 100644 index 0000000..7cc66f0 --- /dev/null +++ b/.agents/skills/create-specifica-feature/SKILL.md @@ -0,0 +1,152 @@ +--- +name: create-specifica-feature +description: > + Step-by-step guide for creating a new Specifica feature folder with + spec.md and design.md. Use when adding a new feature, workflow, or + cross-cutting concern to spec/ or middleware/{component}/spec/. + Also covers where to place the folder (project-level vs component-level) + and how to register a link in AGENTS.md. +--- + +# Creating a Specifica Feature + +[Specifica](https://specifica.org) organises software specs as plain Markdown +files in a directory: one folder per feature, three optional files +(`spec.md`, `design.md`, `tasks.md`). + +--- + +## 1. Choose the Right Location + +| Concern | Location | +| ------- | -------- | +| Affects multiple components, or belongs to no single component | `spec//` (project-level) | +| Internal to one component | `middleware//spec//` (component-level) | + +**Rule of thumb:** if the spec would need to be copied if a second component +appeared, it is project-level. + +--- + +## 2. Create the Folder + +Use kebab-case names that describe the feature. + +```bash +# component-level example +mkdir -p middleware/sql_to_arc/spec/ + +# project-level example +mkdir -p spec/ +``` + +--- + +## 3. Write `spec.md` — The *What* + +`spec.md` captures **requirements**: what the feature must do, in testable, +checkbox form. Keep implementation details out. + +```markdown +# + +One-sentence description of purpose and context. Include the trigger +condition and the expected output or side-effect. + +## Requirements + +- [ ] +- [ ] +- [ ] ... + +## Edge Cases + +. + +. +``` + +**Rules for requirements:** + +- One behaviour per checkbox — if you need "and" it is two requirements. +- State the outcome, not the implementation (`→ return 404` not `→ use Flask abort()`). +- Every edge case ends with a concrete outcome — no open-ended statements. + +--- + +## 4. Write `design.md` — The *How* + +`design.md` captures **decisions**: how it works and why. Skip obvious +implementation details; focus on non-obvious choices and trade-offs. + +```markdown +# — Design + +## + +Brief description or diagram of the main components and their +responsibilities. + +## Key Decisions + +1. **** + — + +2. **** + — +``` + +**Rules for Key Decisions:** + +- Every decision has a stated reason (the `—` clause is mandatory). +- "We chose X over Y because Z" is the target sentence structure. +- Decisions are numbered so they can be referenced from code comments + or other specs. + +--- + +## 5. Write `tasks.md` — The *Work* (optional) + +`tasks.md` is an ordered checklist. Use it for multi-step implementation +work or migrations. Omit it for completed or stable features. + +```markdown +# — Tasks + +- [ ] +- [ ] +- [x] +``` + +Tasks are ordered by dependency. Checked boxes = done. Tools can parse +and update `tasks.md` programmatically — keep entries flat and unambiguous. + +--- + +## 6. Register in `AGENTS.md` + +Add a link under the **Architecture & Design** section so every agent can +discover the new spec. + +```markdown +- **[`middleware/sql_to_arc/spec//`](...)** — Short description. +``` + +Use a relative path from the repository root. + +--- + +## 7. Project Conventions + +These rules apply specifically to this project (see also +[`spec/principles.md`](../../../spec/principles.md)): + +- `spec.md` never restates database view definitions — reference + `docs/sql_to_arc_database_views.md` instead. +- Requirements that are already captured in `spec/principles.md` + (typing, `uv`, `os.environ`) are **not** repeated in feature specs. +- Design decisions that affect public API types go in the component-level + spec, not the project-level spec. +- `tasks.md` is optional and should be removed once the feature is fully + implemented to avoid stale checklists. diff --git a/.devcontainer/antigravity/devcontainer.json b/.devcontainer/antigravity/devcontainer.json index 23ebb37..ce1bf3e 100644 --- a/.devcontainer/antigravity/devcontainer.json +++ b/.devcontainer/antigravity/devcontainer.json @@ -101,9 +101,7 @@ "github.copilot-chat", "charliermarsh.ruff", "tim-koehler.helm-intellisense", - "vadzimnestsiarenka.helm-template-preview-and-more", - "jebbs.plantuml", - "systemticks.c4-dsl-extension" + "vadzimnestsiarenka.helm-template-preview-and-more" ] } }, diff --git a/.devcontainer/vscode/devcontainer.json b/.devcontainer/vscode/devcontainer.json index 4a00e67..61b28b7 100644 --- a/.devcontainer/vscode/devcontainer.json +++ b/.devcontainer/vscode/devcontainer.json @@ -107,9 +107,7 @@ "github.copilot-chat", "charliermarsh.ruff", "tim-koehler.helm-intellisense", - "vadzimnestsiarenka.helm-template-preview-and-more", - "jebbs.plantuml", - "systemticks.c4-dsl-extension" + "vadzimnestsiarenka.helm-template-preview-and-more" ] } }, diff --git a/.github/agents/spec-to-code.agent.md b/.github/agents/spec-to-code.agent.md new file mode 100644 index 0000000..2896943 --- /dev/null +++ b/.github/agents/spec-to-code.agent.md @@ -0,0 +1,157 @@ +--- +name: spec-to-code +description: > + Implement source code changes driven by updates to Specifica spec files + (spec.md / design.md). Reads the changed spec, identifies what requirements + or key decisions changed, finds the affected source code, applies the + changes, and validates with formatter and tests. +tools: + - search + - read + - edit/editFiles + - execute/runInTerminal + - execute/getTerminalOutput + - execute/testFailure +--- + +# spec-to-code Agent + +You are an implementation agent for the FAIRagro SQL-to-ARC Converter. +Your job: translate Specifica spec changes into matching source code. + +## The two input modes + +`spec.md` and `design.md` have different roles: + +- **`spec.md`** is written by the developer/user first. It says *what* the + feature must do. The user is the author. +- **`design.md`** is primarily produced *during* implementation. It documents + the architecture that emerged — *how* it was built and *why*. You write it + as a by-product of implementation. + +This leads to two distinct triggers: + +### Mode A — `spec.md` changed (user added/changed requirements) + +The user has decided *what* to build. Your job is to implement it and then +document the architecture you chose in `design.md`. + +1. Implement the requirements (Steps 1–5 below). +2. After the code is working, **update `design.md`** to reflect: + - Any new or changed module responsibilities. + - Any new Key Decision introduced by this implementation (with `—` reasoning). + - Remove or update decisions that are no longer accurate. + +### Mode B — `design.md` changed (user is steering architecture) + +The user has made an explicit architectural decision and written it into +`design.md`. Your job is to refactor the code to match it. + +1. Read the changed `design.md` carefully — identify which Key Decision changed. +2. Find the code that implements the old decision. +3. Refactor it to match the new decision. +4. Run tests to verify nothing else broke. +5. Do **not** rewrite `design.md` — the user already wrote it. + +If you receive both files changed at once, handle Mode A first (implement +spec), then reconcile with the design constraints from Mode B. + +--- + +## Inputs + +The user will tell you which file changed, or paste its new content. +If a file path is given, read it. If a diff is given, parse it yourself. +Ask the user to clarify if the change is ambiguous before writing any code. + +## Step 1 — Load project context + +Read [`AGENTS.md`](../../AGENTS.md) to get the project's tech stack, +commands, and code quality standards. Do this once per session. + +## Step 2 — Understand the change + +**Mode A (spec.md changed):** +- Identify exactly what was added, removed, or reworded: + - New `- [ ]` requirement checkboxes → new behaviour to implement. + - Removed checkboxes → remove or disable that behaviour. + - Edited checkboxes → adjust existing implementation. + - Changed Edge Case → update guard clauses or error handling. + +**Mode B (design.md changed):** +- Identify which Key Decision changed and what the new decision requires. +- Do not infer intent — if the reasoning clause (`—`) is unclear, ask. + +## Step 3 — Find the affected code + +Use `search` to locate: +- The source module(s) responsible for the feature described in the spec. +- Existing tests that cover that feature. + +The feature-to-module mapping for `middleware/sql_to_arc`: + +| Feature spec | Primary source file(s) | +| ------------ | ---------------------- | +| `arc-building/` | `src/middleware/sql_to_arc/builder.py`, `mapper.py` | +| `database-access/` | `src/middleware/sql_to_arc/database.py`, `models.py` | +| `sql-to-arc-conversion/` | `src/middleware/sql_to_arc/processor.py`, `main.py` | +| `api-upload/` | `src/middleware/sql_to_arc/processor.py` | +| `spec/configuration/` | `src/middleware/sql_to_arc/config.py` | + +For project-level specs (`spec/`) follow links in `AGENTS.md` to the +affected component. + +## Step 4 — Implement the changes + +Apply all required source changes. Follow these rules without exception: + +- **Typed**: all public functions and methods must have full type annotations. +- **No `os.environ`**: all config comes from `Config`. +- **No SQL outside `Database`**: DB queries live only in `database.py`. +- **Worker IPC via JSON string only**: do not pass ARC objects across process + boundaries. +- **`SecretStr`**: use `.get_secret_value()` only at the point of use. +- **Do not add `# noqa`, `# type: ignore`, or `# pylint: disable` comments** + unless a real fix is technically impossible. Explain why if you must. + +## Step 4b — Update `design.md` (Mode A only) + +After the code is working, update `design.md` for the affected feature: + +- Revise module responsibility descriptions if they changed. +- Add a numbered Key Decision for every non-obvious choice you made, + with a mandatory `—` reasoning clause. +- Remove Key Decisions that no longer hold. +- Do **not** add decisions for obvious or trivial implementation choices. + +If `design.md` does not yet exist for this feature, create it following +the template in `.agents/skills/create-specifica-feature/SKILL.md`. + +## Step 5 — Update or add tests + +- Add a unit test for every new requirement. +- Update or remove tests for removed/changed requirements. +- Unit tests live in `middleware/sql_to_arc/tests/unit/`. +- Integration tests live in `middleware/sql_to_arc/tests/integration/`. +- Instantiate `Config` directly in unit tests; mock at the wrapper boundary + in integration tests. + +## Step 6 — Validate + +Run these commands in sequence: + +```bash +uv run ruff format . +uv run pytest middleware/sql_to_arc/tests/ -v +``` + +Then check the VS Code **Problems** tab for any remaining Pylance / Mypy / +Ruff diagnostics. Fix all reported issues before declaring done. + +## Done + +Report: +- Which spec requirements were implemented (list the checkbox text). +- Which files were changed. +- Test results (pass/fail count). +- Any open questions or decisions that the user should review. diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml index 12ed184..a7a4f52 100644 --- a/.pre-commit-config.yaml +++ b/.pre-commit-config.yaml @@ -41,7 +41,7 @@ repos: name: ruff format entry: uv run ruff format language: system - types_or: [python, markdown] + types: [python] args: [--check] # mypy - Type checking diff --git a/.vscode/copilot-instructions.md b/.vscode/copilot-instructions.md deleted file mode 100644 index f431b04..0000000 --- a/.vscode/copilot-instructions.md +++ /dev/null @@ -1,94 +0,0 @@ -# GitHub Copilot Instructions - -This file provides context and instructions for GitHub Copilot in this workspace. - -## 🚨 Critical Rules - ALWAYS Follow - -### Python Package Manager -- **ALWAYS use `uv` for Python commands** - Never use `pip` -- Example: `uv run pytest ...` instead of `python -m pytest` -- Exception: System packages via `apt-get` are fine - -### Configuration System -- YAML-based with environment variable overrides -- ConfigWrapper supports: `str | int | float | bool | None` -- Type parsing: `"true"` → `True`, `"123"` → `123`, `"3.14"` → `3.14` -- Empty env strings become `None` - -### Client Certificates (OPTIONAL) -- **Client certificates in ApiClient are OPTIONAL** (`Path | None`) -- Both must be provided together or both be `None` -- Validate only if `cert_path is not None` - -### Git LFS Setup -- SQL files (`*.sql`) tracked automatically by Git LFS -- Install via `scripts/load-env.sh`, never `git lfs install` -- Version-controlled hooks in `scripts/git-hooks/` - -## 📋 Tech Stack - -- Python 3.12.12 (REQUIRED) -- FastAPI -- Pydantic V2 -- PostgreSQL 15.15 -- Docker + Docker Compose -- Git LFS 3.3.0+ - -## 📁 Key Directories - -``` -middleware/ - ├── shared/ ConfigWrapper (24 tests, 86.53% coverage) - ├── api/ FastAPI REST API - ├── api_client/ HTTP Client (26 tests) - └── sql_to_arc/ SQL to ARC Converter - -scripts/ - ├── load-env.sh Main entry point (sets up hooks) - └── setup-git-lfs.sh -``` - -## 🔧 Essential Commands (with `uv`) - -```bash -# Tests -uv run pytest middleware/shared/tests/unit/ -v -uv run pytest middleware/api_client/tests/unit/ -v - -# Quality -uv run ruff check . -uv run mypy middleware/ - -# Setup -source scripts/load-env.sh - -# Docker -cd dev_environment && ./start-dev.sh --build -``` - -## ⚠️ Common Patterns - -### When Editing Files -1. Check current state with `read_file` -2. Use `replace_string_in_file` with 3-5 lines context -3. Never modify `.git/` directly -4. Run tests after changes: `uv run pytest` - -### Configuration Validation -- Client certs: Optional, check `if cert_path is not None` -- ConfigWrapper: Supports nested dicts and lists with primitives -- ApiClient: Works without certificates (no mTLS required) - -## 📞 Questions Before Making Changes - -- Python command? → Always `uv` -- Client certificates required? → No, optional -- Modify git hooks directly? → No, use scripts -- Python version? → 3.12.12 -- Run tests? → `uv run pytest` - ---- - -**Last Updated**: 2025-12-10 -**Branch**: feature/introduce_sql_to_arc -**For more details**: See AGENTS.md in project root diff --git a/.vscode/extensions.json b/.vscode/extensions.json index b845a5b..93272e3 100644 --- a/.vscode/extensions.json +++ b/.vscode/extensions.json @@ -21,8 +21,6 @@ "charliermarsh.ruff", "tim-koehler.helm-intellisense", "vadzimnestsiarenka.helm-template-preview-and-more", - "ms-vscode-remote.remote-containers", - "jebbs.plantuml", - "systemticks.c4-dsl-extension" + "ms-vscode-remote.remote-containers" ], } diff --git a/.vscode/settings.json b/.vscode/settings.json index 38bccfc..d0493af 100644 --- a/.vscode/settings.json +++ b/.vscode/settings.json @@ -21,10 +21,7 @@ "source.fixAll.ruff": "explicit" } }, - "[markdown]": { - "editor.defaultFormatter": "charliermarsh.ruff", - "editor.formatOnSave": true - }, + "python.defaultInterpreterPath": "${workspaceFolder}/.venv/bin/python", "sops-edit.onlyUseButtons": false, @@ -37,10 +34,11 @@ "${workspaceFolder}/pyproject.toml" ], - // Ruff Extension Settings - ensure consistency with script + // Ruff Extension Settings - ensure consistency with pyproject.toml "ruff.configuration": "./pyproject.toml", "ruff.path": ["uv", "run", "ruff"], "ruff.importStrategy": "fromEnvironment", + "ruff.lint.preview": true, // Pylint Extension Settings "pylint.args": [ diff --git a/AGENTS.md b/AGENTS.md index 0d91437..1d71269 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -16,8 +16,32 @@ This file contains critical context about the FAIRagro SQL-to-ARC Converter proj ## 📁 Project Structure ```text +.agents/ +└── skills/ # Agent Skills (agentskills.io standard) + ├── arctrl/ # arctrl Python library reference + ├── config-wrapper/ # ConfigWrapper / ConfigBase pattern + └── create-specifica-feature/ # How to create a new Specifica feature + +.github/ +└── agents/ # VS Code custom agents + └── spec-to-code.agent.md # Implements code changes from spec updates + +docs/ +├── ai_workflow.md # AI agent workflow documentation +└── sql_to_arc_database_views.md # Authoritative DB view / schema contract + +spec/ # Project-level architecture & design +├── principles.md # Project principles and foundation contract +├── configuration/ # Config loading, env overrides, secrets +└── demo-environment/ # Local demo / deployment setup + middleware/ └── sql_to_arc/ # SQL to ARC converter (Core logic) + ├── spec/ # Component-level architecture & design + │ ├── sql-to-arc-conversion/ # Top-level workflow feature + │ ├── arc-building/ # ARC object construction + │ ├── database-access/ # DB queries and row models + │ └── api-upload/ # ARC upload to the Middleware API ├── src/middleware/sql_to_arc/ │ ├── main.py # Entry point │ ├── mapper.py # Database to ARC mapping logic @@ -38,8 +62,10 @@ scripts/ └── post-merge dev_environment/ -├── start-dev.sh # Start Docker Compose (Postgres + Converter) -├── compose.dev.yaml # Docker services definition +├── start-demo.sh # Start full local demo (DB + Converter + Mock API) +├── start-dev.sh # Start with local DB, external API (needs sops) +├── compose.demo.yaml # Docker services for demo +├── compose.dev.yaml # Docker services for dev └── config.dev.yaml # Development configuration for the converter ``` @@ -51,8 +77,12 @@ dev_environment/ # Run tests for the converter uv run pytest middleware/sql_to_arc/tests/ -v -# Run quality checks -./scripts/quality-check.sh +# Run individual quality tools (never run quality-check.sh — it runs everything and is too slow) +uv run ruff check . +uv run ruff format . +uv run mypy middleware/sql_to_arc/ +uv run pylint middleware/sql_to_arc/ +uv run bandit -r middleware/sql_to_arc/src/ # Install all dependencies (including external shared/api_client via git) uv sync --dev --all-packages @@ -76,38 +106,32 @@ docker compose logs -f docker compose down ``` -## Architecture Rules +## Architecture & Design + +Before generating or modifying code, read the relevant spec folders. + +**Project-level** (`spec/`) — cross-cutting concerns: -Before generating or modifying code, read **[docs/ARCHITECTURE_RULES.md](docs/ARCHITECTURE_RULES.md)**. +- **[`spec/principles.md`](spec/principles.md)** — Project principles and foundation contract (start here). +- **[`spec/configuration/`](spec/configuration/)** — Config loading, env overrides, secrets, extension rules. +- **[`spec/demo-environment/`](spec/demo-environment/)** — Local demo / deployment setup. +- **[`spec/tooling-consistency/`](spec/tooling-consistency/)** — VS Code, pre-commit, and CI must report identical results from a shared config. -It defines binding constraints that MUST be followed: +**Component-level** (`middleware/sql_to_arc/spec/`) — sql_to_arc internals: -- **Module Dependency Graph**: Which module may import from which (no circular imports). -- **Extension Points**: How to add new DB entities, mapper functions, or config values. -- **Concurrency Rules**: IPC contract for worker processes, Semaphore scope. -- **Error Handling**: Per-investigation failure isolation, stats update pattern. -- **Config**: NEVER use `os.environ` directly — always extend `Config` in `config.py`. -- **Database Access**: All SQL goes through `Database`; always use server-side cursors and bulk fetches. +- **[`middleware/sql_to_arc/spec/sql-to-arc-conversion/`](middleware/sql_to_arc/spec/sql-to-arc-conversion/)** — Top-level workflow: workers, stats, CLI. +- **[`middleware/sql_to_arc/spec/arc-building/`](middleware/sql_to_arc/spec/arc-building/)** — ARC object construction (`mapper.py` + `builder.py`). +- **[`middleware/sql_to_arc/spec/database-access/`](middleware/sql_to_arc/spec/database-access/)** — DB access patterns, row models, SQL views. +- **[`middleware/sql_to_arc/spec/api-upload/`](middleware/sql_to_arc/spec/api-upload/)** — Upload to the Middleware API. --- -## �📝 Key Implementation Details +## 📝 Key Implementation Details ### External Dependencies This project depends on `shared` and `api_client` libraries, which are hosted in a separate repository (`m4.2_advanced_middleware_api`). They are included via `uv` workspace sources pointing to Git. -### SQL-to-ARC Mapping (`middleware/sql_to_arc/src/middleware/sql_to_arc/mapper.py`) - -**Purpose**: Transforms relational database rows into standardized Annotated Research Context (ARC) objects using the `arctrl` library. - -**Features**: - -- Mapping of Persons (Contacts) with JSON-encoded roles. -- Mapping of Publications. -- Metadata extraction for ISA (Investigation, Study, Assay) structures. -- CLI support: `--version` provides the current package version (via `importlib.metadata`). - ### Git LFS Integration **Setup Process**: @@ -118,38 +142,7 @@ This project depends on `shared` and `api_client` libraries, which are hosted in **Files Tracked by LFS**: `*.sql` (configured in `.gitattributes`). -## 🐳 Docker Compose Services - -```yaml -services: - postgres: # PostgreSQL database serving Edaphobase data - db-init: # Downloads and imports the Edaphobase SQL dump - sql_to_arc: # The converter component (this repo) -``` - -**Configuration**: `dev_environment/config.dev.yaml` - -- Connects to `postgres` service on port 5432. -- Uses `api_url` pointing to an external Middleware API if needed. - -## 🧪 Testing Strategy - -### Test Locations - -- `middleware/sql_to_arc/tests/unit/` - Isolated logic tests. -- `middleware/sql_to_arc/tests/integration/` - End-to-end workflow tests. - -### Running Tests with uv - -```bash -# Run all tests -uv run pytest middleware/sql_to_arc/ - -# Run with coverage -uv run pytest --cov=middleware/sql_to_arc middleware/sql_to_arc/tests/ -``` - -## 🔐 Security Notes +## Security Notes - DB passwords and API secrets should be managed via environment variables or `.env`. - `client.key` is dynamically handled in container secrets (`tmpfs`). @@ -168,12 +161,12 @@ Agents are expected to maintain high code quality by addressing issues reported When editing files: 1. **Always check current state** - Use `read_file` to see current content. -2. **Review for quality** - Run `./scripts/quality-check.sh` before committing. +2. **Review for quality** - Check the VS Code **Problems** tab (Pylance, Mypy, Ruff run continuously in the background). Only run individual tools (`uv run ruff check .`, `uv run mypy ...`) if the Problems tab is not available. Never run `./scripts/quality-check.sh` — it is too slow. 3. **Never modify `.git/` directly** - Use scripts instead. -4. **Test after changes** - Always run `uv run pytest`. +4. **Format and test after changes** - Run `uv run ruff format .` to auto-format, then `uv run pytest` to verify. --- -**Last Updated**: 2026-02-03 +**Last Updated**: 2026-04-13 **Current Branch**: feature/workflow_fixes **Maintainer Notes**: This repository is now decoupled from the main Middleware API. High-level architecture involves converting SQL views into ARC files. diff --git a/docs/ARCHITECTURAL_DESIGN.md b/docs/ARCHITECTURAL_DESIGN.md deleted file mode 100644 index 1b08433..0000000 --- a/docs/ARCHITECTURAL_DESIGN.md +++ /dev/null @@ -1,142 +0,0 @@ -# Architectural Documentation: SQL-to-ARC Middleware - -## 1. Overview - -The SQL-to-ARC Middleware is responsible for converting metadata from a relational SQL database into the **ARC (Annotated Research Context)** format. The architecture is designed for **high throughput**, **memory-efficient processing**, and **stability** when handling large volumes of data. - -## 2. Core Components - -The middleware consists of three main layers: - -1. **Async IO Loop (Controller):** Orchestrates the data flow, manages database streams, and handles API uploads. -2. **Process Pool Executor (Worker):** Parallelizes CPU-intensive ARC calculations in separate operating system processes. -3. **Streaming Generator (Data Layer):** Reads data in chunks from the database to keep RAM consumption constant. - ---- - -## 3. Detailed Architectural Concepts - -### 3.1 Parallelization & CPU Offloading - -Since generating ARCs (via `arctrl`) is computationally intensive and Python is limited by the Global Interpreter Lock (GIL), the middleware utilizes a `ProcessPoolExecutor`. - -* **Advantage:** Each ARC calculation runs on its own CPU core. -* **Implementation:** `loop.run_in_executor(executor, build_single_arc_task, ...)` -* **Multiprocessing Support:** Calling `multiprocessing.freeze_support()` ensures the middleware correctly starts new processes even in "frozen" environments (such as PyInstaller binaries on Windows). On Linux, this is primarily a best practice for cross-platform compatibility. - -### 3.2 Concurrency & Flow Control (The Semaphore) - -In addition to the process pool, an `asyncio.Semaphore` is used. This addresses two critical issues that a pure process pool cannot solve: - -1. **Memory Protection:** Without a semaphore, Python would start asynchronous tasks for all (e.g., 10,000) datasets simultaneously and keep the database data in RAM. The semaphore limits the number of *concurrently active* workflows. -2. **Network/IO Backpressure:** The semaphore also limits the number of simultaneous HTTP connections to the API to avoid timeouts and rate-limiting. - -**Discussion Point:** *Why not simply limit the size of the process pool?* -The process pool only limits CPU usage. The semaphore limits the **entire lifecycle** (data preparation -> build -> upload). It prevents the memory from overflowing with "waiting" data before it is even handed over to the pool. - -### 3.3 Memory-Efficient Data Streaming - -The middleware implements a **lazy-loading** approach for database entities: - -* **Chunking:** Using the `stream_investigations` generator (in the `Database` class), investigations are loaded in batches. -* **Relational Batching:** Associated studies, assays, contacts, and publications are fetched in bulk for each batch using specialized queries (e.g., `WHERE investigation_ref = ANY(...)`). -* **Effect:** We avoid the "N+1 Query" problem (extremely slow) while also avoiding a "Full Table Load" (extremely memory-intensive). - ---- - -## 4. Memory Management & Performance Optimization - -When processing thousands of investigations (ARC containers), RAM consumption can become critical. The middleware implements three strategies for this: - -### 4.1 Backlog Flow Control (Producer Pause) - -The asynchronous database stream produces data faster than the process pool can convert it. - -* **Problem:** Thousands of `asyncio.Tasks` would wait in RAM simultaneously for execution, including all associated database rows. -* **Solution:** Throttling in the main loop managed by the semaphore and task set management. The stream pauses until capacity becomes available. This limits the number of datasets residing in memory at once. - -### 4.2 Worker-Side Serialization & GC - -ARC objects in the `arctrl` library are complex and consume both Python and .NET-bridge memory. - -* **Strategy:** Conversion to a JSON-LD string is performed directly within the worker process. -* **Memory Cleanup:** After serialization, ARC objects in the worker are explicitly deleted (`del`) and the garbage collector (`gc.collect()`) is called before the process returns the result to the main process. This prevents worker processes from "swelling." - -### 4.3 JSON vs. Object Transfer - -In the current implementation, large ARC objects are not transferred between the main process and workers. Instead, primitive Python types (like dicts) are used as input, and serialized JSON-LD strings are returned as output. This minimizes Inter-Process Communication (IPC) overhead. - -### 4.4 Decoupling I/O and CPU (Workload Balancing) - -To maximize CPU utilization, the number of concurrently active tasks (`max_concurrent_tasks`) is controlled independently of the number of CPU workers (`max_concurrent_arc_builds`). - -* **Principle:** While some tasks wait for the API's network response (I/O), CPU workers can already process the next ARC build from the queue. -* **Configuration:** By default, task capacity is four times larger than the number of CPU workers (configurable via `max_concurrent_tasks`) to bridge latencies without overstretching RAM. - ---- - -## 5. Data Flow (Step-by-Step) - -1. **Producer:** The main process starts the streaming generator. -2. **Throttle:** The loop waits on the `Semaphore` for an available slot. -3. **Data Fetch:** Investigation data and related entities (Studies, Assays, etc.) are fetched from the database. -4. **Build (CPU):** The dataset is sent to the `ProcessPoolExecutor`. The main loop remains free for other tasks in the meantime. -5. **Upload (I/O):** The result (JSON) is sent asynchronously via HTTP to the Middleware API using `ApiClient`. -6. **Release:** The semaphore is released, and the next dataset flows in. - ---- - -## 6. Error Handling & Monitoring - -* **Targeted Exception Handling:** Errors during upload or build do not cause the entire run to abort. -* **ProcessingStats:** Every success and failure is recorded by ID and output as a JSON-LD report at the end. -* **Tracing:** The entire chain is instrumented with **OpenTelemetry** (tracing) to identify performance bottlenecks in the process pool or network. -* **Pre-flight Schema Validation:** The middleware verifies that all required database views and columns exist before starting the process. - ---- - -## 7. Summary of Design Decisions - -| Problem | Solution | Reason | -| :--- | :--- | :--- | -| GIL / CPU Limit | `ProcessPoolExecutor` | True parallelism across multiple cores. | -| Low CPU Utilization | I/O-CPU Decoupling | `max_concurrent_tasks` allows API uploads in parallel with new ARC builds. | -| Memory Overflow (Backlog) | Producer Throttling | Prevents too many datasets from "waiting" in RAM simultaneously. | -| Memory Leak (Worker) | `gc.collect()` + JSON Return | Frees memory in the worker immediately after conversion. | -| Database Load | Server-side Cursors + `ANY()` | Optimal balance between number of queries and memory load. | -| Scalability | Single ARC Processing | Earlier success/error feedback per investigation instead of per batch only. | - ---- - -## 8. Performance Tuning Guide - -To optimally adapt the middleware to existing hardware and database structures, the following parameters in the configuration file (`config.yaml`) can be optimized: - -### 8.1 CPU & Parallelization - -* **`max_concurrent_arc_builds`**: Determines the number of worker processes in the `ProcessPoolExecutor`. - * **Recommendation**: Set this value to the number of available CPU cores minus 1 (to leave reserves for the main process and the operating system). - * **Effect**: Higher CPU load, but faster execution of ARC generation. - -### 8.2 Throughput & I/O Balancing - -* **`max_concurrent_tasks`**: Limits the number of concurrently active asynchronous workflows (data fetch + build + upload). - * **Rule of Thumb**: `4 * max_concurrent_arc_builds`. - * **Why?**: While 4 cores are calculating ARCs, other tasks can wait for the API's network response (I/O). A value that is too high leads to increased RAM consumption; a value that is too low causes the CPU to run dry ("Stop-and-Go"). - -### 8.3 Database Efficiency - -* **`db_batch_size`**: Number of investigations loaded per database chunk. - * **Default**: 100. - * **Tuning**: Increase this value if you have many small investigations (few studies/assays) to reduce SQL roundtrips. Decrease it if individual investigations are extremely large to limit the RAM consumption of the main process. - -### 8.4 Stability & Timeouts - -* **`arc_generation_timeout_minutes`**: Maximum time for a single `build_single_arc_task` call in the worker. - * **Tuning**: Increase this value if you see "Timeout" errors in the log for very large datasets (e.g., thousands of assays). - -### 8.5 Summary: Finding the Optimal Setup - -1. **Find CPU Limit:** Increase `max_concurrent_arc_builds` until CPU cores are saturated. -2. **Fill I/O Gaps:** Increase `max_concurrent_tasks` if CPU load drops to 0% between builds (an indication of waiting for API uploads). -3. **RAM Check:** Monitor memory consumption. RAM requirements increase linearly with `max_concurrent_tasks` and the size of investigations in the batch. diff --git a/docs/ARCHITECTURE_RULES.md b/docs/ARCHITECTURE_RULES.md deleted file mode 100644 index 6fb8025..0000000 --- a/docs/ARCHITECTURE_RULES.md +++ /dev/null @@ -1,135 +0,0 @@ -# Architecture Rules: SQL-to-ARC Middleware - -This document defines **binding rules** for the `middleware/sql_to_arc` package. -These constraints exist to preserve correctness, prevent circular imports, and enforce design patterns. -An AI assistant or developer modifying this codebase MUST follow these rules. - ---- - -## 1. Module Dependency Graph - -`sql_to_arc` is a single, self-contained component. There is no complex inter-package dependency policy to enforce within it. - -The one cross-package rule is: - -> `middleware.shared` and `middleware.api_client` are **read-only dependencies** of `sql_to_arc`. -> They must NEVER import from `middleware.sql_to_arc`. This is naturally enforced since both packages live in a separate repository. - -If intra-package layering rules become necessary in the future (e.g., forbidden imports between specific modules), document them here. - ---- - -## 2. Extension Points - -### 2.1 Adding a New Database Entity - -When adding a new entity type (e.g. `SampleRow`), ALL of the following steps are mandatory: - -1. **Define the model** in `models.py` by subclassing `BaseRow`: - - ```python - class SampleRow(BaseRow): - __view_name__: ClassVar[str] = "vSample" - identifier: str = spec_field(required=True) - ... - ``` - - - Use `spec_field()` (not `Field()` directly) for all ARC-spec-relevant fields. - - Set `__view_name__` to the exact database view name. - -2. **Add a streaming method** to `Database` in `database.py`: - - ```python - async def stream_samples(self, investigation_ids: list[str]) -> AsyncGenerator[SampleRow, None]: - async for r in self._stream_by_investigation(SampleRow, investigation_ids, "sample"): - yield r - ``` - -3. **Register for schema validation** in `Database.validate_schema()`: - - ```python - models = [..., SampleRow] - ``` - -4. **Add a mapper function** in `mapper.py`. - -5. **Link into the data bundle** `ArcBuildData` in `context.py` and populate it in `_fetch_and_group_related_data()` in `processor.py` using `group_stream()`. - -6. **Call the mapper** inside `builder.py` in `build_single_arc_task()`. - -### 2.2 Adding New Mapper Functions - -- Mapper functions live exclusively in `mapper.py`. -- They accept a single `*Row` Pydantic model as input and return an `arctrl` type. -- They MUST NOT perform I/O, logging, or access the database. - -### 2.3 Adding New Configuration Values - -- All configuration values MUST be added as typed, annotated fields in the `Config` class in `config.py` or in other config classes that are referenced by `Config`. -- MUST use `Annotated[..., Field(description="...")]` with a meaningful description. -- Provide a sensible default whenever possible. -- **NEVER** access `os.environ` directly in any module. The `Config` object is the single source of truth for all settings. -- **NEVER** introduce new environment variables outside of `Config`. - ---- - -## 3. Concurrency & IPC Rules - -### 3.1 Process Pool Entry Point - -- `build_single_arc_task()` in `builder.py` is the **only function** executed inside worker processes. -- It MUST be a plain, top-level function (not a method or lambda) because it is pickled for IPC. -- Its argument MUST be the frozen dataclass `ArcBuildData` (picklable, no locks, no sockets). -- Its return value MUST be a `str` (JSON-LD string) or `None`. Returning complex objects (e.g., `ARC`) is forbidden — they are not reliably picklable across process boundaries and waste IPC bandwidth. - -### 3.2 Memory Management in Workers - -- After serializing the ARC to JSON, `del arc` MUST be called, followed by `gc.collect()`. -- This prevents worker processes from accumulating memory across repeated calls. - -### 3.3 Semaphore Usage - -- The `asyncio.Semaphore` (from `config.max_concurrent_tasks`) limits the **full lifecycle** of each investigation: data bundling → CPU build → API upload. -- It is acquired inside `_build_and_upload_single_arc()`. -- NEVER acquire the semaphore in a different scope (e.g. before spawning a task). -- Do NOT use `asyncio.Semaphore` as a substitute for the process pool limit. Both controls serve different purposes: the semaphore manages memory/IO, the `ProcessPoolExecutor` manages CPU. - ---- - -## 4. Error Handling Rules - -- A failure for one investigation MUST NOT abort the entire run. -- Catch expected errors at the point closest to the failure (`_upload_and_update_stats`, `_build_and_upload_single_arc`). -- On failure: increment `stats.failed_datasets` and append the identifier to `stats.failed_ids`. -- Re-raise unexpected errors (i.e., programming errors) so they are visible immediately. -- NEVER use bare `except Exception` as the final catch — only use it in `process_investigations`'s batch loop where it is immediately re-raised after logging. - ---- - -## 5. Configuration & Secrets - -- Configuration is loaded once in `main.py` via `ConfigWrapper` from `middleware.shared`. -- The resulting `Config` object is passed explicitly to functions that need it (dependency injection). -- Secrets (e.g. `connection_string`, API keys) use `pydantic.SecretStr`. Never log them with `str()` directly; use `.get_secret_value()` only at the point of use (e.g., engine creation). - ---- - -## 6. Logging Conventions - -- Every module defines: `logger = logging.getLogger(__name__)`. -- Do NOT use `print()` for any diagnostic output. -- Log messages that occur inside concurrent tasks MUST include a traceability prefix. Use the pattern `"%s: message", inv_info` (see `inv_info` in `processor.py`) so parallel log lines are distinguishable. -- Log levels: - - `DEBUG`: internal state, loop iterations. - - `INFO`: successful milestones (fetch, build, upload). - - `WARNING`: recoverable issues (missing optional column, assay without study). - - `ERROR`: per-item failures (build failed, upload failed). Do not use for fatal errors. - ---- - -## 7. Database Access Rules - -- All database reads go through the `Database` class in `database.py`. No other module is allowed to instantiate `AsyncEngine` or execute SQL directly. -- Use `stream_results=True` on all large queries to enable server-side cursors and avoid loading full tables into RAM. -- Use `literal_column("*")` with `select()` rather than ORM field mappings to generate clean `SELECT *` SQL. -- Related data (studies, assays, etc.) is ALWAYS fetched in bulk per batch using `WHERE investigation_ref IN (...)`. Never fetch related data row-by-row in a loop. diff --git a/docs/ai_workflow.md b/docs/ai_workflow.md new file mode 100644 index 0000000..0d2001a --- /dev/null +++ b/docs/ai_workflow.md @@ -0,0 +1,270 @@ +# AI Agent Workflow + +This document describes how AI coding agents (GitHub Copilot, Claude Code, etc.) +are integrated into this project and how the supporting artifacts are structured. + +--- + +## Overview + +The workflow is built on three open standards: + +| Standard | Purpose | URL | +| -------- | ------- | --- | +| **agents.md** | Central entry point — gives agents project context at startup | | +| **Specifica** | Spec-driven development — machine- and human-readable feature specs | | +| **Agent Skills** | On-demand procedural knowledge — loaded by agents when relevant | | + +--- + +## VS Code Integration + +GitHub Copilot in VS Code natively supports the artifacts described in this +document. Use **Chat: Open Customizations** (Command Palette `Ctrl+Shift+P`) +to explore and edit all active customization files in one place. + +| Artifact | VS Code mechanism | +| -------- | ----------------- | +| `AGENTS.md` | Loaded automatically as an *instructions file* by GitHub Copilot. Shown in **Chat: Open Customizations** under "Instructions". | +| `.agents/skills/*/SKILL.md` | Skill files are listed in **Chat: Open Customizations** under "Skills". The agent sees the frontmatter `description` at startup and loads the full file on demand. | +| `.github/agents/*.agent.md` | Custom agents are listed in the agent picker dropdown. Select an agent to activate its persona, tool set, and instructions. | +| `spec/**/*.md` | Not loaded automatically — agents follow links from `AGENTS.md` and read spec files with file-read tools as needed. | + +To verify which files are active, open the Copilot Chat panel, click the +settings icon, and select **Open Customizations**. All discovered instructions +and skill files are listed there. + +--- + +## Custom Agents: `.github/agents/` + +Custom agents (see [Custom agents in VS Code](https://code.visualstudio.com/docs/copilot/customization/custom-agents)) +combine a fixed persona, a curated tool set, and pre-loaded instructions into a +single, selectable configuration. Unlike skills — which are loaded on demand by +any agent — a custom agent *is* the active agent for the whole conversation. + +### `spec-to-code` — Spec-driven implementation + +[`.github/agents/spec-to-code.agent.md`](../.github/agents/spec-to-code.agent.md) + +This agent's job is to translate Specifica spec changes into matching source +code. Switch to it whenever a `spec.md` or `design.md` was updated and the code +needs to catch up. + +**How to use it:** + +1. Open Copilot Chat and select **spec-to-code** from the agent dropdown + (or type `@spec-to-code`). +2. Tell it what changed: + > The `arc-building/spec.md` now requires that empty annotation tables are + > skipped silently instead of raising a warning. Please implement this. +3. The agent reads the spec, finds the affected code in `builder.py` and its + tests, applies the change, runs `ruff format` and `pytest`, and reports + which requirements were implemented. + +**Tool set:** file read/write, codebase search, terminal (for formatter and +tests), Problems tab — no browser, no git push. + +**When to use it vs plain Agent mode:** + +| Situation | Use | +| --------- | --- | +| Spec changed, code needs to follow | `spec-to-code` | +| Exploratory coding, no spec context needed | default Agent mode | +| Writing a new spec from scratch | default Agent mode + `create-specifica-feature` skill | + +--- + +## Entry Point: `AGENTS.md` + +[`AGENTS.md`](../AGENTS.md) at the repository root is the single entry point for +all AI agents. It is automatically loaded by compatible agents (GitHub Copilot, +Claude Code, and others) at the start of every session. + +It contains only what every agent needs for every task: + +- Tech stack and key versions +- Project structure (with links to `spec/` and component specs) +- Essential commands (`uv`, `ruff`, `pytest`) +- Architecture & Design section — two-level spec index +- Code quality standards and file modification workflow + +**Principle:** `AGENTS.md` links to specs instead of duplicating their content. +It stays short and current. + +--- + +## Spec-Driven Development: `spec/` and `middleware/*/spec/` + +Specs follow the [Specifica](https://specifica.org) convention: each feature +lives in its own folder with a `spec.md` (what it does) and optionally a +`design.md` (key decisions and rationale). + +### Two-Level Layout + +```text +spec/ ← Project-level (cross-cutting concerns) +├── principles.md # Foundation contract, project values +├── configuration/ # ConfigWrapper pattern, env overrides, secrets +└── demo-environment/ # Local deployment setup + +middleware/ +└── sql_to_arc/ + └── spec/ ← Component-level (sql_to_arc internals) + ├── sql-to-arc-conversion/ + ├── arc-building/ + ├── database-access/ + └── api-upload/ +``` + +**Project-level specs** cover concerns that cut across components or that don't +belong to any single component (deployment, shared patterns, principles). + +**Component-level specs** live next to the code they describe +(`middleware//spec/`). Each future component gets its own `spec/` +folder. This makes specs portable and keeps context close to the code. + +### spec.md vs design.md + +- **`spec.md`** — requirements: what the feature must do, acceptance criteria, + interface contracts. Written before implementation. +- **`design.md`** — decisions: *why* it was built this way, key trade-offs, + alternatives rejected. Written alongside or after implementation. + +--- + +## Agent Skills: `.agents/skills/` + +Skills follow the [Agent Skills](https://agentskills.io/) open standard. Each +skill is a folder containing a `SKILL.md` file with YAML frontmatter and +Markdown instructions. + +```text +.agents/ +└── skills/ + ├── arctrl/ + │ └── SKILL.md # How to use the arctrl Python library + ├── config-wrapper/ + │ └── SKILL.md # How to use ConfigWrapper / ConfigBase + └── create-specifica-feature/ + └── SKILL.md # How to create a new Specifica feature folder +``` + +Skills are **project-neutral** — they document a library or pattern in general +terms. Project-specific usage (concrete prefixes, mock paths, accepted +trade-offs) lives in the corresponding feature spec, not in the skill. + +### How Agents Use Skills + +1. **Discovery**: At startup, agents see only the `name` and `description` from + each skill's frontmatter — just enough to know when a skill might apply. +2. **Activation**: When a task matches a skill's description, the agent loads + the full `SKILL.md` into context. +3. **Execution**: The agent follows the instructions, optionally loading + referenced files or scripts. + +Skills are activated on demand, keeping the agent's context window lean. + +--- + +## Workflow in Practice + +When an agent starts a task it: + +1. Loads `AGENTS.md` → gets project context, commands, and spec links. +2. If the task touches a feature → reads the relevant `spec.md` / `design.md`. +3. If the task requires library knowledge → loads the matching skill. +4. After editing → runs `uv run ruff format .` and `uv run pytest`, checks the + VS Code **Problems** tab for Pylance / Mypy / Ruff diagnostics. + +### Example: Adding a New Config Field + +1. `AGENTS.md` links to `spec/configuration/`. +2. Agent reads `spec/configuration/spec.md` → learns the constraints + (no `os.environ`, add to `Config`, use `SecretStr` for secrets). +3. Agent loads the `config-wrapper` skill → learns the exact Pydantic pattern + and how to write the test. +4. Agent edits `config.py`, formats, and runs the tests. + +### Example: Fixing an ARC Serialization Bug + +1. `AGENTS.md` links to `middleware/sql_to_arc/spec/arc-building/`. +2. Agent reads `arc-building/design.md` → understands key decisions (no + `OntologySourceReference`, 7-tuple column key, explicit GC). +3. Agent loads the `arctrl` skill → gets the correct API surface. +4. Agent edits `builder.py` or `mapper.py`, formats, runs tests. + +--- + +## Adding New Skills or Specs + +### New Skill with VS Code and Copilot Chat + +VS Code has built-in support for creating and managing skills +(see [Use Agent Skills in VS Code](https://code.visualstudio.com/docs/copilot/customization/agent-skills)). + +#### Option A — AI-generated skill (recommended) + +Type `/create-skill` in the Copilot Chat input and describe what you need: + +> `/create-skill` a skill for the `httpx` library covering async requests, +> timeout handling, and authentication headers + +Copilot asks clarifying questions and writes the complete +`.agents/skills/httpx/SKILL.md` with valid frontmatter and instructions. + +#### Option B — Manual creation via the Skills menu + +Type `/skills` in the Chat input to open the **Configure Skills** menu directly. +Select **New Skill (Workspace)**, choose a location, and enter a name. +VS Code creates the folder and an empty `SKILL.md` scaffold to fill in. + +Alternatively, open **Chat: Open Customizations** from the Command Palette +(`Ctrl+Shift+P`), select the **Skills** tab, and choose **New Skill** from +the dropdown. + +**Verify** the new skill appears under the Skills tab in +**Chat: Open Customizations** after saving the file. + +Rules that apply regardless of creation method: + +- `name` must match the folder name; lowercase letters and hyphens only. +- `description` must say both *what* the skill does and *when to use it*. +- Keep the skill **project-neutral** — no FAIRagro-specific paths or prefixes. + Project-specific constraints belong in the corresponding feature spec. +- Reference the skill from the relevant feature spec so agents know to load it. + +--- + +### New Feature Spec with `create-specifica-feature` + +The `create-specifica-feature` skill guides Copilot through the full process. + +**Example prompt** (Copilot Chat, Agent mode): + +> Use the `create-specifica-feature` skill to create a new component-level +> spec for a "result-export" feature in `middleware/sql_to_arc`. The feature +> writes ARC RO-Crate files to a local output directory as a fallback when +> the API is unreachable. + +Copilot will: + +1. Load the `create-specifica-feature` skill. +2. Choose the right location: `middleware/sql_to_arc/spec/result-export/` + (component-level, not project-level — affects only this component). +3. Create `spec.md` with a one-sentence purpose, `## Requirements` as + `- [ ]` checkboxes, and `## Edge Cases` as scenario → outcome pairs. +4. Create `design.md` with a `## Key Decisions` section, each decision + preceded by a `—` reasoning clause. +5. Add a link to `AGENTS.md` under **Architecture & Design**. + +The finished folder will look like: + +```text +middleware/sql_to_arc/spec/result-export/ +├── spec.md ← what it must do +└── design.md ← how it works and why +``` + +For detailed formatting rules, see +[`.agents/skills/create-specifica-feature/SKILL.md`](../.agents/skills/create-specifica-feature/SKILL.md). diff --git a/docs/workspace.dsl b/docs/workspace.dsl deleted file mode 100644 index 24e8e3d..0000000 --- a/docs/workspace.dsl +++ /dev/null @@ -1,89 +0,0 @@ -workspace "SQL-to-ARC Middleware" "Middleware component to convert SQL views into Annotated Research Context (ARC) objects." { - - model { - user = person "RDI Data Manager" "Responsible for managing and providing metadata from a Research Data Infrastructure." - - group "FAIRagro Ecosystem" { - fairAgroApi = softwareSystem "FAIRagro Middleware API" "Receives RO-Crate JSON-LD payloads." "External" - } - - sqlToArc = softwareSystem "SQL-to-ARC Converter" "Central component that maps relational data to ARC objects and sends them to the API." { - database = container "RDI SQL Database" "PostgreSQL database serving standardized metadata views." "PostgreSQL" "Database" - - converter = container "Converter Service" "Core logic for database extraction, mapping, and API transmission." "Python" { - main = component "Main Entry Point" "CLI interface and orchestrator." "Python" - - group "Async IO Loop (Controller)" { - orchestrator = component "Workflow Orchestrator" "Coordinates the data flow, manages concurrent tasks via Semaphores." "Python/Asyncio" - stats = component "Processing Stats" "Collects success/failure metrics and generates final reports." "Python" - } - - group "Process Pool Executor (Worker)" { - mapper = component "ARC Mapper" "Transforms relational rows into ARC structures using arctrl. Runs in separate OS processes to bypass GIL." "Python/arctrl" - serializer = component "JSON-LD Serializer" "Converts ARC objects to JSON strings directly in the worker process." "Python" - } - - group "Streaming Generator (Data Layer)" { - db_client = component "Database Client" "Implements lazy-loading and relational batching via SQLAlchemy streaming cursors." "Python/SQLAlchemy" - } - - api_client = component "API Client" "Handles mTLS secured async HTTP uploads to the Middleware API." "Python/httpx" - } - - demo_api = container "Mock API" "Simulates the FAIRagro API for local testing and CI." "FastAPI" "Development" - } - - # Relationships - user -> main "Configures and starts" - - main -> orchestrator "Orchestrates through" - orchestrator -> db_client "Streams investigations from" - orchestrator -> mapper "Submits tasks to Process Pool" - orchestrator -> api_client "Enqueues uploads to" - orchestrator -> stats "Updates metrics in" - - mapper -> serializer "Serializes to JSON-LD via" - db_client -> database "Queries views (vInvestigation, vStudy, etc.)" - api_client -> fairAgroApi "Sends RO-Crate JSON-LD (mTLS)" - api_client -> demo_api "Sends data during local demo" - } - - views { - systemContext sqlToArc "SystemContext" { - include * - autoLayout - } - - container sqlToArc "Containers" { - include * - autoLayout - } - - component converter "Components" { - include * - autoLayout - } - - styles { - element "Software System" { - background #1168bd - color #ffffff - } - element "Container" { - background #438dd5 - color #ffffff - } - element "Database" { - shape Cylinder - } - element "External" { - background #999999 - color #ffffff - } - element "Group" { - color #666666 - border Dotted - } - } - } -} diff --git a/middleware/sql_to_arc/README.md b/middleware/sql_to_arc/README.md index 33211e4..969a49f 100644 --- a/middleware/sql_to_arc/README.md +++ b/middleware/sql_to_arc/README.md @@ -154,9 +154,24 @@ docker run --rm \ --- +## Performance Tuning + +1. **Find the CPU ceiling** — increase `max_concurrent_arc_builds` until + CPU cores are saturated (≈ cores − 1 is the practical maximum). +2. **Fill I/O gaps** — if CPU drops to 0 % between builds (network latency + during API uploads), increase `max_concurrent_tasks`; + rule of thumb: 4 × `max_concurrent_arc_builds`. +3. **Watch RAM** — memory scales linearly with `max_concurrent_tasks` × + average investigation size. Reduce `db_batch_size` for very large + investigations if the main process grows too large. +4. **Timeout errors** — increase `arc_generation_timeout_minutes` only if + logs show timeouts on legitimately large datasets (e.g. thousands of + assays). + +--- + ## Documentation Links -- [Architectural Design](../../docs/ARCHITECTURAL_DESIGN.md) - [Database View Specification](../../docs/sql_to_arc_database_views.md) - [ARCtrl Documentation](https://nfdi4plants.org/ARCtrl/) - [Middleware API Client](https://github.com/fairagro/m4.2_advanced_middleware_api/tree/main/middleware/api_client) diff --git a/middleware/sql_to_arc/spec/api-upload/design.md b/middleware/sql_to_arc/spec/api-upload/design.md new file mode 100644 index 0000000..46eeaa3 --- /dev/null +++ b/middleware/sql_to_arc/spec/api-upload/design.md @@ -0,0 +1,40 @@ +# API Upload — Design + +## API Contract + +The converter calls `ApiClient.create_or_update_arc(rdi, arc_dict)` from +`processor.py`. The exact HTTP endpoint, request/response shape, and +authentication are fully encapsulated in the `middleware.api_client` shared +library and are **not a concern of this component**. + +## Lifecycle in the Converter + +```text +_upload_and_update_stats() + ├── json.loads(arc_json) → arc_dict (re-parse for API client) + ├── ctx.client.create_or_update_arc(rdi, arc_dict) + └── on success: log INFO + on error: stats.failed_datasets += 1 + stats.failed_ids.append(investigation_id) +``` + +## Key Decisions + +1. **JSON string → dict round-trip** + — The worker returns a JSON string (to keep IPC clean). The main process + parses it back to a dict for the API client. The overhead is negligible + and keeps the worker/main interface unambiguous. + +2. **Single `ApiClient` for the entire run** + — `ApiClient` is used as an async context manager in `main.py`. + Connection pooling amortises TLS handshake cost across all uploads. + +3. **OpenTelemetry span per upload** + — Each upload is wrapped in a `tracer.start_as_current_span("upload_arc")` + span with `rdi`, `worker_id`, and `investigation_id` attributes. This + makes per-investigation latency visible in any OTel-compatible backend. + +4. **Error scope: `(ConnectionError, TimeoutError, ApiClientError)`** + — Only network-level and API-level errors are caught here. Programming + errors (e.g. bad JSON) propagate upward so they are visible in the run + report as unexpected failures. diff --git a/middleware/sql_to_arc/spec/api-upload/spec.md b/middleware/sql_to_arc/spec/api-upload/spec.md new file mode 100644 index 0000000..fae8749 --- /dev/null +++ b/middleware/sql_to_arc/spec/api-upload/spec.md @@ -0,0 +1,32 @@ +# API Upload + +Publish each finished ARC RO-Crate JSON-LD document to the FAIRagro +Middleware API. The upload is the final step of the per-investigation +lifecycle and is the only I/O operation that reaches outside the local +machine at runtime. + +## Requirements + +- [ ] For each successfully built ARC, call + `ApiClient.create_or_update_arc(rdi, arc)` with the RO-Crate dict +- [ ] On `ConnectionError`, `TimeoutError`, or `ApiClientError` → count as + `failed_datasets`, add `investigation_id` to `failed_ids`, continue +- [ ] Log success or failure per investigation after each call +- [ ] Reuse the same `ApiClient` instance across all uploads within a run + +## Configuration + +The converter passes an `ApiClientConfig` (from `config.api_client`) to the +`ApiClient` constructor, which includes base URL, timeout, and credentials. +The converter does not interpret these values itself. + +## Edge Cases + +`ApiClientError` (any non-success response) → mark investigation failed, +continue. + +API is unreachable at startup → first upload attempt fails; the converter +does not pre-check connectivity. + +`arc_json` is `None` (build returned nothing) → log error, mark failed, +skip upload entirely. diff --git a/middleware/sql_to_arc/spec/arc-building/design.md b/middleware/sql_to_arc/spec/arc-building/design.md new file mode 100644 index 0000000..8eda912 --- /dev/null +++ b/middleware/sql_to_arc/spec/arc-building/design.md @@ -0,0 +1,81 @@ +# ARC Building — Design + +## Module Responsibilities + +```text +mapper.py — Pure row-to-ARCTRL-object functions (no logic, no branching on + table structure). One public function per entity type. + +builder.py — Orchestration: assembles the ARC from mapper output, handles + relational linking, builds ArcTable objects from flat rows. + Entry point: build_single_arc_task(ArcBuildData) → str +``` + +## Call Graph + +```text +build_single_arc_task(data) + ├── map_investigation(row) → ArcInvestigation + ├── ARC.from_arc_investigation() + ├── _add_studies_to_arc() + │ └── map_study(row) → ArcStudy + ├── _add_assays_to_arc() + │ ├── map_assay(row) → ArcAssay + │ └── _link_assay_to_studies() + ├── _add_contacts_to_arc() + │ └── map_contact(row) → Person + ├── _add_publications_to_arc() + │ └── map_publication(row) → Publication + ├── _process_annotation_tables() + │ ├── _build_arc_table() + │ │ ├── _get_column_key() + │ │ ├── _build_header() → CompositeHeader + │ │ └── _build_single_cell() → CompositeCell + │ └── target.AddTable(table) + └── arc.ToROCrateJsonString() → str (immediately freed after) +``` + +## Key Decisions + +1. **`mapper.py` has no conditional logic on DB structure** + — Each mapper function takes a single typed row and returns a single + ARCTRL object. All relational wiring (linking assays to studies, routing + contacts) is the responsibility of `builder.py`. This keeps mappers + unit-testable without a full ARC context. + +2. **Two-pass grouping for annotation tables** + — `vAnnotationTable` delivers one row per cell (see database-access design). + `builder.py` first groups by `(target_type, target_ref, table_name)` to + identify each table, then by column key to reconstruct columns before + calling `ArcTable.AddColumn()`. + +3. **Column key is a 7-tuple derived from metadata columns** + — `(column_type, column_io_type, column_value, column_annotation_term, + column_annotation_uri, column_annotation_version, column_name)`. + Stable across row iterations; used as a dict key to build per-column + cell lists without a second pass. + +4. **Worker process: explicit GC after serialization** + — `arctrl` objects hold .NET interop memory that the Python GC may not + collect promptly. `del arc` + `gc.collect()` immediately after + `ToROCrateJsonString()` prevents worker processes from accumulating + memory across many investigations. + +5. **No `OntologySourceReference` objects are created** + — `xxx_version` from the DB views belongs to `OntologySourceReference.version` + in ARCtrl, not to `OntologyAnnotation`. Populating it correctly requires + registering one `OntologySourceReference` per ontology source on the + investigation — complexity not justified by the benefit. `tsr` is always `""` + and `_version` is silently dropped. ARCs serialize with + `"ontologySourceReferences": []` — valid JSON-LD, but ontology version + provenance is lost. + +## OntologyAnnotation Mapping Convention + +```text +DB field → OntologyAnnotation argument +xxx_term → name +xxx_uri → tan (TermAccessionNumber) +xxx_version → (ignored, see Key Decision 5) + tsr (TermSourceREF) is always "" +``` diff --git a/middleware/sql_to_arc/spec/arc-building/spec.md b/middleware/sql_to_arc/spec/arc-building/spec.md new file mode 100644 index 0000000..6d6b55b --- /dev/null +++ b/middleware/sql_to_arc/spec/arc-building/spec.md @@ -0,0 +1,40 @@ +# ARC Building + +Transform pre-fetched database rows into a valid ARC RO-Crate JSON-LD +document for a single investigation. Runs in an isolated worker process; +must be stateless and side-effect-free. + +## Requirements + +- [ ] Accept a self-contained `ArcBuildData` bundle (investigation + + related studies, assays, contacts, publications, annotations) +- [ ] Map `InvestigationRow` → `ArcInvestigation` and wrap in `ARC` +- [ ] Map each `StudyRow` → `ArcStudy`; register in the ARC +- [ ] Map each `AssayRow` → `ArcAssay`; register in the ARC; link to + studies via `study_ref` (supports single ID or JSON array) +- [ ] Map each `ContactRow` → `Person`; attach to investigation, study, + or assay depending on `target_type` +- [ ] Map each `PublicationRow` → `Publication`; attach to investigation + or study depending on `target_type` +- [ ] Build `ArcTable` objects from flat annotation rows; attach to the + correct study or assay +- [ ] Serialize the finished ARC to a JSON-LD string via + `arc.ToROCrateJsonString()` +- [ ] Explicitly free ARC objects and call `gc.collect()` before returning +- [ ] Never import `database`, `processor`, or `config`; inputs arrive + as pure Pydantic data + +## Edge Cases + +`study_ref` is a JSON array string → parse and register the assay with +every referenced study. + +Investigation has assays but no studies → log warning. + +Unknown column type → skip that column, log warning. + +Annotation table targets a study/assay identifier that does not exist in +the current investigation's data → skip the table, log warning. + +Contact or publication has an unknown `target_type` → not attached +anywhere; log warning. diff --git a/middleware/sql_to_arc/spec/database-access/design.md b/middleware/sql_to_arc/spec/database-access/design.md new file mode 100644 index 0000000..88ccd89 --- /dev/null +++ b/middleware/sql_to_arc/spec/database-access/design.md @@ -0,0 +1,67 @@ +# Database Access — Design + +## Class Structure + +```text +Database + ├── engine: AsyncEngine (SQLAlchemy async engine) + ├── validator: SchemaValidator (pre-flight checks) + ├── stream_investigations() (server-side cursor, yields InvestigationRow) + ├── stream_studies() (bulk fetch by investigation IDs) + ├── stream_assays() + ├── stream_contacts() + ├── stream_publications() + └── stream_annotation_tables() + +SchemaValidator + ├── validate_models() (iterates all registered models) + ├── _validate_model() (columns + NULL checks per model) + ├── _get_db_columns() (SQLAlchemy inspect) + ├── _check_column_presence() (required vs optional field distinction) + └── _check_null_values() (SELECT COUNT WHERE col IS NULL) +``` + +## Key Decisions + +1. **Server-side cursor for investigations** + — `conn.stream(stmt)` with `stream_results=True` keeps the result set + on the DB server. The engine fetches rows in small batches rather than + pulling the whole table into Python RAM. + +2. **`SELECT *` via `literal_column("*")`** + — Using `sqlalchemy.literal_column("*")` generates `SELECT *` without + quoting the view name into `"vInvestigation"."*"`, which breaks some + dialects. This is intentional, not an oversight. + +3. **`WHERE investigation_ref = ANY(:ids)` for related entities** + — One round-trip per entity type per batch. Avoids the N+1 problem + (one query per investigation) while keeping memory bounded (no full + table load). + +4. **`spec_required` / `spec_override` field metadata** + — Standard Pydantic `is_required()` is not sufficient: some fields have + a default value but must still be present in the view. Custom + `json_schema_extra` flags let the validator express this distinction + without modifying the Python type. + +5. **Connection string normalisation in `__init__`** + — Legacy `postgresql://` and similar prefixes are rewritten to async + driver schemes before the engine is created. This lets operators reuse + existing connection strings without changing config. + +6. **`_validate_and_map` centralises row parsing** + — All DB-to-model transitions go through a single method; validation + errors are logged uniformly and the caller decides whether to skip or + raise. + +## Schema Validation Flow + +```text +validate_schema() + └── for each model: + _get_db_columns() → inspect view columns + _check_column_presence() → missing required → raise + → missing optional → warn + _check_null_values() → NULLs in required field → raise + → NULLs + spec_override → warn +``` diff --git a/middleware/sql_to_arc/spec/database-access/spec.md b/middleware/sql_to_arc/spec/database-access/spec.md new file mode 100644 index 0000000..ec289e5 --- /dev/null +++ b/middleware/sql_to_arc/spec/database-access/spec.md @@ -0,0 +1,71 @@ +# Database Access + +Provide a typed, async, memory-safe interface to the SQL views. +All SQL in the project lives here; no other module may query the database +directly. + +## Requirements + +- [ ] Connect to any SQLAlchemy-supported async dialect via a connection + string (PostgreSQL, MySQL, MSSQL, Oracle); normalise scheme prefixes + automatically +- [ ] Validate that all required views exist and have the expected columns + before the main processing loop starts +- [ ] Warn (not fail) when optional columns are missing; use model defaults +- [ ] Fail fast with `MissingRequiredColumnsError` when required columns + are absent +- [ ] Fail fast with `RequiredColumnsNullError` when required columns + contain NULL values (unless `spec_override=True` is set on the field) +- [ ] Stream investigations using a server-side cursor; never load the full + table into memory +- [ ] Fetch related entities (studies, assays, contacts, publications, + annotations) in bulk for a list of investigation IDs using a single + `WHERE investigation_ref = ANY(...)` query per entity type +- [ ] Validate each row against its Pydantic model; skip invalid rows with + a warning and increment `failed_datasets` +- [ ] For `vAnnotationTable` rows, also validate cross-field constraints + (e.g. `column_io_type` required when `column_type` is `input` or + `output`); log a warning but do not skip the row on constraint + violations + +## Views (Contract) + +The authoritative column-level specification for all views — including +required/optional fields, data types, and cross-dialect type mappings — is +maintained in [docs/sql_to_arc_database_views.md](../../../../docs/sql_to_arc_database_views.md). + +Every view has a corresponding `BaseRow` subclass in `models.py`; no raw +dicts cross module boundaries. + +Summary of views used by this module: + +| View | Pydantic Model | Purpose | +| --- | --- | --- | +| `vInvestigation` | `InvestigationRow` | Top-level metadata | +| `vStudy` | `StudyRow` | Study metadata linked to investigation | +| `vAssay` | `AssayRow` | Assay metadata linked to study | +| `vContact` | `ContactRow` | Person/contact linked to investigation, study, or assay | +| `vPublication` | `PublicationRow` | Publication linked to investigation or study | +| `vAnnotationTable` | `AnnotationTableRow` | One cell row, carrying table/column/cell metadata | + +## Edge Cases + +View does not exist → `validate_schema()` logs a warning and skips that +view (not fatal for optional views; fatal for required ones). + +Row fails Pydantic validation → skip row, log warning with field errors, +increment `failed_ids`. + +Connection strings without an explicit async driver suffix → automatically +rewritten: + +| Standard prefix | Rewritten to aync driver prefix | +| --- | --- | +| `postgresql://` | `postgresql+psycopg://` | +| `mysql://` | `mysql+aiomysql://` | +| `mariadb://` | `mysql+aiomysql://` | +| `oracle://` | `oracle+oracledb://` | +| `mssql://` | `mssql+aioodbc://` | + +Empty investigation list passed to `_stream_by_investigation` → returns +immediately without a query. diff --git a/middleware/sql_to_arc/spec/sql-to-arc-conversion/design.md b/middleware/sql_to_arc/spec/sql-to-arc-conversion/design.md new file mode 100644 index 0000000..5df7896 --- /dev/null +++ b/middleware/sql_to_arc/spec/sql-to-arc-conversion/design.md @@ -0,0 +1,74 @@ +# SQL-to-ARC Conversion — Design + +## Architecture + +Three concurrency layers cooperate to keep the pipeline fast and +memory-bounded: + +```text +Main process (async event loop) + │ + ├─ DB stream (AsyncGenerator, server-side cursor, chunked) + │ + ├─ asyncio.Semaphore (flow control: caps active tasks) + │ + ├─ asyncio.Task set (one Task per investigation) + │ │ + │ └─ ProcessPoolExecutor (CPU-bound ARC build in forked process) + │ + └─ httpx (async HTTP upload to Middleware API) +``` + +## Data Flow + +1. `main.py` parses config and starts `process_investigations()`. +2. DB stream yields `InvestigationRow` objects one batch at a time + (`db_batch_size`, default 100). +3. For each batch, related data (studies, assays, contacts, publications, + annotations) is fetched in a single bulk query per entity type using + `WHERE investigation_ref = ANY(...)`. +4. The semaphore gates entry into the per-investigation task. It limits + peak concurrent active lifecycles to `max_concurrent_tasks` (default + 4 × `max_concurrent_arc_builds`). +5. Inside the task, an `ArcBuildData` bundle (plain Pydantic models) is + handed to `loop.run_in_executor()` which runs `build_single_arc_task` + in a worker process. +6. The worker builds the ARC, serializes it to a JSON-LD string, calls + `gc.collect()`, and returns the string. +7. The main process uploads the JSON string to the API; no ARC object + crosses the process boundary. +8. `ProcessingStats` is updated atomically per investigation. + +## Key Decisions + +1. **ProcessPoolExecutor, not ThreadPoolExecutor** + — `arctrl` is CPU-bound and holds .NET bridge state; the GIL prevents + true parallelism with threads. Separate OS processes give each worker + a dedicated core. + +2. **Semaphore scope wraps the full lifecycle (data → build → upload)** + — A narrower scope (e.g. only around the CPU step) would let the event + loop queue thousands of "waiting" tasks, each holding its DB rows in RAM. + The semaphore prevents the backlog from growing unboundedly. + +3. **IPC via JSON string, not pickled ARC object** + — ARC objects carry .NET interop state that does not survive pickling + cleanly and is large. Returning a string minimises IPC overhead and + avoids worker memory leaks. + +4. **Batch fetch of related data, not per-investigation queries** + — Querying DB once per investigation would be O(N) round-trips. + One bulk `ANY()` query per entity type per batch keeps DB load constant. + +5. **`max_concurrent_tasks` defaults to 4 × `max_concurrent_arc_builds`** + — While `k` workers build ARCs, the remaining slots keep the HTTP + upload pipeline busy without overloading RAM. + +6. **Schema validation before the loop starts** + — Fail fast with a clear diagnostic if the DB schema doesn't match + the expected views. Better than partial output with silent column gaps. + +7. **OpenTelemetry tracing across the full pipeline** + — `processor.py` and `main.py` instrument each investigation span and + the overall run span. This allows identifying bottlenecks in the process + pool (CPU-bound) versus the API upload (I/O-bound) in production. diff --git a/middleware/sql_to_arc/spec/sql-to-arc-conversion/spec.md b/middleware/sql_to_arc/spec/sql-to-arc-conversion/spec.md new file mode 100644 index 0000000..0ce2fc5 --- /dev/null +++ b/middleware/sql_to_arc/spec/sql-to-arc-conversion/spec.md @@ -0,0 +1,31 @@ +# SQL-to-ARC Conversion Pipeline + +Orchestrate the end-to-end batch run: validate the database, process all +investigations, build ARCs, upload them, and report results. This spec +covers only the glue between the other features; the details live there. + +## Requirements + +- [ ] Validate that all required database views and columns exist before + starting the main loop — see [database-access/spec.md](../database-access/spec.md) +- [ ] Stream investigations one at a time and fetch related entities in + bulk per batch — see [database-access/spec.md](../database-access/spec.md) +- [ ] For each investigation: build the ARC in an isolated worker process — + see [arc-building/spec.md](../arc-building/spec.md) +- [ ] Upload each successfully built ARC to the Middleware API — + see [api-upload/spec.md](../api-upload/spec.md) +- [ ] Record success and failure per investigation by ID; print a JSON + provenance report to stdout when the run completes +- [ ] Exit with code 0 if processing succeeded (even with partial failures); + non-zero on fatal errors (schema mismatch, DB unreachable, etc.) +- [ ] Worker process timeout (default 30 min) → investigation counted as + failed, loop continues +- [ ] Respect `debug_limit` config to cap the number of investigations + processed (for testing) +- [ ] Support `--version` CLI flag; support `--config` to specify config file + +## Scope + +Covers orchestration only: entry point, process lifecycle, stats +aggregation, exit codes, and CLI flags. Per-feature behaviour (DB queries, +ARC construction, API calls) is out of scope here. diff --git a/middleware/sql_to_arc/src/middleware/sql_to_arc/builder.py b/middleware/sql_to_arc/src/middleware/sql_to_arc/builder.py index de4c278..29c9e34 100644 --- a/middleware/sql_to_arc/src/middleware/sql_to_arc/builder.py +++ b/middleware/sql_to_arc/src/middleware/sql_to_arc/builder.py @@ -4,7 +4,7 @@ import json import logging from collections import defaultdict -from typing import Any, cast +from typing import Any from arctrl import ( ARC, @@ -16,6 +16,7 @@ IOType, OntologyAnnotation, ) +from arctrl.py.Core.Table.composite_cell import Data from middleware.sql_to_arc.context import ArcBuildData from middleware.sql_to_arc.mapper import ( @@ -26,6 +27,7 @@ map_study, ) from middleware.sql_to_arc.models import ( + AnnotationTableRow, AssayRow, ContactRow, PublicationRow, @@ -48,6 +50,11 @@ def _add_studies_to_arc(arc: ARC, study_rows: list[StudyRow]) -> dict[str, ArcSt def _add_assays_to_arc(arc: ARC, assay_rows: list[AssayRow], study_map: dict[str, ArcStudy]) -> dict[str, ArcAssay]: """Add assays to ARC, link to studies, and return assay map.""" assay_map: dict[str, ArcAssay] = {} + if assay_rows and not study_map: + logger.warning( + "Investigation has %d assay(s) but no studies — assays will not be linked to any study.", + len(assay_rows), + ) for a_row in assay_rows: assay = map_assay(a_row) arc.AddAssay(assay) @@ -150,16 +157,26 @@ def _add_publications_to_arc( study.Publications.append(map_publication(p_row)) -def _get_column_key(r: dict[str, Any]) -> tuple[Any, ...]: +# Maps DB schema column_io_type values (snake_case DB contract) to the canonical +# strings recognised by IOType.of_string() (ARCitect display names). +_IO_TYPE_MAP: dict[str, str] = { + "source_name": "Source Name", + "sample_name": "Sample Name", + "data": "Data", + "material_name": "Material", +} + + +def _get_column_key(r: AnnotationTableRow) -> tuple[Any, ...]: """Extract a unique key for a column definition.""" return ( - r.get("column_type"), - r.get("column_io_type"), - r.get("column_value"), - r.get("column_annotation_term"), - r.get("column_annotation_uri"), - r.get("column_annotation_version"), - r.get("column_name"), # Fallback for simple tests + r.column_type, + r.column_io_type, + r.column_value, + r.column_annotation_term, + r.column_annotation_uri, + r.column_annotation_version, + r.column_name, # Fallback for simple tests ) @@ -169,10 +186,22 @@ def _build_header(key: tuple[Any, ...]) -> CompositeHeader | None: try: oa = OntologyAnnotation(c_ann_term or "", c_ann_uri or "", c_ann_ver or "") + if c_type in {"input", "output"} and not c_io: + default_io = "Source Name" if c_type == "input" else "Sample Name" + logger.warning( + "column_io_type missing for column_type '%s'; defaulting to '%s'", + c_type, + default_io, + ) + # Dispatch table for different header types handlers = { - "input": lambda: CompositeHeader.input(IOType.of_string(c_io or "source_name")), - "output": lambda: CompositeHeader.output(IOType.of_string(c_io or "sample_name")), + "input": lambda: CompositeHeader.input( + IOType.of_string(_IO_TYPE_MAP.get(c_io or "", c_io or "Source Name")) + ), + "output": lambda: CompositeHeader.output( + IOType.of_string(_IO_TYPE_MAP.get(c_io or "", c_io or "Sample Name")) + ), "characteristic": lambda: CompositeHeader.characteristic(oa), "factor": lambda: CompositeHeader.factor(oa), "parameter": lambda: CompositeHeader.parameter(oa), @@ -193,13 +222,12 @@ def _build_header(key: tuple[Any, ...]) -> CompositeHeader | None: return None -def _build_single_cell(cell_row: dict[str, Any], header: CompositeHeader) -> CompositeCell: +def _build_single_cell(cell_row: AnnotationTableRow, header: CompositeHeader) -> CompositeCell: """Build a single CompositeCell from a database row.""" - cv = cell_row.get("cell_value") - cat = cell_row.get("cell_annotation_term") - cau = cell_row.get("cell_annotation_uri") or "" - cav = cell_row.get("cell_annotation_version") or "" - v = cell_row.get("value") # Fallback for old/simple tests + cv = cell_row.cell_value + cat = cell_row.cell_annotation_term + cau = cell_row.cell_annotation_uri or "" + cav = cell_row.cell_annotation_version or "" # Unitized cell (value + ontology term) if cv is not None and cat is not None: @@ -209,8 +237,12 @@ def _build_single_cell(cell_row: dict[str, Any], header: CompositeHeader) -> Com if cat is not None: return CompositeCell.term(OntologyAnnotation(cat, cau, cav)) + # Data cell (file path) — required when header is a Data-type IO column + if header.IsDataColumn: + return CompositeCell.data(Data(name=str(cv)) if cv is not None else Data()) + # Text value? (either from new schema 'cell_value' or fallback 'value') - val_to_use = cv if cv is not None else v + val_to_use = cv if val_to_use is not None: if header.IsTermColumn: # If the column expects a term, wrap the text in an annotation @@ -221,7 +253,7 @@ def _build_single_cell(cell_row: dict[str, Any], header: CompositeHeader) -> Com def _build_column_cells( - rows_map: dict[int, dict[str, Any]], max_row_idx: int, header: CompositeHeader + rows_map: dict[int, AnnotationTableRow], max_row_idx: int, header: CompositeHeader ) -> list[CompositeCell]: """Build a list of CompositeCell objects for a column.""" col_cells = [] @@ -234,7 +266,7 @@ def _build_column_cells( return col_cells -def _build_arc_table(t_name: str, rows: list[dict[str, Any]]) -> ArcTable | None: +def _build_arc_table(t_name: str, rows: list[AnnotationTableRow]) -> ArcTable | None: """Build an ArcTable from flat database rows.""" if not rows: return None @@ -242,20 +274,20 @@ def _build_arc_table(t_name: str, rows: list[dict[str, Any]]) -> ArcTable | None table = ArcTable.init(t_name) # Determine max row index - max_row_idx = max((cast(int, r.get("row_index", 0)) for r in rows), default=-1) + max_row_idx = max((r.row_index for r in rows), default=-1) if max_row_idx < 0: return None col_keys: list[tuple[Any, ...]] = [] seen_keys = set() - col_to_rows: dict[tuple[Any, ...], dict[int, dict[str, Any]]] = defaultdict(dict) + col_to_rows: dict[tuple[Any, ...], dict[int, AnnotationTableRow]] = defaultdict(dict) for r in rows: key = _get_column_key(r) if key not in seen_keys: col_keys.append(key) seen_keys.add(key) - col_to_rows[key][cast(int, r.get("row_index", 0))] = r + col_to_rows[key][r.row_index] = r for key in col_keys: header = _build_header(key) @@ -270,13 +302,13 @@ def _build_arc_table(t_name: str, rows: list[dict[str, Any]]) -> ArcTable | None def _process_annotation_tables( - inv_id: str, annotations: list[dict[str, Any]], study_map: dict[str, Any], assay_map: dict[str, Any] + inv_id: str, annotations: list[AnnotationTableRow], study_map: dict[str, Any], assay_map: dict[str, Any] ) -> None: """Process and add annotation tables.""" - tables_groups = defaultdict(list) + tables_groups: dict[tuple[Any, ...], list[AnnotationTableRow]] = defaultdict(list) for ann in annotations: - if ann.get("investigation_ref") == inv_id: - key = (ann.get("target_type"), ann.get("target_ref"), ann.get("table_name")) + if ann.investigation_ref == inv_id: + key = (ann.target_type, ann.target_ref, ann.table_name) tables_groups[key].append(ann) for (t_type, t_ref, t_name), rows in tables_groups.items(): @@ -293,6 +325,13 @@ def _process_annotation_tables( table = _build_arc_table(t_name, rows) if table: target.AddTable(table) + else: + logger.warning( + "Annotation table '%s' targets %s '%s' which does not exist in this investigation; skipping.", + t_name, + t_type, + t_ref, + ) def build_single_arc_task(data: ArcBuildData) -> str: diff --git a/middleware/sql_to_arc/src/middleware/sql_to_arc/context.py b/middleware/sql_to_arc/src/middleware/sql_to_arc/context.py index 3f52b48..289cfa2 100644 --- a/middleware/sql_to_arc/src/middleware/sql_to_arc/context.py +++ b/middleware/sql_to_arc/src/middleware/sql_to_arc/context.py @@ -2,10 +2,10 @@ import concurrent.futures from dataclasses import dataclass -from typing import Any from middleware.api_client import ApiClient from middleware.sql_to_arc.models import ( + AnnotationTableRow, AssayRow, ContactRow, InvestigationRow, @@ -23,7 +23,7 @@ class ArcBuildData: assays: list[AssayRow] contacts: list[ContactRow] publications: list[PublicationRow] - annotations: list[dict[str, Any]] + annotations: list[AnnotationTableRow] @dataclass(frozen=True, slots=True) @@ -36,7 +36,7 @@ class WorkerContext: assays_by_inv: dict[str, list[AssayRow]] contacts_by_inv: dict[str, list[ContactRow]] pubs_by_inv: dict[str, list[PublicationRow]] - anns_by_inv: dict[str, list[dict[str, Any]]] + anns_by_inv: dict[str, list[AnnotationTableRow]] worker_id: int total_workers: int executor: concurrent.futures.Executor @@ -51,6 +51,6 @@ class RelatedDataBatch: assays_by_inv: dict[str, list[AssayRow]] contacts_by_inv: dict[str, list[ContactRow]] pubs_by_inv: dict[str, list[PublicationRow]] - anns_by_inv: dict[str, list[dict[str, Any]]] + anns_by_inv: dict[str, list[AnnotationTableRow]] study_count: int assay_count: int diff --git a/middleware/sql_to_arc/src/middleware/sql_to_arc/database.py b/middleware/sql_to_arc/src/middleware/sql_to_arc/database.py index a40ee87..ab68d25 100644 --- a/middleware/sql_to_arc/src/middleware/sql_to_arc/database.py +++ b/middleware/sql_to_arc/src/middleware/sql_to_arc/database.py @@ -18,6 +18,7 @@ from sqlalchemy.ext.asyncio import AsyncConnection, AsyncEngine, create_async_engine from middleware.sql_to_arc.models import ( + AnnotationTableRow, AssayRow, BaseRow, ContactRow, @@ -173,6 +174,7 @@ async def validate_schema(self) -> None: AssayRow, ContactRow, PublicationRow, + AnnotationTableRow, ] # Cast to satisfying the Iterable[type[BaseRow]] requirement await self.validator.validate_models(cast(Iterable[type[BaseRow]], models)) @@ -284,30 +286,10 @@ async def stream_publications(self, investigation_ids: list[str]) -> AsyncGenera async for r in self._stream_by_investigation(PublicationRow, investigation_ids, "publication"): yield r - async def stream_annotation_tables(self, investigation_ids: list[str]) -> AsyncGenerator[dict[str, Any], None]: - """Stream annotation tables for given investigations.""" - if not investigation_ids: - return - view_name = "vAnnotationTable" - try: - async with self.engine.connect() as conn: - # Use literal_column("*") to select all columns - c_inv_ref: sqlalchemy.ColumnElement[Any] = column("investigation_ref") - stmt: sqlalchemy.Select[Any] = ( - select(sqlalchemy.literal_column("*")) - .select_from(table(view_name)) - .where(c_inv_ref.in_(investigation_ids)) - .execution_options(stream_results=True) - ) - - result = await conn.stream(stmt) - async for row in result.mappings(): - yield dict(row) - except ProgrammingError as e: - if f'relation "{view_name.lower()}" does not exist' in str(e).lower(): - logger.warning('Table or view "%s" does not exist. Treating as empty.', view_name) - else: - raise + async def stream_annotation_tables(self, investigation_ids: list[str]) -> AsyncGenerator[AnnotationTableRow, None]: + """Stream annotation table rows for given investigations.""" + async for r in self._stream_by_investigation(AnnotationTableRow, investigation_ids, "annotation_table"): + yield r @asynccontextmanager async def connect(self) -> AsyncGenerator[AsyncConnection, None]: diff --git a/middleware/sql_to_arc/src/middleware/sql_to_arc/mapper.py b/middleware/sql_to_arc/src/middleware/sql_to_arc/mapper.py index 662baa7..cb9e33d 100644 --- a/middleware/sql_to_arc/src/middleware/sql_to_arc/mapper.py +++ b/middleware/sql_to_arc/src/middleware/sql_to_arc/mapper.py @@ -20,16 +20,14 @@ StudyRow, ) -# name=term, tan=uri (TermAccessionNumber), tsr="" (TermSourceREF - we don't have it, maybe version?) -# Spec says version is used. If we don't have TSR, we can leave it empty. - def _make_oa(term: str | None, uri: str | None, _version: str | None) -> OntologyAnnotation: + # _version is deliberately ignored: the DB version field belongs to + # OntologySourceReference.Version, not OntologyAnnotation. Mapping it + # correctly would require registering OntologySourceReference objects on + # the investigation and is out of scope. See arc-building/design.md. if not term: return OntologyAnnotation() - - # name=term, tan=uri (TermAccessionNumber), tsr="" (TermSourceREF - we don't have it, maybe version?) - # Spec says version is used. If we don't have TSR, we can leave it empty. return OntologyAnnotation(name=term, tan=uri or "", tsr="") @@ -133,8 +131,3 @@ def map_contact(row: ContactRow) -> Person: affiliation=row.affiliation, roles=roles, ) - - -def map_annotation(row: dict[str, Any]) -> dict[str, Any]: - """Return raw dict for annotation processing.""" - return row diff --git a/middleware/sql_to_arc/src/middleware/sql_to_arc/models.py b/middleware/sql_to_arc/src/middleware/sql_to_arc/models.py index fab4d65..7be465f 100644 --- a/middleware/sql_to_arc/src/middleware/sql_to_arc/models.py +++ b/middleware/sql_to_arc/src/middleware/sql_to_arc/models.py @@ -1,14 +1,11 @@ """Data models for the SQL-to-ARC conversion process.""" -import logging from datetime import datetime -from typing import Any, ClassVar +from typing import Any, ClassVar, Literal from pydantic import BaseModel, ConfigDict, Field, Json, model_validator from pydantic_core import PydanticUndefined -logger = logging.getLogger(__name__) - # JSON types representing the expected structure after parsing type JsonList = list[Any] @@ -135,7 +132,7 @@ class PublicationRow(BaseRow): __view_name__: ClassVar[str] = "vPublication" investigation_ref: str = spec_field() - target_type: str = spec_field() + target_type: Literal["investigation", "study"] = spec_field() pubmed_id: str | None = spec_field(default=None) doi: str | None = spec_field(default=None) authors: str | None = spec_field(default=None) @@ -152,7 +149,7 @@ class ContactRow(BaseRow): __view_name__: ClassVar[str] = "vContact" investigation_ref: str = spec_field() - target_type: str = spec_field() + target_type: Literal["investigation", "study", "assay"] = spec_field() last_name: str | None = spec_field(default=None) first_name: str | None = spec_field(default=None) mid_initials: str | None = spec_field(default=None) @@ -163,3 +160,29 @@ class ContactRow(BaseRow): affiliation: str | None = spec_field(default=None) roles: Json[JsonList] | None = spec_field(default=None) target_ref: str | None = spec_field(default=None) + + +class AnnotationTableRow(BaseRow): + """Pydantic model for annotation table cell rows (vAnnotationTable). + + Each row represents a single cell, together with the column and table it belongs to. + """ + + __view_name__: ClassVar[str] = "vAnnotationTable" + + investigation_ref: str = spec_field() + target_type: Literal["study", "assay"] = spec_field() + target_ref: str = spec_field() + table_name: str = spec_field() + column_type: str = spec_field() + row_index: int = spec_field() + column_io_type: Literal["data", "material_name", "sample_name", "source_name"] | None = spec_field(default=None) + column_value: str | None = spec_field(default=None) + column_annotation_term: str | None = spec_field(default=None) + column_annotation_uri: str | None = spec_field(default=None) + column_annotation_version: str | None = spec_field(default=None) + column_name: str | None = spec_field(default=None) + cell_value: str | None = spec_field(default=None) + cell_annotation_term: str | None = spec_field(default=None) + cell_annotation_uri: str | None = spec_field(default=None) + cell_annotation_version: str | None = spec_field(default=None) diff --git a/middleware/sql_to_arc/src/middleware/sql_to_arc/processor.py b/middleware/sql_to_arc/src/middleware/sql_to_arc/processor.py index 90d4f29..19c3981 100644 --- a/middleware/sql_to_arc/src/middleware/sql_to_arc/processor.py +++ b/middleware/sql_to_arc/src/middleware/sql_to_arc/processor.py @@ -178,9 +178,7 @@ async def group_stream( m = defaultdict(list) count = 0 async for r in gen: - # All models and the annotation dict have investigation_ref. - # We handle both Pydantic models (obj.field) and raw dicts (obj[field]). - inv_ref = r["investigation_ref"] if isinstance(r, dict) else r.investigation_ref + inv_ref = r.investigation_ref m[str(inv_ref)].append(r) count += 1 return dict(m), count diff --git a/middleware/sql_to_arc/tests/integration/test_workflow.py b/middleware/sql_to_arc/tests/integration/test_workflow.py index 0cb1151..4f87771 100644 --- a/middleware/sql_to_arc/tests/integration/test_workflow.py +++ b/middleware/sql_to_arc/tests/integration/test_workflow.py @@ -17,6 +17,7 @@ from middleware.sql_to_arc.context import WorkerContext from middleware.sql_to_arc.main import main from middleware.sql_to_arc.models import ( + AnnotationTableRow, AssayRow, ContactRow, InvestigationRow, @@ -215,7 +216,7 @@ def _prepare_data(data: list[dict[str, Any]] | None, target_cls: type[Any] | Non ) self.db.stream_annotation_tables = MagicMock( side_effect=lambda *args, **kwargs: self._as_gen( # noqa: ARG005 - annotations or [] + _prepare_data(annotations, AnnotationTableRow), AnnotationTableRow ) ) @@ -654,36 +655,40 @@ async def test_assay_with_annotations(workflow_tester: WorkflowTester) -> None: "target_type": "assay", "target_ref": assay_id, "table_name": "Sample Metadata", + "column_type": "input", + "column_io_type": "source_name", "row_index": 0, - "column_name": "Source Name", - "value": "Sample 1", + "cell_value": "Sample 1", }, { "investigation_ref": inv_id, "target_type": "assay", "target_ref": assay_id, "table_name": "Sample Metadata", + "column_type": "characteristic", + "column_annotation_term": "Species", "row_index": 0, - "column_name": "Characteristics [Species]", - "value": "Homo sapiens", + "cell_value": "Homo sapiens", }, { "investigation_ref": inv_id, "target_type": "assay", "target_ref": assay_id, "table_name": "Sample Metadata", + "column_type": "input", + "column_io_type": "source_name", "row_index": 1, - "column_name": "Source Name", - "value": "Sample 2", + "cell_value": "Sample 2", }, { "investigation_ref": inv_id, "target_type": "assay", "target_ref": assay_id, "table_name": "Sample Metadata", + "column_type": "characteristic", + "column_annotation_term": "Species", "row_index": 1, - "column_name": "Characteristics [Species]", - "value": "Mus musculus", + "cell_value": "Mus musculus", }, ] @@ -693,8 +698,6 @@ async def test_assay_with_annotations(workflow_tester: WorkflowTester) -> None: annotations=annotations, ) - # Currently, _process_annotation_tables in main.py is a placeholder. - # The test verifies that the pipeline handles the data gracefully. arcs = await workflow_tester.run() assert len(arcs) == 1 diff --git a/middleware/sql_to_arc/tests/unit/test_builder.py b/middleware/sql_to_arc/tests/unit/test_builder.py index 85b6f7b..34356ae 100644 --- a/middleware/sql_to_arc/tests/unit/test_builder.py +++ b/middleware/sql_to_arc/tests/unit/test_builder.py @@ -4,10 +4,12 @@ from typing import Any import pytest +from arctrl import CompositeHeader, IOType -from middleware.sql_to_arc.builder import build_single_arc_task +from middleware.sql_to_arc.builder import _IO_TYPE_MAP, _build_header, _build_single_cell, build_single_arc_task from middleware.sql_to_arc.context import ArcBuildData from middleware.sql_to_arc.models import ( + AnnotationTableRow, AssayRow, ContactRow, InvestigationRow, @@ -197,3 +199,155 @@ def test_build_ignores_irrelevant_data(sample_investigation: dict[str, Any]) -> # Check that styX is NOT in the graph sty_x = next((item for item in graph if item.get("@id") == "styX" or item.get("identifier") == "styX"), None) assert sty_x is None + + +# --------------------------------------------------------------------------- +# Helpers to build minimal AnnotationTableRow dicts +# --------------------------------------------------------------------------- + + +def _ann_row(**overrides: Any) -> dict[str, Any]: + """Return a minimal AnnotationTableRow dict, optionally overriding any field.""" + base: dict[str, Any] = { + "investigation_ref": "inv1", + "target_type": "study", + "target_ref": "sty1", + "table_name": "T", + "column_type": "input", + "row_index": 0, + "column_io_type": None, + "cell_value": None, + "cell_annotation_term": None, + "cell_annotation_uri": None, + "cell_annotation_version": None, + "column_annotation_term": None, + "column_annotation_uri": None, + "column_annotation_version": None, + "column_value": None, + } + base.update(overrides) + return base + + +def _row(data: dict[str, Any]) -> AnnotationTableRow: + """Validate a dict into an AnnotationTableRow.""" + return AnnotationTableRow.model_validate(data) + + +# --------------------------------------------------------------------------- +# IOType mapping tests +# --------------------------------------------------------------------------- + + +class TestIOTypeMapping: + """_IO_TYPE_MAP translates snake_case DB values to canonical ARCitect strings.""" + + @staticmethod + @pytest.mark.parametrize( + ("db_value", "canonical"), + [ + ("source_name", "Source Name"), + ("sample_name", "Sample Name"), + ("data", "Data"), + ("material_name", "Material"), + ], + ) + def test_map_covers_all_db_values(db_value: str, canonical: str) -> None: + """Each DB snake_case value maps to the expected canonical ARCitect string.""" + assert _IO_TYPE_MAP[db_value] == canonical + + @staticmethod + @pytest.mark.parametrize( + ("db_value", "expected_tag"), + [ + ("source_name", 0), # IOType.Source + ("sample_name", 1), # IOType.Sample + ("data", 2), # IOType.Data + ("material_name", 3), # IOType.Material + ], + ) + def test_build_header_input_uses_named_iotype(db_value: str, expected_tag: int) -> None: + """DB values must produce a named IOType (tag 0–3), never FreeType (tag 4).""" + key = ("input", db_value, None, None, None, None, None) + header = _build_header(key) + assert header is not None + assert header.is_input + assert header.fields[0].tag == expected_tag + + @staticmethod + @pytest.mark.parametrize( + ("db_value", "expected_tag"), + [ + ("sample_name", 1), + ("data", 2), + ("material_name", 3), + ], + ) + def test_build_header_output_uses_named_iotype(db_value: str, expected_tag: int) -> None: + """DB output values must also produce a named IOType, never FreeType.""" + key = ("output", db_value, None, None, None, None, None) + header = _build_header(key) + assert header is not None + assert header.is_output + assert header.fields[0].tag == expected_tag + + @staticmethod + def test_missing_io_type_defaults_to_source_name_for_input() -> None: + """Missing column_io_type falls back to 'Source Name' (tag 0) for input.""" + key = ("input", None, None, None, None, None, None) + header = _build_header(key) + assert header is not None + assert header.is_input + assert header.fields[0].tag == 0 + + @staticmethod + def test_missing_io_type_defaults_to_sample_name_for_output() -> None: + """Missing column_io_type falls back to 'Sample Name' (tag 1) for output.""" + key = ("output", None, None, None, None, None, None) + header = _build_header(key) + assert header is not None + assert header.is_output + assert header.fields[0].tag == 1 + + +# --------------------------------------------------------------------------- +# Data cell tests +# --------------------------------------------------------------------------- + + +class TestDataCellBuilding: + """_build_single_cell must emit CompositeCell.data() for data-typed IO columns.""" + + @staticmethod + def test_data_cell_has_correct_file_path() -> None: + """A data-typed output column must produce a DataCell with the file path set.""" + header = CompositeHeader.output(IOType.of_string("Data")) + row = _row(_ann_row(column_type="output", column_io_type="data", cell_value="raw.fastq.gz")) + cell = _build_single_cell(row, header) + assert cell.is_data + assert cell.AsData.FilePath == "raw.fastq.gz" + + @staticmethod + def test_data_cell_empty_when_no_cell_value() -> None: + """A data-typed column with no cell_value must produce an empty DataCell, not a crash.""" + header = CompositeHeader.output(IOType.of_string("Data")) + row = _row(_ann_row(column_type="output", column_io_type="data", cell_value=None)) + cell = _build_single_cell(row, header) + assert cell.is_data + assert cell.AsData.FilePath is None + + @staticmethod + def test_source_name_column_emits_free_text() -> None: + """A source_name input column must produce a free-text cell, not a DataCell.""" + header = CompositeHeader.input(IOType.of_string("Source Name")) + row = _row(_ann_row(column_type="input", column_io_type="source_name", cell_value="SourceA")) + cell = _build_single_cell(row, header) + assert cell.is_free_text + + @staticmethod + def test_sample_name_column_emits_free_text() -> None: + """A sample_name output column must produce a free-text cell, not a DataCell.""" + header = CompositeHeader.output(IOType.of_string("Sample Name")) + row = _row(_ann_row(column_type="output", column_io_type="sample_name", cell_value="SampleB")) + cell = _build_single_cell(row, header) + assert cell.is_free_text diff --git a/middleware/sql_to_arc/tests/unit/test_mapper.py b/middleware/sql_to_arc/tests/unit/test_mapper.py index e8aab7a..1f4bdc8 100644 --- a/middleware/sql_to_arc/tests/unit/test_mapper.py +++ b/middleware/sql_to_arc/tests/unit/test_mapper.py @@ -13,7 +13,6 @@ from pydantic import ValidationError from middleware.sql_to_arc.mapper import ( - map_annotation, map_assay, map_contact, map_investigation, @@ -197,9 +196,3 @@ def test_map_contact_invalid_roles() -> None: last_name="Smith", roles="{invalid-json}", # type: ignore[arg-type] ) - - -def test_map_annotation() -> None: - """Test the map_annotation helper function.""" - row = {"data": "test_value"} - assert map_annotation(row) == row diff --git a/pyproject.toml b/pyproject.toml index e5ce1b6..15ceb7c 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -131,16 +131,10 @@ exclude = [ ".ruff_cache", ] -# [[tool.mypy.overrides]] -# module = [ -# "middleware.api_client.*", -# "middleware.shared.*", -# ] -# ignore_missing_imports = true - -# arctrl has no type stubs and no py.typed marker — suppress the resulting noise +# arctrl has no type stubs and no py.typed marker — suppress the resulting noise. +# The Fable-transpiled internals live under arctrl.py.* so both patterns are needed. [[tool.mypy.overrides]] -module = ["arctrl"] +module = ["arctrl", "arctrl.*"] ignore_missing_imports = true # Docstring-Regeln aktivieren, damit es wie pylint C0114/15/16 meckert diff --git a/spec/configuration/spec.md b/spec/configuration/spec.md new file mode 100644 index 0000000..3a5bf19 --- /dev/null +++ b/spec/configuration/spec.md @@ -0,0 +1,50 @@ +# Configuration — Spec + +## Purpose + +Define how the converter reads, validates, and exposes configuration so that +all code has a single, typed source of truth for runtime settings. + +## Requirements + +- [ ] Load configuration from a YAML file at startup (path via CLI `-c`). +- [ ] Allow any field to be overridden by an environment variable or Docker + secret, using a consistent naming convention (`{PREFIX}_{FIELD_PATH}`). +- [ ] Validate and type-coerce all values via Pydantic before the application + starts. +- [ ] Expose the resulting `Config` object through explicit dependency injection + — no module reads environment variables or files after startup. +- [ ] All new settings MUST be added as typed, annotated fields in `Config` (or + a sub-model referenced by `Config`). No ad-hoc env reads, no global + variables. +- [ ] Secrets (`connection_string`, TLS keys) use `pydantic.SecretStr`. Access + via `.get_secret_value()` only at the point of use; never pass to `str()` + or log them. +- [ ] Configuration is loaded **once** in `main.py` and passed down via + function arguments. Never re-loaded during a run. + +## Key Files + +| File | Role | +| ---- | ---- | +| `middleware/sql_to_arc/src/middleware/sql_to_arc/config.py` | Project `Config` class (extends `ConfigBase`) | +| `middleware/sql_to_arc/config.example.yaml` | Example configuration with all fields documented | +| `middleware/shared/config/config_wrapper.py` | `ConfigWrapper` — YAML + env/secret override engine (external) | +| `middleware/shared/config/config_base.py` | `ConfigBase` — shared base class with `log_level` and `otel` fields (external) | + +## Project Usage + +- Loaded once in `main.py`: `ConfigWrapper.from_yaml_file(args.config, prefix="SQL_TO_ARC")` +- Prefix `SQL_TO_ARC` applies to all env/secret overrides, e.g.: + - `SQL_TO_ARC_CONNECTION_STRING` + - `SQL_TO_ARC_API_CLIENT_API_URL` + - `/run/secrets/sql_to_arc_connection_string` +- `client.key` uses `tmpfs` in Docker Compose — never written to disk. +- In integration tests: mock at `middleware.sql_to_arc.main.ConfigWrapper.from_yaml_file` + and `middleware.sql_to_arc.main.Config.from_config_wrapper`. + +## Override Resolution, Type Coercion & Extension Rules + +See the `config-wrapper` skill (`.agents/skills/config-wrapper/SKILL.md`) for +the full override resolution order, type coercion rules, how to extend +`ConfigBase`, and general testing patterns. diff --git a/spec/demo-environment/design.md b/spec/demo-environment/design.md new file mode 100644 index 0000000..19fcef3 --- /dev/null +++ b/spec/demo-environment/design.md @@ -0,0 +1,70 @@ +# Demo Environment — Design + +## Service Topology + +```text +compose.demo.yaml + ├── postgres (postgres:15) + │ └── /docker-entrypoint-initdb.d/01-demo.sql ← bind-mounted + │ healthcheck: pg_isready -d rdi + │ + ├── middleware-api (python:3.12-slim) + │ ├── installs fastapi + uvicorn + arctrl at startup + │ ├── mounts demo_api_main.py read-only + │ ├── mounts demo_output/ read-write + │ └── healthcheck: python urllib.request → /live + │ + └── sql_to_arc (sql_to_arc:latest, built from repo) + depends_on: + postgres (service_healthy) + middleware-api (service_healthy) + env: SQL_TO_ARC_CONNECTION_STRING = postgresql+psycopg://…/rdi + config: /etc/sql_to_arc/config.yaml ← config.demo.yaml +``` + +## Key Decisions + +1. **DB init via `/docker-entrypoint-initdb.d/`, not a `db-init` sidecar** + — The postgres image runs SQL files in that directory during its own + startup, *before* the healthcheck passes. This eliminates the race + condition where `db-init` exiting (with code 0) caused compose to stop + postgres before `sql_to_arc` could connect. + +2. **`POSTGRES_DB: rdi` on the postgres service** + — `demo.sql` populates the `rdi` database directly. Setting `POSTGRES_DB` + makes postgres create it on first start so the init script can import + immediately without a `CREATE DATABASE` step. + +3. **Healthcheck uses Python's stdlib `urllib.request` instead of `curl`** + — `python:3.12-slim` does not include `curl`. Using Python avoids an + extra `apt-get install` step in the image. + +4. **`--exit-code-from sql_to_arc` (no `--abort-on-container-exit`)** + — Compose waits for `sql_to_arc` to exit, then propagates its exit code. + Without `--abort-on-container-exit`, long-running services (postgres, + middleware-api) are not killed prematurely while the converter is running. + +5. **Hardcoded `postgres/postgres` credentials** + — The demo environment has no security requirements and no access to + sops-encrypted secrets. Hardcoded defaults remove the need for any + `.env` file. + +6. **Path safety in `demo_api_main.py`** + — ARC identifiers come from user-controlled JSON. The `_derive_safe_arc_id` + function uses `os.path.realpath + startswith(base + os.sep)` (the + CodeQL-recommended pattern) before writing any files. Falls back to a + random hex ID when the identifier is unsafe or non-conforming. + +## Mock API (`demo_api_main.py`) + +```text +POST /v3/arcs?rdi={rdi} + → _derive_safe_arc_id() (sanitize identifier) + → write {arc_id}.payload.json (raw RO-Crate for inspection) + → ARC.from_rocrate_json_string() + → arc.WriteAsync(arc_dir) (arctrl writes ISA files) + → _chown_tree(arc_dir) (fix ownership for host user) + → return { arc_id, status, metadata } + +GET /live → { "status": "ok" } (healthcheck endpoint) +``` diff --git a/spec/demo-environment/spec.md b/spec/demo-environment/spec.md new file mode 100644 index 0000000..93cac21 --- /dev/null +++ b/spec/demo-environment/spec.md @@ -0,0 +1,38 @@ +# Demo Environment + +Provide a one-command, self-contained local environment that demonstrates +the full SQL-to-ARC pipeline end-to-end without requiring production +credentials, mTLS certificates, or network access to external services. + +## Requirements + +- [ ] Start with a single command: `docker compose -f compose.demo.yaml up --build` +- [ ] Spin up PostgreSQL and import a small demo dataset (10 investigations) + without any manual steps +- [ ] Run a mock Middleware API (`middleware-api`) that accepts ARC + RO-Crate uploads and writes them to a local `demo_output/` directory +- [ ] Run the `sql_to_arc` converter against the demo DB and mock API +- [ ] Converter exits 0 when all 10 investigations are processed; compose + exits with the converter's exit code (`--exit-code-from sql_to_arc`) +- [ ] Written ARC files are accessible on the host via a bind-mounted + `demo_output/` volume +- [ ] File ownership of output files matches the host user (via + `LOCAL_UID`/`LOCAL_GID` environment variables) +- [ ] No secrets, encrypted files, or external network calls required + +## Out of Scope + +Production credentials, sops-encrypted secrets, mTLS, and Edaphobase +full-dump downloads are the responsibility of the dev environment +(`compose.dev.yaml`), not this demo. + +## Edge Cases + +`demo.sql` is missing → `postgres` init fails; compose exits non-zero +with a clear log message. + +ARC identifier in payload is unsafe (path traversal attempt) → mock API +falls back to a random ID, logs to console, does not write outside +`demo_output/`. + +`demo_output/` doesn't exist → mock API creates it on first request. diff --git a/spec/principles.md b/spec/principles.md new file mode 100644 index 0000000..77ef2e3 --- /dev/null +++ b/spec/principles.md @@ -0,0 +1,80 @@ +# FAIRagro SQL-to-ARC — Principles + +## Foundation Contract + +The authoritative schema contract for this project is +[docs/sql_to_arc_database_views.md](../docs/sql_to_arc_database_views.md). +It defines every database view, its columns, data types, required/optional +semantics, and cross-field constraints. **All features assume this document +as given.** Feature specs do not restate view definitions; they reference +this document when they need to cite a column or constraint. + +The converter never queries raw tables — only the views defined there. + +## Purpose + +Convert metadata from a relational SQL database into the +Annotated Research Context (ARC) format and publish the result to the +FAIRagro Middleware API. The converter runs as a one-shot batch process, +not as a long-running service. + +## Values + +**Correctness over speed** — valid ARC output matters more than throughput. +If a dataset cannot be mapped cleanly it must fail with a clear error, not +produce silent garbage. + +**Memory-safe by design** — the dataset is large (tens of thousands of +investigations). Every architectural decision must keep peak RAM bounded and +predictable. Assume the host has limited memory. + +**Failure isolation** — one bad investigation must not abort the entire run. +Stats and error IDs are collected and reported at the end. + +**Stateless batch process** — the converter stores no state between runs. +No cache, no lock files, no database writes. The only persistent output is +what the Middleware API receives. + +**Security by default** — inputs from external sources (database, API, +config) are treated as untrusted. Follow OWASP best practices: validate +before use, fail closed, apply least privilege. + +## Constraints + +- Python 3.12. No type-unsafe workarounds; all public APIs are fully typed. +- `uv` for dependency management. Never call `pip` directly in production code. +- `os.environ` must never be accessed directly; use `Config` / `ConfigWrapper`. +- All SQL lives inside the `Database` class. Views are the contract; the + converter never queries raw tables. +- Worker processes communicate via JSON strings only (no shared objects, no + pickle of domain objects across the IPC boundary). +- Code quality gates: Ruff (lint + format), mypy, pylint, bandit, pytest — + all must pass before merge. Every new feature requires matching tests. +- No `noqa`/`type: ignore` suppressions unless technically unavoidable. +- Validation belongs in the Pydantic model where possible. Use `Literal` types or + `@field_validator` to enforce valid values — a `ValidationError` triggers the + standard skip-with-warning path in `database.py`. Only write custom warning code + outside Pydantic when a spec violation should log a warning but NOT skip the row + (rescue scenario). + +## Module Dependency Graph + +```text +main → processor → builder → mapper + ↘ database + ↘ api_client (shared lib) +config ←── all modules (read-only) +stats ←── processor, database (write) +``` + +Circular imports are forbidden. `mapper` and `builder` must not import +`database` or `processor`. + +## Extension Points + +| Need | Where to change | +| --- | --- | +| New DB entity | Add view, model in `models.py`, stream method in `database.py`, mapper in `mapper.py` | +| New config value | Extend `Config` in `config.py` with Pydantic field | +| New mapper function | Add to `mapper.py`, re-export from `builder.py` | +| New ARC structure | Extend `builder.py` helper functions | diff --git a/spec/tooling-consistency/spec.md b/spec/tooling-consistency/spec.md new file mode 100644 index 0000000..90c7910 --- /dev/null +++ b/spec/tooling-consistency/spec.md @@ -0,0 +1,41 @@ +# Tooling Consistency + +The VS Code editor tools, the pre-commit hooks, and the GitHub CI workflows +must all report identical results for the same code. All three execution +environments must draw their configuration from a single source of truth +(`pyproject.toml` or the respective config file at the repo root) so that a +passing local commit never fails in CI, and the editor never silently hides an +issue that the hook or workflow would catch. + +## Requirements + +- [ ] Every quality tool (Ruff, mypy, pylint, bandit) is invoked via + `uv run ` in all three environments (VS Code extension, pre-commit + hook, CI workflow step), ensuring the same installed version is used. +- [ ] Every tool reads its configuration exclusively from `pyproject.toml` (or + the repo-root config file for tools that do not support `pyproject.toml`, + e.g. `.bandit`); no per-environment config overrides are permitted. +- [ ] VS Code extensions are configured to use `importStrategy: + fromEnvironment` (or equivalent) so they pick up the same binary and + version as the `uv run` invocations in hooks and workflows. +- [ ] VS Code extension settings that reference a config file pass the same + path that the hook and workflow use (e.g. `--config-file pyproject.toml`). +- [ ] If a third-party library has no type stubs and no `py.typed` marker, the + suppression is declared once — in `pyproject.toml` + `[[tool.mypy.overrides]]` — not scattered across individual `# type: + ignore` comments. This ensures the suppression applies equally in the + editor, the hook, and CI. +- [ ] Adding a new quality tool to any one environment requires adding it to + all three in the same commit. + +## Edge Cases + +A library subpackage has a different dotted path from the top-level package +(e.g. `arctrl.py.Core.*` vs. `arctrl`) → the mypy override must use a glob +that covers all submodules (`["arctrl", "arctrl.*"]`), otherwise the VS Code +daemon silences the error while the pre-commit hook still fails. + +A VS Code extension runs its tool outside the `uv` virtual environment → +`importStrategy: fromEnvironment` combined with `python.defaultInterpreterPath` +pointing at the `.venv` interpreter resolves this; the extension then uses the +same binary and config as `uv run`.