feat(elt-pipelines): Add initial project with example pipeline (#368)

WHTaylor · web-flow · commit da6cf38bdb47 · 2026-06-29T16:00:42.000+01:00
ref #321 Creates an `elt-pipelines` project with a `statusdisplay` pipeline, which uses `elt-common` to ingest data from the [ISIS cycles endpoint](https://status.isis.stfc.ac.uk/api/cycles). The pipeline can be run using the instructions from the README. - `elt-common` is currently included as a dependency in `elt-pipelines` using a relative path pointing at the package in the parent folder. This makes it easy to work on both locally, but means anything wanting to run pipelines needs both packages in its working directory; don't think it's a big deal, but maybe not ideal? Are we aiming to publish `elt-common` to PyPI so we can use it as a 'normal' dependency? - The new pipeline is ingesting into an `elt_cycles` table for testing purposes. Once we want to migrate to this pipeline in production, it should probably start ingesting into `cycles` instead, but because it's using a different schema from the current DLT pipeline we'll need to replace the table entirely, so there will be a bit of extra work needed at the time
diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
@@ -12,10 +12,10 @@ repos:
     hooks:
       # Run the linter.
       - id: ruff-check
-        files: (elt-common|warehouses)
+        files: (elt-common|elt-pipelines|warehouses)
       # Run the formatter.
       - id: ruff-format
-        files: (elt-common|warehouses)
+        files: (elt-common|elt-pipelines|warehouses)
   - repo: https://github.com/shellcheck-py/shellcheck-py
     rev: v0.11.0.1
     hooks:
diff --git a/elt-pipelines/.gitignore b/elt-pipelines/.gitignore
@@ -0,0 +1,7 @@
+# ignore basic python artifacts
+.env
+**/__pycache__/
+**/*.py[cod]
+**/*$py.class
+**/build/
+**/*.egg-info/
diff --git a/elt-pipelines/README.md b/elt-pipelines/README.md
@@ -0,0 +1,52 @@
+# elt-pipelines
+
+Pipelines for ingesting data from various sources into Iceberg catalogs using elt-common.
+
+**Under construction** - this project is being developed as a replacement for the existing DLT pipelines.
+[See here](https://github.com/ISISNeutronMuon/analytics-data-platform/issues/321) for details.
+
+## Development setup
+
+Development requires the following tools:
+
+- [uv](https://docs.astral.uv/uv/): Used to manage both Python installations and dependencies
+
+### Setting up a Python virtual environment
+
+Once `uv` is installed, create an environment, activate it, and install dependencies with:
+
+```bash
+> uv venv
+> source .venv/bin/activate
+> uv sync
+```
+
+Pipelines can declare optional dependencies in `pyproject.toml` - for example, `statusdisplay` uses `requests` for
+fetching data. To install any additional dependencies for that specific pipeline, use:
+
+```bash
+> uv sync --extra statusdisplay
+```
+
+## Running a pipeline
+
+Pipelines are run using the `elt` CLI tool. As an example, with the package as current working directory,
+`elt run facility_ops statusdisplay` will run the statusdisplay pipeline. See `elt -h` for full usage.
+
+## Directory structure
+
+The project uses the following directory structure:
+
+```txt
+elt-pipelines/
+|-- <target warehouse>/
+|    |-- ingest/
+|    |    |-- <domain>/
+|    |    |    |-- <job name>/
+|    |    |    |    |-- <job name>.py
+```
+
+- Each 'target warehouse' is the name of an Iceberg warehouse. The data ingested by the pipelines inside that directory end up in that warehouse.
+- The directory structure from `ingest` down is what is required for `elt-common` to be able to run 'ingest' pipelines.
+- Data from ingest pipelines is considered 'raw' data, and is loaded into a warehouse suffixed with `_landing`.
+- Under construction: Each warehouse will also have a `transform` subdirectory containing pipelines for converting the raw data into it's final state in the target warehouse.
diff --git a/elt-pipelines/facility_ops/ingest/accelerator/statusdisplay/statusdisplay.py b/elt-pipelines/facility_ops/ingest/accelerator/statusdisplay/statusdisplay.py
@@ -0,0 +1,57 @@
+from datetime import datetime
+import io
+import json
+import pyarrow.json
+import requests
+
+from elt_common.extract import BaseExtract, ResourceProperties, ResourceWriteProperties
+
+CYCLES_URL = "https://status.isis.stfc.ac.uk/api/cycles"
+
+
+class Extract(BaseExtract):
+    def extract_resource_properties(self):
+        yield (
+            "elt_cycles",
+            ResourceProperties(
+                extractor=extract_cycles,
+                write_properties=ResourceWriteProperties(write_mode="replace"),
+                watermark_column=None,
+            ),
+        )
+
+
+def extract_cycles(_):
+    data = clean(fetch())
+    newline_delimited = "\n".join(json.dumps(row) for row in data)
+
+    with io.BytesIO(newline_delimited.encode()) as f:
+        yield pyarrow.json.read_json(f)
+
+
+def fetch():
+    try:
+        response = requests.get(CYCLES_URL, timeout=20)
+    except requests.Timeout as ex:
+        raise RuntimeError("Timed out when fetching cycles") from ex
+
+    if not response.ok:
+        raise RuntimeError(f"Failed to fetch cycles - {response.reason}")
+
+    return response.json()
+
+
+def reformat(date_string):
+    """Convert a date from ISO format into one that pyarrow will convert into a timestamp"""
+    return datetime.fromisoformat(date_string).strftime("%Y-%m-%d %H:%M:%S")
+
+
+def clean(data):
+    for cycle in data:
+        for phase in cycle["phases"]:
+            if "start" in phase:
+                phase["start"] = reformat(phase["start"])
+            if "end" in phase:
+                phase["end"] = reformat(phase["end"])
+
+    return data
diff --git a/elt-pipelines/pyproject.toml b/elt-pipelines/pyproject.toml
@@ -0,0 +1,24 @@
+[project]
+name = "elt-pipelines"
+version = "0.1.0"
+description = "Pipelines for ingesting data into Iceberg catalogs"
+readme = "README.md"
+requires-python = ">=3.13"
+dependencies = [
+    "elt-common",
+    "pydantic-settings>=2.14.2",
+]
+
+[project.optional-dependencies]
+statusdisplay = [
+    "pyarrow>=24.0.0",
+    "requests>=2.34.2",
+]
+
+[tool.uv.sources]
+elt-common = { path = "../elt-common", editable = true }
+
+[dependency-groups]
+dev = [
+    "prek>=0.4.5",
+]
diff --git a/elt-pipelines/uv.lock b/elt-pipelines/uv.lock