Resume from stale batches (#155)

adammcdonagh · web-flow · commit 25a76c863d4a · 2026-04-20T00:51:48.000+01:00
* Add stale batch retry logic

* Bump black

* bump version v26.10.0 -&gt; v26.15.0
diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
@@ -32,7 +32,7 @@ repos:
     hooks:
       - id: prettier
   - repo: https://github.com/psf/black
-    rev: 26.3.0
+    rev: 26.3.1
     hooks:
       - id: black
         args:
@@ -41,6 +41,10 @@ repos:
     rev: v2.2.2
     hooks:
       - id: codespell
+        exclude: >
+          (?x)^(
+            .*/pgp\.py
+          )$
         args:
           - --ignore-words-list=
           - --skip="./.*,*.csv,*.json"
diff --git a/AGENTS.md b/AGENTS.md
@@ -0,0 +1,202 @@
+# AGENTS.md — Open Task Framework (OTF)
+
+This file is written for autonomous agents and maintainers who will modify, test, and extend the Open Task Framework codebase; it provides focused, actionable context so automated systems can make safe, verifiable changes.
+
+## Table of contents
+
+- [AGENTS.md — Open Task Framework (OTF)](#agentsmd--open-task-framework-otf)
+  - [Table of contents](#table-of-contents)
+  - [Quick scan](#quick-scan)
+  - [High-level summary](#high-level-summary)
+  - [Architecture and main components](#architecture-and-main-components)
+  - [Contracts and data shapes — precise](#contracts-and-data-shapes--precise)
+  - [Variable resolution \& templates](#variable-resolution--templates)
+  - [Developer and agent workflow — run / test / iterate](#developer-and-agent-workflow--run--test--iterate)
+  - [Concrete examples (copy-and-paste)](#concrete-examples-copy-and-paste)
+  - [Tests, debugging, and logs](#tests-debugging-and-logs)
+  - [CI and release pointers](#ci-and-release-pointers)
+  - [Best practices for automated agents (rules)](#best-practices-for-automated-agents-rules)
+  - [Where to look for related code](#where-to-look-for-related-code)
+  - [Change summary and contact](#change-summary-and-contact)
+
+## Quick scan
+
+- Entry point(s): `src/opentaskpy/taskrun.py`, `src/opentaskpy/taskhandlers/taskhandler.py`
+- Schemas: `src/opentaskpy/config/schemas/` (validation source of truth)
+- Remote handlers: `src/opentaskpy/remotehandlers/` (SSH/SFTP/local/email/dummy)
+- Plugins: `src/opentaskpy/plugins/` (lookup family)
+- Run unit tests: `pytest tests/ -q`
+
+## High-level summary
+
+Open Task Framework is a modular Python framework that validates and runs tasks defined as JSON documents. Tasks describe either an execution (run a command) or a transfer (move files). The framework uses pluggable remote handlers (execution and transfer) to support protocols like SSH, SFTP, WinRM, and cloud storage services.
+
+Key responsibilities:
+
+- Validate task payloads against JSON schemas.
+- Orchestrate execution and transfer flows via `taskhandler` components.
+- Provide well-encapsulated protocol handlers: concrete classes implement a consistent handler interface so the taskhandler layer can be protocol-agnostic.
+- Provide test fixtures (unit and integration) to verify both logic and environment interactions.
+
+## Architecture and main components
+
+1. Core package: `src/opentaskpy`
+
+   - `taskhandler` — central orchestration logic: accepts a validated task, decides whether to call execution or transfer workflow, orchestrates staging and cleanup, and returns standardized results.
+   - `remotehandlers` — contains abstract base classes and concrete implementations. Expect classes following the naming convention: `*Transfer` and `*Execution`.
+   - `config/schemas` — JSON schemas that the code uses to validate task payloads. Schemas are authoritative; runtime assumes inputs match them.
+   - `otflogging` — logging helpers used across the project for consistent log formatting and task-scoped contexts.
+
+2. Tests and fixtures:
+
+   - `tests/` — pytest test suite with unit tests (fast) and integration tests (may require docker-compose fixtures).
+   - `test/` — helper scripts and docker-compose configurations used to stand up test services (sshd, mock services). Look for `createTestFiles.sh`, `createTestDirectories.sh`, and `setupSSHKeys.sh`.
+
+3. Addons: repository-level addons live in sibling repos. Each addon should follow the same shape: `remotehandlers` implementations, config schemas, tests, and an optional `AGENTS.md` describing the addon details (example: winrm addon).
+
+## Contracts and data shapes — precise
+
+Task manifest (canonical fields):
+
+- `id` (string): unique task identifier
+- `type` (string): one of `transfer`, `execution`, `batch`
+- `source` / `destination` (objects): for transfers, each contains `hostname`, `directory`, `fileRegex`, and `protocol`
+- `hostname`, `directory`, `command` (for execution tasks)
+- `protocol` (object): minimally `{ "name": "<python-class-path>", "credentials": {...}, ... }`
+
+Protocol object details:
+
+- `name` (string): importable Python class path implementing a Transfer or Execution handler
+- `credentials` (object): fields are protocol-specific (e.g., `username`/`password`, `cert_pem`, `transport`)
+- `server_cert_validation` / `port` / `transport` are optional common fields used by multiple handlers
+
+Handler interface expectations (implementations MUST):
+
+- Transfer handlers expose: `list_files(spec)`, `pull_file(spec, dest)`, `push_file(spec, src)`, `move_file(spec)`, `delete_file(spec)`, plus bulk helpers like `pull_files_to_worker()` and `push_files_from_worker()`
+- Execution handlers expose: `execute(spec)` returning a controlled stream or result object, plus `kill(pid)` to request termination. Results must include `exit_code`, `stdout`, `stderr`. If a PID token is emitted by the remote, include `pid` in results.
+
+Error model:
+
+- Handlers should raise specific exceptions for common error classes (validation error, auth error, networkIO error). Taskhandler should catch and translate to standardized result objects for callers and tests.
+
+If you change these shapes, update the JSON schemas in `src/opentaskpy/config/schemas/` and add/update tests in `tests/`.
+
+## Variable resolution & templates
+
+- File types: configuration and task payloads are JSON-based only. Files are either plain `.json` or Jinja2 templates with a `.json.j2` extension. YAML is not used for task/config payloads.
+- Pipeline: when a task/template file is loaded the system performs this pipeline:
+  1. Read the `.json` or `.json.j2` file.
+
+2. If it is a Jinja2 template (`.json.j2`), render it with the available context (variables, plugin helpers, and environment values).
+3. Parse the rendered output as JSON.
+4. Validate the parsed JSON against the appropriate schema in `src/opentaskpy/config/schemas/`.
+
+- Template context and helpers: lookup plugins (see `src/opentaskpy/plugins/lookup`) and other small helpers/filters are available to templates to compute values at render time. Templates must render to valid JSON — agents should always validate rendered output before runtime.
+
+- Guidance for agents:
+  - When editing templates, ensure the rendered output is syntactically valid JSON (use a local render step in tests).
+  - Do not introduce template constructs that rely on secrets stored in-repo; use environment variables or test fixtures for secret injection.
+  - If new helpers/plugins are required by templates, add them under `src/opentaskpy/plugins/` and include unit tests that exercise rendering.
+
+## Developer and agent workflow — run / test / iterate
+
+Local dev quickstart (recommended):
+
+1. Create and activate a virtual environment
+
+```bash
+python -m venv .venv
+source .venv/bin/activate
+pip install -e .[test]
+```
+
+2. Run a focused unit test
+
+```bash
+pytest tests/test_file_helper.py::test_some_helper -q
+```
+
+3. Run full unit test suite
+
+```bash
+pytest tests/ -q
+```
+
+Integration tests (requires docker):
+
+```bash
+cd test
+./createTestDirectories.sh && ./createTestFiles.sh
+docker-compose up -d
+./setupSSHKeys.sh
+cd ..
+pytest tests/ -q
+```
+
+CI notes:
+
+- The project uses `pyproject.toml` for packaging and `pytest.ini` for test config. CI should install dependencies with `pip install -e .[test]` and run `pytest -q`.
+- Integration tests that depend on docker-compose should be gated behind a separate job that runs `cd test && docker-compose up -d` first.
+
+## Concrete examples (copy-and-paste)
+
+Example task manifest — execution
+
+```json
+{
+  "id": "task-123",
+  "type": "execution",
+  "hostname": "127.0.0.1",
+  "directory": "/tmp",
+  "command": "echo hello",
+  "protocol": {
+    "name": "ssh",
+    "credentials": { "username": "test", "keyFile": "path/to/key" }
+  }
+}
+```
+
+Example transfer protocol snippet (schema-driven)
+
+```json
+{
+  "name": "sftp",
+  "credentials": { "username": "user", "keyFile": "path/to/key" }
+}
+```
+
+## Tests, debugging, and logs
+
+- Test fixtures live in `tests/fixtures` or are defined in `tests/conftest.py`. Reuse existing fixtures whenever possible.
+- Integration test logs and artifacts created by `test/` helper scripts are placed under `test/testLogs/` for easy inspection.
+- Logging format: use `otflogging` helpers to include `task_id` and `hostname` in logs. New code should add context to loggers so tests can assert on log markers if needed.
+
+Common debugging steps:
+
+- Reproduce failing test locally with `-k <test_name>` and `-s` to see stdout/stderr streaming.
+- Inspect `test/testLogs/` for integration failures.
+- For networking/auth issues, replicate the protocol flow manually in a small script that uses the same handler class to connect and run a simple command.
+
+## CI and release pointers
+
+- Ensure `pyproject.toml` and `MANIFEST.in` contain any new package data files you add.
+- Bump versions according to semantic versioning and update `CHANGELOG.md` when releasing.
+- Unit tests should be quick; heavy integration tests should run in separate CI jobs that provision docker services.
+
+## Best practices for automated agents (rules)
+
+1. Run the unit tests that exercise your changed files before creating a PR. If you cannot reproduce remote integration locally, add/modify only unit tests or mock the remote layer.
+2. Never add secrets to the repo. Use environment variables or test fixtures that generate ephemeral keys.
+3. If changing a JSON schema, update the schema file and add at least one positive and one negative test case.
+4. Limit scope of edits in a single PR: small, focused changes are easier to review and test.
+
+## Where to look for related code
+
+- `src/opentaskpy/taskhandlers/taskhandler.py` and `src/opentaskpy/taskhandlers/`
+- `src/opentaskpy/remotehandlers/` (SSH, SFTP, WinRM addons live in separate repos but follow the same interface)
+- `src/opentaskpy/config/schemas/` — JSON schemas (canonical)
+- `tests/`, `tests/conftest.py` and `test/` helper scripts
+
+## Change summary and contact
+
+This file was created to give automated agents a reliable starting point for code navigation, safe edits, and test execution. If you're an external maintainer, open issues or PRs in this repository; include failing test output and the `-k` test used to reproduce locally.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,5 +1,18 @@
 # Changelog
 
+# v26.15.0
+
+- Add `OTF_STALE_RUNNING_LOG_SECONDS` environment variable to allow resuming of batches from a `_running` log file that is older than the specified number of seconds.
+- Update GPG key used for testing with a 10 year expiry to stop tests failing.
+- Bump minimum versions dependencies to:
+  - `jsonpath-ng >= 1.8`
+  - `jsonschema >= 4.26`
+  - `paramiko >= 4.0`
+  - `requests >= 2.33`
+  - `referencing >= 0.37`
+  - `tenacity >= 9.1`
+  - `python-gnupg >= 0.5.6`
+
 # No release
 
 - Update docs
diff --git a/README.md b/README.md
@@ -182,6 +182,7 @@ These are some environment variables that can be used to customise the behaviour
 - `OTF_VARIABLES_FILE` - Override the default variables file. This is useful when you want to use the same job definitions, but point at a different environment with different for example. Multiple files can be specified comma-separated. If variables appear in more than one file, they will be resolved from the last entry found.
 - `OTF_PARAMIKO_ULTRA_DEBUG` - Enables the hidden `ultra_debug` option for Paramiko. This will log all SSH communications to the console, and can be very verbose, so be careful when using this. Set to `1` to enable (This is for SFTP only)
 - `OTF_LAZY_LOAD_VARIABLES` - Enables lazy loading of variables. This will only load variables that are used by the task definition. This can be useful if you have a large number of variables, and you only need a few of them.
+- `OTF_STALE_RUNNING_LOG_SECONDS` - When resuming a batch, this variable defines how long a `_running` log file must be inactive for before it's considered stale, and the batch will attempt to resume from it. This is to prevent resuming from a log file that is still being written to by a currently running batch. Default behaviour ignores all `_running` log files, and only resumes using `_failed` log files. Set to `0` to disable this check, and allow resuming from any log file regardless of last modification time, or something like `300` to resume a crashed batch that's at least 5 minutes old.
 
 ## Logging
 
@@ -493,6 +494,8 @@ Each task in a batch has an `order_id`, this is a unique ID for each task, and i
 
 As a batch task runs, it writes out the status of each sub task to it's log file. If a failure occurs, and the batch is rerun with the same arguments, it will attempt to resume from the point of failure. To determine the previous state, the batch handler will look at only logs that are from the current date. This is tp ensure that if something failed at 1am yesterday, but hasn't been rerun, we won't try to recover from the point of failure. Sometimes you might want to recover regardless, this can be done by passing in the date of the log files that you want to recover from, using the environment variable `OTF_BATCH_RESUME_LOG_DATE` in the format `YYYYMMDD`. This will instruct the batch handler to look at logs with that date instead.
 
+By default a batch will only every resume from a `_failed` run. If for some reason you want to resume from a `_running` log file (perhaps you had a crash for some reason), you can set the environment variable `OTF_STALE_RUNNING_LOG_SECONDS` to the number of seconds a `_running` log file must be inactive for before it's considered stale. This will then cause the resume logic to read the `_running` log file if it's at least as old as the number of seconds specified.
+
 # Development
 
 This repo has been primarily configured to work with GitHub Codespaces devcontainers, though it can obviously be used directly on your machine too.
diff --git a/pyproject.toml b/pyproject.toml
@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
 
 [project]
 name = "opentaskpy"
-version = "v26.10.0"
+version = "v26.15.0"
 authors = [{ name = "Adam McDonagh", email = "adam@elitemonkey.net" }]
 license-files = [ "LICENSE" ]
 
@@ -25,13 +25,13 @@ keywords = [
 ]
 dependencies = [
     "jinja2 >= 3.1",
-    "jsonpath-ng >= 1.5",
-    "jsonschema >= 4.17",
-    "paramiko >= 3.0",
-    "requests >= 2.28",
-    "referencing >= 0.29.1",
-    "tenacity >= 8.2.3",
-    "python-gnupg >= 0.5.2",
+    "jsonpath-ng >= 1.8",
+    "jsonschema >= 4.26",
+    "paramiko >= 4.0",
+    "requests >= 2.33",
+    "referencing >= 0.37",
+    "tenacity >= 9.1",
+    "python-gnupg >= 0.5.6",
     "omegaconf >= 2.3.0",
 ]
 description = "A framework for automation execution of commands and transferring files between hosts"
@@ -42,14 +42,14 @@ requires-python = ">=3.11"
 dev = [
     "types-requests >=2.28",
     "types-paramiko >=3.0",
-    "black == 26.3.0",
+    "black == 26.3.1",
     "isort",
     "pytest",
     "bumpver",
     "pytest-shell",
     "lovely-pytest-docker",
     "pre-commit",
-    "pylint >= 3.2.2",
+    "pylint >= 4.0",
     "pydantic",
     "mypy",
     "ruff",
@@ -71,7 +71,7 @@ otf-batch-validator = "opentaskpy.cli.batch_validator:main"
 profile = 'black'
 
 [tool.bumpver]
-current_version = "v26.10.0"
+current_version = "v26.15.0"
 version_pattern = "vYY.WW.PATCH[-TAG]"
 commit_message = "bump version {old_version} -> {new_version}"
 commit = true
@@ -385,8 +385,6 @@ ignore = [
     "D407", # Section name underlining
     "E501", # line too long
     "E731", # do not assign a lambda expression, use a def
-    # Ignored due to performance: https://github.com/charliermarsh/ruff/issues/2923
-    "UP038", # Use `X | Y` in `isinstance` call instead of `(X, Y)`
 ]
 
 
diff --git a/src/opentaskpy/otflogging.py b/src/opentaskpy/otflogging.py
@@ -5,6 +5,7 @@
 import os
 import re
 import threading
+import time
 from datetime import datetime
 
 OTF_LOG_FORMAT = (
@@ -287,12 +288,18 @@ def get_latest_log_file(task_id: str, task_type: str) -> str | None:
         logs
     """
     log_file_name = _define_log_file_name(task_id, task_type)
+
+    stale_running_log_secs = int(os.environ.get("OTF_STALE_RUNNING_LOG_SECONDS", -1))
+
     # Obviously the date/time in the filename needs to be replaced with the latest
     # log file
     # Replace the prefix with a regex wildcard
     log_file_name = log_file_name.replace(os.environ["OTF_LOG_RUN_PREFIX"], ".*")
-    # Also, we don't want to limit to running jobs, only failed or successful ones
-    log_file_name = log_file_name.replace("_running", "(_failed)*")
+    # Also, we don't want to limit to running jobs, only failed or successful ones, unless we're looking for stale logs
+    log_file_name = log_file_name.replace(
+        "_running",
+        "(_failed)*" if stale_running_log_secs < 0 else "(_failed|_running)*",
+    )
 
     logger.debug(f"Looking for log file: {log_file_name}")
     if not os.path.exists(os.path.dirname(log_file_name)):
@@ -334,14 +341,35 @@ def get_latest_log_file(task_id: str, task_type: str) -> str | None:
     # Sort the list by the date/time in the filename
     log_files.sort(key=lambda x: datetime.strptime(x.split("_")[0], "%Y%m%d-%H%M%S.%f"))
     logger.debug(f"Log files after sorting: {log_files}")
+
+    # If we're looking for stale logs, then we want to filter out any _running logs that are NOT yet stale
+    # (i.e. newer than stale_running_log_secs). We need to use the mtime of the file, not the timestamp
+    # in the filename, since a live process keeps updating mtime whereas a crashed process stops.
+    if stale_running_log_secs >= 0:
+        log_files = [
+            f
+            for f in log_files
+            if "_running" not in f
+            or os.path.getmtime(os.path.join(os.path.dirname(log_file_name), f))
+            <= time.time() - stale_running_log_secs
+        ]
+        logger.debug(
+            f"Log files after filtering out non-stale running logs: {log_files}"
+        )
+
     # Get the latest log file
     if log_files:
         log_file_name = f"{os.path.dirname(log_file_name)}/{log_files[-1]}"
         logger.info(f"Latest log file: {log_file_name}")
-        # If the last log was a failure, return that, otherwise we just start from scratch, so return nothing
+        # If the last log was a failure, return that
         if "_failed" in log_file_name:
             return log_file_name
+        # If the last log is a stale running log, return it so the batch can resume from it
+        if "_running" in log_file_name and stale_running_log_secs >= 0:
+            logger.info("Stale running log file found. Resuming batch from it.")
+            return log_file_name
 
+        # Otherwise we just start from scratch, so return nothing
         logger.info("No failed log file found. Starting from scratch.")
 
     return None
diff --git a/tests/fixtures/pgp.py b/tests/fixtures/pgp.py
diff --git a/tests/fixtures/ssh_clients.py b/tests/fixtures/ssh_clients.py
diff --git a/tests/test_logging.py b/tests/test_logging.py