-
Notifications
You must be signed in to change notification settings - Fork 788
ADR: Process error: section
#7233
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
bentsherman
wants to merge
1
commit into
master
Choose a base branch
from
process-error-section
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+176
−0
Open
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,176 @@ | ||
| # Process error section | ||
|
|
||
| - Authors: Ben Sherman | ||
| - Status: proposed | ||
| - Deciders: Ben Sherman, Paolo Di Tommaso | ||
| - Date: 2026-06-15 | ||
| - Tags: lang, static-types, processes, error-handling | ||
|
|
||
| ## Summary | ||
|
|
||
| Add an `error:` section to typed processes, mirroring the `output:` section, that lets a process emit a *domain error* as a value instead of aborting the run. | ||
|
|
||
| ## Problem Statement | ||
|
|
||
| When a task fails, Nextflow handles it using the `errorStrategy` directive, which provides a few standard behaviors such as retrying the task, ignoring the error, or terminating the pipeline. In the case of termination, Nextflow reports the error with a standard format that accounts for all of the various ways in which a task can fail. | ||
|
|
||
| However, a task can fail for two fundamentally different reasons: | ||
|
|
||
| - **Execution errors** -- the infrastructure failed: out of memory, node lost, spot reclaim, submit failure. These are transient or environmental and are typically handled by retries. | ||
|
|
||
| - **Domain errors** -- the tool ran correctly but legitimately failed on the *data*: an input sample was too low quality to align, no variants were called, a file was malformed. These are an expected outcome for some inputs, not a fault of the pipeline or the infrastructure. | ||
|
|
||
| There are currently two ways to handle domain errors manually: | ||
|
|
||
| - **Use the `ignore` error strategy.** This allows the pipeline to complete, but it does not provide an easy way to manage failed inputs. The developer must join the input channel to the output channel and filter out values for which the output is missing, and there is no way for a failed task to return anything (e.g. an error log). | ||
|
|
||
| - **Catch domain errors in the process script.** The developer can add logic in the process script to catch certain error conditions, so that they can still output something. This requires clobbering the process outputs to handle both success and error conditions (e.g. making all outputs optional, providing "fake" outputs on failure) and filtering output channels to separate successful and failed tasks. | ||
|
|
||
| The first approach is a quick fix, not a real error handling solution. The second approach is nearly a complete solution, but lacks a key ingredient -- the ability to emit a separate "error" output for domain errors. | ||
|
|
||
| ## Goals | ||
|
|
||
| - Allow a process to emit a domain error as a value, so the workflow can handle it with normal dataflow logic. | ||
|
|
||
| - Maintain backwards compatibility with existing code -- domain errors should be a per-process, opt-in feature. | ||
|
|
||
| - Continue to use `errorStrategy` for execution errors. | ||
|
|
||
| ## Non-goals | ||
|
|
||
| - Support for legacy processes. The `error:` section is introduced for typed processes only; legacy support may follow later. | ||
|
|
||
| - New syntax for *classifying* errors (no exit-code lists, no guard expressions). | ||
|
|
||
| ## Solution | ||
|
|
||
| Introduce an **`error:` section** for typed processes, with the same syntax as the `output:` section. A domain error is detected structurally -- **a task that succeeds (exit 0) but does not fulfill its declared `output:`** -- and is emitted as an error value instead of triggering the error strategy. | ||
|
|
||
| ## Core Capabilities | ||
|
|
||
| ### Domain errors vs execution errors | ||
|
|
||
| When a task completes, the following strategy is used to detect domain errors vs execution errors: | ||
|
|
||
| 1. **Exit code != 0** → execution error → `errorStrategy`. | ||
| 2. **Exit 0, `output:` fulfilled** → emit `output:` (normal success). The output path wins even if error artifacts also happen to be present. | ||
| 3. **Exit 0, `output:` not fulfilled, `error:` fulfilled** → domain error → emit `error:`, the workflow continues, `errorStrategy` is *not* triggered. | ||
| 4. **Exit 0, `output:` not fulfilled, `error:` not fulfilled** → execution error → `errorStrategy`. | ||
|
|
||
| The output section is considered **not fulfilled** if any required output files (`file()`, `files()`) or environment variables (`env()`) are missing. Other errors such as a missing variable, missing `stdout()`, or missing `eval()` are not treated as domain errors because they usually indicate a malformed pipeline. | ||
|
|
||
| Therefore, the only way to trigger a domain error is to exit 0, ensure that the normal outputs are missing, and ensure that the error outputs are present. The pipeline developer is responsible for writing the process in this way, for example: | ||
|
|
||
| ```groovy | ||
| nextflow.enable.types = true | ||
|
|
||
| process ALIGN { | ||
| input: | ||
| record(id: String, reads: Path) | ||
| index: Path | ||
|
|
||
| output: | ||
| record(id: id, bam: file('aligned.bam')) | ||
|
|
||
| error: | ||
| record(id: id, log: file('aligner.log')) | ||
|
|
||
| script: | ||
| """ | ||
| aligner ${reads} ${index} > aligned.bam 2> aligner.log || { | ||
| rm -f aligned.bam | ||
| exit 0 | ||
| } | ||
|
Comment on lines
+80
to
+83
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. TODO: wrap exit code wrangling in an |
||
| """ | ||
| } | ||
| ``` | ||
|
|
||
| The `error:` section is optional. However, the `output:` section is required when `error:` is defined. | ||
|
|
||
| ### Process call semantics | ||
|
|
||
| A process that declares both `output:` and `error:` has return type `Tuple<V, E>`, where `V` is the output type and `E` is the error type. The caller should destructure the tuple to access output and error separately: | ||
|
|
||
| ```groovy | ||
| ch_smaples = channel.of( /* ... */ ) | ||
| (ch_aligned, ch_failed) = ALIGN(ch_samples) | ||
|
|
||
| ch_aligned.view { r -> "Aligned sample ${r.id}: ${r.bam}" } | ||
| ch_failed.view { r -> "Failed to align sample ${r.id}: ${r.log}" } | ||
| ``` | ||
|
|
||
| When a process with both `output:` and `error:` returns a dataflow channel, it emits each task result to either the output channel or error channel depending on whether the result is a domain error. Thus, the total number of output values and error values is always equal to the number of inputs. | ||
|
|
||
| When a process returns a dataflow value, the output and error values are each bound to either the task result or `null`, depending on whether the result is a domain error. Thus, the dataflow values for output and error are always bound to a value (no "empty" dataflow value). | ||
|
|
||
| ### Execution, caching, and lineage | ||
|
|
||
| A task that emits an `error:` is treated as a **successful, cacheable task** (exit 0). On a resumed run, the domain error is cached and the task is not re-executed. The cache entry does not need to be modified, since the domain error is re-derived from the task directory (outputs missing, error outputs present). | ||
|
|
||
| The `TaskOutput` lineage record should be extended with an `error` field that mirrors the existing `output` field. These fields should be mutually exclusive. | ||
|
|
||
| When a domain error occurs, `topic:` emissions and `publishDir` are skipped. | ||
|
|
||
| ## Alternatives | ||
|
|
||
| ### Triggering domain errors with exit codes | ||
|
|
||
| One alternative is to trigger domain errors by returning certain exit codes. Many command-line tools use exit codes for this very purpose, e.g. to distinguish an invalid input from an out-of-memory error. In fact, earlier versions of Nextflow had a `validExitStatus` process directive for this very purpose. | ||
|
|
||
| However, this approach does not work in general: | ||
|
|
||
| - Different tools use different exit code conventions. | ||
| - There is no way to know which command in a script returned the exit code. | ||
|
|
||
| The `validExitStatus` directive was ultimately removed for these same reasons. While it seems intuitive to simply rely on exit codes, this interface is not rich enough to classify domain vs. execution errors. | ||
|
|
||
| Instead, pipeline developers must write process scripts in a way that triggers domain errors when desired. This approach is more verbose, but it seems to be the only one that works across all possible tools and environments. | ||
|
|
||
| ### Triggering domain errors via `emit` error strategy | ||
|
|
||
| Sometimes, it is useful to treat execution errors as domain errors for practical reasons. For example, given a task that runs out of memory even after multiple attempts with additional memory, the user might want to treat this task as a "lost cause" so that the rest of the pipeline can proceed. | ||
|
|
||
| This can be achieved by adding an `emit` error strategy which simply emits the task failure as a domain error using the `error:` section: | ||
|
|
||
| ```groovy | ||
| nextflow.enable.types = true | ||
|
|
||
| process ALIGN { | ||
| memory { 8.GB * task.attempt } | ||
| errorStrategy { task.attempt < 3 ? 'retry' : 'emit' } | ||
|
|
||
| input: | ||
| record(id: String, reads: Path) | ||
| index: Path | ||
|
|
||
| output: | ||
| record(id: id, bam: file('aligned.bam')) | ||
|
|
||
| error: | ||
| record(id: id, log: file('aligner.log')) | ||
|
|
||
| script: | ||
| """ | ||
| aligner ${reads} ${index} > aligned.bam 2> aligner.log || { | ||
| rm -f aligned.bam | ||
| exit 0 | ||
| } | ||
| """ | ||
| } | ||
| ``` | ||
|
|
||
| However, the `error:` section is not reliable in the event of an execution error, since the task could have failed before the error outputs were created. In that case, the error strategy would have to fallback to a default strategy, likely `terminate` or `finish`. | ||
|
|
||
| As a result, it is unclear whether an `emit` strategy would actually be useful. It remains a possibility for future investigation. | ||
|
|
||
| ### Handling domain errors with try/catch/throw | ||
|
|
||
| Many languages, including Java and Python, provide the ability to *throw* or *raise* an error up the call stack. Any upstream caller can *catch* this error and handle it; otherwise it is handled by the runtime. This approach is flexible because it allows errors to be propagated through multiple levels of indirection with minimal ceremony. | ||
|
|
||
| Nextflow inherits the try-catch-throw syntax from Java and Groovy, mainly for compatibility with existing Nextflow code and Java/Groovy libraries that can throw exceptions. However, Nextflow's dataflow programming model fits much better with errors-as-values because it provides a clear flow of data that works across any level of concurrency. | ||
|
|
||
| ## Links | ||
|
|
||
| - Community issues: [#725](https://github.com/nextflow-io/nextflow/issues/725), [#903](https://github.com/nextflow-io/nextflow/issues/903) | ||
| - Related: [Typed processes](20251017-typed-processes.md) | ||
| - Related: [Typed workflows](20260310-typed-workflows.md) | ||
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure this is sufficient for separating the domain vs infrastructure errors.
See https://nfcore.slack.com/archives/C043FMKUNLB/p1782301366504629 for a relevant discussion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some comments there. That discussion seems to be more about distinguishing between different kinds of execution errors