Skip to content

Clearer error when a job hits a full disk (ENOSPC) (#1899)#2293

Open
arimu1 wants to merge 1 commit into
common-workflow-language:mainfrom
arimu1:fix/1899-disk-full-error
Open

Clearer error when a job hits a full disk (ENOSPC) (#1899)#2293
arimu1 wants to merge 1 commit into
common-workflow-language:mainfrom
arimu1:fix/1899-disk-full-error

Conversation

@arimu1

@arimu1 arimu1 commented Jun 19, 2026

Copy link
Copy Markdown

Problem

Fixes #1899.

When the user's temporary drive (or output directory) fills up while a job is running, the underlying OSError carries errno 28 (ENOSPC, "No space left on device"). In JobBase._execute this fell through to the catch-all branch:

else:
    _logger.exception("Exception while running job: %s", str(e), ...)
processStatus = "permanentFail"

…which dumps a stack trace and gives the user no hint that the actual problem is a full disk (the symptom reported on the CWL forum thread linked from the issue: a tool silently produces empty outputs).

Fix

Add a dedicated ENOSPC branch to the existing OSError handler in cwltool/job.py, right next to the ENOENT ("command not found") case that's already special-cased there. It logs a concise, actionable message instead of a traceback:

[job foo] No space left on device. The temporary directory (/tmp/…) and/or
output directory (/…/out) may be full; free up space, or point --tmpdir-prefix
and --outdir at a location with more capacity.

The full traceback is still available with --debug (the branch passes exc_info=runtimeContext.debug, matching the sibling cases). I also changed the existing magic-number e.errno == 2 to the named errno.ENOSPC's counterpart errno.ENOENT for readability, now that errno is imported.

Tests

tests/test_examples.py::test_disk_full_error_message injects an OSError(errno.ENOSPC, …) at the job's output-handling stage (monkeypatching cwltool.job.bytes2str_in_dicts) during a normal echo run, and asserts:

  • the message contains "No space left on device" and the --tmpdir-prefix guidance,
  • no raw Traceback (most recent call last) is printed,
  • exit code is 1 (permanentFail).

I verified this test fails against the unpatched code (it hits the old generic-exception/traceback path), proving it guards the behavior. black, flake8, and isort are clean on the changed files.

This addresses the "provide a better error message" ask in #1899. Actually probing free space ahead of time is a larger, platform-specific change and isn't attempted here.


This change was produced with the assistance of Claude Code (model: Claude Opus). The diff, root-cause analysis, and test were reviewed by a human before submission.

…anguage#1899)

When the temporary or output directory fills up mid-job, the OSError
(errno 28, ENOSPC) fell through to the generic 'Exception while running job'
handler and dumped a traceback, giving the user no idea the real problem was
a full disk. Add a dedicated ENOSPC branch that logs an actionable message
naming the tmpdir/outdir and pointing at --tmpdir-prefix / --outdir. Also
switch the existing errno == 2 check to the named errno.ENOENT for clarity.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@mr-c

mr-c commented Jun 19, 2026

Copy link
Copy Markdown
Member

@arimu1 thanks for using cwltool and contributing your fixes. Let's get the existing PRs resolved before opening more.

Perhaps you can fix the codecov upload issue. Maybe the version needa bumping? If that doesn't work, I would accept a temporary disabling of CodeCov, but then reviews will take longer as I will have to personally check the code coverage locally.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

if the user's tmp drive is full, provide a better error message

2 participants