Skip to content

Compute worker issues #1205

@Didayolo

Description

@Didayolo

Hopefully solving several points: #2223

1. Containers not removed

Image

EDIT: still happening with MOT20 challenge:

Image

2. Wrong log when storage is full

When docker pull fails because of full storage, we have no clear logs.
See:

Then it gets stuck in Running state. Solved by #2223.

3. Progress bar

Related: show_progress and the progress bar adds up to the mess:

2026-02-28 02:38:37.854 | ERROR    | compute_worker:show_progress:137 - There was an error showing the progress bar
2026-02-28 02:38:37.854 | ERROR    | compute_worker:show_progress:138 - 6
2026-02-28 02:38:37.955 | ERROR    | compute_worker:show_progress:137 - There was an error showing the progress bar
2026-02-28 02:38:37.955 | ERROR    | compute_worker:show_progress:138 - 1

4. Logs

Image

5. No space left

How to manage the disks? Should we limit docker images size?

We could run a prune when docker pull hits the storage limit:

6. Option for container shared memory

  • Add shm-size as a compute_worker .env setting

More details here:

7. Submissions not marked as Failed

Submissions stuck in "Running" or "Scoring" or status

Related issues:

Example failure during "Preparing":

[2025-09-18 11:25:05,234: ERROR/ForkPoolWorker-2] Task compute_worker_run[fd956bf5-3e2d-4168-ab48-f0896dc80993] raised unexpected: OSError(28, 'No space left on device')
Traceback (most recent call last):
[...]
OSError: [Errno 28] No space left on device

8. Duplication of submission files

9. Scoring and ingestion only works without directory structure

Classic CodaLab and Codabench bug, if the scoring program or ingestion program is inside a folder in the zip, the submission fails.

We need either to:

  • Add clear logs when it is the case
  • Make it robust so it does not fail

Related issues:

Tentative fix #1905 got reversed by #1946.

10. To check: log level

The log level is defined in this way in compute_worker.py:

configure_logging(
    os.environ.get("LOG_LEVEL", "INFO"), os.environ.get("SERIALIZED", "false")
)

Generally we want as much log as possible, so we may want to be in "DEBUG" log level.

Related:

11. Docker pull failing

  • Docker pull failing
Pull for image: codalab/codalab-legacy:py39 returned a non-zero exit code! Check if the docker image exists on docker hub.

Related issues:

Solution:

12. Logs at the wrong place

13. No hostname in server status when status is "Preparing"

https://www.codabench.org/server_status

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions