Skip to content

queue start: workers crash on Windows #11035

@Tauwasser

Description

@Tauwasser

Bug Report

Description

I'm on Windows 11 using Python 3.14.4 and installed dvc as recommended using pipx. When queuing experiments starting workers using dvc queue start simply does nothing.

Reproduce

$ git init && dvc init
$ echo -e "1\n2\n3" > input.txt
$ echo "open('output.txt', 'w').write('\n'.join(str(int(line) + 1) for line in open('input.txt')))" > example.py
$ dvc stage add -n example -d example.py -d input.txt -o output.txt python example.py
$ git add -A && git commit -m "initial commit"
$ dvc exp run --queue
$ dvc queue start
Started '1' new experiments task queue worker.
$ dvc queue status
Task     Name        Created    Status
d899386  bardy-doll  07:51 PM   Queued

Worker status: 0 active, 0 idle

The worker was seemingly spawned okay, but just never picks up the queued job.

Expected

I expected the worker to spawn, pick up the job ("Running" -> "Success"), then clean up after itself.

Environment information

Output of dvc doctor:

$ dvc doctor
DVC version: 3.67.1 (pip)
-------------------------
Platform: Python 3.14.4 on Windows-11-10.0.26200-SP0
Subprojects:
        dvc_data = 3.18.3
        dvc_objects = 5.2.0
        dvc_render = 1.0.2
        dvc_task = 0.40.2
        scmrepo = 3.6.2
Supports:
        http (aiohttp = 3.13.5, aiohttp-retry = 2.9.1),
        https (aiohttp = 3.13.5, aiohttp-retry = 2.9.1)
Config:
        Global: C:\Users\<USERNAME>\AppData\Local\iterative\dvc
        System: C:\ProgramData\iterative\dvc
Cache types: hardlink
Cache directory: NTFS on D:\
Caches: local
Remotes: None
Workspace directory: NTFS on D:\
Repo: dvc, git
Repo.site_cache_dir: C:\ProgramData\iterative\dvc\Cache\repo\d482bc26ad57e57c35ce8a849aad758b

Additional Information (if any):

The problem is that the worker actually spawns, but crashes instantly. I set environment variable DVC_DAEMON_LOGFILE and used dvc queue start -v to get a better idea of what was going on.

This is the log output:

Traceback (most recent call last):
  File "C:\Users\<USERNAME>\pipx\venvs\dvc\Lib\site-packages\dvc\__main__.py", line 5, in <module>
    from dvc.cli import main
  File "C:\Users\<USERNAME>\pipx\venvs\dvc\Lib\site-packages\dvc\__init__.py", line 7, in <module>
    import dvc.logger
  File "C:\Users\<USERNAME>\pipx\venvs\dvc\Lib\site-packages\dvc\logger.py", line 3, in <module>
    import logging
  File "C:\Users\<USERNAME>\AppData\Local\Python\pythoncore-3.14-64\Lib\logging\__init__.py", line 26, in <module>
    import sys, os, time, io, re, traceback, warnings, weakref, collections.abc
  File "C:\Users\<USERNAME>\AppData\Local\Python\pythoncore-3.14-64\Lib\re\__init__.py", line 125, in <module>
    import enum
  File "C:\Users\<USERNAME>\AppData\Local\Python\pythoncore-3.14-64\Lib\enum.py", line 3, in <module>
    from types import MappingProxyType, DynamicClassAttribute
  File "C:\Users\<USERNAME>\pipx\venvs\dvc\Lib\site-packages\dvc\types.py", line 1, in <module>
    from typing import TYPE_CHECKING, Any, AnyStr, Union
  File "C:\Users\<USERNAME>\AppData\Local\Python\pythoncore-3.14-64\Lib\typing.py", line 26, in <module>
    import functools
  File "C:\Users\<USERNAME>\AppData\Local\Python\pythoncore-3.14-64\Lib\functools.py", line 22, in <module>
    from types import GenericAlias, MethodType, MappingProxyType, UnionType
ImportError: cannot import name 'GenericAlias' from 'types' (consider renaming 'C:\\Users\\<USERNAME>\\pipx\\venvs\\dvc\\Lib\\site-packages\\dvc\\types.py' since it has the same name as the standard library module named 'types' and prevents importing that standard library module)

The dvc.types module shadows the built-in types module, which causes the worker to crash with the stacktrace shown. However, the code that spawns workers isn't sophisticated enough to notice the non-zero exit code.

The problem is in dvc/daemon.py:59:

dvc/dvc/daemon.py

Lines 59 to 65 in 06ff81c

def _get_dvc_args() -> list[str]:
args = [sys.executable]
if not is_binary():
root_dir = os.path.abspath(os.path.dirname(__file__))
main_entrypoint = os.path.join(root_dir, "__main__.py")
args.append(main_entrypoint)
return args

Basically, when executing through Python, the workers are actually spawned by calling python <path-to-site-packages>/dvc/__main__.py directly, which means the dvc directory will the prepended to sys.path as the first thing.

Now, I don't know enough about dvc or the intention behind calling it that way (and the blame proved not too helpful, either), but I believe simply changing _get_dvc_args as follows would solve the issue. At least it did solve this issue for me.

def _get_dvc_args() -> list[str]:
    args = [sys.executable]
    if not is_binary():
        args.append('-m')
        args.append('dvc')
    return args

This should be safe, because

  1. the rest of the code assumes the dvc module is available anyway (import dvc.<xyz> etc.), and
  2. there is code at dvc/daemon.py:177 that puts the site-packages directory in PYTHONPATH, and
  3. the dvc.daemon.daemonize function is used exactly once exclusively in the Windows-only part of dvc.repo.experiments.queue.celery

However, maybe there is currently a re-factoring going on to switch to dvc_task.proc.process.ManagedProcess that I may not be aware of? Of course, this doesn't improve _detached_subprocess simply not noticing if the workers it spawned died before doing any work. But that change would seem to be somewhat more involved.

A similar issue might have occurred in issue #10829, which seemingly used a snap-installed dvc on Windows Subsystem for Linux within _posix_detached_subprocess, although I didn't look at it more closely.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions