Skip to content

pytest --dist=loadgroup hangs if a crashed worker is restarted - variant 2 #1327

@radoering

Description

@radoering

This issue is similar to #1323 (regarding the issue that can be observed from the outside), but since it is triggered by a slightly different setup, the root cause is different and it requires a different fix, I created a new issue.

When running the following test file

def test_1():
    import time
    time.sleep(5)
    assert True

def test_2():
    assert True

with

pytest -n1 --dist=loadgroup -o faulthandler_timeout=1 -o faulthandler_exit_on_timeout=true testing/test_timeout.py

test execution hangs with

replacing crashed worker gw0
2 workers [2 items]

The issue is that only one item is assigned to the new worker in

if self.collection is not None:
for node in self.nodes:
self._reschedule(node)
return

This item is removed from the queue in

self.nextitem_index = self.torun.get()

Then, in

self.nextitem_index = self.torun.get()

the worker waits for a second item.

In contrast, at first start (when self.collection is still None), two items are added to the queue via

# Assign initial workload
for node in self.nodes:
self._assign_work_unit(node)
# Ensure nodes start with at least two work units if possible (#277)
for node in self.nodes:
self._reschedule(node)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions