Close qmemman client with correct handler#796
Conversation
| if qmemman_client: | ||
| qmemman_client.close() | ||
| if qmemman_task: | ||
| qmemman_task.close() |
There was a problem hiding this comment.
You sure about this? This is just a task that would return qmemman client when completed. Maybe qmemman_task.result().close() would work, but only if the task was completed already. If it was cancelled, I'm not sure how to get the client reference in this code shape...
There was a problem hiding this comment.
In the upstream docs, it uses I was reading the create_task() and then task.cancel().break_task...
About result(), I think it works for asyncio.CancelledError, but for other exceptions that didn't cancel the task, it will raise an exception.
I am going to test these possibilities to be sure, by delaying the qmemman response with a ten seconds sleep and making a call to use the preloaded disposable, which would cancel the call. Then on another run, I can make the call fail by reporting that it did fail to free memory from qube.
There was a problem hiding this comment.
Maybe
qmemman_task.result().close()would work, but only if the task was completed already. If it was cancelled, I'm not sure how to get the client reference in this code shape...
qmeman_task.cancel() is already being use if preload_request_event completes/is set, so I don't think I need to use .result().
About cancelling a task twice,, as asyncio.CancelledError was already raised, yes, that seems weird.
There was a problem hiding this comment.
Hm. Maybe I left the qmemman_client.close() inside asyncio.CancelledError because the cancelled task might be the breask_task, so in every failed scenario, I want the qmemman_task to be cancelled.
There was a problem hiding this comment.
Traceback (most recent call last):
File "/usr/lib/python3.13/site-packages/qubes/vm/dispvm.py", line 623, in on_domain_pre_paused
await earliest_task
File "/usr/lib64/python3.13/asyncio/locks.py", line 213, in wait
await fut
asyncio.exceptions.CancelledError
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/lib/python3.13/site-packages/qubes/events.py", line 243, in fire_event_async
effect = task.result()
File "/usr/lib/python3.13/site-packages/qubes/vm/mix/dvmtemplate.py", line 535, in on_domain_preload_dispvm_used
await asyncio.gather(
...<4 lines>...
)
File "/usr/lib/python3.13/site-packages/qubes/vm/dispvm.py", line 777, in from_appvm
dispvm = await cls.gen_disposable(appvm, preload=preload, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.13/site-packages/qubes/vm/dispvm.py", line 822, in gen_disposable
await dispvm.start()
File "/usr/lib/python3.13/site-packages/qubes/vm/dispvm.py", line 985, in start
await super().start(**kwargs)
File "/usr/lib/python3.13/site-packages/qubes/vm/qubesvm.py", line 1553, in start
await self.fire_event_async(
"domain-start", start_guid=start_guid
)
File "/usr/lib/python3.13/site-packages/qubes/events.py", line 243, in fire_event_async
effect = task.result()
File "/usr/lib/python3.13/site-packages/qubes/vm/dispvm.py", line 586, in on_domain_started_dispvm
await self.pause()
File "/usr/lib/python3.13/site-packages/qubes/vm/qubesvm.py", line 1754, in pause
await self.fire_event_async("domain-pre-paused", pre_event=True)
File "/usr/lib/python3.13/site-packages/qubes/events.py", line 243, in fire_event_async
effect = task.result()
File "/usr/lib/python3.13/site-packages/qubes/vm/dispvm.py", line 631, in on_domain_pre_paused
qmemman_task.close()
^^^^^^^^^^^^^^^^^^
AttributeError: '_asyncio.Future' object has no attribute 'close'
This is inside asyncio.CancelledError.
There was a problem hiding this comment.
To be clear: I mean QMemmanClient.close() needs to be called at some point, not just the asyncio task (of calling QMemmanClient.set_mem()) cancelled.
There was a problem hiding this comment.
You sure about this? This is just a task that would return qmemman client when completed. Maybe
qmemman_task.result().close()would work, but only if the task was completed already. If it was cancelled, I'm not sure how to get the client reference in this code shape...
Hm... qmemman_task.result().close() works inside asyncio.CancelledError when cancelling the break_task and when qmemman_task was already cancelled (by completing the break_task). I will add some debugging steps to qmemman to see if it is fine, but I think it is.
There was a problem hiding this comment.
I ended up using the recommended .result().done(). I tried to break in some ways by not completing the task (any other exception), by having already cancelled the task and trying to cancel it, but it appears to work (stops the memory request and releases qmemman lock).
There was a problem hiding this comment.
To be clear: I mean
QMemmanClient.close()needs to be called at some point, not just the asyncio task (of callingQMemmanClient.set_mem()) cancelled.
The disconnection also happens when length of received data is 0, this is why it wasn't hanging on most cases before this PR and with this PR.
a533474 to
f95e7b3
Compare
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #796 +/- ##
==========================================
+ Coverage 70.12% 70.13% +0.01%
==========================================
Files 61 61
Lines 14001 13997 -4
==========================================
- Hits 9818 9817 -1
+ Misses 4183 4180 -3
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
OpenQA test summaryComplete test suite and dependencies: https://openqa.qubes-os.org/tests/overview?distri=qubesos&version=4.3&build=2026042316-4.3&flavor=pull-requests Test run included the following:
New failures, excluding unstableCompared to: https://openqa.qubes-os.org/tests/overview?distri=qubesos&version=4.3&build=2026032404-devel&flavor=update
Failed tests43 failures
Fixed failuresCompared to: https://openqa.qubes-os.org/tests/170766#dependencies 32 fixed
Unstable testsDetails
Performance TestsPerformance degradation:11 performance degradations
Remaining performance tests:100 tests
|
| if qmemman_client: | ||
| qmemman_client.close() | ||
| if qmemman_task: | ||
| qmemman_task.result().close() |
There was a problem hiding this comment.
So, if earliest_task == break_task, this doesn't work, because qmemman_task is cancelled. With added try/except around this line I get:
Apr 12 16:26:13 dom0 qubesd[10498]: Traceback (most recent call last):
Apr 12 16:26:13 dom0 qubesd[10498]: File "/usr/lib/python3.13/site-packages/qubes/vm/dispvm.py", line 634, in on_domain_pre_paused
Apr 12 16:26:13 dom0 qubesd[10498]: await earliest_task
Apr 12 16:26:13 dom0 qubesd[10498]: asyncio.exceptions.CancelledError
Apr 12 16:26:13 dom0 qubesd[10498]: During handling of the above exception, another exception occurred:
Apr 12 16:26:13 dom0 qubesd[10498]: Traceback (most recent call last):
Apr 12 16:26:13 dom0 qubesd[10498]: File "/usr/lib/python3.13/site-packages/qubes/vm/dispvm.py", line 643, in on_domain_pre_paused
Apr 12 16:26:13 dom0 qubesd[10498]: qmemman_task.result().close()
Apr 12 16:26:13 dom0 qubesd[10498]: ~~~~~~~~~~~~~~~~~~~^^
Apr 12 16:26:13 dom0 qubesd[10498]: asyncio.exceptions.CancelledError
There was a problem hiding this comment.
So, if
earliest_task == break_task, this doesn't work, because qmemman_task is cancelled. With added try/except around this line I get:
On this comment: #796 (comment), I expected it to fail also, but it succeeded.
So what do you mean with added try/except? Is this:
except asyncio.CancelledeError:
if qmemman_task:
try:
qmemman_task.result().close()
except:
raiseThere was a problem hiding this comment.
Well, not just raise , log, exception here seems to get lost.
|
Looking at it closer, I think there is an issue with how this is called in the non-requested case. set_mem() will instruct qmemman to reduce memory of the qube. So far so good. But as soon as the client disconnects, qmemman will start balancing memory again, likely giving just reduced qube more memory again. IIUC that disconnect happens currently only implicitly, when garbage collector releases qmemman_client object, which is IMO quite fragile. Theoretically, the qmemman socket should be kept open until after the qube is paused. In practice, the qube is paused fast enough so it doesn't change much. See how similar qmemman handling is done during qube startup: first qubesd request to free some memory for the new qube, then starts it and only then close the qmemman socket - otherwise qmemman would reuse that freed memory and there wouldn't be space for starting that new qube anymore. But here, it isn't a single function, it would be two separate event handlers - |
|
But the flip side of the "store qmemman_client reference in an attribute" is that, if something goes wrong and close() isn't called, garbage collector won't save you anymore. So, it wouldn't be "delayed for 20min" anymore, but qmemman would stop doing anything forever (which includes breaking starting further qubes). Or at least until responsible qube is removed (or maybe until qubesd or qmemman restart). |
I remember you voicing your concern about qmemman trying to give memory back to the domain before it being paused, when developing this feature last year, so, uh, I was expecting this topic to be revived at some point. I do agree that we are relying on implicit (as it just works) behavior rather than something guaranteed, which would be nicer. I think attribute works well enough for this case.
The connection can seize when the client sends empty reply: qubes-core-admin/qubes/tools/qmemmand.py Line 197 in 2488f2b |
|
Set qmemman log level to info. Letting the preload be be paused (not requesting it before pause)
Requesting the preload mid domain-pre-pause, mid qmemman pref memIt just has |
I don't think the current implementation of qmemman checks if client disconnected mid-action. Not ideal, but also not a huge problem. |
I moved the instantiation of I didn't find a nice solution, I saw that something can maybe be done with
So it seems that the client cannot easily forcefully close the connection nor hold it before ballooning again. I am testing using this script: import asyncio
import qubes
import qubes.qmemman.client
async def main():
app = qubes.Qubes()
domains = app.domains
disp = domains["disp3668"]
break_task = asyncio.create_task(asyncio.sleep(0.0003))
qmemman_client = qubes.qmemman.client.QMemmanClient()
#qmemman_task = asyncio.get_running_loop().run_in_executor(None, disp.set_mem)
qmemman_task = asyncio.get_running_loop().run_in_executor(None, qmemman_client.set_mem, {disp.xid: 0})
tasks = [break_task, qmemman_task]
try:
async for earliest_task in asyncio.as_completed(tasks):
await earliest_task
if earliest_task == break_task:
print("Finished break, cancelling qmemman")
qmemman_task.cancel()
else:
print("Finished qmemman, cancelling break")
break_task.cancel()
except asyncio.CancelledError:
pass
finally:
print("Final")
if qmemman_client.sock:
qmemman_client.shutdown()
result = None
cancelled = False
try:
#result = await qmemman_task
result = qmemman_task.result()
except asyncio.CancelledError:
cancelled = True
finally:
#qmemman_client.close()
if not result or cancelled:
disp.log.warning("Failed to set memory")
if __name__ == "__main__":
asyncio.run(main())Where QMemmanClient.shutdown is shutdown+close(). |
Well, closing the socket does not interrupt the function on the other end of it. qmemman will notice only when it will interact with the socket next time. For interrupting the call, qmemman would need some monitoring of the socket state during |
Right, it ended up being bigger changes that what I expected and I got lost trying to fix it all at once. Will focus just on closing the socket properly. I cannot use |
What about writing to a file or modify DomainState somehow, so that |
fed104c to
ec3965e
Compare
|
Haven't tested latest commit. For another day. |
We used to do file-based trigger, it's messy. For example if two DispVMs would preload in parallel. |
ec3965e to
a56b9a8
Compare
What about a file that contains the domid, to remove it from balancing/ballooning? Tested now and it's working. Currently using 2 files for different purposes:
|
I just gave one example out of many what could go wrong... No, I don't want file based interface for controlling stuff. If you want more examples:
We've been there. I don't want to keep finding new failure modes for the next several months again... |
a56b9a8 to
b2d305f
Compare
QMemmanClient is instantiated from the same method that is calling it as a task, so it guarantees access to the close() method, else, it may not be able to close the connection properly if the task is cancelled, as the result() will not contain the instance, but CancelledError. For: QubesOS/qubes-issues#1512
b2d305f to
3930269
Compare
|
Dropped the file based condition. Will leave for another future future PR. Opened separate issues: |
For: QubesOS/qubes-issues#1512
#757 (comment)