amdgpu rendernode somewhat working but no fork support yet by tursulin · Pull Request #3016 · checkpoint-restore/criu

tursulin · 2026-05-05T12:55:35Z

This is a RFC style PR with 5 interesting patches on top of the cleanups I've sent as separate PRs. The interesting patches are these:

66ec3c611f82 plugins/amdgpu: Re-open the DRM device when saving buffer object content
80e839f67084 plugins/amdgpu: Move GPU VA map to after content restore
afe8e2b2fd23 plugins/amdgpu: Handle DRM render node mmaps
62cd972b4315 plugins/amdgpu: Allow forcing restore on a single GPU system
5d05115056e6 plugins/amdgpu: Enable restore with only the drm node open

With this I can get all but the last the test from https://cgit.freedesktop.org/~tursulin/intel-gpu-tools/commit/?h=amd-criu&id=24487611e08c753e73b3fc650ca85793bf053f4d to pass. The last one includes forking and for that problem I will kick off a separate discussion.

But the most complicated test case from the above:

sudo ~/build-holo/tests/amdgpu/amd_criu --r map-and-mmap-one

Can be checkpointed and restored:

$ sudo /usr/local/sbin/criu dump -t `pgrep amd_criu | head -1` -L /usr/local/lib/criu/ -vvv -o criu.log -j --link-remap --tcp-established --file-locks --ext-unix-sk
$ sudo /usr/local/sbin/criu restore  -L /usr/local/lib/criu/ -vvv -o restore.log --shell-job --link-remap --tcp-established --file-locks --ext-unix-sk

Subtest map-and-mmap-one: SUCCESS (13.989s)

It would be good to test this series with some kfd and kfd+amdgpu combined workload to check I created no regressions.

avagin · 2026-05-19T23:44:05Z

2026-05-05T13:47:45.8749530Z clang -g -Wall -Werror -D _GNU_SOURCE -shared -nostartfiles -fPIC -DCONFIG_AARCH64 -D_FILE_OFFSET_BITS=64 -D_LARGEFILE64_SOURCE -D_GNU_SOURCE -DCONFIG_HAS_LIBBSD -DCONFIG_HAS_SELINUX -DCONFIG_GNUTLS -I ../../compel/include/uapi amdgpu_plugin.c amdgpu_plugin_drm.c amdgpu_plugin_dmabuf.c amdgpu_plugin_topology.c amdgpu_plugin_util.c criu-amdgpu.pb-c.c amdgpu_socket_utils.c -o amdgpu_plugin.so -iquote../../include -iquote../../criu/include -iquote../../criu/arch/aarch64/include/ -iquote../../ -lpthread -lrt -ldrm -ldrm_amdgpu -I/usr/include/libdrm
2026-05-05T13:47:45.9906860Z amdgpu_plugin.c:1923:7: error: variable 'fd' is used uninitialized whenever 'if' condition is true [-Werror,-Wsometimes-uninitialized]
2026-05-05T13:47:45.9907426Z  1923 |                 if (ret) {
2026-05-05T13:47:45.9907619Z       |                     ^~~
2026-05-05T13:47:45.9908673Z amdgpu_plugin.c:2006:6: note: uninitialized use occurs here
2026-05-05T13:47:45.9908955Z  2006 |         if (fd < 0)
2026-05-05T13:47:45.9909123Z       |             ^~
2026-05-05T13:47:45.9958433Z amdgpu_plugin.c:1923:3: note: remove the 'if' if its condition is always false
2026-05-05T13:47:45.9958799Z  1923 |                 if (ret) {
2026-05-05T13:47:45.9958996Z       |                 ^~~~~~~~~~
2026-05-05T13:47:45.9959289Z  1924 |                         pr_err("Failed to parse local system topology %d\n",
2026-05-05T13:47:45.9959646Z       |                         ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
2026-05-05T13:47:45.9959928Z  1925 |                                ret);
2026-05-05T13:47:45.9960143Z       |                                ~~~~~
2026-05-05T13:47:45.9960358Z  1926 |                         goto fail;
2026-05-05T13:47:45.9960566Z       |                         ~~~~~~~~~~
2026-05-05T13:47:45.9960769Z  1927 |                 }
2026-05-05T13:47:45.9960937Z       |                 ~
2026-05-05T13:47:45.9961407Z amdgpu_plugin.c:1913:7: error: variable 'fd' is used uninitialized whenever 'if' condition is true [-Werror,-Wsometimes-uninitialized]
2026-05-05T13:47:45.9961917Z  1913 |                 if (ret < 0) {
2026-05-05T13:47:45.9962122Z       |                     ^~~~~~~
2026-05-05T13:47:45.9962712Z amdgpu_plugin.c:2006:6: note: uninitialized use occurs here
2026-05-05T13:47:45.9963007Z  2006 |         if (fd < 0)
2026-05-05T13:47:45.9963183Z       |             ^~
2026-05-05T13:47:45.9963459Z amdgpu_plugin.c:1913:3: note: remove the 'if' if its condition is always false
2026-05-05T13:47:45.9963791Z  1913 |                 if (ret < 0) {
2026-05-05T13:47:45.9964381Z       |                 ^~~~~~~~~~~~~~
2026-05-05T13:47:45.9964653Z  1914 |                         pr_err("Failed to find unused fd (fd:%d)\n", ret);
2026-05-05T13:47:45.9964980Z       |                         ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
2026-05-05T13:47:45.9965439Z  1915 |                         goto fail;
2026-05-05T13:47:45.9965653Z       |                         ~~~~~~~~~~
2026-05-05T13:47:45.9965850Z  1916 |                 }
2026-05-05T13:47:45.9966015Z       |                 ~
2026-05-05T13:47:45.9966305Z amdgpu_plugin.c:1876:8: note: initialize the variable 'fd' to silence this warning
2026-05-05T13:47:45.9966644Z  1876 |         int fd, ret;
2026-05-05T13:47:45.9966901Z       |               ^
2026-05-05T13:47:45.9967067Z       |                = 0
2026-05-05T13:47:46.0005574Z 2 errors generated.
2026-05-05T13:47:46.4358865Z make[3]: *** [Makefile:31: amdgpu_plugin.so] Error 1
2026-05-05T13:47:46.4359476Z make[2]: *** [Makefile:352: amdgpu_plugin] Error 2
2026-05-05T13:47:46.4359835Z make[2]: Leaving directory '/home/runner/work/criu/criu'
2026-05-05T13:47:46.4363379Z make[1]: *** [Makefile:4: run] Error 2
2026-05-05T13:47:46.4363734Z make[1]: Leaving directory '/home/runner/work/criu/criu/test/others/make'
2026-05-05T13:47:46.4365710Z + cleanup_cgroup

If the protobuf image allocation fails we can simply return immediately since there aren't any other things to clean up at this point. Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>

The DRM minor local variable is shadowed between two blocks inside amdgpu_plugin_drm_dump_file(). If at the top level we access the minor via the copy stored in the protobuf image, we can simply drop this copy and so avoid the buffer object dump loop shadowing it. Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>

… loop We can save a few lines of code and end up with a potentially more readable function if we replace the open-coded try-retry of getting the list of buffer objects with a loop. While refactoring we also remove the needless copy of the entry as the dump loop iterates the handles, and also stop returning a false success with a made up zero buffer objects saved in cases when the kernel does not support the DRM_IOCTL_AMDGPU_GEM_LIST_HANDLES ioctl. Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>

…mage name amdgpu_plugin_drm_dump_file() already defines a large temporary stack buffer for the purpose of generating the file level protobuf image name. Instead of the buffer object dumping loop defining a separate temporary stack buffer for the buffer content protobuf images we can simply use the top level one. Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>

When saving buffer object content the current code constantly allocates and frees aligned memory where the GPU will copy each buffer one by one. We can optimise this path by keeping the buffer around and only grow it if is too small for the current object. While doing this we also fix a theoretically incorrect freeing of memory allocated via posix_memalign with xfree. An alternative solution could be to probe for the maximum size before hand by adding a pre-scan loop. Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>

Instead of initializing and de-initializing libdrm for every buffer object we dump (each of which consists of a few system calls), we can simply initialize it once and by doing so simplify a bunch of the error handling blocks. Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>

Simplify a bunch of error handling blocks by freeing the vm_info_entries array as soon as we are done accessing it. Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>

There is no need to use the fd libdrm duplicated from the original fd. Just use the original one. Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>

It is currently not possible to restore a process which did not have a KFD device open. We can work around that by assuming the restore is being done on the same system with the same GPU id. This can fail over reboots when the GPU id changes, but that also could be worked around with perhaps something like a "--ignore-gpu-id" command line switch. In the long term we probably need to think of adding the relevant GPU discovery APIs to the amdgpu kernel driver, which would enable this to work more smoothly. Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>

Add a new boolean environment variable AMDGPU_IGNORE_SINGLE_GPUID which can be used to force restoring of a single GPU image on a single GPU machine. This is useful for images which had no kfd open and if the gpu id had changed for example after re-booting to a different kernel version. Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>

Outside the KFD world the CPU virtual memory address of a mmapp-ed buffer object is not guaranteed to be same as the GPU virtual address. Therefore the address check in VMA restore will prevent any restore of a render node process. For now we relax the criteria in case of render node to match only on the offset. In the future this will need to be made more robust so strange partial mmaps can be handled as well. Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>

Move GPU VA mapping operation to after the buffer object content is restored in order to avoid libdrm VA range manager conflicting with the already mapped buffers and so avoid failing the restore process. Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>

Re-open the DRM device when dumping buffer object content in order to avoid the libdrm VA range manager conflicting with the client being check-pointed. Otherwise buffer content cannot be saved and the process aborts with a failure. The current approach of initializing libdrm of the current fd is insufficient because libdrm, despite libdrm "opening" a different fd it does that by dup() which does not create a new DRM client, so no new GPU VA address space gets created. Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>

tursulin · 2026-05-20T09:59:32Z

2026-05-05T13:47:45.8749530Z clang -g -Wall -Werror -D _GNU_SOURCE -shared -nostartfiles -fPIC -DCONFIG_AARCH64 -D_FILE_OFFSET_BITS=64 -D_LARGEFILE64_SOURCE -D_GNU_SOURCE -DCONFIG_HAS_LIBBSD -DCONFIG_HAS_SELINUX -DCONFIG_GNUTLS -I ../../compel/include/uapi amdgpu_plugin.c amdgpu_plugin_drm.c amdgpu_plugin_dmabuf.c amdgpu_plugin_topology.c amdgpu_plugin_util.c criu-amdgpu.pb-c.c amdgpu_socket_utils.c -o amdgpu_plugin.so -iquote../../include -iquote../../criu/include -iquote../../criu/arch/aarch64/include/ -iquote../../ -lpthread -lrt -ldrm -ldrm_amdgpu -I/usr/include/libdrm
2026-05-05T13:47:45.9906860Z amdgpu_plugin.c:1923:7: error: variable 'fd' is used uninitialized whenever 'if' condition is true [-Werror,-Wsometimes-uninitialized]
2026-05-05T13:47:45.9907426Z  1923 |                 if (ret) {
2026-05-05T13:47:45.9907619Z       |                     ^~~
2026-05-05T13:47:45.9908673Z amdgpu_plugin.c:2006:6: note: uninitialized use occurs here
2026-05-05T13:47:45.9908955Z  2006 |         if (fd < 0)
2026-05-05T13:47:45.9909123Z       |             ^~
2026-05-05T13:47:45.9958433Z amdgpu_plugin.c:1923:3: note: remove the 'if' if its condition is always false
2026-05-05T13:47:45.9958799Z  1923 |                 if (ret) {
2026-05-05T13:47:45.9958996Z       |                 ^~~~~~~~~~
2026-05-05T13:47:45.9959289Z  1924 |                         pr_err("Failed to parse local system topology %d\n",
2026-05-05T13:47:45.9959646Z       |                         ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
2026-05-05T13:47:45.9959928Z  1925 |                                ret);
2026-05-05T13:47:45.9960143Z       |                                ~~~~~
2026-05-05T13:47:45.9960358Z  1926 |                         goto fail;
2026-05-05T13:47:45.9960566Z       |                         ~~~~~~~~~~
2026-05-05T13:47:45.9960769Z  1927 |                 }
2026-05-05T13:47:45.9960937Z       |                 ~
2026-05-05T13:47:45.9961407Z amdgpu_plugin.c:1913:7: error: variable 'fd' is used uninitialized whenever 'if' condition is true [-Werror,-Wsometimes-uninitialized]
2026-05-05T13:47:45.9961917Z  1913 |                 if (ret < 0) {
2026-05-05T13:47:45.9962122Z       |                     ^~~~~~~
2026-05-05T13:47:45.9962712Z amdgpu_plugin.c:2006:6: note: uninitialized use occurs here
2026-05-05T13:47:45.9963007Z  2006 |         if (fd < 0)
2026-05-05T13:47:45.9963183Z       |             ^~
2026-05-05T13:47:45.9963459Z amdgpu_plugin.c:1913:3: note: remove the 'if' if its condition is always false
2026-05-05T13:47:45.9963791Z  1913 |                 if (ret < 0) {
2026-05-05T13:47:45.9964381Z       |                 ^~~~~~~~~~~~~~
2026-05-05T13:47:45.9964653Z  1914 |                         pr_err("Failed to find unused fd (fd:%d)\n", ret);
2026-05-05T13:47:45.9964980Z       |                         ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
2026-05-05T13:47:45.9965439Z  1915 |                         goto fail;
2026-05-05T13:47:45.9965653Z       |                         ~~~~~~~~~~
2026-05-05T13:47:45.9965850Z  1916 |                 }
2026-05-05T13:47:45.9966015Z       |                 ~
2026-05-05T13:47:45.9966305Z amdgpu_plugin.c:1876:8: note: initialize the variable 'fd' to silence this warning
2026-05-05T13:47:45.9966644Z  1876 |         int fd, ret;
2026-05-05T13:47:45.9966901Z       |               ^
2026-05-05T13:47:45.9967067Z       |                = 0
2026-05-05T13:47:46.0005574Z 2 errors generated.
2026-05-05T13:47:46.4358865Z make[3]: *** [Makefile:31: amdgpu_plugin.so] Error 1
2026-05-05T13:47:46.4359476Z make[2]: *** [Makefile:352: amdgpu_plugin] Error 2
2026-05-05T13:47:46.4359835Z make[2]: Leaving directory '/home/runner/work/criu/criu'
2026-05-05T13:47:46.4363379Z make[1]: *** [Makefile:4: run] Error 2
2026-05-05T13:47:46.4363734Z make[1]: Leaving directory '/home/runner/work/criu/criu/test/others/make'
2026-05-05T13:47:46.4365710Z + cleanup_cgroup

Thanks for spotting this! I have uploaded a fixed version.

One thing to note however is that I am in the process of cleaning up some further patches on top of this series. Those implement more missing functionality like userptr support, context support, reserved vm id and vm always valid objects support. With that work in progress I can checkpoint and restore a client in a busy submit loop. Once I am reasonably happy with that I will extend the RFC to include all that. FYI @fdavid-amd

One more thing is that for the above I added some kernel side UAPI too so will need to coordinate posting of IGT + Kernel + CRIU branches for full picture.

codecov-commenter · 2026-05-20T11:02:14Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 57.00%. Comparing base (e40e6f7) to head (4f0299a).
⚠️ Report is 2 commits behind head on criu-dev.

Additional details and impacted files

@@             Coverage Diff              @@
##           criu-dev    #3016      +/-   ##
============================================
- Coverage     57.01%   57.00%   -0.02%     
============================================
  Files           154      154              
  Lines         40452    40451       -1     
  Branches       8866     8866              
============================================
- Hits          23064    23059       -5     
- Misses        17034    17038       +4     
  Partials        354      354

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

avagin requested a review from fdavid-amd May 5, 2026 23:58

avagin added the gpu/amd label May 5, 2026

Tvrtko Ursulin added 13 commits May 20, 2026 10:51

plugins/amdgpu: Simplify DRM file dump early exit

7110c16

If the protobuf image allocation fails we can simply return immediately since there aren't any other things to clean up at this point. Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>

plugins/amdgpu: Free vm_info_entries as soon as done with them

1f3f6b4

Simplify a bunch of error handling blocks by freeing the vm_info_entries array as soon as we are done accessing it. Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>

plugins/amdgpu: Use original fd to export dmabuf

6054070

There is no need to use the fd libdrm duplicated from the original fd. Just use the original one. Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>

tursulin force-pushed the rendernode-somewhat-working-but-no-fork-support-yet branch from 66ec3c6 to 4f0299a Compare May 20, 2026 09:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

amdgpu rendernode somewhat working but no fork support yet#3016

amdgpu rendernode somewhat working but no fork support yet#3016
tursulin wants to merge 13 commits into
checkpoint-restore:criu-devfrom
tursulin:rendernode-somewhat-working-but-no-fork-support-yet

tursulin commented May 5, 2026

Uh oh!

avagin commented May 19, 2026

Uh oh!

tursulin commented May 20, 2026

Uh oh!

codecov-commenter commented May 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

tursulin commented May 5, 2026

Uh oh!

avagin commented May 19, 2026

Uh oh!

tursulin commented May 20, 2026

Uh oh!

codecov-commenter commented May 20, 2026

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants