Skip to content

amdgpu rendernode somewhat working but no fork support yet#3016

Open
tursulin wants to merge 13 commits into
checkpoint-restore:criu-devfrom
tursulin:rendernode-somewhat-working-but-no-fork-support-yet
Open

amdgpu rendernode somewhat working but no fork support yet#3016
tursulin wants to merge 13 commits into
checkpoint-restore:criu-devfrom
tursulin:rendernode-somewhat-working-but-no-fork-support-yet

Conversation

@tursulin
Copy link
Copy Markdown

@tursulin tursulin commented May 5, 2026

cc @fdavid-amd

This is a RFC style PR with 5 interesting patches on top of the cleanups I've sent as separate PRs. The interesting patches are these:

66ec3c611f82 plugins/amdgpu: Re-open the DRM device when saving buffer object content
80e839f67084 plugins/amdgpu: Move GPU VA map to after content restore
afe8e2b2fd23 plugins/amdgpu: Handle DRM render node mmaps
62cd972b4315 plugins/amdgpu: Allow forcing restore on a single GPU system
5d05115056e6 plugins/amdgpu: Enable restore with only the drm node open

With this I can get all but the last the test from https://cgit.freedesktop.org/~tursulin/intel-gpu-tools/commit/?h=amd-criu&id=24487611e08c753e73b3fc650ca85793bf053f4d to pass. The last one includes forking and for that problem I will kick off a separate discussion.

But the most complicated test case from the above:

sudo ~/build-holo/tests/amdgpu/amd_criu --r map-and-mmap-one

Can be checkpointed and restored:

$ sudo /usr/local/sbin/criu dump -t `pgrep amd_criu | head -1` -L /usr/local/lib/criu/ -vvv -o criu.log -j --link-remap --tcp-established --file-locks --ext-unix-sk
$ sudo /usr/local/sbin/criu restore  -L /usr/local/lib/criu/ -vvv -o restore.log --shell-job --link-remap --tcp-established --file-locks --ext-unix-sk

Subtest map-and-mmap-one: SUCCESS (13.989s)

It would be good to test this series with some kfd and kfd+amdgpu combined workload to check I created no regressions.

@avagin avagin requested a review from fdavid-amd May 5, 2026 23:58
@avagin avagin added the gpu/amd label May 5, 2026
@avagin
Copy link
Copy Markdown
Member

avagin commented May 19, 2026

2026-05-05T13:47:45.8749530Z clang -g -Wall -Werror -D _GNU_SOURCE -shared -nostartfiles -fPIC -DCONFIG_AARCH64 -D_FILE_OFFSET_BITS=64 -D_LARGEFILE64_SOURCE -D_GNU_SOURCE -DCONFIG_HAS_LIBBSD -DCONFIG_HAS_SELINUX -DCONFIG_GNUTLS -I ../../compel/include/uapi amdgpu_plugin.c amdgpu_plugin_drm.c amdgpu_plugin_dmabuf.c amdgpu_plugin_topology.c amdgpu_plugin_util.c criu-amdgpu.pb-c.c amdgpu_socket_utils.c -o amdgpu_plugin.so -iquote../../include -iquote../../criu/include -iquote../../criu/arch/aarch64/include/ -iquote../../ -lpthread -lrt -ldrm -ldrm_amdgpu -I/usr/include/libdrm
2026-05-05T13:47:45.9906860Z amdgpu_plugin.c:1923:7: error: variable 'fd' is used uninitialized whenever 'if' condition is true [-Werror,-Wsometimes-uninitialized]
2026-05-05T13:47:45.9907426Z  1923 |                 if (ret) {
2026-05-05T13:47:45.9907619Z       |                     ^~~
2026-05-05T13:47:45.9908673Z amdgpu_plugin.c:2006:6: note: uninitialized use occurs here
2026-05-05T13:47:45.9908955Z  2006 |         if (fd < 0)
2026-05-05T13:47:45.9909123Z       |             ^~
2026-05-05T13:47:45.9958433Z amdgpu_plugin.c:1923:3: note: remove the 'if' if its condition is always false
2026-05-05T13:47:45.9958799Z  1923 |                 if (ret) {
2026-05-05T13:47:45.9958996Z       |                 ^~~~~~~~~~
2026-05-05T13:47:45.9959289Z  1924 |                         pr_err("Failed to parse local system topology %d\n",
2026-05-05T13:47:45.9959646Z       |                         ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
2026-05-05T13:47:45.9959928Z  1925 |                                ret);
2026-05-05T13:47:45.9960143Z       |                                ~~~~~
2026-05-05T13:47:45.9960358Z  1926 |                         goto fail;
2026-05-05T13:47:45.9960566Z       |                         ~~~~~~~~~~
2026-05-05T13:47:45.9960769Z  1927 |                 }
2026-05-05T13:47:45.9960937Z       |                 ~
2026-05-05T13:47:45.9961407Z amdgpu_plugin.c:1913:7: error: variable 'fd' is used uninitialized whenever 'if' condition is true [-Werror,-Wsometimes-uninitialized]
2026-05-05T13:47:45.9961917Z  1913 |                 if (ret < 0) {
2026-05-05T13:47:45.9962122Z       |                     ^~~~~~~
2026-05-05T13:47:45.9962712Z amdgpu_plugin.c:2006:6: note: uninitialized use occurs here
2026-05-05T13:47:45.9963007Z  2006 |         if (fd < 0)
2026-05-05T13:47:45.9963183Z       |             ^~
2026-05-05T13:47:45.9963459Z amdgpu_plugin.c:1913:3: note: remove the 'if' if its condition is always false
2026-05-05T13:47:45.9963791Z  1913 |                 if (ret < 0) {
2026-05-05T13:47:45.9964381Z       |                 ^~~~~~~~~~~~~~
2026-05-05T13:47:45.9964653Z  1914 |                         pr_err("Failed to find unused fd (fd:%d)\n", ret);
2026-05-05T13:47:45.9964980Z       |                         ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
2026-05-05T13:47:45.9965439Z  1915 |                         goto fail;
2026-05-05T13:47:45.9965653Z       |                         ~~~~~~~~~~
2026-05-05T13:47:45.9965850Z  1916 |                 }
2026-05-05T13:47:45.9966015Z       |                 ~
2026-05-05T13:47:45.9966305Z amdgpu_plugin.c:1876:8: note: initialize the variable 'fd' to silence this warning
2026-05-05T13:47:45.9966644Z  1876 |         int fd, ret;
2026-05-05T13:47:45.9966901Z       |               ^
2026-05-05T13:47:45.9967067Z       |                = 0
2026-05-05T13:47:46.0005574Z 2 errors generated.
2026-05-05T13:47:46.4358865Z make[3]: *** [Makefile:31: amdgpu_plugin.so] Error 1
2026-05-05T13:47:46.4359476Z make[2]: *** [Makefile:352: amdgpu_plugin] Error 2
2026-05-05T13:47:46.4359835Z make[2]: Leaving directory '/home/runner/work/criu/criu'
2026-05-05T13:47:46.4363379Z make[1]: *** [Makefile:4: run] Error 2
2026-05-05T13:47:46.4363734Z make[1]: Leaving directory '/home/runner/work/criu/criu/test/others/make'
2026-05-05T13:47:46.4365710Z + cleanup_cgroup

Tvrtko Ursulin added 13 commits May 20, 2026 10:51
If the protobuf image allocation fails we can simply return immediately
since there aren't any other things to clean up at this point.

Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
The DRM minor local variable is shadowed between two blocks inside
amdgpu_plugin_drm_dump_file(). If at the top level we access the minor via
the copy stored in the protobuf image, we can simply drop this copy and
so avoid the buffer object dump loop shadowing it.

Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
… loop

We can save a few lines of code and end up with a potentially more
readable function if we replace the open-coded try-retry of getting the
list of buffer objects with a loop.

While refactoring we also remove the needless copy of the entry as the
dump loop iterates the handles, and also stop returning a false success
with a made up zero buffer objects saved in cases when the kernel does not
support the DRM_IOCTL_AMDGPU_GEM_LIST_HANDLES ioctl.

Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
…mage name

amdgpu_plugin_drm_dump_file() already defines a large temporary stack
buffer for the purpose of generating the file level protobuf image name.

Instead of the buffer object dumping loop defining a separate temporary
stack buffer for the buffer content protobuf images we can simply use the
top level one.

Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
When saving buffer object content the current code constantly allocates
and frees aligned memory where the GPU will copy each buffer one by one.
We can optimise this path by keeping the buffer around and only grow it if
is too small for the current object.

While doing this we also fix a theoretically incorrect freeing of memory
allocated via posix_memalign with xfree.

An alternative solution could be to probe for the maximum size before hand
by adding a pre-scan loop.

Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
Instead of initializing and de-initializing libdrm for every buffer object
we dump (each of which consists of a few system calls), we can simply
initialize it once and by doing so simplify a bunch of the error handling
blocks.

Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
Simplify a bunch of error handling blocks by freeing the vm_info_entries
array as soon as we are done accessing it.

Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
There is no need to use the fd libdrm duplicated from the original fd.
Just use the original one.

Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
It is currently not possible to restore a process which did not have a KFD
device open.

We can work around that by assuming the restore is being done on the same
system with the same GPU id. This can fail over reboots when the GPU id
changes, but that also could be worked around with perhaps something like
a "--ignore-gpu-id" command line switch.

In the long term we probably need to think of adding the relevant GPU
discovery APIs to the amdgpu kernel driver, which would enable this to
work more smoothly.

Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
Add a new boolean environment variable AMDGPU_IGNORE_SINGLE_GPUID which
can be used to force restoring of a single GPU image on a single GPU
machine. This is useful for images which had no kfd open and if the gpu id
had changed for example after re-booting to a different kernel version.

Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
Outside the KFD world the CPU virtual memory address of a mmapp-ed buffer
object is not guaranteed to be same as the GPU virtual address. Therefore
the address check in VMA restore will prevent any restore of a render node
process.

For now we relax the criteria in case of render node to match only on the
offset. In the future this will need to be made more robust so strange
partial mmaps can be handled as well.

Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
Move GPU VA mapping operation to after the buffer object content is
restored in order to avoid libdrm VA range manager conflicting with the
already mapped buffers and so avoid failing the restore process.

Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
Re-open the DRM device when dumping buffer object content in order to
avoid the libdrm VA range manager conflicting with the client being
check-pointed. Otherwise buffer content cannot be saved and the process
aborts with a failure.

The current approach of initializing libdrm of the current fd is
insufficient because libdrm, despite libdrm "opening" a different fd it
does that by dup() which does not create a new DRM client, so no new
GPU VA address space gets created.

Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
@tursulin tursulin force-pushed the rendernode-somewhat-working-but-no-fork-support-yet branch from 66ec3c6 to 4f0299a Compare May 20, 2026 09:53
@tursulin
Copy link
Copy Markdown
Author

2026-05-05T13:47:45.8749530Z clang -g -Wall -Werror -D _GNU_SOURCE -shared -nostartfiles -fPIC -DCONFIG_AARCH64 -D_FILE_OFFSET_BITS=64 -D_LARGEFILE64_SOURCE -D_GNU_SOURCE -DCONFIG_HAS_LIBBSD -DCONFIG_HAS_SELINUX -DCONFIG_GNUTLS -I ../../compel/include/uapi amdgpu_plugin.c amdgpu_plugin_drm.c amdgpu_plugin_dmabuf.c amdgpu_plugin_topology.c amdgpu_plugin_util.c criu-amdgpu.pb-c.c amdgpu_socket_utils.c -o amdgpu_plugin.so -iquote../../include -iquote../../criu/include -iquote../../criu/arch/aarch64/include/ -iquote../../ -lpthread -lrt -ldrm -ldrm_amdgpu -I/usr/include/libdrm
2026-05-05T13:47:45.9906860Z amdgpu_plugin.c:1923:7: error: variable 'fd' is used uninitialized whenever 'if' condition is true [-Werror,-Wsometimes-uninitialized]
2026-05-05T13:47:45.9907426Z  1923 |                 if (ret) {
2026-05-05T13:47:45.9907619Z       |                     ^~~
2026-05-05T13:47:45.9908673Z amdgpu_plugin.c:2006:6: note: uninitialized use occurs here
2026-05-05T13:47:45.9908955Z  2006 |         if (fd < 0)
2026-05-05T13:47:45.9909123Z       |             ^~
2026-05-05T13:47:45.9958433Z amdgpu_plugin.c:1923:3: note: remove the 'if' if its condition is always false
2026-05-05T13:47:45.9958799Z  1923 |                 if (ret) {
2026-05-05T13:47:45.9958996Z       |                 ^~~~~~~~~~
2026-05-05T13:47:45.9959289Z  1924 |                         pr_err("Failed to parse local system topology %d\n",
2026-05-05T13:47:45.9959646Z       |                         ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
2026-05-05T13:47:45.9959928Z  1925 |                                ret);
2026-05-05T13:47:45.9960143Z       |                                ~~~~~
2026-05-05T13:47:45.9960358Z  1926 |                         goto fail;
2026-05-05T13:47:45.9960566Z       |                         ~~~~~~~~~~
2026-05-05T13:47:45.9960769Z  1927 |                 }
2026-05-05T13:47:45.9960937Z       |                 ~
2026-05-05T13:47:45.9961407Z amdgpu_plugin.c:1913:7: error: variable 'fd' is used uninitialized whenever 'if' condition is true [-Werror,-Wsometimes-uninitialized]
2026-05-05T13:47:45.9961917Z  1913 |                 if (ret < 0) {
2026-05-05T13:47:45.9962122Z       |                     ^~~~~~~
2026-05-05T13:47:45.9962712Z amdgpu_plugin.c:2006:6: note: uninitialized use occurs here
2026-05-05T13:47:45.9963007Z  2006 |         if (fd < 0)
2026-05-05T13:47:45.9963183Z       |             ^~
2026-05-05T13:47:45.9963459Z amdgpu_plugin.c:1913:3: note: remove the 'if' if its condition is always false
2026-05-05T13:47:45.9963791Z  1913 |                 if (ret < 0) {
2026-05-05T13:47:45.9964381Z       |                 ^~~~~~~~~~~~~~
2026-05-05T13:47:45.9964653Z  1914 |                         pr_err("Failed to find unused fd (fd:%d)\n", ret);
2026-05-05T13:47:45.9964980Z       |                         ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
2026-05-05T13:47:45.9965439Z  1915 |                         goto fail;
2026-05-05T13:47:45.9965653Z       |                         ~~~~~~~~~~
2026-05-05T13:47:45.9965850Z  1916 |                 }
2026-05-05T13:47:45.9966015Z       |                 ~
2026-05-05T13:47:45.9966305Z amdgpu_plugin.c:1876:8: note: initialize the variable 'fd' to silence this warning
2026-05-05T13:47:45.9966644Z  1876 |         int fd, ret;
2026-05-05T13:47:45.9966901Z       |               ^
2026-05-05T13:47:45.9967067Z       |                = 0
2026-05-05T13:47:46.0005574Z 2 errors generated.
2026-05-05T13:47:46.4358865Z make[3]: *** [Makefile:31: amdgpu_plugin.so] Error 1
2026-05-05T13:47:46.4359476Z make[2]: *** [Makefile:352: amdgpu_plugin] Error 2
2026-05-05T13:47:46.4359835Z make[2]: Leaving directory '/home/runner/work/criu/criu'
2026-05-05T13:47:46.4363379Z make[1]: *** [Makefile:4: run] Error 2
2026-05-05T13:47:46.4363734Z make[1]: Leaving directory '/home/runner/work/criu/criu/test/others/make'
2026-05-05T13:47:46.4365710Z + cleanup_cgroup

Thanks for spotting this! I have uploaded a fixed version.

One thing to note however is that I am in the process of cleaning up some further patches on top of this series. Those implement more missing functionality like userptr support, context support, reserved vm id and vm always valid objects support. With that work in progress I can checkpoint and restore a client in a busy submit loop. Once I am reasonably happy with that I will extend the RFC to include all that. FYI @fdavid-amd

One more thing is that for the above I added some kernel side UAPI too so will need to coordinate posting of IGT + Kernel + CRIU branches for full picture.

@codecov-commenter
Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 57.00%. Comparing base (e40e6f7) to head (4f0299a).
⚠️ Report is 2 commits behind head on criu-dev.

Additional details and impacted files
@@             Coverage Diff              @@
##           criu-dev    #3016      +/-   ##
============================================
- Coverage     57.01%   57.00%   -0.02%     
============================================
  Files           154      154              
  Lines         40452    40451       -1     
  Branches       8866     8866              
============================================
- Hits          23064    23059       -5     
- Misses        17034    17038       +4     
  Partials        354      354              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants