fix(healthcheck): cleanup transient units on container exit and start#4935
Open
haytok wants to merge 2 commits into
Open
fix(healthcheck): cleanup transient units on container exit and start#4935haytok wants to merge 2 commits into
haytok wants to merge 2 commits into
Conversation
Member
Author
|
Since the tests related to this fix are failing, I will look into it.
|
Suppose a container with healthcheck enabled has exited. ```bash > sudo nerdctl ps -a --filter "name=hoge" CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES dd94022f7dd0 docker.io/library/alpine:latest "sleep 1" About a minute ago Exited (0) About a minute ago hoge ``` When we try to run `nerdctl start` on that container, the following error occurs, and the container cannot be started. ```bash > sudo nerdctl start hoge FATA[0000] 1 errors: failed to create healthcheck timer: systemd-run failed: exit status 1 output: Failed to start transient timer unit: Unit dd94022f7dd0f8b60cb49e803c52de3a7a49c2b39276883ed7c606e13ca1a918.timer was already loaded or has a fragment file. ``` The cause of the failure is the presence of the systemd transient timer unit used when executing health checks. When checking the output of `systemctl status`, the status of the transient timer unit is `active`, but an error has occurred in the transient service unit that executes the healthcheck command. In nerdctl, container health check is performed by running the `systemd-run` command to periodically execute the `exec` command on the target container via a transient service unit and a transient timer unit, and executing the command specified with the `--health-cmd` option. However, the current implementation does not account for the case where the container has exited. Therefore, this commit will ensure that transient units are deleted when a container with a health check enabled exits. It will also ensure that the system checks for the presence of transient units when restarting a stopped container with a health check enabled. The specific approach is as follows: - Use the `--collect` option of the `systemd-run` command so that the transient service unit can be garbage-collected even when it is in a failed state. - Delete the transient timer unit when the process exits and the container is in a stopped state. - Before creating a new transient timer unit in CreateTimer, check whether a transient timer unit with the same name already exists and remove it if so. Note that if the `--collect` option is specified when executing the `systemd-run` command, deleting the transient timer unit will cause it to be unloaded by systemd's garbage collection. References: - https://www.freedesktop.org/software/systemd/man/latest/systemd-run.html#-G - https://www.freedesktop.org/software/systemd/man/latest/systemd.unit.html#CollectMode= Signed-off-by: Hayato Kiwata <dev@haytok.jp>
Signed-off-by: Hayato Kiwata <dev@haytok.jp>
6e3f5ba to
08b99f8
Compare
Member
Author
|
I have resolved the CI failures mentioned above. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Details are described in this commit (adfb7d4) message.
Note that my investigation into this issue are as follows:
The message
Active: failed (Result: exit-code)indicates that the healthcheck for the container, which has entered a stopped state, has failed.