Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 7 additions & 1 deletion bin/setup-policy-routes.sh
Original file line number Diff line number Diff line change
Expand Up @@ -50,12 +50,18 @@ refresh)
start)
register_networkd_reloader
counter=0
max_wait=3000 # 5 minute timeout to avoid infinite loop if sysfs node never appears
while [ ! -e "/sys/class/net/${iface}" ]; do
if ((counter % 1000 == 0)); then
debug "Waiting for sysfs node to exist for ${iface} (iteration $counter)"
fi
sleep 0.1
((counter++))
((counter++)) || true
if ((counter >= max_wait)); then
error "Timed out waiting for sysfs node for ${iface} after $((counter / 10)) seconds"
/usr/bin/systemctl disable --now refresh-policy-routes@${iface}.timer 2>/dev/null || true
exit 2
fi
done
debug "Starting configuration for $iface"
debug /lib/systemd/systemd-networkd-wait-online -i "$iface"
Expand Down
14 changes: 13 additions & 1 deletion lib/lib.sh
Original file line number Diff line number Diff line change
Expand Up @@ -627,10 +627,22 @@ maybe_reload_networkd() {

register_networkd_reloader() {
local -i registered=1 cnt=0
local -i max=10000
local -i max=3000 # 300s (3000 × 0.1s); matches sysfs wait timeout in setup-policy-routes.sh
local -r lockfile="${lockdir}/${iface}"
local old_opts=$-

# If the existing lock owner is no longer alive, remove the stale lockfile
# so subsequent invocations don't spin for up to 1000 seconds waiting on a
# process that will never release it.
if [ -f "${lockfile}" ]; then
local existing_pid
existing_pid=$(cat "${lockfile}" 2>/dev/null)
if [ -n "$existing_pid" ] && ! kill -0 "$existing_pid" 2>/dev/null; then
debug "Removing stale lock from dead process $existing_pid for ${iface}"
rm -f "${lockfile}"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could there be a race condtion when two PIDs clash and a lock file is removed by accident?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, good catch. If the PID is reassigned to a another setup-policy-routes process which acquires the lock after the ! kill -0 "$existing_pid" check, this code would then delete a valid lockfile. This is very unlikely, but let's consider it.

I dont think an atomic operation is possible purely with shell code.

What if we also add a check on the lockfile age? We can use the same value that we set for the sysfs wait timeout (original is 300s) as a stale threshold? Only if the lockfile is older than that timeout, can we consider it stale.

Something like:

        local lock_age=$(( $(date +%s) - $(stat -c %Y "${lockfile}" 2>/dev/null || echo 0) ))
        if [ "$lock_age" -gt 300 ]; then
            debug "Removing stale lock from dead process $existing_pid for ${iface}"
            rm -f "${lockfile}"
        fi

Note: the threshold should stay in sync with max_wait * 0.1 from the sysfs wait timeout

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this approach but I am also okay with not adding more complication since the PID space is quite large.

fi
fi

# Disable -o errexit in the following block so we can capture
# nonzero exit codes from a redirect without considering them
# fatal errors
Expand Down
1 change: 1 addition & 0 deletions systemd/system/policy-routes@.service
Original file line number Diff line number Diff line change
Expand Up @@ -17,4 +17,5 @@ User=root
ExecStart=/usr/bin/setup-policy-routes %i start
Restart=on-failure
RestartSec=1
RestartPreventExitStatus=2
KillMode=process
Loading