The current implementation of tsi_connect roughly goes like (ref)
static int tsi_connect(struct socket *sock, struct sockaddr *addr, int addr_len, int flags)
{
// ...
if (isocket) {
err = isocket->ops->connect(isocket, addr, addr_len,
flags & ~O_NONBLOCK);
if (err == 0 || err == -EALREADY) {
tsk->status = S_INET;
goto release;
} else if (err == -EINPROGRESS) {
tsk->status = S_INET;
goto release;
}
}
}
However this creates a problem with SOCK_DGRAM, upon connect with a DGRAM socket, isocket only checks if the target destination IP is routed but does not check if it is actually reachable (contrary to SOCK_STREAM where TCP handshake is performed thus confirming reachability).
This creates a problem for an edge case where if the guest configured its routing table with an IP that is reachable for the host, the guest kernel will always try to send UDP packets to that IP on the guest side.
Particularly with libkrun setting a dummy interface with IP 10.0.0.1/8 in past versions, and with k8s and podman using default pod address from this same range, this can make krun pods in these environment not able to route to other pods, and potentially problems with DNS as well.
Although this was fixed in this commit by replacing the dummy IP range with a less-used one. IMO This is still just a workaround not a proper fix. Under exotic network setups and edge cases, this might still trigger with overlapping network blocks.
I think the proper fix is to make DGRAM connects stay in S_HYBRID, copy the address to tsk->sendto_addr and defer the personality decision to sendmsg time, like this:
static int tsi_connect(struct socket *sock, struct sockaddr *addr, int addr_len, int flags)
{
// ...
if (isocket) {
if (sock->type == SOCK_DGRAM) {
// Store the address for later use in sendmsg
if (!tsk->sendto_addr) {
tsk->sendto_addr = kmalloc(addr_len, GFP_KERNEL);
tsk->sendto_addr_len = addr_len;
}
memcpy(tsk->sendto_addr, addr, addr_len);
goto release; // stay in current state (S_HYBRID)
}
err = isocket->ops->connect(isocket, addr, addr_len,
flags & ~O_NONBLOCK);
// normal commit to S_INET for SOCK_STREAM ...
}
}
I went for an issue because I am not sure how to contribute to the code in this repo since they are just a patchset and directly modifying a patch feels wrong... If there is a kernel source tree branch or mailing list you can point me to I would be more than happy to submit a patch for this there.
The current implementation of
tsi_connectroughly goes like (ref)However this creates a problem with
SOCK_DGRAM, uponconnectwith aDGRAMsocket,isocketonly checks if the target destination IP is routed but does not check if it is actually reachable (contrary toSOCK_STREAMwhere TCP handshake is performed thus confirming reachability).This creates a problem for an edge case where if the guest configured its routing table with an IP that is reachable for the host, the guest kernel will always try to send UDP packets to that IP on the guest side.
Particularly with libkrun setting a dummy interface with IP
10.0.0.1/8in past versions, and with k8s and podman using default pod address from this same range, this can make krun pods in these environment not able to route to other pods, and potentially problems with DNS as well.Although this was fixed in this commit by replacing the dummy IP range with a less-used one. IMO This is still just a workaround not a proper fix. Under exotic network setups and edge cases, this might still trigger with overlapping network blocks.
I think the proper fix is to make
DGRAMconnects stay inS_HYBRID, copy the address totsk->sendto_addrand defer the personality decision tosendmsgtime, like this:I went for an issue because I am not sure how to contribute to the code in this repo since they are just a patchset and directly modifying a patch feels wrong... If there is a kernel source tree branch or mailing list you can point me to I would be more than happy to submit a patch for this there.