Within nexus/reconfigurator/execution/src/dns.rs:
- The blueprint contains a DNS version, in
{internal,external}_dns_version. (Foreshadowing: This is logged, but otherwise unused during the execution phase!)
deploy_dns reads the current DNS config from the database, and passes it to deploy_dns_one as dns_config_current
- The diff between the database (
dns_config_current) and the blueprint is calculated in dns_compute_update
- The update of database DNS state happens in
dns_update_from_version, where an old_version is passed.
The transaction reads version from and bails out if version.version != old_version. That old_version is the value we read from the DB earlier, before we calculcated the diff.
This guards against "the DNS data changing since when we started deploy_dns" - which is great, but it is not guarded against the DNS data being from an old blueprint.
How this could manifest as a bug:
- Nexus N1 is executing blueprint B1. It is the current target, so it starts executing, but stalls.
- Nexus N2 creates and executes blueprint B2. It is the current target, and dns data from B2 gets written to the database.
- Later, Nexus N1 resumes execution of B1. B1's DNS data is out-of-date. However, nothing in deploy DNS stops it from proceeding.
- N1 would read the "latest DNS configs" from the DB (the ones written from B2)
- N1 would compute an "update" that causes a regression back to B1 (this is bad! B1 is old!)
- N1 calls "dns_update_from_version" - which would check that the database state is unchanged (it is unchanged -- it's from B2) and will proceed to overwrite it with the DNS state from B1.
There is a comment in deploy_dns, which partially acknowledges this issue:
// We could check here that the DNS version we found isn't newer than when
// the blueprint was generated. But we have to check later when we try to
// update the database anyway. And we're not wasting much effort allowing
// this proceed for now. This way, we have only one code path for this and
// we know it's being hit when we exercise this condition.
However, we never check that "the DNS version found isn't newer than the blueprint!" If this comment was true, we wouldn't have an issue. Unfortunately, the blueprint generation is fully ignored.
To fix:
- We should actually use
{internal,external}_dns_version in our checks, before deciding to write the DNS data to the database.
Within
nexus/reconfigurator/execution/src/dns.rs:{internal,external}_dns_version. (Foreshadowing: This is logged, but otherwise unused during the execution phase!)deploy_dnsreads the current DNS config from the database, and passes it todeploy_dns_oneasdns_config_currentdns_config_current) and the blueprint is calculated indns_compute_updatedns_update_from_version, where anold_versionis passed.The transaction reads
versionfrom and bails outif version.version != old_version. Thatold_versionis the value we read from the DB earlier, before we calculcated the diff.This guards against "the DNS data changing since when we started
deploy_dns" - which is great, but it is not guarded against the DNS data being from an old blueprint.How this could manifest as a bug:
There is a comment in
deploy_dns, which partially acknowledges this issue:However, we never check that "the DNS version found isn't newer than the blueprint!" If this comment was true, we wouldn't have an issue. Unfortunately, the blueprint generation is fully ignored.
To fix:
{internal,external}_dns_versionin our checks, before deciding to write the DNS data to the database.