Skip to content

Reconfigurator DNS execution may overwrite fresh data with stale data #10251

@smklein

Description

@smklein

Within nexus/reconfigurator/execution/src/dns.rs:

  • The blueprint contains a DNS version, in {internal,external}_dns_version. (Foreshadowing: This is logged, but otherwise unused during the execution phase!)
  • deploy_dns reads the current DNS config from the database, and passes it to deploy_dns_one as dns_config_current
  • The diff between the database (dns_config_current) and the blueprint is calculated in dns_compute_update
  • The update of database DNS state happens in dns_update_from_version, where an old_version is passed.

The transaction reads version from and bails out if version.version != old_version. That old_version is the value we read from the DB earlier, before we calculcated the diff.

This guards against "the DNS data changing since when we started deploy_dns" - which is great, but it is not guarded against the DNS data being from an old blueprint.

How this could manifest as a bug:

  • Nexus N1 is executing blueprint B1. It is the current target, so it starts executing, but stalls.
  • Nexus N2 creates and executes blueprint B2. It is the current target, and dns data from B2 gets written to the database.
  • Later, Nexus N1 resumes execution of B1. B1's DNS data is out-of-date. However, nothing in deploy DNS stops it from proceeding.
    • N1 would read the "latest DNS configs" from the DB (the ones written from B2)
    • N1 would compute an "update" that causes a regression back to B1 (this is bad! B1 is old!)
    • N1 calls "dns_update_from_version" - which would check that the database state is unchanged (it is unchanged -- it's from B2) and will proceed to overwrite it with the DNS state from B1.

There is a comment in deploy_dns, which partially acknowledges this issue:

    // We could check here that the DNS version we found isn't newer than when
    // the blueprint was generated.  But we have to check later when we try to
    // update the database anyway.  And we're not wasting much effort allowing
    // this proceed for now.  This way, we have only one code path for this and
    // we know it's being hit when we exercise this condition.

However, we never check that "the DNS version found isn't newer than the blueprint!" If this comment was true, we wouldn't have an issue. Unfortunately, the blueprint generation is fully ignored.

To fix:

  • We should actually use {internal,external}_dns_version in our checks, before deciding to write the DNS data to the database.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething that isn't working.databaseRelated to database accessnexusRelated to nexus

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions