Skip to content

More robust handling of missing uptime events during farmerbot wakeups #34

@scottyeager

Description

@scottyeager

In the current flow of minting, nodes managed by the farmerbot must submit at least one uptime report during their wake ups, in order for the minting algorithm to detect and register a wake up. The problem is that nodes sometimes fail to submit an uptime report, even though they have booted up as requested and even set their power state to Up. When this happens, the state tracked by minting regarding the node diverges from reality and the normal sequence of events is eventually interpreted as a violation.

Here's an example of such a sequence of events:

# Node is entering standby normally
power_managed: None power_manage_boot: None
PowerStateChanged(state='Down', timestamp=1721033508)
power_managed: 1721033508 power_manage_boot: None
NodeUptimeReported(uptime=2304, timestamp=1721033514)
power_managed: 1721033508 power_manage_boot: None

# Wake up is triggered
PowerTargetChanged(target='Up', timestamp=1721054940)
power_managed: 1721033508 power_manage_boot: 1721054940
# Here's where the uptime event is missing
PowerStateChanged(state='Up', timestamp=1721055780)
power_managed: 1721033508 power_manage_boot: 1721054940
PowerTargetChanged(target='Down', timestamp=1721057664)
power_managed: 1721033508 power_manage_boot: 1721054940
PowerStateChanged(state='Down', timestamp=1721057670)
power_managed: 1721033508 power_manage_boot: 1721054940
NodeUptimeReported(uptime=2166, timestamp=1721057676)
power_managed: None power_manage_boot: None
# Node is now standby, but power_managed is not set

# Next wake up
PowerTargetChanged(target='Up', timestamp=1721117088)
power_managed: None power_manage_boot: 1721117088
# Uptime event missing again
PowerStateChanged(state='Up', timestamp=1721118108)
power_managed: None power_manage_boot: 1721117088
PowerTargetChanged(target='Down', timestamp=1721120118)
power_managed: None power_manage_boot: 1721117088
PowerStateChanged(state='Down', timestamp=1721120124)
power_managed: 1721120124 power_manage_boot: 1721117088
NodeUptimeReported(uptime=2469, timestamp=1721120130)
power_managed: 1721120124 power_manage_boot: 1721117088
# Boot request was not cleared, so it looks like node is still booting up when actually it's shutting down
# This node boots again sometime after 30 minutes from the current power_manage_boot and is assigned a violation

Of course Zos should be responsible for generating the proper sequence of events here, but the fact is that sometimes it doesn't. Summary of that issue is that new health checks implemented in Zos delayed uptime reports from being sent when farmerbot controlled nodes woke up, but most importantly, the implementation caused many nodes to generate the invalid sequence of events I showed an example of above.

Beyond that, it's also the case that transactions submitted to tfchain are sometimes simply not present in the finalized chain (that's actually first how I observed the sequence of events presented above, but we are now seeing many more examples due to the new issue in Zos).

So my thinking here is that we can make minting better able to handle cases like this by also registering a wake up when the node changes its power state to Up.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions