feat: poll testnet status and open infra recovery issues#742
feat: poll testnet status and open infra recovery issues#742vivekgsharma merged 4 commits intov1.0-devfrom
Conversation
📝 WalkthroughWalkthroughThis pull request introduces a complete testnet status monitoring and recovery automation system. It adds a scheduled GitHub Actions workflow, Ansible provisioning for monitoring infrastructure, a Node.js module that detects unhealthy testnet nodes and creates recovery issues, supporting scripts, and a Nix development environment. Changes
Sequence Diagram(s)sequenceDiagram
participant GHA as GitHub Actions
participant Status as Testnet Status API
participant Inventory as Node Inventory
participant GHIssues as GitHub Issues API
GHA->>Inventory: Read expected node names
Inventory-->>GHA: Node list (masternodes, hp_masternodes)
GHA->>Status: Poll status API
Status-->>GHA: Current status for all nodes
GHA->>GHA: Compare expected vs actual<br/>Detect recovery incidents
GHA->>GHIssues: Query existing open issues
GHIssues-->>GHA: List of open recovery issues
GHA->>GHA: Filter incidents<br/>without existing issues
loop For each new incident
GHA->>GHIssues: Create recovery issue
GHIssues-->>GHA: Issue created with URL
end
sequenceDiagram
participant Ansible as Ansible Provisioning
participant Host as Target Host
participant Sudoers as /etc/sudoers.d
Ansible->>Host: Create dashmon user account
Ansible->>Host: Configure SSH directory<br/>with forced-command
Ansible->>Host: Deploy dashmon-check.sh<br/>to /usr/local/bin/
Ansible->>Sudoers: Install sudoers rules
Sudoers-->>Sudoers: Validate with visudo
Note over Host: SSH connection forces<br/>execution of dashmon-check.sh
Note over Sudoers: dashmon can run dashmate<br/>or dash-cli without password
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related issues
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (10)
ansible/roles/status_dashboard/tasks/main.yml (1)
15-17: Consider adding a preflight key-exists check before copy.This will fail earlier with a clearer error if
status_dashboard_ssh_private_key_pathis misconfigured on the control node.Proposed refactor
+- name: Validate SSH monitoring key exists on control node + ansible.builtin.stat: + path: "{{ status_dashboard_ssh_private_key_path }}" + register: status_dashboard_ssh_key_stat + delegate_to: localhost + become: false + +- name: Fail if SSH monitoring key is missing + ansible.builtin.assert: + that: + - status_dashboard_ssh_key_stat.stat.exists + - status_dashboard_ssh_key_stat.stat.isreg + fail_msg: >- + Missing SSH key at {{ status_dashboard_ssh_private_key_path }}. + Set STATUS_DASHBOARD_SSH_KEY_PATH or override status_dashboard_ssh_private_key_path. + - name: Copy SSH monitoring key for status dashboard ansible.builtin.copy: src: "{{ status_dashboard_ssh_private_key_path }}"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@ansible/roles/status_dashboard/tasks/main.yml` around lines 15 - 17, Add a preflight check to ensure the control-node file referenced by status_dashboard_ssh_private_key_path exists before running the ansible.builtin.copy task; use ansible.builtin.stat to check the path (register a variable like ssh_key_stat) and then fail fast with ansible.builtin.fail if ssh_key_stat.stat.exists is false, keeping the existing copy task (named "Copy SSH monitoring key for status dashboard") conditional on the check so the playbook errors with a clear message when the key is misconfigured.lib/testnetStatus/pollTestnetStatus.js (3)
112-126: Consider adding a fetch timeout.The
fetchcall has no timeout configured. If the status API is unresponsive, the workflow could hang until GitHub's job timeout (10 minutes). Adding an AbortController timeout provides faster failure detection:♻️ Suggested improvement
-async function fetchStatusNodes(statusApiUrl = DEFAULT_STATUS_API_URL, fetchImpl = fetch) { - const response = await fetchImpl(statusApiUrl); +async function fetchStatusNodes( + statusApiUrl = DEFAULT_STATUS_API_URL, + fetchImpl = fetch, + timeoutMs = 30000, +) { + const controller = new AbortController(); + const timeoutId = setTimeout(() => controller.abort(), timeoutMs); + + try { + const response = await fetchImpl(statusApiUrl, { signal: controller.signal }); + // ... rest of function + } finally { + clearTimeout(timeoutId); + }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@lib/testnetStatus/pollTestnetStatus.js` around lines 112 - 126, fetchStatusNodes currently calls fetchImpl(statusApiUrl) with no timeout which can hang; modify fetchStatusNodes to create an AbortController, start a timer (e.g., configurable or fixed like 10s) that calls controller.abort(), pass controller.signal to fetchImpl, clear the timer after response, and throw a clear timeout error when aborted; update the function signature to accept or use the AbortController and ensure any fetch errors due to abort are handled and rethrown with a descriptive message.
128-146: The 500-issue limit could cause missed duplicates.If
dashpay/infraaccumulates more than 500 open issues, some recovery issues might be missed in the deduplication check, potentially creating duplicates. Consider adding--search "in:title [Testnet Recovery]"to filter to only recovery issues, or paginating if needed.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@lib/testnetStatus/pollTestnetStatus.js` around lines 128 - 146, listOpenRecoveryIssues currently fetches up to 500 open issues which can miss recovery issues and cause duplicate creation; update it to either (A) filter server-side by adding the search flag to the gh command (e.g. include '--search', 'in:title [Testnet Recovery]') so only recovery issues are returned, or (B) implement pagination by repeatedly calling execFileImpl('gh', [..., '--limit', '100', '--page', String(page)]) and aggregating parsed JSON until an empty page is returned; modify the call sites in listOpenRecoveryIssues (and use the DEFAULT_REPOSITORY and execFileImpl parameters) to return the full filtered/aggregated array.
198-200: Minor efficiency improvement possible for skipped incidents calculation.Line 200 uses
incidentsToCreate.includes(incident)which is O(n) per incident. With a small number of incidents this is fine, but for consistency with the Set usage elsewhere:♻️ Suggested improvement
const openIssues = await listOpenRecoveryIssues(repository, execFileImpl); const incidentsToCreate = filterIncidentsWithoutOpenIssues(incidents, openIssues); - const skippedIncidents = incidents.filter((incident) => !incidentsToCreate.includes(incident)); + const incidentsToCreateSet = new Set(incidentsToCreate); + const skippedIncidents = incidents.filter((incident) => !incidentsToCreateSet.has(incident));🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@lib/testnetStatus/pollTestnetStatus.js` around lines 198 - 200, The skippedIncidents calculation is currently using incidentsToCreate.includes(incident) which is O(n) per check; create a Set from incidentsToCreate (e.g., a new Set(incidentsToCreate)) and then compute skippedIncidents by filtering incidents with Set.has to get O(1) membership checks; update the logic near where listOpenRecoveryIssues, filterIncidentsWithoutOpenIssues, incidentsToCreate, and skippedIncidents are defined to use the Set for membership testing.flake.nix (2)
10-11: Single architecture limits portability.The flake only targets
x86_64-linux. If any team members use ARM-based systems (aarch64-linux) or macOS, they won't be able to use this environment. Consider usingflake-utilsornixpkgs.lib.genAttrsto support multiple systems if needed.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@flake.nix` around lines 10 - 11, The flake currently hardcodes system = "x86_64-linux" and pkgs = nixpkgs.legacyPackages.${system}, which restricts portability; change the flake to support multiple systems by switching to a multi-system layout (e.g., use flake-utils' eachDefaultSystem or a systems list and map over it) so that pkgs is selected per-system rather than a fixed "system" string; update references to system and pkgs in the flake outputs to use the per-system keys (e.g., using inputs.flake-utils and the eachSystem lambda) so aarch64-linux and darwin can be included.
73-82: Hardcoded working directory may break for different developer setups.The
runScriptassumes the repo is located at$HOME/code/dash-network-deploy. Developers who clone elsewhere will encounter an immediate exit.Consider using the flake's source directory or making the path configurable:
♻️ Suggested improvement
runScript = pkgs.writeShellScript "dash-network-deploy-run" '' - cd "$HOME/code/dash-network-deploy" || exit 1 + REPO_DIR="''${DASH_NETWORK_DEPLOY_DIR:-$HOME/code/dash-network-deploy}" + cd "$REPO_DIR" || { echo "Cannot cd to $REPO_DIR. Set DASH_NETWORK_DEPLOY_DIR if needed."; exit 1; } if [ $# -eq 0 ]; then🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@flake.nix` around lines 73 - 82, The runScript currently hardcodes cd "$HOME/code/dash-network-deploy" which will fail for developers who store the repo elsewhere; update the pkgs.writeShellScript "dash-network-deploy-run" script so it derives the repo path dynamically or from a configurable env var (e.g., use a DEPLOY_DIR env fallback to the flake source or current working directory). Concretely, change the cd logic inside runScript to first check a configurable variable (DEPLOY_DIR) and fall back to the flake's source directory or $PWD, and document that variable so callers can override it; adjust the script that creates the shell (the writeShellScript block named "dash-network-deploy-run") accordingly.test/pollTestnetStatus.spec.js (1)
74-122: Consider adding edge case tests.The
findRecoveryIncidentstest covers the happy path well. For additional robustness, consider adding tests for:
- Empty
expectedNodeNamesarray- Empty
statusNodesarray- Nodes with partial/missing
statusobjectsThese edge cases would help ensure the code handles unexpected API responses gracefully.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/pollTestnetStatus.spec.js` around lines 74 - 122, Add additional unit tests for findRecoveryIncidents to cover the suggested edge cases: create one test where expectedNodeNames is an empty array and assert it returns an empty incidents array; one test where statusNodes is an empty array and assert every expected node is reported as "missing from status API"; and one test where some nodes have missing or partial status objects (e.g., null status, missing health or status fields) and assert the observedState reflects the available information or reports "missing from status API" accordingly. Use the existing test structure and expectations style (expect(...).to.deep.equal(...)) and reference the findRecoveryIncidents function to locate where to add these tests.bin/poll-testnet-status.js (2)
1-5: Missing shebang for direct script execution.The script lacks a shebang line (
#!/usr/bin/env node), which is needed if someone tries to run it directly (e.g.,./bin/poll-testnet-status.js). While the workflow invokes it vianode bin/poll-testnet-status.js, adding the shebang improves usability.♻️ Suggested fix
+#!/usr/bin/env node /* eslint-disable no-console */ const {🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bin/poll-testnet-status.js` around lines 1 - 5, Add a proper shebang line as the very first line of bin/poll-testnet-status.js (#!/usr/bin/env node) so the script can be executed directly, keep the existing require and pollTestnetStatus import intact, and ensure the file is marked executable (chmod +x) in the repo or CI so running ./bin/poll-testnet-status.js works as intended.
29-32: Consider logging the full error for debugging.Currently only
error.messageis logged, which loses the stack trace. For scheduled CI jobs, having the full error context aids debugging.♻️ Suggested improvement
main().catch((error) => { - console.error(error.message); + console.error(error); process.exit(1); });🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bin/poll-testnet-status.js` around lines 29 - 32, The current catch handler only logs error.message which drops the stack; update the catch block for main() to log the full error (e.g., console.error(error) or console.error(error.stack || error)) so the stack trace and full context are preserved before calling process.exit(1)..github/workflows/poll-testnet-status.yml (1)
8-10:issues: writepermission is unnecessary here.The workflow creates issues in
dashpay/infra(a different repository) using the PAT stored inINFRA_ISSUES_TOKEN. Theissues: writepermission grants access only to the current repository's issues, which this workflow doesn't use. You can safely remove it:♻️ Suggested fix
permissions: contents: read - issues: write🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.github/workflows/poll-testnet-status.yml around lines 8 - 10, Remove the unnecessary "issues: write" permission from the workflow's permissions block so it only specifies "contents: read"; edit the permissions section (the permissions: block containing "contents: read" and "issues: write") and delete the "issues: write" line, leaving the rest unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@ansible/roles/status_monitoring/files/dashmon-sudoers`:
- Around line 6-10: Replace hardcoded "dashmon" entries in the sudoers template
with the Ansible variable status_monitoring_user so overrides work; update the
three rules that mention "dashmon ALL=(dashmate) NOPASSWD: /usr/bin/dashmate
status" and the two "dashmon ALL=(ubuntu) NOPASSWD: /usr/local/bin/dash-cli
getblockchaininfo" and "dashmon ALL=(ubuntu) NOPASSWD: /usr/local/bin/dash-cli
masternode status" to use the templated user (status_monitoring_user) instead,
keeping the target run-as users (dashmate, ubuntu) and command paths unchanged.
---
Nitpick comments:
In @.github/workflows/poll-testnet-status.yml:
- Around line 8-10: Remove the unnecessary "issues: write" permission from the
workflow's permissions block so it only specifies "contents: read"; edit the
permissions section (the permissions: block containing "contents: read" and
"issues: write") and delete the "issues: write" line, leaving the rest
unchanged.
In `@ansible/roles/status_dashboard/tasks/main.yml`:
- Around line 15-17: Add a preflight check to ensure the control-node file
referenced by status_dashboard_ssh_private_key_path exists before running the
ansible.builtin.copy task; use ansible.builtin.stat to check the path (register
a variable like ssh_key_stat) and then fail fast with ansible.builtin.fail if
ssh_key_stat.stat.exists is false, keeping the existing copy task (named "Copy
SSH monitoring key for status dashboard") conditional on the check so the
playbook errors with a clear message when the key is misconfigured.
In `@bin/poll-testnet-status.js`:
- Around line 1-5: Add a proper shebang line as the very first line of
bin/poll-testnet-status.js (#!/usr/bin/env node) so the script can be executed
directly, keep the existing require and pollTestnetStatus import intact, and
ensure the file is marked executable (chmod +x) in the repo or CI so running
./bin/poll-testnet-status.js works as intended.
- Around line 29-32: The current catch handler only logs error.message which
drops the stack; update the catch block for main() to log the full error (e.g.,
console.error(error) or console.error(error.stack || error)) so the stack trace
and full context are preserved before calling process.exit(1).
In `@flake.nix`:
- Around line 10-11: The flake currently hardcodes system = "x86_64-linux" and
pkgs = nixpkgs.legacyPackages.${system}, which restricts portability; change the
flake to support multiple systems by switching to a multi-system layout (e.g.,
use flake-utils' eachDefaultSystem or a systems list and map over it) so that
pkgs is selected per-system rather than a fixed "system" string; update
references to system and pkgs in the flake outputs to use the per-system keys
(e.g., using inputs.flake-utils and the eachSystem lambda) so aarch64-linux and
darwin can be included.
- Around line 73-82: The runScript currently hardcodes cd
"$HOME/code/dash-network-deploy" which will fail for developers who store the
repo elsewhere; update the pkgs.writeShellScript "dash-network-deploy-run"
script so it derives the repo path dynamically or from a configurable env var
(e.g., use a DEPLOY_DIR env fallback to the flake source or current working
directory). Concretely, change the cd logic inside runScript to first check a
configurable variable (DEPLOY_DIR) and fall back to the flake's source directory
or $PWD, and document that variable so callers can override it; adjust the
script that creates the shell (the writeShellScript block named
"dash-network-deploy-run") accordingly.
In `@lib/testnetStatus/pollTestnetStatus.js`:
- Around line 112-126: fetchStatusNodes currently calls fetchImpl(statusApiUrl)
with no timeout which can hang; modify fetchStatusNodes to create an
AbortController, start a timer (e.g., configurable or fixed like 10s) that calls
controller.abort(), pass controller.signal to fetchImpl, clear the timer after
response, and throw a clear timeout error when aborted; update the function
signature to accept or use the AbortController and ensure any fetch errors due
to abort are handled and rethrown with a descriptive message.
- Around line 128-146: listOpenRecoveryIssues currently fetches up to 500 open
issues which can miss recovery issues and cause duplicate creation; update it to
either (A) filter server-side by adding the search flag to the gh command (e.g.
include '--search', 'in:title [Testnet Recovery]') so only recovery issues are
returned, or (B) implement pagination by repeatedly calling execFileImpl('gh',
[..., '--limit', '100', '--page', String(page)]) and aggregating parsed JSON
until an empty page is returned; modify the call sites in listOpenRecoveryIssues
(and use the DEFAULT_REPOSITORY and execFileImpl parameters) to return the full
filtered/aggregated array.
- Around line 198-200: The skippedIncidents calculation is currently using
incidentsToCreate.includes(incident) which is O(n) per check; create a Set from
incidentsToCreate (e.g., a new Set(incidentsToCreate)) and then compute
skippedIncidents by filtering incidents with Set.has to get O(1) membership
checks; update the logic near where listOpenRecoveryIssues,
filterIncidentsWithoutOpenIssues, incidentsToCreate, and skippedIncidents are
defined to use the Set for membership testing.
In `@test/pollTestnetStatus.spec.js`:
- Around line 74-122: Add additional unit tests for findRecoveryIncidents to
cover the suggested edge cases: create one test where expectedNodeNames is an
empty array and assert it returns an empty incidents array; one test where
statusNodes is an empty array and assert every expected node is reported as
"missing from status API"; and one test where some nodes have missing or partial
status objects (e.g., null status, missing health or status fields) and assert
the observedState reflects the available information or reports "missing from
status API" accordingly. Use the existing test structure and expectations style
(expect(...).to.deep.equal(...)) and reference the findRecoveryIncidents
function to locate where to add these tests.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 4ef96c25-4f65-43ec-90e7-52efe7c002b1
⛔ Files ignored due to path filters (2)
ansible/roles/status_monitoring/files/dashmon-testnet.pubis excluded by!**/*.pubflake.lockis excluded by!**/*.lock
📒 Files selected for processing (12)
.github/workflows/poll-testnet-status.ymlansible/roles/status_dashboard/defaults/main.ymlansible/roles/status_dashboard/tasks/main.ymlansible/roles/status_dashboard/templates/docker-compose.yml.j2ansible/roles/status_monitoring/defaults/main.ymlansible/roles/status_monitoring/files/dashmon-check.shansible/roles/status_monitoring/files/dashmon-sudoersansible/roles/status_monitoring/tasks/main.ymlbin/poll-testnet-status.jsflake.nixlib/testnetStatus/pollTestnetStatus.jstest/pollTestnetStatus.spec.js
| dashmon ALL=(dashmate) NOPASSWD: /usr/bin/dashmate status | ||
|
|
||
| # Regular masternodes: dash-cli commands as ubuntu user | ||
| dashmon ALL=(ubuntu) NOPASSWD: /usr/local/bin/dash-cli getblockchaininfo | ||
| dashmon ALL=(ubuntu) NOPASSWD: /usr/local/bin/dash-cli masternode status |
There was a problem hiding this comment.
Hardcoded sudoers username breaks status_monitoring_user overrides.
Line 6, Line 9, and Line 10 pin permissions to dashmon, but provisioning uses {{ status_monitoring_user }}. If that variable is overridden, the actual monitoring account won’t have required sudo access.
🔧 Proposed fix (template sudoers with role variable)
-# ansible/roles/status_monitoring/files/dashmon-sudoers
+## ansible/roles/status_monitoring/templates/dashmon-sudoers.j2
- dashmon ALL=(dashmate) NOPASSWD: /usr/bin/dashmate status
+ {{ status_monitoring_user }} ALL=(dashmate) NOPASSWD: /usr/bin/dashmate status
- dashmon ALL=(ubuntu) NOPASSWD: /usr/local/bin/dash-cli getblockchaininfo
- dashmon ALL=(ubuntu) NOPASSWD: /usr/local/bin/dash-cli masternode status
+ {{ status_monitoring_user }} ALL=(ubuntu) NOPASSWD: /usr/local/bin/dash-cli getblockchaininfo
+ {{ status_monitoring_user }} ALL=(ubuntu) NOPASSWD: /usr/local/bin/dash-cli masternode status-# ansible/roles/status_monitoring/tasks/main.yml
-- name: Install dashmon sudoers rules
- ansible.builtin.copy:
- src: dashmon-sudoers
+ - name: Install dashmon sudoers rules
+ ansible.builtin.template:
+ src: dashmon-sudoers.j2
dest: /etc/sudoers.d/dashmon
mode: "0440"
owner: root
group: root
validate: /usr/sbin/visudo -cf %s🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@ansible/roles/status_monitoring/files/dashmon-sudoers` around lines 6 - 10,
Replace hardcoded "dashmon" entries in the sudoers template with the Ansible
variable status_monitoring_user so overrides work; update the three rules that
mention "dashmon ALL=(dashmate) NOPASSWD: /usr/bin/dashmate status" and the two
"dashmon ALL=(ubuntu) NOPASSWD: /usr/local/bin/dash-cli getblockchaininfo" and
"dashmon ALL=(ubuntu) NOPASSWD: /usr/local/bin/dash-cli masternode status" to
use the templated user (status_monitoring_user) instead, keeping the target
run-as users (dashmate, ubuntu) and command paths unchanged.
Issue being fixed or feature implemented
Implements dashpay/infra#28 by adding automation to poll the testnet status API and open deduped recovery issues for affected testnet masternodes and HP masternodes.
This also folds in the status monitoring access work from PR #741 so the status dashboard and monitor path are aligned with the recovery-issue automation.
What was done?
https://status.testnet.networks.dash.org/api/nodesmasternode-*andhp-masternode-*hosts fromnetworks/testnet.inventoryPOSE_BANNEDupservice states[Testnet Recovery] <node-name>dashpay/infradashinfraclawhp-masternode-1to confirm thedashmonaccess path works end-to-endHow Has This Been Tested?
ran unit tests locally:
ran lint locally:
ran a dry poll locally against the real status API and
dashpay/infratarget:fetched and merged PR ansible: codify dashmon status monitoring access #741 locally, then ran a targeted provisioning deploy for monitor access on
hp-masternode-1:./bin/deploy -p --tags=status_dashboard -a="--limit hp-masternode-1" testnetverified on
hp-masternode-1that:dashmonuser existsdashmonsudoers rules are presentsudo -u dashmon /usr/local/bin/dashmon-checkreturns healthy node, Tenderdash, and system metrics outputBreaking Changes
None.
Checklist:
Summary by CodeRabbit
New Features
Tests
Chores