Skip to content

feat: poll testnet status and open infra recovery issues#742

Merged
vivekgsharma merged 4 commits intov1.0-devfrom
vivek/poll-testnet-status
Apr 15, 2026
Merged

feat: poll testnet status and open infra recovery issues#742
vivekgsharma merged 4 commits intov1.0-devfrom
vivek/poll-testnet-status

Conversation

@vivekgsharma
Copy link
Copy Markdown
Collaborator

@vivekgsharma vivekgsharma commented Apr 15, 2026

Issue being fixed or feature implemented

Implements dashpay/infra#28 by adding automation to poll the testnet status API and open deduped recovery issues for affected testnet masternodes and HP masternodes.

This also folds in the status monitoring access work from PR #741 so the status dashboard and monitor path are aligned with the recovery-issue automation.

What was done?

  • added a scheduled GitHub Actions workflow to poll testnet node status every 2 hours and on manual dispatch
  • added a Node.js poller that:
    • fetches https://status.testnet.networks.dash.org/api/nodes
    • reads expected masternode-* and hp-masternode-* hosts from networks/testnet.inventory
    • detects nodes that are:
      • missing from the API
      • unhealthy
      • POSE_BANNED
      • reporting non-up service states
    • dedupes incidents using deterministic open issue titles of the form [Testnet Recovery] <node-name>
    • creates new issues in dashpay/infra
    • assigns created issues to dashinfraclaw
  • added unit coverage for:
    • inventory parsing
    • incident detection
    • issue body/title generation
    • dedupe behavior
  • merged the status monitoring branch from PR ansible: codify dashmon status monitoring access #741 into this branch
  • verified and deployed the status monitoring role specifically to hp-masternode-1 to confirm the dashmon access path works end-to-end

How Has This Been Tested?

  • ran unit tests locally:

    npx mocha --no-config test/pollTestnetStatus.spec.js
  • ran lint locally:

    npx eslint bin/poll-testnet-status.js lib/testnetStatus/pollTestnetStatus.js test/pollTestnetStatus.spec.js
  • ran a dry poll locally against the real status API and dashpay/infra target:

    DRY_RUN=1 TESTNET_RECOVERY_ISSUE_REPOSITORY=dashpay/infra node bin/poll-testnet-status.js
  • fetched and merged PR ansible: codify dashmon status monitoring access #741 locally, then ran a targeted provisioning deploy for monitor access on hp-masternode-1:

    ./bin/deploy -p --tags=status_dashboard -a="--limit hp-masternode-1" testnet
  • verified on hp-masternode-1 that:

    • dashmon user exists
    • forced-command authorized key is installed
    • dashmon sudoers rules are present
    • sudo -u dashmon /usr/local/bin/dashmon-check returns healthy node, Tenderdash, and system metrics output

Breaking Changes

None.

Checklist:

  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have added or updated relevant unit/integration/functional/e2e tests
  • I have made corresponding changes to the documentation

Summary by CodeRabbit

  • New Features

    • Automated testnet status monitoring that runs every 2 hours and automatically creates recovery issues for detected incidents on masternodes
  • Tests

    • Added test suite for testnet status polling functionality
  • Chores

    • Added infrastructure provisioning and development environment configuration

@vivekgsharma vivekgsharma requested a review from ktechmidas April 15, 2026 12:35
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 15, 2026

📝 Walkthrough

Walkthrough

This pull request introduces a complete testnet status monitoring and recovery automation system. It adds a scheduled GitHub Actions workflow, Ansible provisioning for monitoring infrastructure, a Node.js module that detects unhealthy testnet nodes and creates recovery issues, supporting scripts, and a Nix development environment.

Changes

Cohort / File(s) Summary
GitHub Actions Workflow
.github/workflows/poll-testnet-status.yml
New scheduled workflow (2-hour intervals) that runs the testnet status polling script with appropriate permissions and environment configuration.
Ansible Status Dashboard Configuration
ansible/roles/status_dashboard/defaults/main.yml, ansible/roles/status_dashboard/tasks/main.yml, ansible/roles/status_dashboard/templates/docker-compose.yml.j2
Adds SSH key path and user variables; refactors SSH key copying task; templatizes SSH_USER environment variable in docker-compose for dynamic configuration.
Ansible Status Monitoring Role
ansible/roles/status_monitoring/defaults/main.yml, ansible/roles/status_monitoring/tasks/main.yml, ansible/roles/status_monitoring/files/dashmon-check.sh, ansible/roles/status_monitoring/files/dashmon-sudoers
Establishes dedicated monitoring user account, SSH forced-command configuration, sudoers rules for passwordless command execution, and extends dashmon-check script with Tenderdash RPC probing for HP masternodes.
Testnet Status Polling Implementation
lib/testnetStatus/pollTestnetStatus.js, bin/poll-testnet-status.js
New module that polls testnet status API, reads expected node inventory, detects recovery incidents (banned/down/unhealthy nodes), and creates deduplicated GitHub issues with configurable assignee.
Polling Tests
test/pollTestnetStatus.spec.js
Test suite validating inventory parsing, incident detection, observed state formatting, issue title/body generation, and deduplication logic.
Nix Development Environment
flake.nix
Defines FHS-based development environment with Node.js 22, Ansible, Terraform, Docker, Python, AWS CLI, and required build libraries for project dependencies.

Sequence Diagram(s)

sequenceDiagram
    participant GHA as GitHub Actions
    participant Status as Testnet Status API
    participant Inventory as Node Inventory
    participant GHIssues as GitHub Issues API
    
    GHA->>Inventory: Read expected node names
    Inventory-->>GHA: Node list (masternodes, hp_masternodes)
    
    GHA->>Status: Poll status API
    Status-->>GHA: Current status for all nodes
    
    GHA->>GHA: Compare expected vs actual<br/>Detect recovery incidents
    
    GHA->>GHIssues: Query existing open issues
    GHIssues-->>GHA: List of open recovery issues
    
    GHA->>GHA: Filter incidents<br/>without existing issues
    
    loop For each new incident
        GHA->>GHIssues: Create recovery issue
        GHIssues-->>GHA: Issue created with URL
    end
Loading
sequenceDiagram
    participant Ansible as Ansible Provisioning
    participant Host as Target Host
    participant Sudoers as /etc/sudoers.d
    
    Ansible->>Host: Create dashmon user account
    Ansible->>Host: Configure SSH directory<br/>with forced-command
    Ansible->>Host: Deploy dashmon-check.sh<br/>to /usr/local/bin/
    Ansible->>Sudoers: Install sudoers rules
    Sudoers-->>Sudoers: Validate with visudo
    
    Note over Host: SSH connection forces<br/>execution of dashmon-check.sh
    Note over Sudoers: dashmon can run dashmate<br/>or dash-cli without password
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related issues

Poem

🐰 Hop, hop, through the testnet we go,
Checking nodes both high and low,
When masternodes stumble and fall,
Recovery issues spring forth to call!
With monitoring scripts and Ansible's care,
The network stays healthy everywhere! 🌟

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'feat: poll testnet status and open infra recovery issues' accurately summarizes the main changes: adding automation to poll testnet status and create recovery issues.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch vivek/poll-testnet-status

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (10)
ansible/roles/status_dashboard/tasks/main.yml (1)

15-17: Consider adding a preflight key-exists check before copy.

This will fail earlier with a clearer error if status_dashboard_ssh_private_key_path is misconfigured on the control node.

Proposed refactor
+- name: Validate SSH monitoring key exists on control node
+  ansible.builtin.stat:
+    path: "{{ status_dashboard_ssh_private_key_path }}"
+  register: status_dashboard_ssh_key_stat
+  delegate_to: localhost
+  become: false
+
+- name: Fail if SSH monitoring key is missing
+  ansible.builtin.assert:
+    that:
+      - status_dashboard_ssh_key_stat.stat.exists
+      - status_dashboard_ssh_key_stat.stat.isreg
+    fail_msg: >-
+      Missing SSH key at {{ status_dashboard_ssh_private_key_path }}.
+      Set STATUS_DASHBOARD_SSH_KEY_PATH or override status_dashboard_ssh_private_key_path.
+
 - name: Copy SSH monitoring key for status dashboard
   ansible.builtin.copy:
     src: "{{ status_dashboard_ssh_private_key_path }}"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ansible/roles/status_dashboard/tasks/main.yml` around lines 15 - 17, Add a
preflight check to ensure the control-node file referenced by
status_dashboard_ssh_private_key_path exists before running the
ansible.builtin.copy task; use ansible.builtin.stat to check the path (register
a variable like ssh_key_stat) and then fail fast with ansible.builtin.fail if
ssh_key_stat.stat.exists is false, keeping the existing copy task (named "Copy
SSH monitoring key for status dashboard") conditional on the check so the
playbook errors with a clear message when the key is misconfigured.
lib/testnetStatus/pollTestnetStatus.js (3)

112-126: Consider adding a fetch timeout.

The fetch call has no timeout configured. If the status API is unresponsive, the workflow could hang until GitHub's job timeout (10 minutes). Adding an AbortController timeout provides faster failure detection:

♻️ Suggested improvement
-async function fetchStatusNodes(statusApiUrl = DEFAULT_STATUS_API_URL, fetchImpl = fetch) {
-  const response = await fetchImpl(statusApiUrl);
+async function fetchStatusNodes(
+  statusApiUrl = DEFAULT_STATUS_API_URL,
+  fetchImpl = fetch,
+  timeoutMs = 30000,
+) {
+  const controller = new AbortController();
+  const timeoutId = setTimeout(() => controller.abort(), timeoutMs);
+
+  try {
+    const response = await fetchImpl(statusApiUrl, { signal: controller.signal });
+    // ... rest of function
+  } finally {
+    clearTimeout(timeoutId);
+  }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@lib/testnetStatus/pollTestnetStatus.js` around lines 112 - 126,
fetchStatusNodes currently calls fetchImpl(statusApiUrl) with no timeout which
can hang; modify fetchStatusNodes to create an AbortController, start a timer
(e.g., configurable or fixed like 10s) that calls controller.abort(), pass
controller.signal to fetchImpl, clear the timer after response, and throw a
clear timeout error when aborted; update the function signature to accept or use
the AbortController and ensure any fetch errors due to abort are handled and
rethrown with a descriptive message.

128-146: The 500-issue limit could cause missed duplicates.

If dashpay/infra accumulates more than 500 open issues, some recovery issues might be missed in the deduplication check, potentially creating duplicates. Consider adding --search "in:title [Testnet Recovery]" to filter to only recovery issues, or paginating if needed.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@lib/testnetStatus/pollTestnetStatus.js` around lines 128 - 146,
listOpenRecoveryIssues currently fetches up to 500 open issues which can miss
recovery issues and cause duplicate creation; update it to either (A) filter
server-side by adding the search flag to the gh command (e.g. include
'--search', 'in:title [Testnet Recovery]') so only recovery issues are returned,
or (B) implement pagination by repeatedly calling execFileImpl('gh', [...,
'--limit', '100', '--page', String(page)]) and aggregating parsed JSON until an
empty page is returned; modify the call sites in listOpenRecoveryIssues (and use
the DEFAULT_REPOSITORY and execFileImpl parameters) to return the full
filtered/aggregated array.

198-200: Minor efficiency improvement possible for skipped incidents calculation.

Line 200 uses incidentsToCreate.includes(incident) which is O(n) per incident. With a small number of incidents this is fine, but for consistency with the Set usage elsewhere:

♻️ Suggested improvement
   const openIssues = await listOpenRecoveryIssues(repository, execFileImpl);
   const incidentsToCreate = filterIncidentsWithoutOpenIssues(incidents, openIssues);
-  const skippedIncidents = incidents.filter((incident) => !incidentsToCreate.includes(incident));
+  const incidentsToCreateSet = new Set(incidentsToCreate);
+  const skippedIncidents = incidents.filter((incident) => !incidentsToCreateSet.has(incident));
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@lib/testnetStatus/pollTestnetStatus.js` around lines 198 - 200, The
skippedIncidents calculation is currently using
incidentsToCreate.includes(incident) which is O(n) per check; create a Set from
incidentsToCreate (e.g., a new Set(incidentsToCreate)) and then compute
skippedIncidents by filtering incidents with Set.has to get O(1) membership
checks; update the logic near where listOpenRecoveryIssues,
filterIncidentsWithoutOpenIssues, incidentsToCreate, and skippedIncidents are
defined to use the Set for membership testing.
flake.nix (2)

10-11: Single architecture limits portability.

The flake only targets x86_64-linux. If any team members use ARM-based systems (aarch64-linux) or macOS, they won't be able to use this environment. Consider using flake-utils or nixpkgs.lib.genAttrs to support multiple systems if needed.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@flake.nix` around lines 10 - 11, The flake currently hardcodes system =
"x86_64-linux" and pkgs = nixpkgs.legacyPackages.${system}, which restricts
portability; change the flake to support multiple systems by switching to a
multi-system layout (e.g., use flake-utils' eachDefaultSystem or a systems list
and map over it) so that pkgs is selected per-system rather than a fixed
"system" string; update references to system and pkgs in the flake outputs to
use the per-system keys (e.g., using inputs.flake-utils and the eachSystem
lambda) so aarch64-linux and darwin can be included.

73-82: Hardcoded working directory may break for different developer setups.

The runScript assumes the repo is located at $HOME/code/dash-network-deploy. Developers who clone elsewhere will encounter an immediate exit.

Consider using the flake's source directory or making the path configurable:

♻️ Suggested improvement
       runScript = pkgs.writeShellScript "dash-network-deploy-run" ''
-          cd "$HOME/code/dash-network-deploy" || exit 1
+          REPO_DIR="''${DASH_NETWORK_DEPLOY_DIR:-$HOME/code/dash-network-deploy}"
+          cd "$REPO_DIR" || { echo "Cannot cd to $REPO_DIR. Set DASH_NETWORK_DEPLOY_DIR if needed."; exit 1; }
           if [ $# -eq 0 ]; then
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@flake.nix` around lines 73 - 82, The runScript currently hardcodes cd
"$HOME/code/dash-network-deploy" which will fail for developers who store the
repo elsewhere; update the pkgs.writeShellScript "dash-network-deploy-run"
script so it derives the repo path dynamically or from a configurable env var
(e.g., use a DEPLOY_DIR env fallback to the flake source or current working
directory). Concretely, change the cd logic inside runScript to first check a
configurable variable (DEPLOY_DIR) and fall back to the flake's source directory
or $PWD, and document that variable so callers can override it; adjust the
script that creates the shell (the writeShellScript block named
"dash-network-deploy-run") accordingly.
test/pollTestnetStatus.spec.js (1)

74-122: Consider adding edge case tests.

The findRecoveryIncidents test covers the happy path well. For additional robustness, consider adding tests for:

  • Empty expectedNodeNames array
  • Empty statusNodes array
  • Nodes with partial/missing status objects

These edge cases would help ensure the code handles unexpected API responses gracefully.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/pollTestnetStatus.spec.js` around lines 74 - 122, Add additional unit
tests for findRecoveryIncidents to cover the suggested edge cases: create one
test where expectedNodeNames is an empty array and assert it returns an empty
incidents array; one test where statusNodes is an empty array and assert every
expected node is reported as "missing from status API"; and one test where some
nodes have missing or partial status objects (e.g., null status, missing health
or status fields) and assert the observedState reflects the available
information or reports "missing from status API" accordingly. Use the existing
test structure and expectations style (expect(...).to.deep.equal(...)) and
reference the findRecoveryIncidents function to locate where to add these tests.
bin/poll-testnet-status.js (2)

1-5: Missing shebang for direct script execution.

The script lacks a shebang line (#!/usr/bin/env node), which is needed if someone tries to run it directly (e.g., ./bin/poll-testnet-status.js). While the workflow invokes it via node bin/poll-testnet-status.js, adding the shebang improves usability.

♻️ Suggested fix
+#!/usr/bin/env node
 /* eslint-disable no-console */

 const {
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bin/poll-testnet-status.js` around lines 1 - 5, Add a proper shebang line as
the very first line of bin/poll-testnet-status.js (#!/usr/bin/env node) so the
script can be executed directly, keep the existing require and pollTestnetStatus
import intact, and ensure the file is marked executable (chmod +x) in the repo
or CI so running ./bin/poll-testnet-status.js works as intended.

29-32: Consider logging the full error for debugging.

Currently only error.message is logged, which loses the stack trace. For scheduled CI jobs, having the full error context aids debugging.

♻️ Suggested improvement
 main().catch((error) => {
-  console.error(error.message);
+  console.error(error);
   process.exit(1);
 });
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bin/poll-testnet-status.js` around lines 29 - 32, The current catch handler
only logs error.message which drops the stack; update the catch block for main()
to log the full error (e.g., console.error(error) or console.error(error.stack
|| error)) so the stack trace and full context are preserved before calling
process.exit(1).
.github/workflows/poll-testnet-status.yml (1)

8-10: issues: write permission is unnecessary here.

The workflow creates issues in dashpay/infra (a different repository) using the PAT stored in INFRA_ISSUES_TOKEN. The issues: write permission grants access only to the current repository's issues, which this workflow doesn't use. You can safely remove it:

♻️ Suggested fix
 permissions:
   contents: read
-  issues: write
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/poll-testnet-status.yml around lines 8 - 10, Remove the
unnecessary "issues: write" permission from the workflow's permissions block so
it only specifies "contents: read"; edit the permissions section (the
permissions: block containing "contents: read" and "issues: write") and delete
the "issues: write" line, leaving the rest unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@ansible/roles/status_monitoring/files/dashmon-sudoers`:
- Around line 6-10: Replace hardcoded "dashmon" entries in the sudoers template
with the Ansible variable status_monitoring_user so overrides work; update the
three rules that mention "dashmon ALL=(dashmate) NOPASSWD: /usr/bin/dashmate
status" and the two "dashmon ALL=(ubuntu) NOPASSWD: /usr/local/bin/dash-cli
getblockchaininfo" and "dashmon ALL=(ubuntu) NOPASSWD: /usr/local/bin/dash-cli
masternode status" to use the templated user (status_monitoring_user) instead,
keeping the target run-as users (dashmate, ubuntu) and command paths unchanged.

---

Nitpick comments:
In @.github/workflows/poll-testnet-status.yml:
- Around line 8-10: Remove the unnecessary "issues: write" permission from the
workflow's permissions block so it only specifies "contents: read"; edit the
permissions section (the permissions: block containing "contents: read" and
"issues: write") and delete the "issues: write" line, leaving the rest
unchanged.

In `@ansible/roles/status_dashboard/tasks/main.yml`:
- Around line 15-17: Add a preflight check to ensure the control-node file
referenced by status_dashboard_ssh_private_key_path exists before running the
ansible.builtin.copy task; use ansible.builtin.stat to check the path (register
a variable like ssh_key_stat) and then fail fast with ansible.builtin.fail if
ssh_key_stat.stat.exists is false, keeping the existing copy task (named "Copy
SSH monitoring key for status dashboard") conditional on the check so the
playbook errors with a clear message when the key is misconfigured.

In `@bin/poll-testnet-status.js`:
- Around line 1-5: Add a proper shebang line as the very first line of
bin/poll-testnet-status.js (#!/usr/bin/env node) so the script can be executed
directly, keep the existing require and pollTestnetStatus import intact, and
ensure the file is marked executable (chmod +x) in the repo or CI so running
./bin/poll-testnet-status.js works as intended.
- Around line 29-32: The current catch handler only logs error.message which
drops the stack; update the catch block for main() to log the full error (e.g.,
console.error(error) or console.error(error.stack || error)) so the stack trace
and full context are preserved before calling process.exit(1).

In `@flake.nix`:
- Around line 10-11: The flake currently hardcodes system = "x86_64-linux" and
pkgs = nixpkgs.legacyPackages.${system}, which restricts portability; change the
flake to support multiple systems by switching to a multi-system layout (e.g.,
use flake-utils' eachDefaultSystem or a systems list and map over it) so that
pkgs is selected per-system rather than a fixed "system" string; update
references to system and pkgs in the flake outputs to use the per-system keys
(e.g., using inputs.flake-utils and the eachSystem lambda) so aarch64-linux and
darwin can be included.
- Around line 73-82: The runScript currently hardcodes cd
"$HOME/code/dash-network-deploy" which will fail for developers who store the
repo elsewhere; update the pkgs.writeShellScript "dash-network-deploy-run"
script so it derives the repo path dynamically or from a configurable env var
(e.g., use a DEPLOY_DIR env fallback to the flake source or current working
directory). Concretely, change the cd logic inside runScript to first check a
configurable variable (DEPLOY_DIR) and fall back to the flake's source directory
or $PWD, and document that variable so callers can override it; adjust the
script that creates the shell (the writeShellScript block named
"dash-network-deploy-run") accordingly.

In `@lib/testnetStatus/pollTestnetStatus.js`:
- Around line 112-126: fetchStatusNodes currently calls fetchImpl(statusApiUrl)
with no timeout which can hang; modify fetchStatusNodes to create an
AbortController, start a timer (e.g., configurable or fixed like 10s) that calls
controller.abort(), pass controller.signal to fetchImpl, clear the timer after
response, and throw a clear timeout error when aborted; update the function
signature to accept or use the AbortController and ensure any fetch errors due
to abort are handled and rethrown with a descriptive message.
- Around line 128-146: listOpenRecoveryIssues currently fetches up to 500 open
issues which can miss recovery issues and cause duplicate creation; update it to
either (A) filter server-side by adding the search flag to the gh command (e.g.
include '--search', 'in:title [Testnet Recovery]') so only recovery issues are
returned, or (B) implement pagination by repeatedly calling execFileImpl('gh',
[..., '--limit', '100', '--page', String(page)]) and aggregating parsed JSON
until an empty page is returned; modify the call sites in listOpenRecoveryIssues
(and use the DEFAULT_REPOSITORY and execFileImpl parameters) to return the full
filtered/aggregated array.
- Around line 198-200: The skippedIncidents calculation is currently using
incidentsToCreate.includes(incident) which is O(n) per check; create a Set from
incidentsToCreate (e.g., a new Set(incidentsToCreate)) and then compute
skippedIncidents by filtering incidents with Set.has to get O(1) membership
checks; update the logic near where listOpenRecoveryIssues,
filterIncidentsWithoutOpenIssues, incidentsToCreate, and skippedIncidents are
defined to use the Set for membership testing.

In `@test/pollTestnetStatus.spec.js`:
- Around line 74-122: Add additional unit tests for findRecoveryIncidents to
cover the suggested edge cases: create one test where expectedNodeNames is an
empty array and assert it returns an empty incidents array; one test where
statusNodes is an empty array and assert every expected node is reported as
"missing from status API"; and one test where some nodes have missing or partial
status objects (e.g., null status, missing health or status fields) and assert
the observedState reflects the available information or reports "missing from
status API" accordingly. Use the existing test structure and expectations style
(expect(...).to.deep.equal(...)) and reference the findRecoveryIncidents
function to locate where to add these tests.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 4ef96c25-4f65-43ec-90e7-52efe7c002b1

📥 Commits

Reviewing files that changed from the base of the PR and between 555e290 and 4bdd09a.

⛔ Files ignored due to path filters (2)
  • ansible/roles/status_monitoring/files/dashmon-testnet.pub is excluded by !**/*.pub
  • flake.lock is excluded by !**/*.lock
📒 Files selected for processing (12)
  • .github/workflows/poll-testnet-status.yml
  • ansible/roles/status_dashboard/defaults/main.yml
  • ansible/roles/status_dashboard/tasks/main.yml
  • ansible/roles/status_dashboard/templates/docker-compose.yml.j2
  • ansible/roles/status_monitoring/defaults/main.yml
  • ansible/roles/status_monitoring/files/dashmon-check.sh
  • ansible/roles/status_monitoring/files/dashmon-sudoers
  • ansible/roles/status_monitoring/tasks/main.yml
  • bin/poll-testnet-status.js
  • flake.nix
  • lib/testnetStatus/pollTestnetStatus.js
  • test/pollTestnetStatus.spec.js

Comment on lines +6 to +10
dashmon ALL=(dashmate) NOPASSWD: /usr/bin/dashmate status

# Regular masternodes: dash-cli commands as ubuntu user
dashmon ALL=(ubuntu) NOPASSWD: /usr/local/bin/dash-cli getblockchaininfo
dashmon ALL=(ubuntu) NOPASSWD: /usr/local/bin/dash-cli masternode status
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Hardcoded sudoers username breaks status_monitoring_user overrides.

Line 6, Line 9, and Line 10 pin permissions to dashmon, but provisioning uses {{ status_monitoring_user }}. If that variable is overridden, the actual monitoring account won’t have required sudo access.

🔧 Proposed fix (template sudoers with role variable)
-# ansible/roles/status_monitoring/files/dashmon-sudoers
+## ansible/roles/status_monitoring/templates/dashmon-sudoers.j2
- dashmon ALL=(dashmate) NOPASSWD: /usr/bin/dashmate status
+ {{ status_monitoring_user }} ALL=(dashmate) NOPASSWD: /usr/bin/dashmate status

- dashmon ALL=(ubuntu) NOPASSWD: /usr/local/bin/dash-cli getblockchaininfo
- dashmon ALL=(ubuntu) NOPASSWD: /usr/local/bin/dash-cli masternode status
+ {{ status_monitoring_user }} ALL=(ubuntu) NOPASSWD: /usr/local/bin/dash-cli getblockchaininfo
+ {{ status_monitoring_user }} ALL=(ubuntu) NOPASSWD: /usr/local/bin/dash-cli masternode status
-# ansible/roles/status_monitoring/tasks/main.yml
-- name: Install dashmon sudoers rules
-  ansible.builtin.copy:
-    src: dashmon-sudoers
+ - name: Install dashmon sudoers rules
+  ansible.builtin.template:
+    src: dashmon-sudoers.j2
     dest: /etc/sudoers.d/dashmon
     mode: "0440"
     owner: root
     group: root
     validate: /usr/sbin/visudo -cf %s
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ansible/roles/status_monitoring/files/dashmon-sudoers` around lines 6 - 10,
Replace hardcoded "dashmon" entries in the sudoers template with the Ansible
variable status_monitoring_user so overrides work; update the three rules that
mention "dashmon ALL=(dashmate) NOPASSWD: /usr/bin/dashmate status" and the two
"dashmon ALL=(ubuntu) NOPASSWD: /usr/local/bin/dash-cli getblockchaininfo" and
"dashmon ALL=(ubuntu) NOPASSWD: /usr/local/bin/dash-cli masternode status" to
use the templated user (status_monitoring_user) instead, keeping the target
run-as users (dashmate, ubuntu) and command paths unchanged.

@vivekgsharma vivekgsharma merged commit b5d57e2 into v1.0-dev Apr 15, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants