Skip to content

agent: add liveness probe component#11577

Merged
kylewanginchina merged 1 commit intomainfrom
codex-liveness-probe-20260402
Apr 10, 2026
Merged

agent: add liveness probe component#11577
kylewanginchina merged 1 commit intomainfrom
codex-liveness-probe-20260402

Conversation

@rvql
Copy link
Copy Markdown
Contributor

@rvql rvql commented Apr 2, 2026

This PR is for:

  • Agent

@rvql
Copy link
Copy Markdown
Contributor Author

rvql commented Apr 2, 2026

@claude review code

@claude
Copy link
Copy Markdown
Contributor

claude bot commented Apr 2, 2026

Claude finished @rvql's task in 13m 45s —— View job


Todo

  • Post initial review checklist
  • Inspect the PR diff and relevant files
  • Analyze the implementation for correctness and edge cases
  • Post review feedback in the PR comment

Review

I found 2 correctness issues that should be fixed before merge.

  1. Guard liveness can false-fail on valid configs
    @agent/src/trident.rs:146 registers the guard with a fixed 30_000ms timeout, but the guard thread only heartbeats once per loop in @agent/src/utils/guard.rs:562 and then waits config.guard_interval in @agent/src/utils/guard.rs:787. That interval is populated from global.tunning.resource_monitoring_interval in @agent/src/config/handler.rs:2023, and I couldn't find validation that keeps it under 30s. If an operator sets a larger monitoring interval, /livez will report the agent unhealthy even though the guard thread is behaving correctly.

    Fix this →

  2. Synchronizer liveness has the same false-negative problem
    @agent/src/trident.rs:147 hard-codes a 90_000ms timeout for the synchronizer components, but both synchronizer loops sleep for the full sync interval between heartbeats: @agent/src/rpc/synchronizer.rs:1823-1875 and @agent/src/rpc/synchronizer.rs:1898-2023. That interval comes from global.communication.proactive_request_interval, which @agent/src/config/config.rs:3288-3293 explicitly allows up to 1 hour. Any configured interval above 90s will make healthy agents fail /livez.

    Fix this →

Summary

Static review only. I did not make code changes in this pass.

@rvql rvql force-pushed the codex-liveness-probe-20260402 branch from 7909dbc to 8594b0c Compare April 9, 2026 08:23
@rvql
Copy link
Copy Markdown
Contributor Author

rvql commented Apr 9, 2026

@claude review this code

@rvql rvql requested a review from kylewanginchina April 9, 2026 08:55
@claude
Copy link
Copy Markdown
Contributor

claude bot commented Apr 9, 2026

Claude finished @rvql's task in 4m 37s —— View job


Reviewing this PR.

Todo

  • Post initial review checklist
  • Inspect the PR diff and relevant files
  • Analyze the implementation for correctness and edge cases
  • Post review feedback in the PR comment

@rvql
Copy link
Copy Markdown
Contributor Author

rvql commented Apr 9, 2026

输出例:

[root@earth-108 ~]# curl 10.50.1.108:39090/livez | jq .
{
  "status": "ok",
  "version": {
    "name": "deepflow-agent-ee",
    "branch": "feature-liveness-probe",
    "commit_id": "578eeb33097716ebd4e100e168c80d476c38612a",
    "rev_count": "6015",
    "compiler": "rustc 1.83.0 (90b35a623 2024-11-26)",
    "compile_time": "2026-04-09 08:50:08",
    "revision": "feature-liveness-probe 6015-578eeb33097716ebd4e100e168c80d476c38612a"
  },
  "uptime_ms": 95200,
  "failed_components": [],
  "components": [
    {
      "module": "dispatcher",
      "id": 0,
      "display_name": "dispatcher local",
      "running": true,
      "required": true,
      "timeout_ms": 60000,
      "last_heartbeat_ago_ms": 855,
      "last_event": "Heartbeat",
      "last_location": {
        "file": "/usr/local/cargo/git/checkouts/deepflow-0e37f216e1982654/8594b0c/agent/src/dispatcher/local_mode_dispatcher.rs",
        "line": 272
      }
    },
    {
      "module": "guard",
      "id": 0,
      "display_name": "guard",
      "running": true,
      "required": true,
      "timeout_ms": 20000,
      "last_heartbeat_ago_ms": 4843,
      "last_event": "Heartbeat",
      "last_location": {
        "file": "/usr/local/cargo/git/checkouts/deepflow-0e37f216e1982654/8594b0c/agent/src/utils/guard.rs",
        "line": 824
      }
    },
    {
      "module": "main-loop",
      "id": 0,
      "display_name": "main loop",
      "running": true,
      "required": true,
      "timeout_ms": 60000,
      "last_heartbeat_ago_ms": 6524,
      "last_event": "Heartbeat",
      "last_location": {
        "file": "/usr/local/cargo/git/checkouts/deepflow-0e37f216e1982654/8594b0c/agent/src/trident.rs",
        "line": 963
      }
    },
    {
      "module": "synchronizer",
      "id": 0,
      "display_name": "synchronizer sync",
      "running": true,
      "required": false,
      "timeout_ms": 90000,
      "last_heartbeat_ago_ms": 35000,
      "last_event": "Heartbeat",
      "last_location": {
        "file": "/usr/local/cargo/git/checkouts/deepflow-0e37f216e1982654/8594b0c/agent/src/rpc/synchronizer.rs",
        "line": 1989
      }
    },
    {
      "module": "synchronizer",
      "id": 1,
      "display_name": "synchronizer triggered",
      "running": true,
      "required": false,
      "timeout_ms": 90000,
      "last_heartbeat_ago_ms": 17031,
      "last_event": "Heartbeat",
      "last_location": {
        "file": "/usr/local/cargo/git/checkouts/deepflow-0e37f216e1982654/8594b0c/agent/src/rpc/synchronizer.rs",
        "line": 1383
      }
    }
  ]
}

@rvql
Copy link
Copy Markdown
Contributor Author

rvql commented Apr 9, 2026

失败时:

[root@earth-108 ~]# curl -v 10.50.1.108:39090/livez 
* About to connect() to 10.50.1.108 port 39090 (#0)
*   Trying 10.50.1.108...
* Connected to 10.50.1.108 (10.50.1.108) port 39090 (#0)
> GET /livez HTTP/1.1
> User-Agent: curl/7.29.0
> Host: 10.50.1.108:39090
> Accept: */*
> 
< HTTP/1.1 500 Internal Server Error
< content-type: application/json
< content-length: 2213
< date: Thu, 09 Apr 2026 09:47:47 GMT
< 
{"status":"fail",

@rvql
Copy link
Copy Markdown
Contributor Author

rvql commented Apr 9, 2026

@claude review code

@claude
Copy link
Copy Markdown
Contributor

claude bot commented Apr 9, 2026

Claude finished @rvql's task in 2m 14s —— View job


Reviewing this PR.

Todo

  • Update the PR comment with a review checklist
  • Inspect the PR diff and relevant Rust files
  • Analyze the liveness implementation for correctness and edge cases
  • Post review feedback in the PR comment

Copy link
Copy Markdown
Contributor

@yuanchaoa yuanchaoa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

x

@rvql
Copy link
Copy Markdown
Contributor Author

rvql commented Apr 10, 2026

@claude review code

@claude
Copy link
Copy Markdown
Contributor

claude bot commented Apr 10, 2026

Reviewing this PR.

Todo

  • Update the PR comment with a review checklist
  • Inspect the PR diff and relevant Rust files
  • Analyze the liveness implementation for correctness and edge cases
  • Post review feedback in the PR comment

View job run

@kylewanginchina kylewanginchina force-pushed the codex-liveness-probe-20260402 branch from 6c22880 to 2d024e5 Compare April 10, 2026 07:25
@kylewanginchina kylewanginchina enabled auto-merge (rebase) April 10, 2026 07:25
@kylewanginchina kylewanginchina merged commit dc3f931 into main Apr 10, 2026
8 checks passed
@kylewanginchina kylewanginchina deleted the codex-liveness-probe-20260402 branch April 10, 2026 07:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants