fix: add backend startup heartbeat liveness probe#114
fix: add backend startup heartbeat liveness probe#114zouyonghe merged 13 commits intoAstrBotDevs:mainfrom
Conversation
There was a problem hiding this comment.
Hey - I've found 2 issues
Prompt for AI Agents
Please address the comments from this code review:
## Individual Comments
### Comment 1
<location path="src-tauri/src/backend/readiness.rs" line_range="206-214" />
<code_context>
+ Some(UNIX_EPOCH + Duration::from_millis(heartbeat.updated_at_ms))
+}
+
+fn startup_heartbeat_timestamp_is_fresh(
+ updated_at: Option<SystemTime>,
+ now: SystemTime,
+ max_age: Duration,
+) -> bool {
+ updated_at
+ .map(|updated_at| now.duration_since(updated_at).unwrap_or(Duration::ZERO))
+ .is_some_and(|age| age <= max_age)
+}
</code_context>
<issue_to_address>
**suggestion (bug_risk):** Avoid treating future heartbeat timestamps as implicitly fresh.
Here, a future heartbeat timestamp (due to clock skew or a malformed file) makes `duration_since` return an error, which you coerce to `Duration::ZERO` and therefore treat as freshest. That inverts the intended semantics. Consider treating the `Err` case as stale/invalid (e.g., returning `false` when `duration_since` fails or skipping the heartbeat) instead of mapping it to zero.
```suggestion
fn startup_heartbeat_timestamp_is_fresh(
updated_at: Option<SystemTime>,
now: SystemTime,
max_age: Duration,
) -> bool {
updated_at
.and_then(|updated_at| now.duration_since(updated_at).ok())
.is_some_and(|age| age <= max_age)
}
```
</issue_to_address>
### Comment 2
<location path="src-tauri/src/backend/readiness.rs" line_range="85-92" />
<code_context>
+ };
+
+ if let Some(heartbeat_path) = readiness.startup_heartbeat_path.as_deref() {
+ if let Some(updated_at) =
+ read_startup_heartbeat_updated_at(heartbeat_path, child_pid)
+ {
+ if last_startup_heartbeat_at
+ .map(|last_seen| updated_at > last_seen)
+ .unwrap_or(true)
+ {
+ last_startup_heartbeat_at = Some(updated_at);
+ }
+ }
</code_context>
<issue_to_address>
**suggestion (bug_risk):** Consider clearing the last-seen heartbeat when the file no longer matches the current process.
If `read_startup_heartbeat_updated_at` returns `None` (PID mismatch or non-`starting` state), `last_startup_heartbeat_at` keeps its previous value. That allows an old heartbeat from a previous process to be treated as stale for the current process and trigger an idle-timeout error. To avoid this, consider setting `last_startup_heartbeat_at` to `None` whenever the read returns `None`, so staleness is only checked against heartbeats from the current PID.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
|
@sourcery-ai review |
There was a problem hiding this comment.
Code Review
This pull request introduces a heartbeat mechanism to improve the reliability of the backend startup monitoring process. The backend now periodically writes a JSON heartbeat file containing its PID and status, while the Tauri frontend monitors this file during the readiness check to distinguish between a slow-starting process and a hung one. Changes include new environment variables for heartbeat paths and idle timeouts, a Python implementation of the heartbeat loop in the launcher script, and updated Rust logic to validate heartbeat freshness during the HTTP readiness probe. I have no feedback to provide as there are no review comments to assess.
There was a problem hiding this comment.
Hey - I've left some high level feedback:
- The startup heartbeat path resolution is duplicated in both
runtime::backend_readiness_configandBackendState::spawn_backend; consider computing the path once and reusing it for both the env var and the readiness config to avoid divergence if the resolution rules change later. - In
StartupHeartbeatFile, using aStringforstatemakes it easy for typos or unexpected values to slip through; consider switching this to an enum withserdetags (e.g.Starting/Stopping) so only valid states are accepted at deserialization time.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- The startup heartbeat path resolution is duplicated in both `runtime::backend_readiness_config` and `BackendState::spawn_backend`; consider computing the path once and reusing it for both the env var and the readiness config to avoid divergence if the resolution rules change later.
- In `StartupHeartbeatFile`, using a `String` for `state` makes it easy for typos or unexpected values to slip through; consider switching this to an enum with `serde` tags (e.g. `Starting`/`Stopping`) so only valid states are accepted at deserialization time.Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
|
@sourcery-ai review |
SourceryAI
left a comment
There was a problem hiding this comment.
Hey - I've found 1 issue, and left some high level feedback:
- Both
read_startup_heartbeat_updated_atand the Pythonwrite_startup_heartbeatswallow all errors silently, which could make misconfiguration or file permission issues very hard to diagnose; consider logging at least a single warning (throttled) when JSON parsing or file writes repeatedly fail. - In
startup_heartbeat_timestamp_is_freshyou recomputeSystemTime::now()at each call site (and separately when logging timeout details); if you care about precise age calculations and consistent diagnostics, consider threading a singlenowvalue through the readiness loop and into the log helper instead of callingSystemTime::now()multiple times.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- Both `read_startup_heartbeat_updated_at` and the Python `write_startup_heartbeat` swallow all errors silently, which could make misconfiguration or file permission issues very hard to diagnose; consider logging at least a single warning (throttled) when JSON parsing or file writes repeatedly fail.
- In `startup_heartbeat_timestamp_is_fresh` you recompute `SystemTime::now()` at each call site (and separately when logging timeout details); if you care about precise age calculations and consistent diagnostics, consider threading a single `now` value through the readiness loop and into the log helper instead of calling `SystemTime::now()` multiple times.
## Individual Comments
### Comment 1
<location path="src-tauri/src/backend/readiness.rs" line_range="198-204" />
<code_context>
+ Stopping,
+}
+
+fn read_startup_heartbeat_updated_at(path: &Path, expected_pid: u32) -> Option<SystemTime> {
+ let payload = fs::read_to_string(path).ok()?;
+ let heartbeat: StartupHeartbeatFile = serde_json::from_str(&payload).ok()?;
+ if heartbeat.pid != expected_pid || heartbeat.state != StartupHeartbeatState::Starting {
+ return None;
+ }
+ Some(UNIX_EPOCH + Duration::from_millis(heartbeat.updated_at_ms))
+}
+
</code_context>
<issue_to_address>
**issue (bug_risk):** Guard against overflow when converting `updated_at_ms` into `SystemTime`.
On some platforms, `UNIX_EPOCH + Duration::from_millis(...)` will panic if `updated_at_ms` is large enough that the resulting `SystemTime` is out of range (e.g., due to corruption or manual edits). Using `UNIX_EPOCH.checked_add(Duration::from_millis(...))` and returning `None` on overflow would avoid a crash and treat a bad heartbeat as "no heartbeat" instead.
</issue_to_address>Hi @zouyonghe! 👋
Thanks for trying out Sourcery by commenting with @sourcery-ai review! 🚀
Install the sourcery-ai bot to get automatic code reviews on every pull request ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.There was a problem hiding this comment.
Hey - I've found 1 issue, and left some high level feedback:
- In the startup readiness loop,
next_startup_heartbeat_atclears the previous timestamp whenever the current read returnsNone, which means a disappeared/invalid heartbeat file won't trigger the idle-timeout fast-fail and will instead only be caught by the global timeout—consider treating a transition fromSometoNoneas stale (or at least logging it) to surface regressions in the heartbeat writer more clearly. - The Python
write_startup_heartbeathelper fully swallows all exceptions, including on the initial write; since the Rust side silently ignores missing/invalid heartbeats, a one-time warning to stderr or a debug log when the first write fails would make diagnosing misconfiguredASTRBOT_BACKEND_STARTUP_HEARTBEAT_PATHor permission issues much easier without impacting normal operation.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- In the startup readiness loop, `next_startup_heartbeat_at` clears the previous timestamp whenever the current read returns `None`, which means a disappeared/invalid heartbeat file won't trigger the idle-timeout fast-fail and will instead only be caught by the global timeout—consider treating a transition from `Some` to `None` as stale (or at least logging it) to surface regressions in the heartbeat writer more clearly.
- The Python `write_startup_heartbeat` helper fully swallows all exceptions, including on the initial write; since the Rust side silently ignores missing/invalid heartbeats, a one-time warning to stderr or a debug log when the first write fails would make diagnosing misconfigured `ASTRBOT_BACKEND_STARTUP_HEARTBEAT_PATH` or permission issues much easier without impacting normal operation.
## Individual Comments
### Comment 1
<location path="scripts/backend/templates/launch_backend.py" line_range="129-138" />
<code_context>
+ return Path(raw)
+
+
+def write_startup_heartbeat(path: Path, state: str) -> None:
+ try:
+ path.parent.mkdir(parents=True, exist_ok=True)
+ payload = {
+ "pid": os.getpid(),
+ "state": state,
+ "updated_at_ms": int(time.time() * 1000),
+ }
+ temp_path = path.with_name(f"{path.name}.tmp")
+ temp_path.write_text(
+ json.dumps(payload, separators=(",", ":")),
+ encoding="utf-8",
+ )
+ temp_path.replace(path)
+ except Exception:
+ return
+
</code_context>
<issue_to_address>
**suggestion (bug_risk):** Swallowing all exceptions on heartbeat writes can make diagnosing misconfiguration difficult.
This unconditional `except Exception` prevents visibility into issues like permission errors, invalid paths, or serialization failures. Please log at least a brief message (e.g., to stderr, optionally behind an env flag or limited to expected exception types) so that heartbeat write failures are diagnosable when readiness based on this file doesn’t work.
Suggested implementation:
```python
def resolve_startup_heartbeat_path() -> Path | None:
raw = os.environ.get(STARTUP_HEARTBEAT_ENV, "").strip()
if not raw:
return None
return Path(raw)
def write_startup_heartbeat(path: Path, state: str) -> None:
try:
path.parent.mkdir(parents=True, exist_ok=True)
payload = {
"pid": os.getpid(),
"state": state,
"updated_at_ms": int(time.time() * 1000),
}
temp_path = path.with_name(f"{path.name}.tmp")
temp_path.write_text(
json.dumps(payload, separators=(",", ":")),
encoding="utf-8",
)
temp_path.replace(path)
except Exception as exc:
if os.environ.get("STARTUP_HEARTBEAT_DEBUG"):
print(
f"[startup-heartbeat] failed to write heartbeat to {path!s}: {exc.__class__.__name__}: {exc}",
file=sys.stderr,
)
return
import atexit
```
If you prefer logging only specific, expected exceptions (e.g., `OSError`, `json.JSONDecodeError`) instead of all exceptions, you can narrow the `except` clause accordingly. Also, if you want the debug env var name to be a constant alongside `STARTUP_HEARTBEAT_ENV`, introduce something like `STARTUP_HEARTBEAT_DEBUG_ENV = "STARTUP_HEARTBEAT_DEBUG"` near its definition and replace the hard-coded string in `os.environ.get(...)` with that constant for consistency.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
|
@sourcery-ai review |
There was a problem hiding this comment.
Hey - I've found 2 issues, and left some high level feedback:
- In the readiness loop you call
SystemTime::now()multiple times per iteration (for heartbeat freshness and for timeout logging); consider capturingnowonce at the top of the loop and passing it through to keep timing consistent and avoid redundant syscalls. - For
StartupHeartbeatFileyou might want to add#[serde(deny_unknown_fields)]to fail fast on malformed heartbeat files with unexpected fields rather than silently ignoring potential format drift or operator mistakes.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- In the readiness loop you call `SystemTime::now()` multiple times per iteration (for heartbeat freshness and for timeout logging); consider capturing `now` once at the top of the loop and passing it through to keep timing consistent and avoid redundant syscalls.
- For `StartupHeartbeatFile` you might want to add `#[serde(deny_unknown_fields)]` to fail fast on malformed heartbeat files with unexpected fields rather than silently ignoring potential format drift or operator mistakes.
## Individual Comments
### Comment 1
<location path="src-tauri/src/backend/readiness.rs" line_range="226-232" />
<code_context>
+ Invalidated(SystemTime),
+}
+
+fn read_startup_heartbeat_updated_at(path: &Path, expected_pid: u32) -> Option<SystemTime> {
+ let payload = fs::read_to_string(path).ok()?;
+ let heartbeat: StartupHeartbeatFile = serde_json::from_str(&payload).ok()?;
+ if heartbeat.pid != expected_pid || heartbeat.state != StartupHeartbeatState::Starting {
+ return None;
+ }
+ Some(UNIX_EPOCH + Duration::from_millis(heartbeat.updated_at_ms))
+}
+
</code_context>
<issue_to_address>
**issue (bug_risk):** Guard against potential overflow when converting `updated_at_ms` into a `SystemTime`.
`updated_at_ms` is a `u64` from an external file and is added to `UNIX_EPOCH` via `Duration::from_millis`. For very large values (e.g., near `u64::MAX`), `UNIX_EPOCH + Duration` can panic. Since this function already returns `Option`, consider using `SystemTime::UNIX_EPOCH.checked_add(Duration::from_millis(heartbeat.updated_at_ms))?` or clamping to a reasonable maximum, so malformed/adversarial heartbeat files don’t cause a panic.
</issue_to_address>
### Comment 2
<location path="scripts/backend/templates/launch_backend.py" line_range="155" />
<code_context>
+ return False
+
+
+def start_startup_heartbeat() -> None:
+ heartbeat_path = resolve_startup_heartbeat_path()
+ if heartbeat_path is None:
</code_context>
<issue_to_address>
**issue (complexity):** Consider refactoring `start_startup_heartbeat` to use a small helper class that encapsulates the shared state and related behavior instead of nested functions with `nonlocal`.
You can reduce the inline complexity in `start_startup_heartbeat` by encapsulating the shared state (`warning_emitted`, `stop_event`, `heartbeat_path`) and behavior into a small helper class instead of nested functions with `nonlocal`.
This keeps all behavior the same (including the first write, warning semantics, atexit behavior, and periodic loop), but makes the control flow easier to follow and test.
```python
class StartupHeartbeat:
def __init__(self, path: Path, interval: float) -> None:
self._path = path
self._interval = interval
self._stop_event = threading.Event()
self._warning_emitted = False
def _write(self, state: str, *, warn_on_error: bool) -> bool:
ok = write_startup_heartbeat(
self._path,
state,
warn_on_error=warn_on_error,
)
if not ok:
self._warning_emitted = True
return ok
def start(self) -> None:
# initial write: always warn on error
self._write("starting", warn_on_error=True)
atexit.register(self.stop)
threading.Thread(
target=self._loop,
name="astrbot-startup-heartbeat",
daemon=True,
).start()
def stop(self) -> None:
self._stop_event.set()
# respect prior failure: only warn if we haven't already
self._write("stopping", warn_on_error=not self._warning_emitted)
def _loop(self) -> None:
while not self._stop_event.wait(self._interval):
self._write("starting", warn_on_error=not self._warning_emitted)
```
`start_startup_heartbeat` then becomes a thin, declarative wrapper:
```python
def start_startup_heartbeat() -> None:
heartbeat_path = resolve_startup_heartbeat_path()
if heartbeat_path is None:
return
StartupHeartbeat(
heartbeat_path,
STARTUP_HEARTBEAT_INTERVAL_SECONDS,
).start()
```
This removes nested closures and `nonlocal`, keeps lifetime management in one cohesive unit, and makes the module-level startup sequence (`configure_*`, `preload_*`, `start_startup_heartbeat`) straightforward to skim.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
|
@sourcery-ai review |
There was a problem hiding this comment.
Hey - I've found 1 issue
Prompt for AI Agents
Please address the comments from this code review:
## Individual Comments
### Comment 1
<location path="scripts/backend/templates/launch_backend.py" line_range="155-164" />
<code_context>
+class StartupHeartbeat:
</code_context>
<issue_to_address>
**suggestion (bug_risk):** Reconsider when to suppress warnings on heartbeat write failures
Because `_warning_emitted` is set to `True` on the first write failure, all later writes (including the final `"stopping"` write in `stop`) use `warn_on_error=False`. A transient early failure can therefore silence more critical stop‑time failures. If you’re trying to avoid log spam, consider only suppressing repeated failures after at least one successful write (or rate‑limiting logs), while still ensuring a warning is emitted if the stop‑time write fails.
Suggested implementation:
```python
class StartupHeartbeat:
def __init__(self, path: Path, interval_seconds: float) -> None:
self._path = path
self._interval_seconds = interval_seconds
self._stop_event = threading.Event()
# Tracks whether we have ever successfully written a heartbeat
self._had_successful_write = False
# Tracks whether we've already emitted a warning for the *current* failure sequence
self._warning_emitted = False
def _write(self, state: str, *, warn_on_error: bool) -> bool:
# Only suppress warnings for repeated failures that occur *after* at least one
# successful write. Before any success, every failure (including stop-time)
# will still warn.
effective_warn_on_error = warn_on_error and (
not self._warning_emitted or not self._had_successful_write
)
ok = write_startup_heartbeat(
self._path,
state,
warn_on_error=effective_warn_on_error,
)
if ok:
# Any success resets the failure sequence and re-enables warning for
# the next failure.
self._had_successful_write = True
self._warning_emitted = False
elif effective_warn_on_error:
# Mark that we've already warned for this failure sequence so
# subsequent failures (without an intervening success) can be
# suppressed to avoid log spam.
self._warning_emitted = True
return ok
```
These changes assume the previous implementation of `_write` matched the `SEARCH` block exactly. If it differs in your repo, adapt the `SEARCH` section to fully cover the old `_write` definition so it gets replaced cleanly.
Also, this implementation preserves the existing call sites’ semantics: callers still pass `warn_on_error` as before. The new logic ensures:
- Before any successful write, every failure (including the stop-time write) will log a warning.
- After at least one successful write, the first failure will log a warning, and subsequent failures in the same failure sequence will be suppressed until the next success.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
|
@sourcery-ai review |
There was a problem hiding this comment.
Hey - I've found 2 issues
Prompt for AI Agents
Please address the comments from this code review:
## Individual Comments
### Comment 1
<location path="src-tauri/src/backend/readiness.rs" line_range="85" />
<code_context>
}
+ };
+
+ if let Some(heartbeat_path) = readiness.startup_heartbeat_path.as_deref() {
+ match next_startup_heartbeat_at(
+ last_startup_heartbeat_at,
</code_context>
<issue_to_address>
**issue (complexity):** Consider extracting the startup heartbeat handling into a dedicated stateful helper function so the main readiness loop only coordinates high-level steps instead of inlining the state machine logic.
You can shrink the cognitive load of the main loop by pushing the heartbeat state machine into a dedicated helper and hiding `StartupHeartbeatObservation` and the timestamp freshness logic behind it. That keeps all behavior but reduces branching in the loop and makes the heartbeat logic more locally testable.
For example, introduce a small state struct and helper:
```rust
#[derive(Debug, Clone, Copy)]
struct StartupHeartbeatState {
last_seen_at: Option<SystemTime>,
logged_fresh: bool,
}
impl StartupHeartbeatState {
fn new() -> Self {
Self {
last_seen_at: None,
logged_fresh: false,
}
}
}
enum StartupHeartbeatStep {
Continue(StartupHeartbeatState),
Failed(String),
}
```
Then encapsulate all the heartbeat-related logic, including `next_startup_heartbeat_at` and `startup_heartbeat_timestamp_is_fresh`, into a single function:
```rust
fn step_startup_heartbeat(
heartbeat_path: &Path,
child_pid: u32,
now: SystemTime,
idle_timeout: Duration,
mut state: StartupHeartbeatState,
) -> StartupHeartbeatStep {
match next_startup_heartbeat_at(
state.last_seen_at,
read_startup_heartbeat_updated_at(heartbeat_path, child_pid),
) {
StartupHeartbeatObservation::Missing => {
state.last_seen_at = None;
StartupHeartbeatStep::Continue(state)
}
StartupHeartbeatObservation::Observed(updated_at) => {
state.last_seen_at = Some(updated_at);
if startup_heartbeat_timestamp_is_fresh(state.last_seen_at, now, idle_timeout) {
if !state.logged_fresh {
append_desktop_log(
"backend startup heartbeat is fresh while HTTP dashboard is not ready yet; waiting",
);
state.logged_fresh = true;
}
StartupHeartbeatStep::Continue(state)
} else {
append_desktop_log(
"backend startup heartbeat went stale before HTTP dashboard became ready",
);
StartupHeartbeatStep::Failed(format!(
"Backend startup heartbeat went stale after {}ms without HTTP readiness.",
idle_timeout.as_millis()
))
}
}
StartupHeartbeatObservation::Invalidated(previous) => {
let heartbeat_age_ms = now
.duration_since(previous)
.ok()
.map(|age| age.as_millis().to_string())
.unwrap_or_else(|| "unknown".to_string());
append_desktop_log(&format!(
"backend startup heartbeat disappeared or became invalid before HTTP dashboard became ready: last_valid_age_ms={heartbeat_age_ms}",
));
StartupHeartbeatStep::Failed(
"Backend startup heartbeat disappeared or became invalid before HTTP readiness."
.to_string(),
)
}
}
}
```
Now the main loop only coordinates, instead of open-coding the state machine:
```rust
let mut heartbeat_state = StartupHeartbeatState::new();
loop {
let (http_status, tcp_reachable) =
self.probe_backend_readiness(&readiness.path, readiness.probe_timeout_ms);
if matches!(http_status, Some(status_code) if (200..400).contains(&status_code)) {
return Ok(());
}
let now = SystemTime::now();
let child_pid = /* existing child lock / try_wait logic */;
if let Some(heartbeat_path) = readiness.startup_heartbeat_path.as_deref() {
match step_startup_heartbeat(
heartbeat_path,
child_pid,
now,
startup_idle_timeout,
heartbeat_state,
) {
StartupHeartbeatStep::Continue(next_state) => {
heartbeat_state = next_state;
}
StartupHeartbeatStep::Failed(reason) => {
return Err(reason);
}
}
}
// existing tcp_reachable + timeout + sleep logic
}
```
This keeps all functionality (including use of `StartupHeartbeatObservation`, `startup_heartbeat_timestamp_is_fresh`, and timeout snapshotting) but:
- centralizes heartbeat behavior and logging in one place
- removes multiple flags (`startup_heartbeat_logged`, `last_startup_heartbeat_at`) from the loop body
- makes it easier to unit test the heartbeat behavior with different sequences of observations.
</issue_to_address>
### Comment 2
<location path="scripts/backend/templates/launch_backend.py" line_range="155" />
<code_context>
+ return False
+
+
+class StartupHeartbeat:
+ def __init__(self, path: Path, interval_seconds: float) -> None:
+ self._path = path
</code_context>
<issue_to_address>
**issue (complexity):** Consider replacing the StartupHeartbeat class with a simple heartbeat loop function and local flags to make the lifecycle and warning logic more explicit and easier to follow.
The `StartupHeartbeat` class adds indirection and scattered state for behavior that can be expressed with a single loop function and a couple of local flags, while keeping all semantics (first-failure warnings, reset after success, always warn on stop, fire‑and‑forget thread).
You can collapse the class into a functional pattern and keep the warning behavior explicit:
```python
def heartbeat_loop(path: Path, interval: float, stop_event: threading.Event) -> None:
had_success = False
warned_on_failure = False
# initial write
ok = write_startup_heartbeat(path, "starting", warn_on_error=True)
if ok:
had_success = True
warned_on_failure = False
else:
warned_on_failure = True
while not stop_event.wait(interval):
# warn on first failure, or before any success
warn_now = (not warned_on_failure) or (not had_success)
ok = write_startup_heartbeat(path, "starting", warn_on_error=warn_now)
if ok:
had_success = True
warned_on_failure = False
elif warn_now:
warned_on_failure = True
```
Then wire it up in `start_startup_heartbeat` without the class:
```python
def start_startup_heartbeat() -> None:
heartbeat_path = resolve_startup_heartbeat_path()
if heartbeat_path is None:
return
stop_event = threading.Event()
def on_exit() -> None:
stop_event.set()
# always warn on stop failure
write_startup_heartbeat(heartbeat_path, "stopping", warn_on_error=True)
atexit.register(on_exit)
threading.Thread(
target=heartbeat_loop,
args=(heartbeat_path, STARTUP_HEARTBEAT_INTERVAL_SECONDS, stop_event),
name="astrbot-startup-heartbeat",
daemon=True,
).start()
```
This:
- Removes the `StartupHeartbeat` object and private state fields.
- Keeps all existing behavior (initial write, periodic writes, warning throttling, final “stopping” write with warnings).
- Makes the lifecycle and warning logic visible in one place, reducing cognitive load and indirection while preserving the new feature.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
|
@sourcery-ai review |
There was a problem hiding this comment.
Hey - I've found 1 issue
Prompt for AI Agents
Please address the comments from this code review:
## Individual Comments
### Comment 1
<location path="src-tauri/src/backend/readiness.rs" line_range="176" />
<code_context>
}
}
+
+#[derive(serde::Deserialize)]
+#[serde(deny_unknown_fields)]
+struct StartupHeartbeatFile {
</code_context>
<issue_to_address>
**issue (complexity):** Consider simplifying the new heartbeat and timeout logic by collapsing helper types, letting the tracker be mutated directly, and extracting child PID handling so the main readiness loop reads more linearly.
You can reduce the new complexity without changing behavior by collapsing some of the extra types/indirection and making the loop more linear.
### 1. Simplify `step_startup_heartbeat` API and remove `StartupHeartbeatStep`
`StartupHeartbeatStep` is a bespoke `Result` and the tracker is small. Let `step_startup_heartbeat` mutate the tracker via `&mut` and return `Result<(), String>`:
```rust
#[derive(Debug, Clone, Copy)]
struct StartupHeartbeatTracker {
last_seen_at: Option<SystemTime>,
logged_fresh: bool,
}
impl StartupHeartbeatTracker {
fn new() -> Self {
Self {
last_seen_at: None,
logged_fresh: false,
}
}
}
fn step_startup_heartbeat(
heartbeat_path: &Path,
child_pid: u32,
now: SystemTime,
idle_timeout: Duration,
state: &mut StartupHeartbeatTracker,
) -> Result<(), String> {
let prev = state.last_seen_at;
let current = read_startup_heartbeat_updated_at(heartbeat_path, child_pid);
match (prev, current) {
(Some(prev), None) => {
let heartbeat_age_ms = now
.duration_since(prev)
.ok()
.map(|age| age.as_millis().to_string())
.unwrap_or_else(|| "unknown".to_string());
append_desktop_log(&format!(
"backend startup heartbeat disappeared or became invalid before HTTP dashboard became ready: last_valid_age_ms={heartbeat_age_ms}",
));
Err("Backend startup heartbeat disappeared or became invalid before HTTP readiness.".to_string())
}
(None, None) => {
state.last_seen_at = None;
Ok(())
}
(_, Some(current)) => {
let updated_at = match prev {
Some(prev) if current <= prev => prev,
_ => current,
};
state.last_seen_at = Some(updated_at);
if startup_heartbeat_timestamp_is_fresh(state.last_seen_at, now, idle_timeout) {
if !state.logged_fresh {
append_desktop_log(
"backend startup heartbeat is fresh while HTTP dashboard is not ready yet; waiting",
);
state.logged_fresh = true;
}
Ok(())
} else {
append_desktop_log(
"backend startup heartbeat went stale before HTTP dashboard became ready",
);
Err(format!(
"Backend startup heartbeat went stale after {}ms without HTTP readiness.",
idle_timeout.as_millis()
))
}
}
}
}
```
Then the call site becomes simpler and more idiomatic:
```rust
if let Some(heartbeat_path) = readiness.startup_heartbeat_path.as_deref() {
step_startup_heartbeat(
heartbeat_path,
child_pid,
now,
startup_idle_timeout,
&mut startup_heartbeat_state,
)?;
}
```
This removes `StartupHeartbeatStep` and `StartupHeartbeatObservation` while keeping all behavior.
### 2. Inline `heartbeat_updated_at_ms_to_system_time`
This helper is only used once and just forwards to `checked_add`. You can inline it into `read_startup_heartbeat_updated_at` and drop the function:
```rust
fn read_startup_heartbeat_updated_at(path: &Path, expected_pid: u32) -> Option<SystemTime> {
let payload = fs::read_to_string(path).ok()?;
let heartbeat: StartupHeartbeatFile = serde_json::from_str(&payload).ok()?;
if heartbeat.pid != expected_pid || heartbeat.state != StartupHeartbeatState::Starting {
return None;
}
UNIX_EPOCH.checked_add(Duration::from_millis(heartbeat.updated_at_ms))
}
```
Tests for `heartbeat_updated_at_ms_to_system_time` can be rewritten to assert against this behavior directly if still needed.
### 3. Remove `ReadinessTimeoutSnapshot`
`ReadinessTimeoutSnapshot` is just a tuple wrapper used in one place. You can pass the fields directly and compute the heartbeat age inside `log_backend_readiness_timeout`:
```rust
fn log_backend_readiness_timeout(
&self,
timeout: Duration,
ready_http_path: &str,
probe_timeout_ms: u64,
now: SystemTime,
last_http_status: Option<u16>,
tcp_reachable: bool,
last_startup_heartbeat_at: Option<SystemTime>,
) {
let last_http_status_text = last_http_status
.map(|status| status.to_string())
.unwrap_or_else(|| "none".to_string());
let startup_heartbeat_age_ms = last_startup_heartbeat_at
.and_then(|updated_at| now.duration_since(updated_at).ok())
.map(|age| age.as_millis().to_string())
.unwrap_or_else(|| "none".to_string());
append_desktop_log(&format!(
"backend HTTP readiness check timed out after {}ms: backend_url={}, path={}, probe_timeout_ms={}, tcp_reachable={}, last_http_status={}, startup_heartbeat_age_ms={}",
timeout.as_millis(),
self.backend_url,
ready_http_path,
probe_timeout_ms,
tcp_reachable,
last_http_status_text,
startup_heartbeat_age_ms,
));
}
```
Call site:
```rust
self.log_backend_readiness_timeout(
limit,
&readiness.path,
readiness.probe_timeout_ms,
now,
http_status,
ever_tcp_reachable,
startup_heartbeat_state.last_seen_at,
);
```
This removes one struct and a layer of indirection without changing logging.
### 4. Extract child PID / status polling into a helper
The inner block in the loop can be pulled into a small helper to flatten the loop:
```rust
fn child_pid_or_error(&self) -> Result<u32, String> {
let mut guard = self
.child
.lock()
.map_err(|_| "Backend process lock poisoned.".to_string())?;
if let Some(child) = guard.as_mut() {
let pid = child.id();
match child.try_wait() {
Ok(Some(status)) => {
*guard = None;
Err(format!("Backend process exited before becoming reachable: {status}"))
}
Ok(None) => Ok(pid),
Err(error) => Err(format!("Failed to poll backend process status: {error}")),
}
} else {
Err("Backend process is not running.".to_string())
}
}
```
Loop body then reads more linearly:
```rust
let now = SystemTime::now();
let child_pid = self.child_pid_or_error()?;
if let Some(heartbeat_path) = readiness.startup_heartbeat_path.as_deref() {
step_startup_heartbeat(
heartbeat_path,
child_pid,
now,
startup_idle_timeout,
&mut startup_heartbeat_state,
)?;
}
```
These changes keep your new behavior (heartbeat tracking, logging, timeout info) but reduce the number of custom types and split logic, making the readiness loop and heartbeat handling easier to follow.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
|
@sourcery-ai review |
There was a problem hiding this comment.
Hey - I've found 2 issues
Prompt for AI Agents
Please address the comments from this code review:
## Individual Comments
### Comment 1
<location path="src-tauri/src/backend/readiness.rs" line_range="115" />
<code_context>
(http_status, tcp_reachable)
}
+ fn child_pid_or_error(&self) -> Result<u32, String> {
+ let mut guard = self
+ .child
</code_context>
<issue_to_address>
**issue (complexity):** Consider splitting child PID retrieval from liveness checks and factoring out repeated time-delta logic to make the readiness and heartbeat flow clearer and easier to read.
You can simplify a couple of the new pieces without changing behavior.
### 1. Separate PID lookup from liveness checking
`child_pid_or_error` currently does three things: acquires the lock, checks liveness, and returns the PID. That makes every PID lookup implicitly mutate state and pay a `try_wait` cost.
You can split this into two helpers so the responsibilities are clearer, while keeping behavior identical in the caller:
```rust
fn child_pid(&self) -> Result<u32, String> {
let guard = self
.child
.lock()
.map_err(|_| "Backend process lock poisoned.".to_string())?;
guard
.as_ref()
.map(|child| child.id())
.ok_or_else(|| "Backend process is not running.".to_string())
}
fn ensure_child_alive(&self) -> Result<(), String> {
let mut guard = self
.child
.lock()
.map_err(|_| "Backend process lock poisoned.".to_string())?;
if let Some(child) = guard.as_mut() {
match child.try_wait() {
Ok(Some(status)) => {
*guard = None;
Err(format!(
"Backend process exited before becoming reachable: {status}"
))
}
Ok(None) => Ok(()),
Err(error) => Err(format!("Failed to poll backend process status: {error}")),
}
} else {
Err("Backend process is not running.".to_string())
}
}
```
Then in the loop you can make the lifecycle handling explicit:
```rust
let now = SystemTime::now();
self.ensure_child_alive()?;
let child_pid = self.child_pid()?;
if let Some(heartbeat_path) = readiness.startup_heartbeat_path.as_deref() {
step_startup_heartbeat(
heartbeat_path,
child_pid,
now,
startup_idle_timeout,
&mut startup_heartbeat_state,
)?;
}
```
This keeps the same failure modes but makes “is the process alive?” a visible step instead of hidden inside “get me the PID.”
### 2. Centralize the `duration_since(...).ok()` / age math
You currently repeat the `duration_since(...).ok()` pattern both in `startup_heartbeat_timestamp_is_fresh` and in `log_backend_readiness_timeout`. A tiny helper reduces duplication and makes the freshness logic easier to scan:
```rust
fn ms_since(earlier: SystemTime, now: SystemTime) -> Option<u128> {
now.duration_since(earlier).ok().map(|d| d.as_millis())
}
```
Use it in the tracker:
```rust
fn startup_heartbeat_timestamp_is_fresh(
updated_at: Option<SystemTime>,
now: SystemTime,
max_age: Duration,
) -> bool {
updated_at
.and_then(|updated_at| ms_since(updated_at, now))
.is_some_and(|age_ms| age_ms <= max_age.as_millis())
}
```
And in the timeout logging:
```rust
let startup_heartbeat_age_ms = last_startup_heartbeat_at
.and_then(|updated_at| ms_since(updated_at, now))
.map(|age| age.to_string())
.unwrap_or_else(|| "none".to_string());
```
This also makes it easy to reuse age computation elsewhere if the heartbeat behavior evolves.
### 3. Minor cleanup in `step_startup_heartbeat`
The `(None, None)` branch explicitly sets `state.last_seen_at = None`, which is already known to be `None` and can be dropped for less noise:
```rust
( None, None ) => Ok(()),
```
This doesn’t change behavior but makes the match a bit easier to read.
</issue_to_address>
### Comment 2
<location path="scripts/backend/templates/launch_backend.py" line_range="155" />
<code_context>
+ return False
+
+
+def heartbeat_loop(
+ path: Path, interval_seconds: float, stop_event: threading.Event
+) -> None:
</code_context>
<issue_to_address>
**issue (complexity):** Consider refactoring the heartbeat logic and heartbeat file writing into smaller, well‑named helpers to make the state and responsibilities easier to understand without changing behavior.
You can keep the current behavior but reduce the mental overhead a bit with some small refactors.
### 1. Make `heartbeat_loop` state easier to follow
The behavior is “log every failure until a first success, then log only the first failure in each new error streak.” The current flags encode this but are hard to read.
Renaming and encapsulating the condition helps without changing behavior:
```python
def heartbeat_loop(
path: Path, interval_seconds: float, stop_event: threading.Event
) -> None:
# At least one successful write has happened.
had_successful_write = False
# A warning has already been emitted since the last successful write.
warning_emitted_since_last_success = False
def should_warn() -> bool:
# Before first success: always warn on failure.
# After first success: only warn on the first failure per error streak.
return (not had_successful_write) or (not warning_emitted_since_last_success)
ok = write_startup_heartbeat(path, "starting", warn_on_error=True)
if ok:
had_successful_write = True
else:
warning_emitted_since_last_success = True
while not stop_event.wait(interval_seconds):
warn_now = should_warn()
ok = write_startup_heartbeat(path, "starting", warn_on_error=warn_now)
if ok:
had_successful_write = True
warning_emitted_since_last_success = False
elif warn_now:
warning_emitted_since_last_success = True
```
This keeps the same semantics but removes the ad‑hoc expression inside the loop and makes the states self-documenting.
### 2. Split `write_startup_heartbeat` into focused helpers
You can separate “what we write” from “how we write it” to shrink the responsibility of `write_startup_heartbeat`, while preserving external behavior:
```python
def build_heartbeat_payload(state: str) -> dict[str, object]:
return {
"pid": os.getpid(),
"state": state,
"updated_at_ms": int(time.time() * 1000),
}
def atomic_write_json(path: Path, payload: dict[str, object]) -> None:
temp_path = path.with_name(f"{path.name}.tmp")
temp_path.write_text(
json.dumps(payload, separators=(",", ":")),
encoding="utf-8",
)
temp_path.replace(path)
def write_startup_heartbeat(
path: Path, state: str, *, warn_on_error: bool = False
) -> bool:
try:
path.parent.mkdir(parents=True, exist_ok=True)
payload = build_heartbeat_payload(state)
atomic_write_json(path, payload)
return True
except Exception as exc:
if warn_on_error:
print(
f"[startup-heartbeat] failed to write heartbeat to {path}: "
f"{exc.__class__.__name__}: {exc}",
file=sys.stderr,
)
return False
```
Call sites stay the same, but reasoning about each piece gets easier: the payload shape, the atomic write pattern, and the error/reporting logic are now isolated.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
|
@sourcery-ai review |
There was a problem hiding this comment.
Hey - I've found 1 issue, and left some high level feedback:
- The
ASTRBOT_BACKEND_STARTUP_HEARTBEAT_PATHenv var name is duplicated between Rust (BACKEND_STARTUP_HEARTBEAT_PATH_ENV) andlaunch_backend.py(STARTUP_HEARTBEAT_ENV); consider adding an explicit comment or a small shared-generation mechanism to avoid future drift if the name changes. - In
resolve_backend_startup_heartbeat_path, the precedence betweenroot_dirandASTRBOT_ROOTis subtle; adding a brief comment about whyroot_dirwins over the env var and how packaged vs dev roots are chosen would make future changes safer.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- The `ASTRBOT_BACKEND_STARTUP_HEARTBEAT_PATH` env var name is duplicated between Rust (`BACKEND_STARTUP_HEARTBEAT_PATH_ENV`) and `launch_backend.py` (`STARTUP_HEARTBEAT_ENV`); consider adding an explicit comment or a small shared-generation mechanism to avoid future drift if the name changes.
- In `resolve_backend_startup_heartbeat_path`, the precedence between `root_dir` and `ASTRBOT_ROOT` is subtle; adding a brief comment about why `root_dir` wins over the env var and how packaged vs dev roots are chosen would make future changes safer.
## Individual Comments
### Comment 1
<location path="scripts/backend/templates/launch_backend.py" line_range="162" />
<code_context>
+ return False
+
+
+def heartbeat_loop(
+ path: Path, interval_seconds: float, stop_event: threading.Event
+) -> None:
</code_context>
<issue_to_address>
**issue (complexity):** Consider simplifying the `heartbeat_loop` warning logic to use a single flag that tracks consecutive failures instead of multiple booleans and a helper function.
You can simplify `heartbeat_loop`’s state machine without changing behavior by tracking only “have we already warned for this run of consecutive failures?” instead of two booleans plus `should_warn()`.
Current behavior:
- Always warn on the first failure.
- After a success, warn on the first subsequent failure.
- Suppress repeated warnings for continuous failures.
That can be expressed with a single flag:
```python
def heartbeat_loop(
path: Path, interval_seconds: float, stop_event: threading.Event
) -> None:
# Have we already emitted a warning for the current run of consecutive failures?
warned_for_current_failure_run = False
# Initial write
ok = write_startup_heartbeat(path, "starting", warn_on_error=True)
warned_for_current_failure_run = not ok
while not stop_event.wait(interval_seconds):
warn_now = not warned_for_current_failure_run
ok = write_startup_heartbeat(path, "starting", warn_on_error=warn_now)
if ok:
# Reset on success so the next failure warns again
warned_for_current_failure_run = False
elif warn_now:
warned_for_current_failure_run = True
```
This keeps the “warn-on-first-failure-after-success” semantics, but removes the extra `had_successful_write` flag and nested `should_warn()` helper, making the control flow easier to follow and test.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
|
@sourcery-ai review |
There was a problem hiding this comment.
Hey - I've found 2 issues, and left some high level feedback:
- In
BackendState::wait_for_ready, you call bothensure_child_alive()andchild_pid(), each taking the child lock separately on every loop iteration; consider refactoring to fetch and check the child process once per iteration (e.g., return the PID fromensure_child_alive) to avoid repeated locking and reduce contention. - In
resolve_backend_startup_heartbeat_path, theASTRBOT_ROOTenvironment variable name is hard-coded; it may be safer to reuse a shared constant (similar to other env-var constants inapp_constants.rs) to avoid drift if the name ever changes. - In
start_startup_heartbeat, theatexithandler writes a finalstoppingheartbeat without coordinating with the heartbeat thread beyond settingstop_event; if you want a stronger guarantee on the final state, consider optionally joining the thread or adding a small wait to reduce the race with in-flight writes.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- In `BackendState::wait_for_ready`, you call both `ensure_child_alive()` and `child_pid()`, each taking the child lock separately on every loop iteration; consider refactoring to fetch and check the child process once per iteration (e.g., return the PID from `ensure_child_alive`) to avoid repeated locking and reduce contention.
- In `resolve_backend_startup_heartbeat_path`, the `ASTRBOT_ROOT` environment variable name is hard-coded; it may be safer to reuse a shared constant (similar to other env-var constants in `app_constants.rs`) to avoid drift if the name ever changes.
- In `start_startup_heartbeat`, the `atexit` handler writes a final `stopping` heartbeat without coordinating with the heartbeat thread beyond setting `stop_event`; if you want a stronger guarantee on the final state, consider optionally joining the thread or adding a small wait to reduce the race with in-flight writes.
## Individual Comments
### Comment 1
<location path="src-tauri/src/backend/readiness.rs" line_range="162-165" />
<code_context>
let last_http_status_text = last_http_status
.map(|status| status.to_string())
.unwrap_or_else(|| "none".to_string());
+ let startup_heartbeat_age_ms = last_startup_heartbeat_at
+ .and_then(|updated_at| ms_since(updated_at, now))
+ .map(|age| age.to_string())
+ .unwrap_or_else(|| "none".to_string());
append_desktop_log(&format!(
- "backend HTTP readiness check timed out after {}ms: backend_url={}, path={}, probe_timeout_ms={}, tcp_reachable={}, last_http_status={}",
</code_context>
<issue_to_address>
**suggestion:** Timeout logging conflates "no heartbeat" with clock issues by reporting `"none"` for all `ms_since` failures.
If `last_startup_heartbeat_at` is `Some` but `ms_since` returns `None` (e.g. timestamp is in the future), we log `startup_heartbeat_age_ms=none`, which is indistinguishable from "no heartbeat seen". It would help diagnostics to distinguish these cases (e.g. log a specific value like "future" or include the raw timestamp when `ms_since` fails).
```suggestion
let startup_heartbeat_age_ms = match last_startup_heartbeat_at {
Some(updated_at) => match ms_since(updated_at, now) {
Some(age) => age.to_string(),
None => format!("future ({:?})", updated_at),
},
None => "none".to_string(),
};
```
</issue_to_address>
### Comment 2
<location path="scripts/backend/templates/launch_backend.py" line_range="130" />
<code_context>
+ return Path(raw)
+
+
+def build_heartbeat_payload(state: str) -> dict[str, object]:
+ return {
+ "pid": os.getpid(),
</code_context>
<issue_to_address>
**issue (complexity):** Consider inlining the heartbeat JSON construction and atomic write logic directly into write_startup_heartbeat to avoid unnecessary helper indirection.
You can reduce indirection without changing behavior by inlining the tightly-coupled helpers into `write_startup_heartbeat`.
Right now:
```python
def build_heartbeat_payload(state: str) -> dict[str, object]:
return {
"pid": os.getpid(),
"state": state,
"updated_at_ms": int(time.time() * 1000),
}
def atomic_write_json(path: Path, payload: dict[str, object]) -> None:
temp_path = path.with_name(f"{path.name}.tmp")
temp_path.write_text(
json.dumps(payload, separators=(",", ":")),
encoding="utf-8",
)
temp_path.replace(path)
def write_startup_heartbeat(
path: Path, state: str, *, warn_on_error: bool = False
) -> bool:
try:
path.parent.mkdir(parents=True, exist_ok=True)
atomic_write_json(path, build_heartbeat_payload(state))
return True
except Exception as exc:
...
```
These helpers are only used together in a single call path and don’t provide reuse elsewhere, so they add mental hops without buying much.
You can keep the atomic write semantics and JSON format but collapse this into a single function:
```python
def write_startup_heartbeat(
path: Path, state: str, *, warn_on_error: bool = False
) -> bool:
try:
path.parent.mkdir(parents=True, exist_ok=True)
temp_path = path.with_name(f"{path.name}.tmp")
payload = {
"pid": os.getpid(),
"state": state,
"updated_at_ms": int(time.time() * 1000),
}
temp_path.write_text(
json.dumps(payload, separators=(",", ":")),
encoding="utf-8",
)
temp_path.replace(path)
return True
except Exception as exc:
if warn_on_error:
print(
f"[startup-heartbeat] failed to write heartbeat to {path}: "
f"{exc.__class__.__name__}: {exc}",
file=sys.stderr,
)
return False
```
Then you can remove `build_heartbeat_payload` and `atomic_write_json` entirely. This makes the heartbeat write path self-contained and easier to scan, while preserving all existing behavior (atomic replace, payload shape, logging).
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
| return Path(raw) | ||
|
|
||
|
|
||
| def build_heartbeat_payload(state: str) -> dict[str, object]: |
There was a problem hiding this comment.
issue (complexity): Consider inlining the heartbeat JSON construction and atomic write logic directly into write_startup_heartbeat to avoid unnecessary helper indirection.
You can reduce indirection without changing behavior by inlining the tightly-coupled helpers into write_startup_heartbeat.
Right now:
def build_heartbeat_payload(state: str) -> dict[str, object]:
return {
"pid": os.getpid(),
"state": state,
"updated_at_ms": int(time.time() * 1000),
}
def atomic_write_json(path: Path, payload: dict[str, object]) -> None:
temp_path = path.with_name(f"{path.name}.tmp")
temp_path.write_text(
json.dumps(payload, separators=(",", ":")),
encoding="utf-8",
)
temp_path.replace(path)
def write_startup_heartbeat(
path: Path, state: str, *, warn_on_error: bool = False
) -> bool:
try:
path.parent.mkdir(parents=True, exist_ok=True)
atomic_write_json(path, build_heartbeat_payload(state))
return True
except Exception as exc:
...These helpers are only used together in a single call path and don’t provide reuse elsewhere, so they add mental hops without buying much.
You can keep the atomic write semantics and JSON format but collapse this into a single function:
def write_startup_heartbeat(
path: Path, state: str, *, warn_on_error: bool = False
) -> bool:
try:
path.parent.mkdir(parents=True, exist_ok=True)
temp_path = path.with_name(f"{path.name}.tmp")
payload = {
"pid": os.getpid(),
"state": state,
"updated_at_ms": int(time.time() * 1000),
}
temp_path.write_text(
json.dumps(payload, separators=(",", ":")),
encoding="utf-8",
)
temp_path.replace(path)
return True
except Exception as exc:
if warn_on_error:
print(
f"[startup-heartbeat] failed to write heartbeat to {path}: "
f"{exc.__class__.__name__}: {exc}",
file=sys.stderr,
)
return FalseThen you can remove build_heartbeat_payload and atomic_write_json entirely. This makes the heartbeat write path self-contained and easier to scan, while preserving all existing behavior (atomic replace, payload shape, logging).
|
@sourcery-ai review |
There was a problem hiding this comment.
Hey - I've left some high level feedback:
- The startup heartbeat freshness logic is based on SystemTime (including ms_since and describe_heartbeat_age), which makes it sensitive to wall‑clock adjustments; consider basing freshness and idle timeout decisions on a monotonic clock (Instant + durations) and only using SystemTime for logging.
- step_startup_heartbeat treats any transition from Some(last_seen_at) to None (e.g., a single failed read/parse) as a hard error, which could make startup brittle in the face of transient filesystem/IO issues; you might want to require a grace period or consecutive failures before treating the heartbeat as permanently disappeared.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- The startup heartbeat freshness logic is based on SystemTime (including ms_since and describe_heartbeat_age), which makes it sensitive to wall‑clock adjustments; consider basing freshness and idle timeout decisions on a monotonic clock (Instant + durations) and only using SystemTime for logging.
- step_startup_heartbeat treats any transition from Some(last_seen_at) to None (e.g., a single failed read/parse) as a hard error, which could make startup brittle in the face of transient filesystem/IO issues; you might want to require a grace period or consecutive failures before treating the heartbeat as permanently disappeared.Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
|
@sourcery-ai review |
There was a problem hiding this comment.
Hey - I've found 3 issues, and left some high level feedback:
- The Rust tests
startup_heartbeat_progress_is_fresh_for_recent_instantandstartup_heartbeat_progress_is_not_fresh_when_stalecreate aTempDirand write a heartbeat file that is never read, which adds noise without exercising the heartbeat file parsing; consider removing the unused file I/O or adjusting the tests to coverread_startup_heartbeat_updated_atexplicitly.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- The Rust tests `startup_heartbeat_progress_is_fresh_for_recent_instant` and `startup_heartbeat_progress_is_not_fresh_when_stale` create a `TempDir` and write a heartbeat file that is never read, which adds noise without exercising the heartbeat file parsing; consider removing the unused file I/O or adjusting the tests to cover `read_startup_heartbeat_updated_at` explicitly.
## Individual Comments
### Comment 1
<location path="scripts/backend/templates/launch_backend.py" line_range="139-145" />
<code_context>
+ }
+
+
+def atomic_write_json(path: Path, payload: dict[str, object]) -> None:
+ temp_path = path.with_name(f"{path.name}.tmp")
+ temp_path.write_text(
+ json.dumps(payload, separators=(",", ":")),
+ encoding="utf-8",
+ )
+ temp_path.replace(path)
+
+
</code_context>
<issue_to_address>
**suggestion:** Handle failures of `replace` in `atomic_write_json` to avoid leaving stale temp files silently.
If `temp_path.replace(path)` fails (e.g., permissions, transient FS issues, AV interference), the temp file is left behind and the error isn’t distinguished from the write step. Consider wrapping `replace` in a `try` and surfacing its failure in the heartbeat warning, and/or explicitly `unlink`ing the temp file on failure to avoid accumulating stale `.tmp` files.
Suggested implementation:
```python
def build_heartbeat_payload(state: str) -> dict[str, object]:
return {
"pid": os.getpid(),
"state": state,
"updated_at_ms": int(time.time() * 1000),
}
def atomic_write_json(path: Path, payload: dict[str, object]) -> None:
temp_path = path.with_name(f"{path.name}.tmp")
temp_path.write_text(
json.dumps(payload, separators=(",", ":")),
encoding="utf-8",
)
try:
temp_path.replace(path)
except Exception:
# Best-effort cleanup to avoid accumulating stale temp files.
try:
temp_path.unlink(missing_ok=True)
except Exception:
# Ignore cleanup failures; original error is more important.
pass
raise
import atexit
import ctypes
import json
import os
import runpy
import sys
import threading
import time
from pathlib import Path
```
1. Wherever `atomic_write_json` is called for the heartbeat file, consider catching exceptions and including explicit context in any warnings/logs (e.g., distinguish between write vs. replace failures if you propagate more specific exception types or messages later).
2. If the project uses a logging framework in this file, you may want to log the replace failure inside the `except` block before re-raising, following existing logging conventions.
</issue_to_address>
### Comment 2
<location path="scripts/backend/templates/launch_backend.py" line_range="199-213" />
<code_context>
+ daemon=True,
+ )
+
+ def on_exit() -> None:
+ stop_event.set()
+ thread.join(timeout=STARTUP_HEARTBEAT_STOP_JOIN_TIMEOUT_SECONDS)
+ write_startup_heartbeat(heartbeat_path, "stopping", warn_on_error=True)
+
+ atexit.register(on_exit)
</code_context>
<issue_to_address>
**suggestion:** Guard `on_exit` against rare cases where the heartbeat thread never started.
If `thread.start()` fails (for example due to resource exhaustion) after `atexit.register(on_exit)` is called, `on_exit` will call `thread.join(...)` on a thread that never ran. CPython tolerates this today, but it’s brittle and could obscure startup failures. Consider either registering the `atexit` handler only after a successful `thread.start()`, or tracking a flag so `on_exit` skips `join` when the thread wasn’t started.
```suggestion
stop_event = threading.Event()
thread = threading.Thread(
target=heartbeat_loop,
args=(heartbeat_path, STARTUP_HEARTBEAT_INTERVAL_SECONDS, stop_event),
name="astrbot-startup-heartbeat",
daemon=True,
)
thread_started = False
def on_exit() -> None:
stop_event.set()
if thread_started:
thread.join(timeout=STARTUP_HEARTBEAT_STOP_JOIN_TIMEOUT_SECONDS)
write_startup_heartbeat(heartbeat_path, "stopping", warn_on_error=True)
atexit.register(on_exit)
thread.start()
thread_started = True
```
</issue_to_address>
### Comment 3
<location path="scripts/backend/templates/launch_backend.py" line_range="164" />
<code_context>
+ return False
+
+
+def heartbeat_loop(
+ path: Path, interval_seconds: float, stop_event: threading.Event
+) -> None:
</code_context>
<issue_to_address>
**issue (complexity):** Consider simplifying the heartbeat warning logic and inlining the payload builder to make the heartbeat flow easier to follow and maintain.
You can reduce complexity in two small, targeted ways without changing the overall behavior (periodic atomic JSON writes + “stopping” heartbeat).
### 1. Simplify heartbeat loop warning logic
The current `heartbeat_loop` has two flags plus a nested `should_warn` helper. You can replace that small state machine with a simple time-based throttling scheme, which is easier to read and reason about.
For example, throttle warnings to at most once every 10 seconds:
```python
WARNING_THROTTLE_SECONDS = 10.0
def heartbeat_loop(
path: Path, interval_seconds: float, stop_event: threading.Event
) -> None:
last_warning_time: float | None = None
while not stop_event.wait(interval_seconds):
now = time.time()
warn_now = (
last_warning_time is None
or (now - last_warning_time) >= WARNING_THROTTLE_SECONDS
)
ok = write_startup_heartbeat(path, "starting", warn_on_error=warn_now)
if warn_now and not ok:
last_warning_time = now
```
This keeps log volume bounded while removing `had_successful_write`, `warning_emitted_since_last_success`, and the nested function.
### 2. Inline `build_heartbeat_payload` into `write_startup_heartbeat`
`build_heartbeat_payload` only wraps a small dict and is only used from `write_startup_heartbeat`. Inlining it removes one layer in the call stack (`build_heartbeat_payload → atomic_write_json → write_startup_heartbeat → heartbeat_loop`), making the flow easier to follow.
```python
def write_startup_heartbeat(
path: Path, state: str, *, warn_on_error: bool = False
) -> bool:
payload = {
"pid": os.getpid(),
"state": state,
"updated_at_ms": int(time.time() * 1000),
}
try:
path.parent.mkdir(parents=True, exist_ok=True)
atomic_write_json(path, payload)
return True
except Exception as exc:
if warn_on_error:
print(
f"[startup-heartbeat] failed to write heartbeat to {path}: "
f"{exc.__class__.__name__}: {exc}",
file=sys.stderr,
)
return False
```
`atomic_write_json` remains a focused helper for the critical “write temp → replace” behavior, but the heartbeat logic is now more localized and straightforward.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
| return False | ||
|
|
||
|
|
||
| def heartbeat_loop( |
There was a problem hiding this comment.
issue (complexity): Consider simplifying the heartbeat warning logic and inlining the payload builder to make the heartbeat flow easier to follow and maintain.
You can reduce complexity in two small, targeted ways without changing the overall behavior (periodic atomic JSON writes + “stopping” heartbeat).
1. Simplify heartbeat loop warning logic
The current heartbeat_loop has two flags plus a nested should_warn helper. You can replace that small state machine with a simple time-based throttling scheme, which is easier to read and reason about.
For example, throttle warnings to at most once every 10 seconds:
WARNING_THROTTLE_SECONDS = 10.0
def heartbeat_loop(
path: Path, interval_seconds: float, stop_event: threading.Event
) -> None:
last_warning_time: float | None = None
while not stop_event.wait(interval_seconds):
now = time.time()
warn_now = (
last_warning_time is None
or (now - last_warning_time) >= WARNING_THROTTLE_SECONDS
)
ok = write_startup_heartbeat(path, "starting", warn_on_error=warn_now)
if warn_now and not ok:
last_warning_time = nowThis keeps log volume bounded while removing had_successful_write, warning_emitted_since_last_success, and the nested function.
2. Inline build_heartbeat_payload into write_startup_heartbeat
build_heartbeat_payload only wraps a small dict and is only used from write_startup_heartbeat. Inlining it removes one layer in the call stack (build_heartbeat_payload → atomic_write_json → write_startup_heartbeat → heartbeat_loop), making the flow easier to follow.
def write_startup_heartbeat(
path: Path, state: str, *, warn_on_error: bool = False
) -> bool:
payload = {
"pid": os.getpid(),
"state": state,
"updated_at_ms": int(time.time() * 1000),
}
try:
path.parent.mkdir(parents=True, exist_ok=True)
atomic_write_json(path, payload)
return True
except Exception as exc:
if warn_on_error:
print(
f"[startup-heartbeat] failed to write heartbeat to {path}: "
f"{exc.__class__.__name__}: {exc}",
file=sys.stderr,
)
return Falseatomic_write_json remains a focused helper for the critical “write temp → replace” behavior, but the heartbeat logic is now more localized and straightforward.
|
@sourcery-ai review |
There was a problem hiding this comment.
Hey - I've found 2 issues
Prompt for AI Agents
Please address the comments from this code review:
## Individual Comments
### Comment 1
<location path="src-tauri/src/backend/readiness.rs" line_range="185" />
<code_context>
+struct StartupHeartbeatTracker {
+ last_seen_at: Option<SystemTime>,
+ last_progress_at: Option<Instant>,
+ consecutive_invalid_reads: u8,
+ logged_fresh: bool,
+}
</code_context>
<issue_to_address>
**issue (complexity):** Consider simplifying the `StartupHeartbeatTracker` state machine by replacing the invalid-read counter with a boolean flag and inlining the freshness helper logic to make the control flow easier to follow.
You can keep all behavior but trim the state machine in `StartupHeartbeatTracker` by replacing the `u8` counter and helper around `Option<Instant>` with simpler fields.
### 1. Replace `consecutive_invalid_reads: u8` with a boolean
You only ever compare the counter to a small constant and reset it to `0`, so a boolean flag expresses the intent (“we already saw one invalid/missing read”) more directly:
```rust
#[derive(Debug, Clone, Copy)]
struct StartupHeartbeatTracker {
last_seen_at: Option<SystemTime>,
last_progress_at: Option<Instant>,
had_invalid_read: bool,
logged_fresh: bool,
}
```
Then in `step_startup_heartbeat`:
```rust
match (previous, current) {
(Some(previous), None) => {
if !state.had_invalid_read {
state.had_invalid_read = true;
return Ok(());
}
let heartbeat_age_ms = describe_heartbeat_age(Some(previous), wall_now);
append_desktop_log(&format!(
"backend startup heartbeat disappeared or became invalid before HTTP dashboard became ready: last_valid_age_ms={heartbeat_age_ms}"
));
return Err(
"Backend startup heartbeat disappeared or became invalid before HTTP readiness."
.to_string(),
);
}
(None, None) => {
state.had_invalid_read = false;
Ok(())
}
(_, Some(current)) => {
state.had_invalid_read = false;
// existing logic...
}
}
```
This preserves the “one grace failure” behavior of `STARTUP_HEARTBEAT_INVALID_READ_THRESHOLD == 2` while making the disappearance handling easier to follow.
### 2. Inline `startup_heartbeat_progress_is_fresh` and drop the `Option` parameter
`startup_heartbeat_progress_is_fresh` is currently only used with `state.last_progress_at`, and immediately short-circuits on `None`. You can move that logic into `step_startup_heartbeat` and keep `last_progress_at` as an `Option<Instant>` without a separate helper:
```rust
// inside the (_, Some(current)) arm after updating state.last_progress_at:
let is_fresh = match state.last_progress_at {
Some(last) => monotonic_now.duration_since(last) <= idle_timeout,
None => true, // treat as fresh until first heartbeat progress is recorded
};
if is_fresh {
if !state.logged_fresh {
append_desktop_log(
"backend startup heartbeat is fresh while HTTP dashboard is not ready yet; waiting",
);
state.logged_fresh = true;
}
Ok(())
} else {
append_desktop_log(
"backend startup heartbeat went stale before HTTP dashboard became ready",
);
Err(format!(
"Backend startup heartbeat went stale after {}ms without HTTP readiness.",
idle_timeout.as_millis()
))
}
```
With this change you can delete:
```rust
fn startup_heartbeat_progress_is_fresh(
last_progress_at: Option<Instant>,
now: Instant,
max_age: Duration,
) -> bool { ... }
```
These two adjustments keep all existing behavior (including the grace-miss semantics and stale detection) but make the state and control flow less intricate.
</issue_to_address>
### Comment 2
<location path="scripts/backend/templates/launch_backend.py" line_range="131" />
<code_context>
-main_file = APP_DIR / "main.py"
-if not main_file.is_file():
- raise FileNotFoundError(f"Backend entrypoint not found: {main_file}")
+def build_heartbeat_payload(state: str) -> dict[str, object]:
+ return {
+ "pid": os.getpid(),
</code_context>
<issue_to_address>
**issue (complexity):** Consider simplifying the heartbeat implementation by inlining helpers, reducing state in the loop, and centralizing logging policy to make the code easier to follow.
You can simplify the heartbeat implementation without changing behavior in any meaningful way for this feature.
### 1. Flatten `build_heartbeat_payload` and `atomic_write_json` into `write_startup_heartbeat`
For this internal heartbeat file, the extra helper layers don’t add much clarity, but they do add mental overhead. You can keep the atomic-write behavior while reducing the helper depth:
```py
def write_startup_heartbeat(
path: Path, state: str, *, warn_on_error: bool = False
) -> bool:
payload = {
"pid": os.getpid(),
"state": state,
"updated_at_ms": int(time.time() * 1000),
}
temp_path = path.with_name(f"{path.name}.tmp")
try:
path.parent.mkdir(parents=True, exist_ok=True)
temp_path.write_text(
json.dumps(payload, separators=(",", ":")),
encoding="utf-8",
)
temp_path.replace(path)
return True
except Exception as exc:
# Best-effort cleanup; ignore errors.
try:
temp_path.unlink(missing_ok=True)
except Exception:
pass
if warn_on_error:
print(
f"[startup-heartbeat] failed to write heartbeat to {path}: "
f"{exc.__class__.__name__}: {exc}",
file=sys.stderr,
)
return False
```
This removes two helpers (`build_heartbeat_payload`, `atomic_write_json`) while preserving the atomic update behavior and logging.
### 2. Simplify `heartbeat_loop` logging policy
The dual-boolean state (`had_successful_write`, `warning_emitted_since_last_success`) plus `should_warn` is a lot of logic for a logging policy. A time-based throttle keeps visibility and is easier to follow.
For example, log at most once every N seconds:
```py
LOG_THROTTLE_SECONDS = 30.0 # or whatever is reasonable
def heartbeat_loop(
path: Path, interval_seconds: float, stop_event: threading.Event
) -> None:
last_log_time = 0.0
while not stop_event.wait(interval_seconds):
now = time.time()
warn_now = (now - last_log_time) >= LOG_THROTTLE_SECONDS
ok = write_startup_heartbeat(path, "starting", warn_on_error=warn_now)
if not ok and warn_now:
last_log_time = now
```
If you still want “every failure before first success, then throttled”, you can keep a single `had_successful_write` flag instead of two booleans plus a closure:
```py
def heartbeat_loop(
path: Path, interval_seconds: float, stop_event: threading.Event
) -> None:
had_successful_write = False
last_log_time = 0.0
LOG_THROTTLE_SECONDS = 30.0
while not stop_event.wait(interval_seconds):
now = time.time()
warn_now = (not had_successful_write) or (
(now - last_log_time) >= LOG_THROTTLE_SECONDS
)
ok = write_startup_heartbeat(path, "starting", warn_on_error=warn_now)
if ok:
had_successful_write = True
elif warn_now:
last_log_time = now
```
This preserves the “pre-first-success: log every failure; post-first-success: throttled” semantics with much simpler state.
### 3. Consider removing `warn_on_error` + `bool` return if logging is unified
If you adopt a simple throttle inside the loop, you can also simplify `write_startup_heartbeat` to only raise or only log, and keep the logging policy in one place. For example:
```py
def write_startup_heartbeat(path: Path, state: str) -> None:
# same body as above but raise instead of returning bool / warn_on_error
...
```
Then in `heartbeat_loop`:
```py
try:
write_startup_heartbeat(path, "starting")
except Exception as exc:
if warn_now:
print(
f"[startup-heartbeat] failed to write heartbeat to {path}: "
f"{exc.__class__.__name__}: {exc}",
file=sys.stderr,
)
```
This removes the `warn_on_error` flag and boolean return, and concentrates the logging policy in the loop.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
| struct StartupHeartbeatTracker { | ||
| last_seen_at: Option<SystemTime>, | ||
| last_progress_at: Option<Instant>, | ||
| consecutive_invalid_reads: u8, |
There was a problem hiding this comment.
issue (complexity): Consider simplifying the StartupHeartbeatTracker state machine by replacing the invalid-read counter with a boolean flag and inlining the freshness helper logic to make the control flow easier to follow.
You can keep all behavior but trim the state machine in StartupHeartbeatTracker by replacing the u8 counter and helper around Option<Instant> with simpler fields.
1. Replace consecutive_invalid_reads: u8 with a boolean
You only ever compare the counter to a small constant and reset it to 0, so a boolean flag expresses the intent (“we already saw one invalid/missing read”) more directly:
#[derive(Debug, Clone, Copy)]
struct StartupHeartbeatTracker {
last_seen_at: Option<SystemTime>,
last_progress_at: Option<Instant>,
had_invalid_read: bool,
logged_fresh: bool,
}Then in step_startup_heartbeat:
match (previous, current) {
(Some(previous), None) => {
if !state.had_invalid_read {
state.had_invalid_read = true;
return Ok(());
}
let heartbeat_age_ms = describe_heartbeat_age(Some(previous), wall_now);
append_desktop_log(&format!(
"backend startup heartbeat disappeared or became invalid before HTTP dashboard became ready: last_valid_age_ms={heartbeat_age_ms}"
));
return Err(
"Backend startup heartbeat disappeared or became invalid before HTTP readiness."
.to_string(),
);
}
(None, None) => {
state.had_invalid_read = false;
Ok(())
}
(_, Some(current)) => {
state.had_invalid_read = false;
// existing logic...
}
}This preserves the “one grace failure” behavior of STARTUP_HEARTBEAT_INVALID_READ_THRESHOLD == 2 while making the disappearance handling easier to follow.
2. Inline startup_heartbeat_progress_is_fresh and drop the Option parameter
startup_heartbeat_progress_is_fresh is currently only used with state.last_progress_at, and immediately short-circuits on None. You can move that logic into step_startup_heartbeat and keep last_progress_at as an Option<Instant> without a separate helper:
// inside the (_, Some(current)) arm after updating state.last_progress_at:
let is_fresh = match state.last_progress_at {
Some(last) => monotonic_now.duration_since(last) <= idle_timeout,
None => true, // treat as fresh until first heartbeat progress is recorded
};
if is_fresh {
if !state.logged_fresh {
append_desktop_log(
"backend startup heartbeat is fresh while HTTP dashboard is not ready yet; waiting",
);
state.logged_fresh = true;
}
Ok(())
} else {
append_desktop_log(
"backend startup heartbeat went stale before HTTP dashboard became ready",
);
Err(format!(
"Backend startup heartbeat went stale after {}ms without HTTP readiness.",
idle_timeout.as_millis()
))
}With this change you can delete:
fn startup_heartbeat_progress_is_fresh(
last_progress_at: Option<Instant>,
now: Instant,
max_age: Duration,
) -> bool { ... }These two adjustments keep all existing behavior (including the grace-miss semantics and stale detection) but make the state and control flow less intricate.
| main_file = APP_DIR / "main.py" | ||
| if not main_file.is_file(): | ||
| raise FileNotFoundError(f"Backend entrypoint not found: {main_file}") | ||
| def build_heartbeat_payload(state: str) -> dict[str, object]: |
There was a problem hiding this comment.
issue (complexity): Consider simplifying the heartbeat implementation by inlining helpers, reducing state in the loop, and centralizing logging policy to make the code easier to follow.
You can simplify the heartbeat implementation without changing behavior in any meaningful way for this feature.
1. Flatten build_heartbeat_payload and atomic_write_json into write_startup_heartbeat
For this internal heartbeat file, the extra helper layers don’t add much clarity, but they do add mental overhead. You can keep the atomic-write behavior while reducing the helper depth:
def write_startup_heartbeat(
path: Path, state: str, *, warn_on_error: bool = False
) -> bool:
payload = {
"pid": os.getpid(),
"state": state,
"updated_at_ms": int(time.time() * 1000),
}
temp_path = path.with_name(f"{path.name}.tmp")
try:
path.parent.mkdir(parents=True, exist_ok=True)
temp_path.write_text(
json.dumps(payload, separators=(",", ":")),
encoding="utf-8",
)
temp_path.replace(path)
return True
except Exception as exc:
# Best-effort cleanup; ignore errors.
try:
temp_path.unlink(missing_ok=True)
except Exception:
pass
if warn_on_error:
print(
f"[startup-heartbeat] failed to write heartbeat to {path}: "
f"{exc.__class__.__name__}: {exc}",
file=sys.stderr,
)
return FalseThis removes two helpers (build_heartbeat_payload, atomic_write_json) while preserving the atomic update behavior and logging.
2. Simplify heartbeat_loop logging policy
The dual-boolean state (had_successful_write, warning_emitted_since_last_success) plus should_warn is a lot of logic for a logging policy. A time-based throttle keeps visibility and is easier to follow.
For example, log at most once every N seconds:
LOG_THROTTLE_SECONDS = 30.0 # or whatever is reasonable
def heartbeat_loop(
path: Path, interval_seconds: float, stop_event: threading.Event
) -> None:
last_log_time = 0.0
while not stop_event.wait(interval_seconds):
now = time.time()
warn_now = (now - last_log_time) >= LOG_THROTTLE_SECONDS
ok = write_startup_heartbeat(path, "starting", warn_on_error=warn_now)
if not ok and warn_now:
last_log_time = nowIf you still want “every failure before first success, then throttled”, you can keep a single had_successful_write flag instead of two booleans plus a closure:
def heartbeat_loop(
path: Path, interval_seconds: float, stop_event: threading.Event
) -> None:
had_successful_write = False
last_log_time = 0.0
LOG_THROTTLE_SECONDS = 30.0
while not stop_event.wait(interval_seconds):
now = time.time()
warn_now = (not had_successful_write) or (
(now - last_log_time) >= LOG_THROTTLE_SECONDS
)
ok = write_startup_heartbeat(path, "starting", warn_on_error=warn_now)
if ok:
had_successful_write = True
elif warn_now:
last_log_time = nowThis preserves the “pre-first-success: log every failure; post-first-success: throttled” semantics with much simpler state.
3. Consider removing warn_on_error + bool return if logging is unified
If you adopt a simple throttle inside the loop, you can also simplify write_startup_heartbeat to only raise or only log, and keep the logging policy in one place. For example:
def write_startup_heartbeat(path: Path, state: str) -> None:
# same body as above but raise instead of returning bool / warn_on_error
...Then in heartbeat_loop:
try:
write_startup_heartbeat(path, "starting")
except Exception as exc:
if warn_now:
print(
f"[startup-heartbeat] failed to write heartbeat to {path}: "
f"{exc.__class__.__name__}: {exc}",
file=sys.stderr,
)This removes the warn_on_error flag and boolean return, and concentrates the logging policy in the loop.
Summary
The desktop shell currently waits for backend readiness by polling
/api/stat/start-timeuntil a fixed overall timeout expires. That breaks down when the packaged backend spends a long time installing dependencies before the HTTP server starts listening: the process is still doing useful startup work, but the shell sees only repeated HTTP failures and eventually kills it as if startup had hung.This change adds a launcher-owned startup heartbeat so the shell can distinguish between "not ready yet" and "no longer making progress". The Python launcher now refreshes a heartbeat file while startup is still in progress, and the Rust readiness loop treats a fresh heartbeat from the current child PID as evidence that startup is alive even when HTTP is not ready yet. If the heartbeat goes stale before readiness arrives, startup still fails fast. The packaged-mode hard timeout is also raised from 5 minutes to 15 minutes so long-running dependency installation has more headroom, while an idle-timeout guard prevents endless waits.
The heartbeat is scoped to the current child PID and requires
state == \"starting\"to avoid stale files from previous runs or other instances being treated as current progress. The new timeout configuration and heartbeat path are documented for operators.Testing
cargo test backend::python3 -m py_compile scripts/backend/templates/launch_backend.pySummary by Sourcery
Introduce a backend startup heartbeat mechanism so the desktop can distinguish active startup from hung startup while waiting for HTTP readiness.
New Features:
Bug Fixes:
Enhancements:
Documentation:
Tests: