Skip to content

feat(grpc): enforce a maximum number of reconnection attempts#902

Open
Molter73 wants to merge 2 commits into
mainfrom
mauro/feat/crash-on-reconnect-fail
Open

feat(grpc): enforce a maximum number of reconnection attempts#902
Molter73 wants to merge 2 commits into
mainfrom
mauro/feat/crash-on-reconnect-fail

Conversation

@Molter73

@Molter73 Molter73 commented Jun 24, 2026

Copy link
Copy Markdown
Collaborator

Description

If fact fails to connect to the gRPC server in the specified number of attemtps it will crash, which can be used as a clear signal in k8s that something is not working as expected. If the number of retries is set to 0, there will be no limit to the amount of times fact attempts to reconnect.

In order to allow the reconnection failure to trigger an application wide crash, the main output task monitors the result of the grpc task's handle propagating the error further up. This method should allow for other output components to be added in the future and follow this same pattern without having to change anything outside the output module. The stdout component is not included in this logic because it has no condition that could merit an application wide crash.

Checklist

  • Patch has a change log entry OR does not need one.
  • Investigated and inspected CI test results
  • Updated documentation accordingly

Automated testing

  • Added unit tests
  • Added integration tests
  • Added regression tests

If any of these don't apply, please comment below.

Testing Performed

Pointed fact to connect to a non-existant gRPC server and got it to crash as expected.

fact output on crash
[INFO  2026-06-24T16:55:17Z] FactConfig {
    paths: Some(
        [
            "/etc/sensitive-files/**/*",
            "/etc/sensitive-files",
        ],
    ),
    grpc: GrpcConfig {
        url: Some(
            "http://localhost:9999",
        ),
        certs: None,
        backoff: BackoffConfig {
            initial: None,
            max: Some(
                1s,
            ),
            jitter: Some(
                false,
            ),
            multiplier: None,
            retries_max: Some(
                3,
            ),
        },
    },
    endpoint: EndpointConfig {
        address: None,
        expose_metrics: Some(
            true,
        ),
        health_check: None,
    },
    bpf: BpfConfig {
        ringbuf_size: None,
        inodes_max: None,
    },
    skip_pre_flight: None,
    json: None,
    hotreload: None,
    scan_interval: None,
    rate_limit: None,
}
[INFO  2026-06-24T16:55:17Z] fact version: 0.3.x-104-g669f5be002
[INFO  2026-06-24T16:55:17Z] OS: Fedora Linux 44 (Workstation Edition)
[INFO  2026-06-24T16:55:17Z] Kernel version: 7.0.12-201.fc44.x86_64
[INFO  2026-06-24T16:55:17Z] Architecture: x86_64
[INFO  2026-06-24T16:55:17Z] Hostname: mmoltras-thinkpadp1gen5.remote.csb
[INFO  2026-06-24T16:55:17Z] Starting BPF worker...
[WARN  2026-06-24T16:55:17Z] Invalid initial value: 1 >= 1
[INFO  2026-06-24T16:55:17Z] Attempting to connect to gRPC server...
[INFO  2026-06-24T16:55:17Z] Starting host scanner...
[WARN  2026-06-24T16:55:17Z] Using unencrypted gRPC channel
[INFO  2026-06-24T16:55:17Z] Failed to connect to server: transport error

Caused by:
    0: tcp connect error
    1: tcp connect error
    2: Connection refused (os error 111), retrying in 1s
[INFO  2026-06-24T16:55:18Z] Attempting to connect to gRPC server...
[WARN  2026-06-24T16:55:18Z] Using unencrypted gRPC channel
[INFO  2026-06-24T16:55:18Z] Failed to connect to server: transport error

Caused by:
    0: tcp connect error
    1: tcp connect error
    2: Connection refused (os error 111), retrying in 1s
[INFO  2026-06-24T16:55:19Z] Attempting to connect to gRPC server...
[WARN  2026-06-24T16:55:19Z] Using unencrypted gRPC channel
[INFO  2026-06-24T16:55:19Z] Failed to connect to server: transport error

Caused by:
    0: tcp connect error
    1: tcp connect error
    2: Connection refused (os error 111), retrying in 1s
[INFO  2026-06-24T16:55:20Z] Attempting to connect to gRPC server...
[WARN  2026-06-24T16:55:20Z] Using unencrypted gRPC channel
[INFO  2026-06-24T16:55:20Z] Exiting...
[INFO  2026-06-24T16:55:20Z] Stopping config reloader...
[INFO  2026-06-24T16:55:20Z] Stopping endpoints...
Error: Output worker errored out: grpc worker errored out: gRPC error: Failed to connect to server: reconnection attempts exhausted

Summary by CodeRabbit

  • New Features

    • Added a configurable limit for gRPC reconnect retries via config, environment variable, or CLI.
    • Default retry behavior is now consistent when the setting is not provided.
  • Bug Fixes

    • Improved gRPC reconnect handling so it stops retrying after the configured limit instead of looping indefinitely.
    • Made shutdown and failure handling more reliable, with clearer exits when background work fails.

@coderabbitai

coderabbitai Bot commented Jun 24, 2026

Copy link
Copy Markdown

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Enterprise

Run ID: ad019753-0a7b-4454-9cb6-b3d4acd503b0

📥 Commits

Reviewing files that changed from the base of the PR and between 669f5be and 8acc9dc.

📒 Files selected for processing (4)
  • fact/src/config/mod.rs
  • fact/src/config/tests.rs
  • fact/src/output/grpc.rs
  • fact/src/output/mod.rs
🚧 Files skipped from review as they are similar to previous changes (4)
  • fact/src/config/mod.rs
  • fact/src/output/mod.rs
  • fact/src/config/tests.rs
  • fact/src/output/grpc.rs

📝 Walkthrough

Walkthrough

The PR threads a configurable gRPC retry limit through YAML, CLI, and tests, changes gRPC backoff to stop after the retry budget is exhausted, and updates output and run control flow to propagate task and worker errors through shutdown.

Changes

Finite gRPC Retry Budget

Layer / File(s) Summary
Retry budget configuration and coverage
fact/src/config/mod.rs, fact/src/config/tests.rs
BackoffConfig gains retries_max, YAML and CLI inputs map into it, retries() uses the configured value or default, and config tests cover parsing, errors, update, defaults, and env vars.
Backoff exhaustion and reconnect termination
fact/src/output/grpc.rs
Backoff tracks retry budget state, next() returns Option<Duration> to stop after exhaustion, reconnect attempts bail when no delay is available, and the backoff tests are updated.
JoinHandle-based worker error propagation
fact/src/output/grpc.rs, fact/src/output/mod.rs, fact/src/lib.rs
Client::start returns a join handle, output::start supervises the gRPC task while forwarding events, and run flattens join and worker errors into the shutdown result.

Sequence Diagram(s)

sequenceDiagram
  participant runLoop as run
  participant outputStart as output::start
  participant clientStart as Client::start
  participant clientRun as Client::run
  participant Backoff
  runLoop->>outputStart: start output task
  outputStart->>clientStart: spawn gRPC task
  clientStart->>clientRun: run()
  clientRun->>Backoff: next()
  Backoff-->>clientRun: None
  clientRun-->>outputStart: Err("reconnection attempts exhausted")
  outputStart-->>runLoop: task resolves Err
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

  • stackrox/fact#789: Extends the same gRPC backoff path with a configurable retry budget and Option<Duration> reconnect behavior.

Suggested reviewers

  • rhacs-bot
  • vladbologa
  • Stringy

Poem

A bunny hopped through retry mist,
and backoff gave a final twist.
The handle spun, the task said “oh,”
then shutdown hummed a tidy glow.
🐇✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 75.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarizes the main change: adding a maximum retry limit for gRPC reconnections.
Description check ✅ Passed The description follows the template with a summary, checklist, automated testing, and testing performed sections.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch mauro/feat/crash-on-reconnect-fail

Comment @coderabbitai help to get the list of available commands.

@codecov-commenter

codecov-commenter commented Jun 24, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 65.80645% with 53 lines in your changes missing coverage. Please review.
✅ Project coverage is 34.35%. Comparing base (5dd2a85) to head (8acc9dc).
⚠️ Report is 4 commits behind head on main.

Files with missing lines Patch % Lines
fact/src/lib.rs 0.00% 23 Missing ⚠️
fact/src/output/mod.rs 0.00% 20 Missing ⚠️
fact/src/output/grpc.rs 90.09% 10 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #902      +/-   ##
==========================================
+ Coverage   32.72%   34.35%   +1.62%     
==========================================
  Files          21       21              
  Lines        2695     2771      +76     
  Branches     2695     2771      +76     
==========================================
+ Hits          882      952      +70     
- Misses       1810     1816       +6     
  Partials        3        3              

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

If fact fails to connect to the gRPC server in the specified number of
attemtps it will crash, which can be used as a clear signal in k8s that
something is not working as expected. If the number of retries is set to
0, there will be no limit to the amount of times fact attempts to
reconnect.

In order to allow the reconnection failure to trigger an application
wide crash, the main output task monitors the result of the grpc task's
handle propagating the error further up. This method should allow for
other output components to be added in the future and follow this same
pattern without having to change anything outside the output module. The
stdout component is not included in this logic because it has no
condition that could merit an application wide crash.

We also now propagate the error of worker tasks all the way up to the
termination of the application.
@Molter73 Molter73 force-pushed the mauro/feat/crash-on-reconnect-fail branch from a6f8c0e to 669f5be Compare June 24, 2026 16:54

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (2)
fact/src/output/mod.rs (1)

47-74: 📐 Maintainability & Code Quality | 🔵 Trivial | 🏗️ Heavy lift

Add focused coverage for gRPC supervision failures.

This loop is the new contract that converts Ok(Err(_)) and join failures into output-level errors, but the PR coverage report shows this file has no patch coverage. A small async test or extracted helper around the match res path would protect the shutdown behavior.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@fact/src/output/mod.rs` around lines 47 - 74, Add focused test coverage for
the gRPC supervision path in the output loop so shutdown/error handling is
exercised. The `tokio::spawn` block in `output::mod` currently maps
`grpc_handle.borrow_mut()` results into output-level failures, so add a small
async test or extract a helper around the `match res` branch to verify both
`Ok(Err(_))` and join/task errors are handled as expected.
fact/src/config/tests.rs (1)

567-582: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Add the negative retries parse-error case.

Once the parser rejects negative retry counts, cover retries: -1 here so YAML validation matches the new retry-budget contract.

Proposed test case
         (
             r#"
             grpc:
               backoff:
                 retries: true
             "#,
             "invalid grpc.backoff.retries: Boolean(true)",
         ),
+        (
+            r#"
+            grpc:
+              backoff:
+                retries: -1
+            "#,
+            "invalid grpc.backoff.retries: Integer(-1)",
+        ),

As per coding guidelines, fact/src/config/mod.rs: "Add unit tests in fact/src/config/tests.rs for configuration schema changes in fact/src/config/mod.rs."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@fact/src/config/tests.rs` around lines 567 - 582, The config parsing tests in
tests.rs are missing coverage for the new negative retry validation in
grpc.backoff.retries. Add a unit test case alongside the existing retries
parse-error assertions in the config tests to verify that retries: -1 is
rejected with the appropriate invalid grpc.backoff.retries error, using the same
pattern as the current retries validation cases.

Source: Coding guidelines

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@fact/src/config/mod.rs`:
- Around line 416-420: Reject negative YAML retry counts in the retries parsing
branch of the config loader, since converting a signed value to u64 can wrap and
bypass the intended retry limit. Update the logic in the backoff/retries
handling inside mod.rs to accept 0 and positive values only, and explicitly bail
out when v.as_i64() returns a value below zero before assigning to
backoff.retries_max. Also add unit coverage in fact/src/config/tests.rs for the
retries configuration path to verify negative values fail and zero remains
valid.

In `@fact/src/output/grpc.rs`:
- Around line 217-220: The reconnect exhaustion path in gRPC output handling
drops the last connection failure, so update the backoff termination branch in
the connection retry loop to preserve the final error details when reconnection
attempts are exhausted. In the logic around the backoff handling in the gRPC
connection code, make the terminal bail/error include the original connection
error value from the retry loop so callers can see the actual
DNS/TLS/refused-connection cause instead of only “attempts exhausted”.

---

Nitpick comments:
In `@fact/src/config/tests.rs`:
- Around line 567-582: The config parsing tests in tests.rs are missing coverage
for the new negative retry validation in grpc.backoff.retries. Add a unit test
case alongside the existing retries parse-error assertions in the config tests
to verify that retries: -1 is rejected with the appropriate invalid
grpc.backoff.retries error, using the same pattern as the current retries
validation cases.

In `@fact/src/output/mod.rs`:
- Around line 47-74: Add focused test coverage for the gRPC supervision path in
the output loop so shutdown/error handling is exercised. The `tokio::spawn`
block in `output::mod` currently maps `grpc_handle.borrow_mut()` results into
output-level failures, so add a small async test or extract a helper around the
`match res` branch to verify both `Ok(Err(_))` and join/task errors are handled
as expected.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Enterprise

Run ID: 81043491-08dc-4da7-a651-5e8569ee9b26

📥 Commits

Reviewing files that changed from the base of the PR and between 818c659 and 669f5be.

📒 Files selected for processing (5)
  • fact/src/config/mod.rs
  • fact/src/config/tests.rs
  • fact/src/lib.rs
  • fact/src/output/grpc.rs
  • fact/src/output/mod.rs

Comment thread fact/src/config/mod.rs
Comment thread fact/src/output/grpc.rs Outdated
@Molter73 Molter73 marked this pull request as ready for review June 24, 2026 17:20
@Molter73 Molter73 requested a review from a team as a code owner June 24, 2026 17:20
Comment thread fact/src/config/tests.rs
},
..Default::default()
},
),

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add some tests with negative retries.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Some minor log improvements are also thrown in there.
@Molter73 Molter73 requested a review from JoukoVirtanen June 25, 2026 09:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants