fix(logs): SLES-2843 bound retries on non-success HTTP status (#1220)

litianningdatadog · claude · shreyamalpani · web-flow · commit ec644b49d863 · 2026-05-13T09:27:57.000-04:00
## Overview Fixes [SLES-2843](https://datadoghq.atlassian.net/browse/SLES-2843). `Flusher::send` (in `bottlecap/src/logs/flusher.rs`) only enforced `FLUSH_RETRY_COUNT` on transport errors. Non-success HTTP responses (4xx/5xx) on the `Ok` arm fell through and looped without bound, so a degraded log endpoint would be hit thousands of times per flush. A customer's Observability Pipelines Worker became overloaded during a 2× load test (April 27–30) and started returning HTTP errors. The unbounded retry loop produced ~10,000 attempts per failed flush against an OPW endpoint reached via cross-VPC `EKS → TGW → ELB`, contributing to ~$34K of unexpected AWS Transit Gateway / ELB egress charges over 3 days. Customer log: ``` DD_EXTENSION | ERROR | LOGS | Failed to send request after 2 ms and 10000 attempts: reqwest::Error { kind: Request, ..., source: hyper_util::client::legacy::Error(SendRequest, hyper::Error(Io, Kind(ConnectionReset))) } ``` ### Fix Mirror the `traces/proxy_flusher.rs:163` pattern: cap retries on `Ok(non-success)` too, returning a `FailedRequestError` so the next flush cycle can redrive the request — same recovery semantics as the existing `Err`-branch cap. Also captures and logs the response body for diagnostics (the previous code discarded it via `_ = resp.text().await`). The fix preserves all existing behavior: - 2xx → `Ok` immediately. - 403 → `Ok` early-exit (auth failures are permanent; don't retry). - 4xx / 5xx → retry up to `FLUSH_RETRY_COUNT` (3), then surface as `FailedRequestError` for the next cycle to redrive. - Transport errors (timeout/reset) → unchanged behavior, still capped at `FLUSH_RETRY_COUNT`. ## Testing Tested manually that this fixes the unbounded retry case. Before: ``` DD_EXTENSION | ERROR | LOGS | Failed to send request after 5005 ms and 132 attempts: reqwest::Error {..} ``` After: ``` DD_EXTENSION | ERROR | LOGS | Failed to send request after 1 ms and 3 attempts: status 500 Internal Server Error ``` Adds 7 unit tests in `bottlecap/src/logs/flusher.rs` covering every exit path of `Flusher::send`: | Test | Asserts | | --- | --- | | `send_returns_ok_on_success` | 200 → `Ok`, 1 hit | | `send_returns_ok_on_forbidden_without_retry` | 403 → `Ok`, 1 hit (no retries on permanent auth failures) | | `send_bounds_retries_on_non_success_status` | Persistent 500 → `Err` after 3 hits (the SLES-2843 regression) | | `send_bounds_retries_on_4xx_status` | Persistent 404 (e.g. misrouted OPW URL) → `Err` after 3 hits | | `send_bounds_retries_on_transport_error` | Persistent timeout → `Err` after 3 hits via the `Err` branch | | `send_succeeds_on_retry_after_transient_failure` | 500 once → 200 → `Ok` (verifies retries actually retry) | | `failed_request_error_carries_replayable_request` | After exhaustion, returned error downcasts to `FailedRequestError`, its `request` is cloneable, and the cloned request actually reaches the server with the original body — proves the redrive contract is intact | The eventual-success test uses a small axum-based stateful server because httpmock 0.7's matcher signature (`fn`, not `Fn`) cannot capture per-request state. - [x] `cargo test --lib logs::flusher::tests::` — 7/7 pass in 0.75s - [x] `cargo test --lib logs::` — 57/57 pass (no regressions) - [x] `cargo test --test logs_integration_test` — passes - [x] `cargo clippy --lib --tests --features default -- -D warnings` — clean - [x] `cargo fmt --check` — clean ## Follow-ups (not in this PR) These are improvements rather than blockers; tracked separately: 1. Add per-attempt backoff/jitter (currently the inner retry loop has no sleep between iterations). 2. Cycle-level circuit breaker — repeated flush cycles can still amplify load on a degraded endpoint via the redrive queue. 3. Fix misleading `"after X ms and Y attempts"` log message to report cumulative time, not just the last iteration. 🤖 Generated with [Claude Code](https://claude.com/claude-code) [SLES-2843]: https://datadoghq.atlassian.net/browse/SLES-2843?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: shreyamalpani <shreya.malpani@datadoghq.com>
diff --git a/bottlecap/src/logs/flusher.rs b/bottlecap/src/logs/flusher.rs
@@ -65,17 +65,14 @@ impl Flusher {
         }
 
         let mut failed_requests = Vec::new();
+        // `JoinSet::join_all` returns `Vec<T>` directly (not `Vec<Result<T, JoinError>>`),
+        // so each `result` is the `Result<(), Box<dyn Error + Send>>` returned by `send`.
+        // A non-`FailedRequestError` here means the task completed but reported an error
+        // it doesn't want redriven (e.g. could-not-clone); a `FailedRequestError` is a
+        // bounded retry exhaustion and the request must be re-queued for the next cycle.
         for result in set.join_all().await {
-            if let Err(e) = result {
-                debug!("LOGS | Failed to join task: {}", e);
-                continue;
-            }
-
-            // At this point we know the task completed successfully,
-            // but the send operation itself may have failed
             if let Err(e) = result {
                 if let Some(failed_req_err) = e.downcast_ref::<FailedRequestError>() {
-                    // Clone the request from our custom error
                     failed_requests.push(
                         failed_req_err
                             .request
@@ -120,7 +117,10 @@ impl Flusher {
             match resp {
                 Ok(resp) => {
                     let status = resp.status();
-                    _ = resp.text().await;
+                    // Drain the body to allow connection reuse, but do not capture
+                    // or log it: log intakes can echo back the submitted log payload,
+                    // which may contain sensitive customer data.
+                    let _ = resp.text().await;
                     if status == StatusCode::FORBIDDEN {
                         // Access denied. Stop retrying.
                         error!(
@@ -131,6 +131,22 @@ impl Flusher {
                     if status.is_success() {
                         return Ok(());
                     }
+                    if attempts >= FLUSH_RETRY_COUNT {
+                        // Non-success HTTP response (e.g. 4xx/5xx from an OPW or
+                        // intake under load) — surface the request for later retry
+                        // instead of looping unbounded against a degraded endpoint.
+                        error!(
+                            "LOGS | Failed to send request after {} ms and {} attempts: status {status}",
+                            elapsed.as_millis(),
+                            attempts,
+                        );
+                        return Err(Box::new(FailedRequestError {
+                            request: req,
+                            message: format!(
+                                "LOGS | Failed after {attempts} attempts: status {status}"
+                            ),
+                        }));
+                    }
                 }
                 Err(e) => {
                     if attempts >= FLUSH_RETRY_COUNT {
@@ -309,3 +325,275 @@ impl LogsFlusher {
         encoder.finish().map_err(|e| Box::new(e) as Box<dyn Error>)
     }
 }
+
+#[cfg(test)]
+mod tests {
+    use super::*;
+    use httpmock::prelude::*;
+    use std::sync::atomic::Ordering;
+    use std::time::Duration;
+
+    fn build_request(server: &MockServer, timeout: Duration) -> reqwest::RequestBuilder {
+        reqwest::Client::new()
+            .post(server.url("/api/v2/logs"))
+            .timeout(timeout)
+            .body("test")
+    }
+
+    #[tokio::test]
+    async fn send_returns_ok_on_success() {
+        let server = MockServer::start();
+        let mock = server.mock(|when, then| {
+            when.method(POST).path("/api/v2/logs");
+            then.status(200);
+        });
+
+        let result = Flusher::send(build_request(&server, Duration::from_secs(2))).await;
+
+        assert!(result.is_ok(), "2xx response should return Ok immediately");
+        mock.assert_hits(1);
+    }
+
+    #[tokio::test]
+    async fn send_returns_ok_on_forbidden_without_retry() {
+        let server = MockServer::start();
+        let mock = server.mock(|when, then| {
+            when.method(POST).path("/api/v2/logs");
+            then.status(403);
+        });
+
+        let result = Flusher::send(build_request(&server, Duration::from_secs(2))).await;
+
+        assert!(
+            result.is_ok(),
+            "403 is permanent (bad API key) — drop, do not retry"
+        );
+        mock.assert_hits(1);
+    }
+
+    /// Regression test for SLES-2843: a persistent non-success, non-403 status
+    /// (e.g. 500 from an OPW under load) must respect `FLUSH_RETRY_COUNT`
+    /// instead of looping unbounded and hammering the endpoint.
+    #[tokio::test]
+    async fn send_bounds_retries_on_non_success_status() {
+        let server = MockServer::start();
+        let mock = server.mock(|when, then| {
+            when.method(POST).path("/api/v2/logs");
+            then.status(500);
+        });
+
+        // Bound the test so the buggy (unbounded) implementation fails fast
+        // instead of hanging the suite.
+        let result = tokio::time::timeout(
+            Duration::from_secs(3),
+            Flusher::send(build_request(&server, Duration::from_secs(2))),
+        )
+        .await
+        .expect("send must respect FLUSH_RETRY_COUNT and not loop forever on non-success");
+
+        assert!(
+            result.is_err(),
+            "send should return Err after exhausting retries on persistent 5xx"
+        );
+        mock.assert_hits(FLUSH_RETRY_COUNT);
+    }
+
+    #[tokio::test]
+    async fn send_bounds_retries_on_4xx_status() {
+        let server = MockServer::start();
+        let mock = server.mock(|when, then| {
+            when.method(POST).path("/api/v2/logs");
+            then.status(404);
+        });
+
+        let result = tokio::time::timeout(
+            Duration::from_secs(3),
+            Flusher::send(build_request(&server, Duration::from_secs(2))),
+        )
+        .await
+        .expect("send must respect FLUSH_RETRY_COUNT on 4xx (e.g. misrouted OPW URL)");
+
+        assert!(
+            result.is_err(),
+            "persistent 4xx should bound at FLUSH_RETRY_COUNT"
+        );
+        mock.assert_hits(FLUSH_RETRY_COUNT);
+    }
+
+    #[tokio::test]
+    async fn send_bounds_retries_on_transport_error() {
+        let server = MockServer::start();
+        // Server holds the response longer than the client timeout so every
+        // attempt resolves to a reqwest timeout error (the `Err` branch).
+        let mock = server.mock(|when, then| {
+            when.method(POST).path("/api/v2/logs");
+            then.status(200).delay(Duration::from_secs(5));
+        });
+
+        let result = tokio::time::timeout(
+            Duration::from_secs(3),
+            Flusher::send(build_request(&server, Duration::from_millis(50))),
+        )
+        .await
+        .expect("send must respect FLUSH_RETRY_COUNT on transport errors");
+
+        assert!(
+            result.is_err(),
+            "send should return Err after exhausting retries on transport timeout"
+        );
+        mock.assert_hits(FLUSH_RETRY_COUNT);
+    }
+
+    /// Gap #2: when a transient failure clears, the loop must actually retry
+    /// and succeed — not treat the first non-2xx as a permanent failure.
+    /// Without this test, a refactor that early-returns on the first failed
+    /// status would still pass the bounded-retry tests above.
+    ///
+    /// Uses an axum-based stateful mock because httpmock 0.7 matchers cannot
+    /// capture state (their predicate type is `fn`, not `Fn`).
+    #[tokio::test]
+    async fn send_succeeds_on_retry_after_transient_failure() {
+        use axum::{Router, http::StatusCode, routing::post};
+        use std::sync::atomic::AtomicUsize;
+
+        let counter = Arc::new(AtomicUsize::new(0));
+        let app = Router::new().route(
+            "/api/v2/logs",
+            post({
+                let counter = Arc::clone(&counter);
+                move || {
+                    let counter = Arc::clone(&counter);
+                    async move {
+                        // First attempt → 500 (transient), all later attempts → 200.
+                        let n = counter.fetch_add(1, Ordering::SeqCst);
+                        if n == 0 {
+                            StatusCode::INTERNAL_SERVER_ERROR
+                        } else {
+                            StatusCode::OK
+                        }
+                    }
+                }
+            }),
+        );
+
+        let listener = tokio::net::TcpListener::bind("127.0.0.1:0")
+            .await
+            .expect("test should be able to bind a local port");
+        let addr = listener
+            .local_addr()
+            .expect("bound listener should have a local addr");
+        tokio::spawn(async move {
+            axum::serve(listener, app)
+                .await
+                .expect("test server should run cleanly");
+        });
+
+        let req = reqwest::Client::new()
+            .post(format!("http://{addr}/api/v2/logs"))
+            .timeout(Duration::from_secs(2))
+            .body("test");
+
+        let result = Flusher::send(req).await;
+
+        assert!(
+            result.is_ok(),
+            "send must retry past a transient failure and succeed on a later attempt"
+        );
+        assert_eq!(
+            counter.load(Ordering::SeqCst),
+            2,
+            "expected exactly one failed attempt + one successful retry"
+        );
+    }
+
+    /// Gap #1: after retry exhaustion, the returned error must carry a
+    /// re-issuable request so `Flusher::flush` can re-queue it for the next
+    /// flush cycle. A refactor that loses or corrupts the request would
+    /// silently turn a transient failure into permanent data loss.
+    ///
+    /// The mock matches on the body too, so any corruption of the stashed
+    /// request's payload would cause the re-issue to miss the mock and fail
+    /// the status assertion below.
+    #[tokio::test]
+    async fn failed_request_error_carries_replayable_request() {
+        let server = MockServer::start();
+        let mock = server.mock(|when, then| {
+            when.method(POST).path("/api/v2/logs").body_contains("test");
+            then.status(500);
+        });
+
+        let err = tokio::time::timeout(
+            Duration::from_secs(3),
+            Flusher::send(build_request(&server, Duration::from_secs(2))),
+        )
+        .await
+        .expect("send must terminate")
+        .expect_err("expected Err after retry exhaustion");
+
+        let failed = err
+            .downcast_ref::<FailedRequestError>()
+            .expect("error must be downcastable to FailedRequestError so flush() can re-queue it");
+
+        let cloned = failed
+            .request
+            .try_clone()
+            .expect("FailedRequestError.request must be cloneable for redrive");
+
+        // Re-issue the stashed request and confirm it actually reaches the
+        // server — proves the request is intact (URL, method, body), not a
+        // corrupted shell. If the body were lost, the body_contains matcher
+        // would miss and the response status would be 404 (httpmock default).
+        let response = cloned
+            .send()
+            .await
+            .expect("re-issued request should be sendable");
+        assert_eq!(
+            response.status().as_u16(),
+            500,
+            "re-issued request must hit the same mock — proves URL, method, and body are intact"
+        );
+
+        // FLUSH_RETRY_COUNT attempts during send + 1 from the re-issue above.
+        mock.assert_hits(FLUSH_RETRY_COUNT + 1);
+    }
+
+    /// Codex P1 (PR #1220): verifies that `Flusher::flush` actually re-queues
+    /// failed requests for the next flush cycle. The earlier dead-code path
+    /// in `flush` swallowed every `FailedRequestError` so the bounded retries
+    /// in `send` would silently drop data on persistent endpoint failures.
+    #[tokio::test]
+    async fn flush_redrives_failed_requests_after_retry_exhaustion() {
+        use crate::config::Config;
+
+        let server = MockServer::start();
+        let mock = server.mock(|when, then| {
+            when.method(POST).path("/api/v2/logs");
+            then.status(500);
+        });
+
+        let api_key_factory = Arc::new(ApiKeyFactory::new("test-key"));
+        let config = Arc::new(Config {
+            logs_config_use_compression: false,
+            ..Config::default()
+        });
+        let flusher = Flusher::new(
+            api_key_factory,
+            server.url(""),
+            config,
+            reqwest::Client::new(),
+        );
+
+        let batches = Some(Arc::new(vec![b"test-batch".to_vec()]));
+
+        let failed = tokio::time::timeout(Duration::from_secs(5), flusher.flush(batches))
+            .await
+            .expect("flush must terminate within bounded retries");
+
+        assert_eq!(
+            failed.len(),
+            1,
+            "after retry exhaustion, the failed request must be returned for redrive on the next flush cycle"
+        );
+        mock.assert_hits(FLUSH_RETRY_COUNT);
+    }
+}