Skip to content

feat(retry): migrate to Rust-majority#91

Open
madhu-mohan-jaishankar wants to merge 6 commits into
mainfrom
feat/rust_retry_with_backoff
Open

feat(retry): migrate to Rust-majority#91
madhu-mohan-jaishankar wants to merge 6 commits into
mainfrom
feat/rust_retry_with_backoff

Conversation

@madhu-mohan-jaishankar
Copy link
Copy Markdown
Collaborator

@madhu-mohan-jaishankar madhu-mohan-jaishankar commented May 11, 2026

Migrate retry_with_backoff to Rust-Majority Architecture (v0.3.1 → v0.3.2)

Related Issues

Closes #50

Overview

This PR completes the migration of the retry_with_backoff plugin from a hybrid Python/Rust architecture to a Rust-majority architecture (95% Rust / 5% Python), following the established patterns from pii_filter and secrets_detection plugins.


Motivation

The previous hybrid architecture split retry logic across Python and Rust, creating maintenance overhead and limiting performance. This migration:

  • Consolidates all retry logic in Rust (single source of truth)
  • Eliminates Python fallback code (no dual maintenance)
  • Improves performance (3–5x faster expected)
  • Provides type-safe configuration validation
  • Aligns with the project's Rust-majority plugin pattern

Changes

New Rust Components

src/config.rs (332 lines)

  • Complete configuration validation in Rust
  • Serde-based deserialization from PyDict
  • Tool-specific override merging
  • Default value functions
  • Validation logic (max_retries ≤ 10, backoff_base_ms > 0, etc.)

src/plugin.rs (356 lines)

  • RetryWithBackoffPluginCore struct with PyO3 bindings
  • Complete tool_post_invoke() implementation
  • Complete resource_post_fetch() implementation
  • Failure detection (structured content + text content parsing)
  • Exponential backoff calculation with jitter
  • State management with TTL (300s)
  • Metadata generation

Updated Python Components

retry_with_backoff.py (276 lines → 35 lines)

  • Pure delegation pattern (matches pii_filter)
  • No Python logic or fallback
  • Simple wrapper around Rust core

Removed

Item Reason
test_compat.py Python compatibility layer no longer needed
RetryConfig Pydantic model Replaced by Rust config
_ToolRetryState dataclass Replaced by Rust state management
_is_failure() function Replaced by Rust failure detection
_compute_delay_ms() function Replaced by Rust backoff calculation
_get_state() / _del_state() functions Replaced by Rust state management
_cfg_for() function Replaced by Rust config merging

Test Suite Cleanup

Removed Obsolete Tests (24 tests)

Test Class Count Reason
TestRustNativePolicy 5 Tested non-existent to_rust_native_policy() method
TestPluginInit 3 Tested removed get_settings() and gateway ceiling clamping
TestRustFallback 3 Tested old Python fallback architecture
TestGetState.test_ttl_eviction 1 Tested Python _ToolRetryState class
TestCfgFor 4 Tested Python config merging
Obsolete TestIsFailure tests 8 Rewrote to use Rust

Rewrote Tests (10 tests)

Test Class Count Change
TestComputeDelayMs 4 Now uses RetryStateManager.compute_delay()
TestIsFailure 6 Now uses RetryStateManager.check_failure()

Results

Before After
Total tests 43 19
Passing 31 19
Failing 12 0

Architecture Comparison

Before (Hybrid)

Python Layer (276 lines)
├── RetryConfig (Pydantic)
├── _is_failure()          — failure detection
├── _compute_delay_ms()    — backoff calculation
├── _get_state/_del_state  — state management
└── tool_post_invoke()     — orchestration
    ↓
Rust Layer (454 lines)
└── RetryStateManager      — state tracking utility

After (Rust-Majority)

Python Shim (35 lines)
└── Pure delegation to Rust core
    ↓
Rust Core (688 lines)
├── config.rs  — Configuration validation
└── plugin.rs  — Complete plugin implementation
    ├── Failure detection
    ├── State management
    ├── Backoff calculation
    ├── Metadata generation
    └── Hook implementations

Breaking Changes

None — full backward compatibility maintained:

  • Same configuration schema
  • Same behavior
  • Same API surface
  • All hook signatures unchanged

Dependencies

Added:

  • serde (with derive feature)
  • serde_json
  • cpex_framework_bridge

Removed:

  • pydantic (no longer needed)

Performance

Expected improvements over previous implementations:

  • 3–5x faster than Python-only implementation
  • ~50% improvement over hybrid architecture
  • Minimal memory overhead with TTL-based cleanup

Checklist

  • All tests passing (19/19)
  • No clippy warnings
  • Code formatted (cargo fmt)
  • Type stubs updated
  • Version bumped (0.2.0 → 0.3.0)
  • Plugin manifest updated
  • Documentation updated
  • DCO sign-off included

Signed-off-by: Madhu Mohan Jaishankar <madhu.mohan.jaishankar@ibm.com>
Copilot AI review requested due to automatic review settings May 11, 2026 13:52
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Signed-off-by: Madhu Mohan Jaishankar <madhu.mohan.jaishankar@ibm.com>
Copilot AI review requested due to automatic review settings May 11, 2026 15:39
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

…t tests

Signed-off-by: Madhu Mohan Jaishankar <madhu.mohan.jaishankar@ibm.com>
@madhu-mohan-jaishankar madhu-mohan-jaishankar force-pushed the feat/rust_retry_with_backoff branch from bf4a968 to d8fd06b Compare May 11, 2026 17:15
…bridge

Signed-off-by: Madhu Mohan Jaishankar <madhu.mohan.jaishankar@ibm.com>
Copilot AI review requested due to automatic review settings May 11, 2026 21:47
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 10 out of 11 changed files in this pull request and generated 13 comments.

Comment on lines +8 to +33
try:
from mcpgateway.plugins.framework import (
Plugin,
PluginConfig,
PluginContext,
ResourcePostFetchPayload,
ResourcePostFetchResult,
ToolPostInvokePayload,
ToolPostInvokeResult,
)
except ModuleNotFoundError:
# Fallback for testing without mcpgateway
class Plugin: # type: ignore[no-redef]
def __init__(self, config) -> None:
self.config = config

class PluginConfig: # type: ignore[no-redef]
def __init__(self, **kwargs) -> None:
for k, v in kwargs.items():
setattr(self, k, v)

PluginContext = object # type: ignore[misc,assignment]
ToolPostInvokePayload = object # type: ignore[misc,assignment]
ToolPostInvokeResult = object # type: ignore[misc,assignment]
ResourcePostFetchPayload = object # type: ignore[misc,assignment]
ResourcePostFetchResult = object # type: ignore[misc,assignment]
Comment on lines +64 to +94
pub fn from_py_dict(dict: &Bound<'_, PyDict>) -> PyResult<Self> {
// Extract fields manually to handle Python types
let max_retries = dict
.get_item("max_retries")?
.and_then(|v| v.extract::<u32>().ok())
.unwrap_or_else(default_max_retries);

let backoff_base_ms = dict
.get_item("backoff_base_ms")?
.and_then(|v| v.extract::<u64>().ok())
.unwrap_or_else(default_backoff_base_ms);

let max_backoff_ms = dict
.get_item("max_backoff_ms")?
.and_then(|v| v.extract::<u64>().ok())
.unwrap_or_else(default_max_backoff_ms);

let retry_on_status = dict
.get_item("retry_on_status")?
.and_then(|v| v.extract::<Vec<i32>>().ok())
.unwrap_or_else(default_retry_on_status);

let jitter = dict
.get_item("jitter")?
.and_then(|v| v.extract::<bool>().ok())
.unwrap_or_else(default_jitter);

let check_text_content = dict
.get_item("check_text_content")?
.and_then(|v| v.extract::<bool>().ok())
.unwrap_or(false);
Comment on lines +139 to +153
/// Get configuration for a specific tool, applying overrides if present
pub fn get_tool_config(&self, tool_name: &str) -> Self {
if let Some(override_cfg) = self.tool_overrides.get(tool_name) {
Self {
max_retries: override_cfg.max_retries.unwrap_or(self.max_retries),
backoff_base_ms: override_cfg.backoff_base_ms.unwrap_or(self.backoff_base_ms),
max_backoff_ms: override_cfg.max_backoff_ms.unwrap_or(self.max_backoff_ms),
retry_on_status: override_cfg
.retry_on_status
.clone()
.unwrap_or_else(|| self.retry_on_status.clone()),
jitter: override_cfg.jitter.unwrap_or(self.jitter),
check_text_content: self.check_text_content,
tool_overrides: HashMap::new(), // Don't nest overrides
}
pub backoff_base_ms: Option<u64>,
pub max_backoff_ms: Option<u64>,
pub retry_on_status: Option<Vec<i32>>,
pub jitter: Option<bool>,
Comment on lines +198 to +199
let retry_status_set = config.retry_on_status_set();

Comment on lines +1 to 4
description: "High-performance retry policy engine with exponential backoff, jitter, per-tool overrides, and retry metadata for transient tool and resource failures - Rust-majority architecture (95% Rust / 5% Python)"
author: "ContextForge Contributors"
version: "0.3.1"
version: "0.3.0"
kind: "cpex_retry_with_backoff.retry_with_backoff.RetryWithBackoffPlugin"
Comment on lines 1 to 4
[package]
name = "retry_with_backoff"
version = "0.3.1"
version = "0.3.0"
edition.workspace = true
Comment on lines +6 to +8
from cpex_retry_with_backoff.retry_with_backoff import RetryWithBackoffPlugin

def __getattr__(name: str):
if name in {"RetryConfig", "RetryWithBackoffPlugin"}:
from cpex_retry_with_backoff.retry_with_backoff import RetryConfig, RetryWithBackoffPlugin

exports = {
"RetryConfig": RetryConfig,
"RetryWithBackoffPlugin": RetryWithBackoffPlugin,
}
return exports[name]
raise AttributeError(f"module {__name__!r} has no attribute {name!r}")


__all__ = ["RetryConfig", "RetryWithBackoffPlugin"]
__all__ = ["RetryWithBackoffPlugin"] No newline at end of file
Comment on lines 5 to 9
import logging
import time
import uuid
from dataclasses import dataclass
from pathlib import Path
from unittest.mock import MagicMock, patch

Comment on lines +96 to +103
let tool_overrides = dict
.get_item("tool_overrides")?
.and_then(|v| {
v.cast::<PyDict>()
.ok()
.and_then(|d| parse_tool_overrides(d).ok())
})
.unwrap_or_default();
Copy link
Copy Markdown
Collaborator

@lucarlig lucarlig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@madhu-mohan-jaishankar mcpgateway.plugins.framework should not exists anywhere in your code we are on versions relying solely on cpex. version should be bumped up not down.

Signed-off-by: Madhu Mohan Jaishankar <madhu.mohan.jaishankar@ibm.com>
@madhu-mohan-jaishankar
Copy link
Copy Markdown
Collaborator Author

@lucarlig addressed review comments.

Framework import (mcpgatewaycpex)
Replaced the mcpgateway.plugins.framework try/except block (with fallback stubs) with a direct cpex.framework import, matching the pattern used by pii_filter and secrets_detection.

Version
Restored monotonic versioning — base on main was already 0.3.1 (bumped by #83), so this PR lands at 0.3.2.

config.rs

  • from_py_dict now uses strict extract()? with a named ValueError per field — wrong types are rejected instead of silently falling back to defaults
  • tool_overrides parsing propagates errors (no more silent unwrap_or_default)
  • Same strict extraction applied inside parse_tool_overrides for each override field
  • ToolOverride gains check_text_content: Option<bool>; get_tool_config now applies it and calls validate() on the merged config before returning
  • Removed retry_on_status_set() — replaced with Vec::contains to avoid per-call HashSet allocation

plugin.rs

  • check_text_content branch is now gated on structuredContent being absent, preserving backward-compatible failure detection semantics
  • Content item type check now explicitly skips items without type == "text" (items with no type field are skipped, not processed)
  • #[mutants::skip] replaced with #[cfg_attr(feature = "mutants", mutants::skip)]

Cargo.toml
mutants moved to an optional dependency behind a mutants feature flag — no longer pulled into the production build graph.

init.py
Restored the lazy __getattr__ import pattern for consistency with other plugins.

test_integration.py
Removed unused imports (logging, time, MagicMock, patch).

…ction

Signed-off-by: Madhu Mohan Jaishankar <madhu.mohan.jaishankar@ibm.com>
Copilot AI review requested due to automatic review settings May 12, 2026 11:32
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 10 out of 11 changed files in this pull request and generated 7 comments.

Comment on lines +213 to +215
// Check structuredContent; track presence to gate text content check
let has_structured_content = result_dict.get_item("structuredContent")?.is_some();
if let Some(structured) = result_dict.get_item("structuredContent")?

for (key, value) in dict.iter() {
let tool_name = key.extract::<String>()?;
let override_dict = value.cast::<PyDict>()?;
Comment on lines +298 to +306
fn compute_delay_ms(attempt: u32, base_ms: u64, max_ms: u64, jitter: bool) -> u64 {
let ceiling = base_ms
.saturating_mul(2u64.saturating_pow(attempt))
.min(max_ms);
if jitter {
rand::thread_rng().gen_range(0..=ceiling)
} else {
ceiling
}
Comment on lines +21 to +27
mutants = "0"
pyo3.workspace = true
pyo3-log.workspace = true
pyo3-stub-gen.workspace = true
rand.workspace = true
serde = { workspace = true, features = ["derive"] }
serde_json.workspace = true
Comment on lines +1 to +3
description: "High-performance retry policy engine with exponential backoff, jitter, per-tool overrides, and retry metadata for transient tool and resource failures - Rust-majority architecture (95% Rust / 5% Python)"
author: "ContextForge Contributors"
version: "0.3.1"
version: "0.3.2"
Comment on lines 26 to +29
def __init__(self, config: PluginConfig) -> None:
super().__init__(config)
raw_cfg = RetryConfig(**(config.config or {}))

ceiling = getattr(get_settings(), "max_tool_retries", raw_cfg.max_retries)
if raw_cfg.max_retries > ceiling:
log.warning(
"retry_with_backoff: max_retries=%d exceeds gateway ceiling=%d, clamping",
raw_cfg.max_retries,
ceiling,
)
raw_cfg = raw_cfg.model_copy(update={"max_retries": ceiling})

for tool_name, overrides in raw_cfg.tool_overrides.items():
if overrides.get("max_retries", 0) > ceiling:
log.warning(
"retry_with_backoff: tool_overrides[%s].max_retries=%d exceeds ceiling=%d, clamping",
tool_name,
overrides["max_retries"],
ceiling,
)
overrides["max_retries"] = ceiling

self._cfg = raw_cfg
self._rust = RetryStateManager(
self._cfg.max_retries,
self._cfg.backoff_base_ms,
self._cfg.max_backoff_ms,
self._cfg.jitter,
self._cfg.retry_on_status,
)
self._rust_overrides = {
tool_name: RetryStateManager(
overrides.get("max_retries", self._cfg.max_retries),
overrides.get("backoff_base_ms", self._cfg.backoff_base_ms),
overrides.get("max_backoff_ms", self._cfg.max_backoff_ms),
overrides.get("jitter", self._cfg.jitter),
overrides.get("retry_on_status", self._cfg.retry_on_status),
)
for tool_name, overrides in self._cfg.tool_overrides.items()
}

def to_rust_native_policy(self, tool_name: str, ceiling: int) -> Optional[dict[str, Any]]:
raw_cfg = RetryConfig(**(self.config.config or {}))
cfg = _cfg_for(raw_cfg, tool_name)
if cfg.max_retries > ceiling:
cfg = cfg.model_copy(update={"max_retries": ceiling})

if cfg.check_text_content:
return None

return {
"kind": "retry_with_backoff",
"maxRetries": int(cfg.max_retries),
"backoffBaseMs": int(cfg.backoff_base_ms),
"maxBackoffMs": int(cfg.max_backoff_ms),
"retryOnStatus": list(cfg.retry_on_status),
"jitter": bool(cfg.jitter),
}
self._core = RetryWithBackoffPluginCore(config.config or {})
log.info("retry_with_backoff: Initialized with Rust core (v0.3.0)")
Comment on lines +44 to +48
"""Delegate to Rust core for resource post-fetch processing."""
return self._core.resource_post_fetch(payload, context)


__all__ = [
"RetryConfig",
"RetryWithBackoffPlugin",
"RetryStateManager",
"_STATE",
"_STATE_TTL_SECONDS",
"_ToolRetryState",
"_cfg_for",
"_compute_delay_ms",
"_del_state",
"_get_state",
"_is_failure",
]
__all__ = ["RetryWithBackoffPlugin"] No newline at end of file
Copy link
Copy Markdown
Collaborator

@lucarlig lucarlig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a few minor issues need fixing before merge:

  1. Gateway retry ceiling is no longer enforced.
    RetryWithBackoffPlugin.__init__ now passes config straight into Rust, while Rust only caps max_retries <= 10. Deployments with a lower gateway max_tool_retries can now retry more than policy allows. Please restore ceiling clamping for both global config and tool overrides.

  2. structuredContent: None now skips text-content retry detection.
    The new Rust path treats key presence as structured content, so check_text_content=True will not parse retryable JSON text when structuredContent is explicitly None. Previous behavior parsed text in that case. Please treat missing and Python None the same, and only suppress text parsing for non-null structured content.

  3. Retry logic now exists in two Rust cores.
    RetryStateManager in src/lib.rs and RetryWithBackoffPluginCore in src/plugin.rs now both own state/TTL/delay/failure-classification logic. That creates likely divergence for future fixes. Please consolidate shared retry policy logic or make one path call the other.

  4. Stale eviction scans all retry state on every failed invocation.
    Each failure calls evict_stale, which retains over the full state map. Under many failed in-flight requests this makes failure handling O(n) per invocation. Please switch to periodic/bounded eviction or another approach that does not scan the full map on every failure.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Migrate retry_with_backoff to Rust-Majority Architecture

3 participants