Skip to content

[AI-8th] Support Configurable Retry Exceptions#1560

Open
pmupkin wants to merge 1 commit into
sofastack:masterfrom
pmupkin:custom-exception-retry
Open

[AI-8th] Support Configurable Retry Exceptions#1560
pmupkin wants to merge 1 commit into
sofastack:masterfrom
pmupkin:custom-exception-retry

Conversation

@pmupkin
Copy link
Copy Markdown

@pmupkin pmupkin commented Apr 24, 2026

This PR(Related issue #1362) adds support for configurable retryable exceptions in the SOFARPC failover retry flow.

Changes include:

  • Added consumer-level and method-level retryExceptions configuration.
  • Preserved the existing default retry behavior for client timeout and server busy errors.
  • Added retry matching for configured exception classes, including superclass matching and wrapped causes.
  • Added support for retrying business exceptions returned in SofaResponse.
  • Added HTTP response metadata to preserve the original business exception class name, so custom retry rules can work for HTTP responses where the Throwable object itself is not serialized.
  • Added unit tests for retry exception configuration, failover retry behavior, and remote exception-class matching.

Copy link
Copy Markdown
Collaborator

@EvenLjj EvenLjj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review 总结

方向正确、测试也写得不错,但默认行为变更每次调用都反射加载类配置非法直接打断调用这三类问题必须修,否则会带来线上风险。下面按严重程度分级列出。


🔴 Blocker(必须修)

B1. resolveRetryExceptions 在每次 RPC 调用都做 Class.forName — 性能问题

FailoverCluster.doInvoke 入口 + RetryableExceptionHelper.java:56-70

// FailoverCluster.doInvoke 每次调用都执行:
List<Class<? extends Throwable>> retryableExceptions =
    RetryableExceptionHelper.resolveRetryExceptions(
        consumerConfig.getMethodRetryExceptions(methodName));

// helper 内部:
for (String className : classNames) {
    exceptionClass = ClassUtils.forName(className);  // ← 反射 + split 字符串
    ...
}

每个 RPC 请求都会拆字符串 + 反射加载 N 个类。RPC 是高频路径,这是显著的额外开销。

建议:在 ConsumerConfig#setRetryExceptions(...) / MethodConfig#setRetryExceptions(...) 时一次性解析并缓存 List<Class<? extends Throwable>>,或在 RetryableExceptionHelper 内用 ConcurrentHashMap<String, List<Class<? extends Throwable>>> 缓存。前者更优,可以在配置阶段 fail-fast。

B2. 配置非法时抛 IllegalArgumentException 直接打断每次 RPC 调用

RetryableExceptionHelper.java:62-69

try {
    exceptionClass = ClassUtils.forName(className);
} catch (RuntimeException e) {
    throw new IllegalArgumentException("Failed to load retry exception class: " + className, e);
}
if (!Throwable.class.isAssignableFrom(exceptionClass)) {
    throw new IllegalArgumentException("Retry exception class must extend Throwable: " + className);
}
  • 如果用户拼错了类名 / 类不在 classpath,每次 RPC 调用都抛异常,整个接口直接不可用。
  • 这个 IAE 会进到 FailoverClustercatch (Exception e),又被包装成 SofaRpcException(CLIENT_UNDECLARED_ERROR),再次走 shouldRetry —— 解析仍然会失败,日志会被刷爆。

建议

  1. 与 B1 配套:把校验前移到 setRetryExceptions(...) 配置阶段,运行期不会再抛。
  2. 即便保留运行期解析,也应该 WARN + 跳过该非法类名,而不是中断整个调用。

B3. 默认行为破坏:业务返回 Throwable 不再"原样返回"

FailoverCluster.java:73-82(diff 修改后)

原行为:只要 response != nullreturn response,业务异常直接交给上层处理。
PR 行为:先 RetryableExceptionHelper.shouldRetry(response, ...) 判断;如果 appResponse instanceof Throwable 且匹配重试列表,就重试

副作用:

  1. 隐式的 SERVER_BIZ 重试:现在 RPC 服务端业务异常会通过 HEAD_RESPONSE_EXCEPTION 暴露真实类名给客户端,客户端只要配置了对应类(哪怕是 RuntimeException 这种泛型父类),就会盲目重试非幂等的业务异常,可能造成订单/支付双写。
  2. 配置 RuntimeException 这种宽匹配直接命中所有业务异常,文档里必须明确警告,且建议加默认黑名单(如 IllegalArgumentException、参数校验类异常默认不应重试)。

建议

  • ConsumerConfig#setRetryExceptions 的 javadoc 和 README/文档里明确警告幂等性要求。
  • 考虑增加一个独立开关 retryOnBusinessException,默认关闭,避免存量用户升级后行为静默改变。

🟠 Major(强烈建议修)

M1. time 计数器语义被悄悄改变

FailoverCluster.java:73-130

原代码 time 表示"已失败次数",只在 catch 里 time++。PR 在"response 非空但被判 retryable"分支里也 time++,于是 time 变成"已尝试次数"。整体可以收敛,但:

  • 没有任何注释说明这层语义变化;
  • 三个 time++ 散落在不同分支,未来维护者很容易漏改;
  • do { ... } while (time <= retries) 的边界并未在测试里覆盖 retries=0 + response 需重试 的极端场景。

建议:抽 private boolean shouldContinue(int time) 之类的方法,并补一个注释说明 time 语义。同时补一条测试:retries=0 + 第一次返回 appResponse instanceof CustomRetryException → 应直接返回(不重试),断言 invokeTimes == 1

M2. matchesRemoteExceptionClass.forName 失败时仅做字符串全等回退,与本地匹配语义不一致

RetryableExceptionHelper.java:112-138

} catch (RuntimeException e) {
    return matchesClassName((String) exceptionClassName, retryableExceptions);
}
// matchesClassName 仅比 retryableException.getName().equals(exceptionClassName)

本地匹配支持父类 + cause 链,远端类加载失败时退化为"全限定名相等",用户配 RuntimeException 时永远命中不了远端的 MyBizException

建议:要么明确文档说明远端匹配仅支持精确类名;要么至少加 WARN 日志告知"远端类名 xxx 在本地无法加载,已退化为字符串匹配"。

M3. HTTP / Bolt 写出 HEAD_RESPONSE_EXCEPTION — 信息泄漏 + 跨版本兼容

Http1ServerTaskHttp2ServerTaskSofaRpcSerialization

headers.set(RemotingConstants.HEAD_RESPONSE_EXCEPTION, throwable.getClass().getName());
  • 把服务端内部包结构无差别暴露到客户端,HTTP 场景常见跨组织调用,存在信息泄漏。
  • 老版本客户端拿到这个 prop 会忽略,没问题;但反过来新客户端 + 老服务端 → 永远拿不到这个 prop,远端匹配能力失效,文档要写清这个限制
  • 是否所有写出业务异常的协议路径都加了?建议作者在 PR 描述里补一份"已覆盖路径清单"。

建议:增加配置开关 exposeRemoteExceptionClass(服务端侧),按团队策略决定是否开启;README/文档说明此能力依赖双端版本升级。

M4. cause 链匹配未防护循环引用

RetryableExceptionHelper.java:97-107

while (current != null) {
    ...
    current = current.getCause();
}

JDK 的 Throwable.initCause 有自防护,但反序列化构造的 Throwable 可绕过,理论上存在死循环风险。

建议:用 IdentityHashMap 记录已访问,或限制最大深度(如 16)。


🟡 Minor(建议)

  • FailoverCluster.doInvoke 圈复杂度 ~16,建议抽 handleResponse / handleSofaRpcException / handleOtherException 三段。
  • 变量命名:exceptionResponselastRetryableResponse 更准确。
  • 菱形运算符:new ArrayList<Class<? extends Throwable>>(...) 可改 new ArrayList<>(...)
  • 缺关键注释:throwableexceptionResponse 互斥置 null 的设计意图需要文字说明。
  • FailoverClusterTest 没有断言 invokedProviderInfos 累积情况,建议补 assertEquals(2, invokedProviderInfos.size()) 验证 select 不会重复。

✅ 优点

  1. 职责拆分清晰:RetryableExceptionHelper 工具类单一职责,便于单测。
  2. 保留默认行为:isDefaultRetryable 完整保留 SERVER_BUSY / CLIENT_TIMEOUT 自动重试。
  3. 测试覆盖度比较好(约 85%):覆盖了 method override、cause 链、远端类名、配置非法、retries 用尽等关键场景。
  4. HTTP 场景的远端类名传递思路正确,解决了协议层 Throwable 不可序列化的痛点。

建议补充的测试

// 1. 直接的父类匹配
@Test public void testParentClassMatch() // 配 RuntimeException,抛 IAE
// 2. retries=0 但 response 需重试
@Test public void testNoRetryButReturnRetryableResponse()
// 3. provider 累积
Assert.assertEquals(2, cluster.getInvokedProviderInfos().size());
// 4. 配置非法不应在每次调用时再次抛(修完 B2 后)
@Test public void testInvalidConfigDoesNotBreakRuntime()
// 5. cause 链自引用防护(修完 M4 后)
@Test public void testCauseLoopProtection()

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds configurable retryable-exception support to SOFARPC failover retry logic, including matching configured exception classes (and causes) and enabling HTTP transports to propagate original business-exception class metadata for retry decisions.

Changes:

  • Introduces retryExceptions at consumer and method config levels, plus a new config key constant.
  • Extends failover retry logic to retry on configured exception types and business-exception responses, with helper utilities and tests.
  • Propagates HTTP response metadata (HEAD_RESPONSE_EXCEPTION) so clients can apply retry rules even when the original Throwable isn’t serialized.

Reviewed changes

Copilot reviewed 14 out of 14 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
remoting/remoting-http/src/main/java/com/alipay/sofa/rpc/transport/http/AbstractHttpClientHandler.java Reads and forwards business exception class metadata from HTTP headers into SofaResponse props.
remoting/remoting-http/src/main/java/com/alipay/sofa/rpc/server/http/Http2ServerTask.java Emits business exception class name into HTTP/2 headers for app errors.
remoting/remoting-http/src/main/java/com/alipay/sofa/rpc/server/http/Http1ServerTask.java Emits business exception class name into HTTP/1.1 headers for app errors.
remoting/remoting-http/src/main/java/com/alipay/sofa/rpc/server/http/AbstractHttpServerTask.java Plumbs the thrown business exception into sendAppError(...) so transports can emit metadata.
core/api/src/test/java/com/alipay/sofa/rpc/config/MethodConfigTest.java Adds unit test coverage for MethodConfig.retryExceptions.
core/api/src/test/java/com/alipay/sofa/rpc/config/ConsumerConfigTest.java Adds unit test coverage for consumer vs method override of retryExceptions.
core/api/src/main/java/com/alipay/sofa/rpc/config/MethodConfig.java Adds method-level retryExceptions configuration property.
core/api/src/main/java/com/alipay/sofa/rpc/config/ConsumerConfig.java Adds consumer-level retryExceptions and method-level resolution helper.
core/api/src/main/java/com/alipay/sofa/rpc/common/RpcConstants.java Adds CONFIG_KEY_RETRY_EXCEPTIONS.
core/api/src/main/java/com/alipay/sofa/rpc/common/RemotingConstants.java Adds HEAD_RESPONSE_EXCEPTION header constant.
core-impl/client/src/test/java/com/alipay/sofa/rpc/client/RetryableExceptionHelperTest.java Tests helper parsing and retry matching (including remote exception class name).
core-impl/client/src/test/java/com/alipay/sofa/rpc/client/FailoverClusterTest.java Tests failover retry behaviors with configured retry exceptions and response-based exceptions.
core-impl/client/src/main/java/com/alipay/sofa/rpc/client/RetryableExceptionHelper.java New helper for parsing configured exceptions and matching local/remote failures.
core-impl/client/src/main/java/com/alipay/sofa/rpc/client/FailoverCluster.java Integrates configurable exception retry logic into failover flow.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Throwable throwable = new SofaRpcException(RpcErrorType.SERVER_BIZ, errorMsg);
response.setAppResponse(throwable);
String exceptionClass = headers.get(RemotingConstants.HEAD_RESPONSE_EXCEPTION);
if (exceptionClass != null) {
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When propagating HEAD_RESPONSE_EXCEPTION, consider ignoring blank/empty header values (not just null) to avoid storing meaningless response props (e.g., use a blank check before addResponseProp).

Suggested change
if (exceptionClass != null) {
if (StringUtils.isNotBlank(exceptionClass)) {

Copilot uses AI. Check for mistakes.
Comment on lines +463 to +472
/**
* Sets retry exceptions.
*
* @param retryExceptions the retry exceptions
* @return the retry exceptions
*/
public ConsumerConfig<T> setRetryExceptions(String retryExceptions) {
this.retryExceptions = retryExceptions;
return this;
}
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Javadoc mismatch: this setter returns ConsumerConfig<T> for fluent chaining, but the @return tag says "the retry exceptions". Update the return description to reflect the actual return type/value (e.g., "this consumer config").

Copilot uses AI. Check for mistakes.
Comment on lines +63 to +73
Class<?> exceptionClass;
try {
exceptionClass = ClassUtils.forName(className);
} catch (RuntimeException e) {
throw new IllegalArgumentException("Failed to load retry exception class: " + className, e);
}
if (!Throwable.class.isAssignableFrom(exceptionClass)) {
throw new IllegalArgumentException("Retry exception class must extend Throwable: " + className);
}
result.add(castThrowableClass(exceptionClass));
}
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resolveRetryExceptions() uses ClassUtils.forName(className) which initializes classes (initialize=true). For user-provided exception class names this can trigger static initializers and adds avoidable overhead; prefer loading without initialization (e.g., ClassUtils.forName(className, false)), then validate Throwable assignability.

Copilot uses AI. Check for mistakes.
Comment on lines +111 to +120
Object exceptionClassName = response.getResponseProp(RemotingConstants.HEAD_RESPONSE_EXCEPTION);
if (!(exceptionClassName instanceof String) || StringUtils.isBlank((String) exceptionClassName)) {
return false;
}
Class<?> exceptionClass;
try {
exceptionClass = ClassUtils.forName((String) exceptionClassName);
} catch (RuntimeException e) {
return matchesClassName((String) exceptionClassName, retryableExceptions);
}
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

matchesRemoteException() attempts to ClassUtils.forName() the exception class name provided by the remote peer. Because ClassUtils.forName() initializes classes, this can execute static initializers based on untrusted input. Consider loading with initialization disabled (or avoiding class loading entirely and matching only against configured class names) to prevent side effects from remote-controlled headers.

Copilot uses AI. Check for mistakes.
Comment on lines +61 to +62
List<Class<? extends Throwable>> retryableExceptions = RetryableExceptionHelper
.resolveRetryExceptions(consumerConfig.getMethodRetryExceptions(methodName));
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resolveRetryExceptions(...) performs class loading and list building on every invocation of doInvoke(). Since retry exception config is method-scoped and typically static, consider resolving/caching the parsed List<Class<? extends Throwable>> in the consumer config value cache (or a per-method cache in the cluster) to avoid repeated class-loading work on hot paths.

Copilot uses AI. Check for mistakes.
@pmupkin
Copy link
Copy Markdown
Author

pmupkin commented Apr 28, 2026

Review 总结

方向正确、测试也写得不错,但默认行为变更每次调用都反射加载类配置非法直接打断调用这三类问题必须修,否则会带来线上风险。下面按严重程度分级列出。

🔴 Blocker(必须修)

B1. resolveRetryExceptions 在每次 RPC 调用都做 Class.forName — 性能问题

FailoverCluster.doInvoke 入口 + RetryableExceptionHelper.java:56-70

// FailoverCluster.doInvoke 每次调用都执行:
List<Class<? extends Throwable>> retryableExceptions =
    RetryableExceptionHelper.resolveRetryExceptions(
        consumerConfig.getMethodRetryExceptions(methodName));

// helper 内部:
for (String className : classNames) {
    exceptionClass = ClassUtils.forName(className);  // ← 反射 + split 字符串
    ...
}

每个 RPC 请求都会拆字符串 + 反射加载 N 个类。RPC 是高频路径,这是显著的额外开销。

建议:在 ConsumerConfig#setRetryExceptions(...) / MethodConfig#setRetryExceptions(...) 时一次性解析并缓存 List<Class<? extends Throwable>>,或在 RetryableExceptionHelper 内用 ConcurrentHashMap<String, List<Class<? extends Throwable>>> 缓存。前者更优,可以在配置阶段 fail-fast。

B2. 配置非法时抛 IllegalArgumentException 直接打断每次 RPC 调用

RetryableExceptionHelper.java:62-69

try {
    exceptionClass = ClassUtils.forName(className);
} catch (RuntimeException e) {
    throw new IllegalArgumentException("Failed to load retry exception class: " + className, e);
}
if (!Throwable.class.isAssignableFrom(exceptionClass)) {
    throw new IllegalArgumentException("Retry exception class must extend Throwable: " + className);
}
  • 如果用户拼错了类名 / 类不在 classpath,每次 RPC 调用都抛异常,整个接口直接不可用。
  • 这个 IAE 会进到 FailoverClustercatch (Exception e),又被包装成 SofaRpcException(CLIENT_UNDECLARED_ERROR),再次走 shouldRetry —— 解析仍然会失败,日志会被刷爆。

建议

  1. 与 B1 配套:把校验前移到 setRetryExceptions(...) 配置阶段,运行期不会再抛。
  2. 即便保留运行期解析,也应该 WARN + 跳过该非法类名,而不是中断整个调用。

B3. 默认行为破坏:业务返回 Throwable 不再"原样返回"

FailoverCluster.java:73-82(diff 修改后)

原行为:只要 response != nullreturn response,业务异常直接交给上层处理。 PR 行为:先 RetryableExceptionHelper.shouldRetry(response, ...) 判断;如果 appResponse instanceof Throwable 且匹配重试列表,就重试

副作用:

  1. 隐式的 SERVER_BIZ 重试:现在 RPC 服务端业务异常会通过 HEAD_RESPONSE_EXCEPTION 暴露真实类名给客户端,客户端只要配置了对应类(哪怕是 RuntimeException 这种泛型父类),就会盲目重试非幂等的业务异常,可能造成订单/支付双写。
  2. 配置 RuntimeException 这种宽匹配直接命中所有业务异常,文档里必须明确警告,且建议加默认黑名单(如 IllegalArgumentException、参数校验类异常默认不应重试)。

建议

  • ConsumerConfig#setRetryExceptions 的 javadoc 和 README/文档里明确警告幂等性要求。
  • 考虑增加一个独立开关 retryOnBusinessException,默认关闭,避免存量用户升级后行为静默改变。

🟠 Major(强烈建议修)

M1. time 计数器语义被悄悄改变

FailoverCluster.java:73-130

原代码 time 表示"已失败次数",只在 catch 里 time++。PR 在"response 非空但被判 retryable"分支里也 time++,于是 time 变成"已尝试次数"。整体可以收敛,但:

  • 没有任何注释说明这层语义变化;
  • 三个 time++ 散落在不同分支,未来维护者很容易漏改;
  • do { ... } while (time <= retries) 的边界并未在测试里覆盖 retries=0 + response 需重试 的极端场景。

建议:抽 private boolean shouldContinue(int time) 之类的方法,并补一个注释说明 time 语义。同时补一条测试:retries=0 + 第一次返回 appResponse instanceof CustomRetryException → 应直接返回(不重试),断言 invokeTimes == 1

M2. matchesRemoteExceptionClass.forName 失败时仅做字符串全等回退,与本地匹配语义不一致

RetryableExceptionHelper.java:112-138

} catch (RuntimeException e) {
    return matchesClassName((String) exceptionClassName, retryableExceptions);
}
// matchesClassName 仅比 retryableException.getName().equals(exceptionClassName)

本地匹配支持父类 + cause 链,远端类加载失败时退化为"全限定名相等",用户配 RuntimeException 时永远命中不了远端的 MyBizException

建议:要么明确文档说明远端匹配仅支持精确类名;要么至少加 WARN 日志告知"远端类名 xxx 在本地无法加载,已退化为字符串匹配"。

M3. HTTP / Bolt 写出 HEAD_RESPONSE_EXCEPTION — 信息泄漏 + 跨版本兼容

Http1ServerTaskHttp2ServerTaskSofaRpcSerialization

headers.set(RemotingConstants.HEAD_RESPONSE_EXCEPTION, throwable.getClass().getName());
  • 把服务端内部包结构无差别暴露到客户端,HTTP 场景常见跨组织调用,存在信息泄漏。
  • 老版本客户端拿到这个 prop 会忽略,没问题;但反过来新客户端 + 老服务端 → 永远拿不到这个 prop,远端匹配能力失效,文档要写清这个限制
  • 是否所有写出业务异常的协议路径都加了?建议作者在 PR 描述里补一份"已覆盖路径清单"。

建议:增加配置开关 exposeRemoteExceptionClass(服务端侧),按团队策略决定是否开启;README/文档说明此能力依赖双端版本升级。

M4. cause 链匹配未防护循环引用

RetryableExceptionHelper.java:97-107

while (current != null) {
    ...
    current = current.getCause();
}

JDK 的 Throwable.initCause 有自防护,但反序列化构造的 Throwable 可绕过,理论上存在死循环风险。

建议:用 IdentityHashMap 记录已访问,或限制最大深度(如 16)。

🟡 Minor(建议)

  • FailoverCluster.doInvoke 圈复杂度 ~16,建议抽 handleResponse / handleSofaRpcException / handleOtherException 三段。
  • 变量命名:exceptionResponselastRetryableResponse 更准确。
  • 菱形运算符:new ArrayList<Class<? extends Throwable>>(...) 可改 new ArrayList<>(...)
  • 缺关键注释:throwableexceptionResponse 互斥置 null 的设计意图需要文字说明。
  • FailoverClusterTest 没有断言 invokedProviderInfos 累积情况,建议补 assertEquals(2, invokedProviderInfos.size()) 验证 select 不会重复。

✅ 优点

  1. 职责拆分清晰:RetryableExceptionHelper 工具类单一职责,便于单测。
  2. 保留默认行为:isDefaultRetryable 完整保留 SERVER_BUSY / CLIENT_TIMEOUT 自动重试。
  3. 测试覆盖度比较好(约 85%):覆盖了 method override、cause 链、远端类名、配置非法、retries 用尽等关键场景。
  4. HTTP 场景的远端类名传递思路正确,解决了协议层 Throwable 不可序列化的痛点。

建议补充的测试

// 1. 直接的父类匹配
@Test public void testParentClassMatch() // 配 RuntimeException,抛 IAE
// 2. retries=0 但 response 需重试
@Test public void testNoRetryButReturnRetryableResponse()
// 3. provider 累积
Assert.assertEquals(2, cluster.getInvokedProviderInfos().size());
// 4. 配置非法不应在每次调用时再次抛(修完 B2 后)
@Test public void testInvalidConfigDoesNotBreakRuntime()
// 5. cause 链自引用防护(修完 M4 后)
@Test public void testCauseLoopProtection()

好的,我会继续往建议的方向提交

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla:yes CLA is ok First-time contributor First-time contributor size/XL

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants