Skip to content

Commit 6ddefce

Browse files
hjwsm1989huangjun
andauthored
Roll back LocalityAwareLoadBalancer to gettimeofday_us to match callers (#3283)
PR #3268 ("Use monotonic time instead of wall time") switched LocalityAwareLoadBalancer::Weight::Update's end_time_us and LocalityAwareLoadBalancer::Describe's now to butil::cpuwide_time_us(), but every caller that supplies CallInfo::begin_time_us still uses butil::gettimeofday_us(): - Channel::CallMethod (channel.cpp:451) -> Controller::IssueRPC -> Controller::Call::begin_time_us -> SelectIn::begin_time_us -> CallInfo::begin_time_us - Controller::OnVersionedRPCReturned retry sites (controller.cpp:672, 715) call IssueRPC(gettimeofday_us()) on backup-request and regular retries The mismatched time domains make latency = end_time_us - ci.begin_time_us = cpuwide_now - wallclock_begin ~= -1.7e15 us trigger the `if (latency <= 0) { /* time skews, ignore */ return 0; }` short-circuit on every call. _time_q never accumulates samples, _avg_latency stays at 0, and locality-aware weight feedback is silently disabled. Visible downstream symptom: cold-start `list://` channels with `lb=la` and 2 backends occasionally fail RPCs with EHOSTDOWN ("Fail to select server from list://...") on retry even when one backend is healthy. Bisected reproduction in xsky/brpc fork: - 51 commit range c41e838..604dad0c (1.16.1 .. 1.17.0-rc2) - master code + LA-driven multipath probe at 2 backends, max_retry=1, repeat 500x: * commit 771de31 (one before #3268): 0/500 fail * commit 12fb539 (#3268): 25/500 fail * commit 12fb539 + revert only Weight::Update::end_time_us to gettimeofday_us: 0/500 fail This commit reverts the LA-side of #3268's clock change so the LB lines up with its existing callers again. Channel::CallMethod and the retry paths in Controller stay on butil::gettimeofday_us(), which preserves the wall-clock semantics of Controller::_begin_time_us / Controller::latency_us() that public users rely on. Adds test/brpc_load_balancer_unittest.cpp::la_records_latency_with_consistent_time_source which drives a series of SelectServer + Feedback cycles against LocalityAwareLoadBalancer (no Server / Channel needed) and asserts that _avg_latency reflects the elapsed time, rather than being stuck at 0 because of a time-source mismatch. Co-authored-by: huangjun <huangjun@xsky.com>
1 parent 457e22a commit 6ddefce

2 files changed

Lines changed: 80 additions & 3 deletions

File tree

src/brpc/policy/locality_aware_load_balancer.cpp

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,7 @@
1818

1919
#include <limits> // numeric_limits
2020
#include <gflags/gflags.h>
21-
#include "butil/time.h" // cpuwide_time_us
21+
#include "butil/time.h" // gettimeofday_us
2222
#include "butil/fast_rand.h"
2323
#include "brpc/log.h"
2424
#include "brpc/socket.h"
@@ -376,7 +376,7 @@ void LocalityAwareLoadBalancer::Feedback(const CallInfo& info) {
376376

377377
int64_t LocalityAwareLoadBalancer::Weight::Update(
378378
const CallInfo& ci, size_t index) {
379-
const int64_t end_time_us = butil::cpuwide_time_us();
379+
const int64_t end_time_us = butil::gettimeofday_us();
380380
const int64_t latency = end_time_us - ci.begin_time_us;
381381
BAIDU_SCOPED_LOCK(_mutex);
382382
if (Disabled()) {
@@ -524,7 +524,7 @@ void LocalityAwareLoadBalancer::Describe(
524524
if (_db_servers.Read(&s) != 0) {
525525
os << "fail to read _db_servers";
526526
} else {
527-
const int64_t now = butil::cpuwide_time_us();
527+
const int64_t now = butil::gettimeofday_us();
528528
const size_t n = s->weight_tree.size();
529529
os << '[';
530530
for (size_t i = 0; i < n; ++i) {

test/brpc_load_balancer_unittest.cpp

Lines changed: 77 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1303,6 +1303,83 @@ TEST_F(LoadBalancerTest, revived_from_all_failed_intergrated) {
13031303
}
13041304
#endif // BUTIL_USE_ASAN
13051305

1306+
// Regression for #3268's incomplete migration of LocalityAwareLoadBalancer.
1307+
//
1308+
// #3268 switched `LocalityAwareLoadBalancer::Weight::Update::end_time_us` and
1309+
// `LocalityAwareLoadBalancer::Describe::now` to `butil::cpuwide_time_us()`
1310+
// while every caller that supplies `CallInfo::begin_time_us` (the RPC entry
1311+
// in `Channel::CallMethod` and the retry sites in
1312+
// `Controller::OnVersionedRPCReturned`) still uses `butil::gettimeofday_us()`.
1313+
// The resulting time-source mismatch makes
1314+
//
1315+
// latency = end_time_us - ci.begin_time_us
1316+
// = cpuwide_now - wallclock_begin
1317+
// ~= -1.7e15 us (huge negative)
1318+
//
1319+
// trigger the
1320+
//
1321+
// if (latency <= 0) { /* time skews, ignore the sample */ return 0; }
1322+
//
1323+
// short-circuit on every call. `_time_q` never accumulates samples,
1324+
// `_avg_latency` stays at 0, and locality-aware weight feedback is silently
1325+
// disabled. Visible downstream symptom: cold-start `list://` channels with
1326+
// `lb=la` and 2 backends occasionally fail RPCs with `EHOSTDOWN`
1327+
// ("Fail to select server") on retry even when one backend is healthy.
1328+
//
1329+
// This commit reverts the LA side of #3268, so `Weight::Update` and
1330+
// `Describe` once again use `butil::gettimeofday_us()` to match every
1331+
// existing caller of `CallInfo::begin_time_us`.
1332+
//
1333+
// The test below runs entirely against `LocalityAwareLoadBalancer` (no
1334+
// Server / Channel is involved), so it is hermetic. It supplies a
1335+
// gettimeofday-based `begin_time_us` (matching what `Channel::CallMethod`
1336+
// passes today) and asserts that the LB records a positive `_avg_latency`,
1337+
// rather than tripping the time-skew short-circuit.
1338+
TEST_F(LoadBalancerTest, la_records_latency_with_consistent_time_source) {
1339+
LALB lalb;
1340+
char addr[] = "192.168.1.1:8080";
1341+
butil::EndPoint dummy;
1342+
ASSERT_EQ(0, str2endpoint(addr, &dummy));
1343+
brpc::ServerId id(8888);
1344+
brpc::SocketOptions options;
1345+
options.remote_side = dummy;
1346+
ASSERT_EQ(0, brpc::Socket::Create(options, &id.id));
1347+
ASSERT_TRUE(lalb.AddServer(id));
1348+
1349+
auto avg_latency = [&]() -> int64_t {
1350+
std::ostringstream os;
1351+
brpc::DescribeOptions opts;
1352+
opts.verbose = true;
1353+
lalb.Describe(os, opts);
1354+
const std::string s = os.str();
1355+
const size_t p = s.find("avg_latency=");
1356+
if (p == std::string::npos) return -1;
1357+
return strtoll(s.c_str() + p + strlen("avg_latency="), NULL, 10);
1358+
};
1359+
1360+
// Drive a few "RPCs": pick a server, sleep ~2ms, feed back. begin_time_us
1361+
// comes from gettimeofday_us(), matching what Channel::CallMethod and the
1362+
// retry sites in Controller::OnVersionedRPCReturned pass on every RPC.
1363+
for (int i = 0; i < 8; ++i) {
1364+
const int64_t begin_us = butil::gettimeofday_us();
1365+
brpc::SocketUniquePtr ptr;
1366+
brpc::LoadBalancer::SelectIn in = { begin_us, true, false, 0u, NULL };
1367+
brpc::LoadBalancer::SelectOut out(&ptr);
1368+
ASSERT_EQ(0, lalb.SelectServer(in, &out));
1369+
bthread_usleep(2000);
1370+
brpc::LoadBalancer::CallInfo ci = { begin_us, id.id, 0, NULL };
1371+
lalb.Feedback(ci);
1372+
}
1373+
1374+
// _avg_latency must reflect actual elapsed time. If this is 0, either
1375+
// Weight::Update::end_time_us was changed away from gettimeofday_us
1376+
// again (re-introducing the time-source mismatch) or some caller of
1377+
// CallInfo::begin_time_us drifted to a different clock domain.
1378+
EXPECT_GT(avg_latency(), 0);
1379+
1380+
ASSERT_EQ(0, brpc::Socket::SetFailed(id.id));
1381+
}
1382+
13061383
TEST_F(LoadBalancerTest, la_selection_too_long) {
13071384
brpc::GlobalInitializeOrDie();
13081385
brpc::LoadBalancerWithNaming lb;

0 commit comments

Comments
 (0)