Skip to content

[ISSUE #10373] Fix quarantined flaky tests and remove CI rerun workflow#10378

Closed
lizhimins wants to merge 1666 commits into
apache:mainfrom
lizhimins:fix/flaky-tests-10373
Closed

[ISSUE #10373] Fix quarantined flaky tests and remove CI rerun workflow#10378
lizhimins wants to merge 1666 commits into
apache:mainfrom
lizhimins:fix/flaky-tests-10373

Conversation

@lizhimins
Copy link
Copy Markdown
Member

Summary

Changes

Test Root Cause Fix
BloomFilterTest#testCheckFalseHit new Random(System.nanoTime()) per-character produced duplicate strings Use single seeded Random(42) instance
TransactionalMessageServiceImplTest#testDeletePrepareMessage_maxSize verify(bridge, timeout(50)) too short for slow CI Increase to timeout(3000)
DefaultMQConsumerWithTraceTest#testPullMessage_WithTrace_Success Async PullMessageService race condition Call pullMessage directly
DefaultMQLitePullConsumerWithTraceTest (2 methods) Instance-level RebalanceService.waitInterval race Set as static field in @Before

Test plan

  • BloomFilterTest: 478/1000 runs passed (zero failures) before manual stop
  • TransactionalMessageServiceImplTest: 1000/1000 passed
  • DefaultMQConsumerWithTraceTest: 5000x verification in progress
  • DefaultMQLitePullConsumerWithTraceTest: ~900/1000 passed before manual stop

yx9o and others added 30 commits July 4, 2025 10:02
* [ISSUE apache#9501] correcting mismatched comments

* Update
…leDuration encounters flow limit

 Return origin handle to consumer when changeInvisibleDuration encounters flow limit
…very

Fix combineCQ extra search commitLog files for recovery
…ResetOffset (apache#9310)

* feat: support clients to reset lmq consumption offset

* fix

* fix

* fix

* fix: clean pull offset in #removeOffset

* fix: clean pull offset in #removeOffset

* rerun test

---------

Co-authored-by: hqbfzwang <hqbfzwang@tencent.com>
Co-authored-by: hqbfzwang <hqbfzwang@tencent.com>
…apache#9554)

* [ISSUE apache#9553] Improve performance by avoiding repeated get(key)

* Update
* limit group length to 120 for max length for pop retry topic is 255.
* Add unit test for validating group.
* Fix unit test for validating gRPC group, limit length to 120
…sumeQueueStore to adapt to CombineConsumeQueueStore (apache#9566)
…ecord flush (apache#9627)

* [ISSUE apache#9626] Prevent premature offset commit before consumer record flush
…n tieredMessageStore (apache#9649)

* [ISSUE apache#9648] Fix getOffsetInQueueByTime missing in tieredMessageStore

* Update TieredMessageStoreTest.java

* Delete inappropriate UT

* Remove unused import
…cessor (apache#7838)

* Fix start and shutdown process of DefaultMessagingProcessor

* minimal changes
…>= 3.5.0) (apache#9665)

Co-authored-by: cvictory <shengli.caosl@alibaba-inc.com>
* [ISSUE apache#9650] Unified FAQ related URLs

* Update

* Update
f1amingo and others added 28 commits April 3, 2026 11:23
…uspend for LiteTopic (apache#10204)

- Add wildcard (*) subscription support for liteTopic
- Implement consume suspend mechanism with invalid scan count threshold
- Refactor subscriber query interface with SubscriberWrapper for flexible retrieval
- Add wildcard client cache with 30s TTL for performance optimization
- Update related components and enhance test coverage

Change-Id: I4ecaceec7daa2f4364d911437007df98dc49d542
… gRPC export failures (apache#10239)

* Add BatchSplittingMetricExporter to prevent OTLP gRPC export failures

When high-cardinality metrics (consumer_group x topic) produce OTLP export
payloads exceeding the gRPC 32MB limit or SLS per-RPC processing limit,
all metrics fail to export. This adds a MetricExporter decorator that:

- Splits large batches of MetricData objects into smaller sub-batches
- Splits single oversized MetricData objects by their internal data points
  into multiple smaller MetricData objects (supports all 7 MetricDataType)
- Configurable via BrokerConfig.metricsExportBatchMaxDataPoints (default 1000)
- Fast path with zero overhead when data points are within threshold
- Logs failed batch details for debugging

* fix(metrics): snapshot MetricData points before export to prevent AIOOBE

The OTel SDK's NumberDataPointMarshaler.createRepeated allocates an
array based on points.size() then iterates. If callback threads
concurrently add data points between size() and iteration, an
ArrayIndexOutOfBoundsException occurs. This adds a defensive snapshot
of all data point collections at the start of export(), ensuring
the delegate exporter always receives immutable point collections.

* test(metrics): add unit tests for snapshot defensive copy

- testSnapshotCreatesNewMetricData: verify delegate receives
  snapshotted MetricData, not the original reference
- testSnapshotFallsBackToOriginal: verify catch block falls
  back to original when snapshot fails (e.g., mock without type)
- testSnapshotPointsAreIndependentCopy: verify the snapshotted
  points collection is a separate instance from the original
…ve and fix delete co… (apache#9778)

* feat: use data version from master while sync slave and fix delete config while sync

Change-Id: I42b2e7b1acc6836d3c90973801c9defba5f1325c

* fix: assign new version using master while sync slave

Change-Id: I7ec20607a84499fe5a6607763013c59d726aedc3

* feat: allow set dataVersion directly for topic/group config sync

Change-Id: Ic845794350e8bdaa847bdd0ae4b3e40ab1ad6311

* feat: set data version directly while sync from master

Change-Id: I39e78477a5223b578a4ede3e5cb76f04368d1ca3

* test: adjust slave sync test for version

Change-Id: I9e835568912928ddf6e81816095ee3ed8f93afc0
…StoreService.queryAsync (apache#10269)

Co-authored-by: lizhimins <lizhimins@users.noreply.github.com>
…ricExporter pool race (apache#10267)

OpenTelemetry Java 1.44.0 ~ 1.46.x ships OtlpGrpcMetricExporter with
MemoryMode.REUSABLE_DATA by default. The underlying
MetricReusableDataMarshaler.marshalerPool is a non-thread-safe
ArrayDeque accessed concurrently by the reader thread (poll) and the
OkHttp callback thread (add, via whenComplete). With
BatchSplittingMetricExporter issuing N concurrent sub-batch exports
per cycle, the pool races and leaks marshalers (~132 KiB each) until
OOM. Fixed upstream in 1.47.0 via open-telemetry/opentelemetry-java#7041
(ArrayDeque -> ConcurrentLinkedDeque).

- Bump OpenTelemetry to 1.47.0 in pom.xml so the upstream race fix is
  in effect.
- Default OtlpGrpcMetricExporter to MemoryMode.IMMUTABLE_DATA to
  preserve the pre-1.44 default behavior; exposed via
  brokerConfig.metricsExportOtelMemoryMode ("IMMUTABLE_DATA" /
  "REUSABLE_DATA", case-insensitive). Operators may opt in to
  REUSABLE_DATA when running on OTel >= 1.47.
- Cap concurrent in-flight sub-batches in BatchSplittingMetricExporter
  with a Semaphore controlled by
  brokerConfig.metricsExportBatchMaxConcurrent (default 4; set to 1
  to serialize and match pre-batch behavior; 0 or Integer.MAX_VALUE
  means unlimited).
- Add brokerConfig.metricsExportBatchSplitEnabled (default true) as
  an escape hatch to bypass BatchSplittingMetricExporter entirely,
  restoring the raw OtlpGrpcMetricExporter wiring.
- Defensively snapshot MetricData points before export to avoid
  ArrayIndexOutOfBoundsException in NumberDataPointMarshaler when
  async instrument callbacks mutate point collections during export.
…a MessageStoreConfig (apache#10271)

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
…orrect parsing during LiteTopic wildcard unregistration (apache#10254)
…n losing CK record when visibilityTimeout collision (apache#10277)
… test (apache#10287)

Co-authored-by: wangtao_ <wangtao684@huawei.com>
* [ISSUE apache#10110] Plain request process success and response fail when tlsMode=enforcing

* fix queryMessage default indexType

* fix
…10349)

Update rocketmq-client dependency version in Example_Simple.md documentation
from the outdated 4.3.0 to the latest stable release 5.5.0

Co-authored-by: H145608 <1404499274@qq.com>
Rename Troubleshoopting.md -> Troubleshooting.md
The original filename had a spelling error ('shoopting' instead of 'shooting').

Co-authored-by: H145608 <1404499274@qq.com>
…cs (apache#10374)

* [ISSUE apache#10373] Quarantine flaky tests and add detection plan docs

Ran all RocketMQ module tests 100x across 10 ECS nodes to identify
non-deterministic failures. Quarantined methods with @ignore across
broker, client, filter, and tieredstore modules.

Flaky tests quarantined:
- broker: LiteLifecycleManagerTest#testCleanByParentTopic (2%)
- broker: ConsumerOrderInfoManagerLockFreeNotifyTest#testRecover (2%)
- broker: TransactionalMessageServiceImplTest#testDeletePrepareMessage_maxSize (1%)
- client: DefaultMQConsumerWithTraceTest#testPullMessage_WithTrace_Success (1%)
- client: DefaultMQLitePullConsumerWithTraceTest#testSubscribe_PollMessageSuccess_WithCustomizedTraceTopic (5%)
- client: DefaultMQLitePullConsumerWithTraceTest#testSubscribe_PollMessageSuccess_WithDefaultTraceTopic (6%)
- filter: BloomFilterTest#testCheckFalseHit (1%)
- tieredstore: IndexStoreServiceTest#queryCrossFileBoundaryTest (35%)
- tieredstore: IndexStoreServiceTest#concurrentGetTest (1.5%)

Additional changes:
- LiteLifecycleManagerTest: Switch to MockitoJUnitRunner.Silent
- Add flaky test detection plan docs (CN + EN)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* [ISSUE apache#10373] Quarantine flaky PopPriorityIT and fix test cases

- Quarantine PopPriorityIT at class level (multiple methods fail
  intermittently with 'expected:<8> but was:<2>' due to async race)
- Fix ConsumerOrderInfoManagerLockFreeNotifyTest
- Fix IndexStoreServiceTest

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* [ISSUE apache#10373] Fix flaky test detection plan docs path and naming

Move English doc from docs/cn/ to docs/en/ and rename both files
to match existing docs naming convention (underscore + PascalCase).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
…ConsumeQueueService by removing getLifeCycle indirection (apache#10376)
…workflow

Fix root causes of flaky tests quarantined in apache#10374:

- BloomFilterTest#testCheckFalseHit: use single seeded Random instance
  instead of per-character Random(System.nanoTime()) which produced
  duplicate strings in tight loops
- TransactionalMessageServiceImplTest#testDeletePrepareMessage_maxSize:
  increase verify timeout from 50ms to 3000ms to accommodate slow
  thread scheduling
- DefaultMQConsumerWithTraceTest#testPullMessage_WithTrace_Success:
  call pullMessage directly instead of async PullMessageService to
  eliminate race condition
- DefaultMQLitePullConsumerWithTraceTest: set RebalanceService.waitInterval
  as static field in @before to avoid instance-level race condition

Also remove rerun-workflow.yml to stop masking flaky tests with
automatic CI retries.
@lizhimins lizhimins closed this May 26, 2026
@codecov-commenter
Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 54.34842% with 1685 lines in your changes missing coverage. Please review.
✅ Project coverage is 48.89%. Comparing base (0a32a54) to head (b268d06).

Files with missing lines Patch % Lines
...on/rocksdb/TransactionalMessageRocksDBService.java 0.00% 148 Missing ⚠️
...ache/rocketmq/broker/lite/LiteEventDispatcher.java 55.39% 92 Missing and 32 partials ⚠️
...etmq/broker/processor/PopLiteMessageProcessor.java 52.43% 100 Missing and 17 partials ⚠️
...q/broker/metrics/BatchSplittingMetricExporter.java 56.80% 88 Missing and 4 partials ⚠️
.../broker/longpolling/PopLiteLongPollingService.java 42.46% 72 Missing and 12 partials ⚠️
.../rocketmq/broker/metrics/BrokerMetricsManager.java 55.35% 63 Missing and 12 partials ⚠️
...ocketmq/broker/processor/AdminBrokerProcessor.java 23.95% 64 Missing and 9 partials ⚠️
...cketmq/broker/processor/NotificationProcessor.java 0.00% 65 Missing ⚠️
...rocketmq/broker/processor/AckMessageProcessor.java 4.68% 58 Missing and 3 partials ⚠️
...a/org/apache/rocketmq/broker/BrokerController.java 50.94% 26 Missing and 26 partials ⚠️
... and 59 more
Additional details and impacted files
@@             Coverage Diff              @@
##               main   #10378      +/-   ##
============================================
+ Coverage     47.94%   48.89%   +0.94%     
- Complexity    12011    13446    +1435     
============================================
  Files          1312     1376      +64     
  Lines         92659   100527    +7868     
  Branches      11849    12983    +1134     
============================================
+ Hits          44427    49149    +4722     
- Misses        42713    45356    +2643     
- Partials       5519     6022     +503     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.