[ISSUE #10373] Fix quarantined flaky tests and remove CI rerun workflow#10378
Closed
lizhimins wants to merge 1666 commits into
Closed
[ISSUE #10373] Fix quarantined flaky tests and remove CI rerun workflow#10378lizhimins wants to merge 1666 commits into
lizhimins wants to merge 1666 commits into
Conversation
* [ISSUE apache#9501] correcting mismatched comments * Update
…leDuration encounters flow limit Return origin handle to consumer when changeInvisibleDuration encounters flow limit
…very Fix combineCQ extra search commitLog files for recovery
…ResetOffset (apache#9310) * feat: support clients to reset lmq consumption offset * fix * fix * fix * fix: clean pull offset in #removeOffset * fix: clean pull offset in #removeOffset * rerun test --------- Co-authored-by: hqbfzwang <hqbfzwang@tencent.com>
Co-authored-by: hqbfzwang <hqbfzwang@tencent.com>
…e when enable split registration (apache#9521)
…ionDataSet is null
…apache#9554) * [ISSUE apache#9553] Improve performance by avoiding repeated get(key) * Update
* limit group length to 120 for max length for pop retry topic is 255. * Add unit test for validating group. * Fix unit test for validating gRPC group, limit length to 120
…sumeQueueStore to adapt to CombineConsumeQueueStore (apache#9566)
…ineConsumeQueueStore
…pache#9608) Co-authored-by: hqbfzwang <hqbfzwang@tencent.com>
…ecord flush (apache#9627) * [ISSUE apache#9626] Prevent premature offset commit before consumer record flush
…n tieredMessageStore (apache#9649) * [ISSUE apache#9648] Fix getOffsetInQueueByTime missing in tieredMessageStore * Update TieredMessageStoreTest.java * Delete inappropriate UT * Remove unused import
…cessor (apache#7838) * Fix start and shutdown process of DefaultMessagingProcessor * minimal changes
…>= 3.5.0) (apache#9665) Co-authored-by: cvictory <shengli.caosl@alibaba-inc.com>
* [ISSUE apache#9650] Unified FAQ related URLs * Update * Update
…uspend for LiteTopic (apache#10204) - Add wildcard (*) subscription support for liteTopic - Implement consume suspend mechanism with invalid scan count threshold - Refactor subscriber query interface with SubscriberWrapper for flexible retrieval - Add wildcard client cache with 30s TTL for performance optimization - Update related components and enhance test coverage Change-Id: I4ecaceec7daa2f4364d911437007df98dc49d542
…migration to RocksDB CQ (apache#10174)
… gRPC export failures (apache#10239) * Add BatchSplittingMetricExporter to prevent OTLP gRPC export failures When high-cardinality metrics (consumer_group x topic) produce OTLP export payloads exceeding the gRPC 32MB limit or SLS per-RPC processing limit, all metrics fail to export. This adds a MetricExporter decorator that: - Splits large batches of MetricData objects into smaller sub-batches - Splits single oversized MetricData objects by their internal data points into multiple smaller MetricData objects (supports all 7 MetricDataType) - Configurable via BrokerConfig.metricsExportBatchMaxDataPoints (default 1000) - Fast path with zero overhead when data points are within threshold - Logs failed batch details for debugging * fix(metrics): snapshot MetricData points before export to prevent AIOOBE The OTel SDK's NumberDataPointMarshaler.createRepeated allocates an array based on points.size() then iterates. If callback threads concurrently add data points between size() and iteration, an ArrayIndexOutOfBoundsException occurs. This adds a defensive snapshot of all data point collections at the start of export(), ensuring the delegate exporter always receives immutable point collections. * test(metrics): add unit tests for snapshot defensive copy - testSnapshotCreatesNewMetricData: verify delegate receives snapshotted MetricData, not the original reference - testSnapshotFallsBackToOriginal: verify catch block falls back to original when snapshot fails (e.g., mock without type) - testSnapshotPointsAreIndependentCopy: verify the snapshotted points collection is a separate instance from the original
…ve and fix delete co… (apache#9778) * feat: use data version from master while sync slave and fix delete config while sync Change-Id: I42b2e7b1acc6836d3c90973801c9defba5f1325c * fix: assign new version using master while sync slave Change-Id: I7ec20607a84499fe5a6607763013c59d726aedc3 * feat: allow set dataVersion directly for topic/group config sync Change-Id: Ic845794350e8bdaa847bdd0ae4b3e40ab1ad6311 * feat: set data version directly while sync from master Change-Id: I39e78477a5223b578a4ede3e5cb76f04368d1ca3 * test: adjust slave sync test for version Change-Id: I9e835568912928ddf6e81816095ee3ed8f93afc0
…StoreService.queryAsync (apache#10269) Co-authored-by: lizhimins <lizhimins@users.noreply.github.com>
…ricExporter pool race (apache#10267) OpenTelemetry Java 1.44.0 ~ 1.46.x ships OtlpGrpcMetricExporter with MemoryMode.REUSABLE_DATA by default. The underlying MetricReusableDataMarshaler.marshalerPool is a non-thread-safe ArrayDeque accessed concurrently by the reader thread (poll) and the OkHttp callback thread (add, via whenComplete). With BatchSplittingMetricExporter issuing N concurrent sub-batch exports per cycle, the pool races and leaks marshalers (~132 KiB each) until OOM. Fixed upstream in 1.47.0 via open-telemetry/opentelemetry-java#7041 (ArrayDeque -> ConcurrentLinkedDeque). - Bump OpenTelemetry to 1.47.0 in pom.xml so the upstream race fix is in effect. - Default OtlpGrpcMetricExporter to MemoryMode.IMMUTABLE_DATA to preserve the pre-1.44 default behavior; exposed via brokerConfig.metricsExportOtelMemoryMode ("IMMUTABLE_DATA" / "REUSABLE_DATA", case-insensitive). Operators may opt in to REUSABLE_DATA when running on OTel >= 1.47. - Cap concurrent in-flight sub-batches in BatchSplittingMetricExporter with a Semaphore controlled by brokerConfig.metricsExportBatchMaxConcurrent (default 4; set to 1 to serialize and match pre-batch behavior; 0 or Integer.MAX_VALUE means unlimited). - Add brokerConfig.metricsExportBatchSplitEnabled (default true) as an escape hatch to bypass BatchSplittingMetricExporter entirely, restoring the raw OtlpGrpcMetricExporter wiring. - Defensively snapshot MetricData points before export to avoid ArrayIndexOutOfBoundsException in NumberDataPointMarshaler when async instrument callbacks mutate point collections during export.
…a MessageStoreConfig (apache#10271) Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
…orrect parsing during LiteTopic wildcard unregistration (apache#10254)
…n losing CK record when visibilityTimeout collision (apache#10277)
… test (apache#10287) Co-authored-by: wangtao_ <wangtao684@huawei.com>
* [ISSUE apache#10110] Plain request process success and response fail when tlsMode=enforcing * fix queryMessage default indexType * fix
…al rate limiting (apache#10342)
…10349) Update rocketmq-client dependency version in Example_Simple.md documentation from the outdated 4.3.0 to the latest stable release 5.5.0 Co-authored-by: H145608 <1404499274@qq.com>
…ed without rocksdbjni dependency (apache#10371)
Rename Troubleshoopting.md -> Troubleshooting.md
The original filename had a spelling error ('shoopting' instead of 'shooting').
Co-authored-by: H145608 <1404499274@qq.com>
…cs (apache#10374) * [ISSUE apache#10373] Quarantine flaky tests and add detection plan docs Ran all RocketMQ module tests 100x across 10 ECS nodes to identify non-deterministic failures. Quarantined methods with @ignore across broker, client, filter, and tieredstore modules. Flaky tests quarantined: - broker: LiteLifecycleManagerTest#testCleanByParentTopic (2%) - broker: ConsumerOrderInfoManagerLockFreeNotifyTest#testRecover (2%) - broker: TransactionalMessageServiceImplTest#testDeletePrepareMessage_maxSize (1%) - client: DefaultMQConsumerWithTraceTest#testPullMessage_WithTrace_Success (1%) - client: DefaultMQLitePullConsumerWithTraceTest#testSubscribe_PollMessageSuccess_WithCustomizedTraceTopic (5%) - client: DefaultMQLitePullConsumerWithTraceTest#testSubscribe_PollMessageSuccess_WithDefaultTraceTopic (6%) - filter: BloomFilterTest#testCheckFalseHit (1%) - tieredstore: IndexStoreServiceTest#queryCrossFileBoundaryTest (35%) - tieredstore: IndexStoreServiceTest#concurrentGetTest (1.5%) Additional changes: - LiteLifecycleManagerTest: Switch to MockitoJUnitRunner.Silent - Add flaky test detection plan docs (CN + EN) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * [ISSUE apache#10373] Quarantine flaky PopPriorityIT and fix test cases - Quarantine PopPriorityIT at class level (multiple methods fail intermittently with 'expected:<8> but was:<2>' due to async race) - Fix ConsumerOrderInfoManagerLockFreeNotifyTest - Fix IndexStoreServiceTest Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * [ISSUE apache#10373] Fix flaky test detection plan docs path and naming Move English doc from docs/cn/ to docs/en/ and rename both files to match existing docs naming convention (underscore + PascalCase). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
…ConsumeQueueService by removing getLifeCycle indirection (apache#10376)
…workflow Fix root causes of flaky tests quarantined in apache#10374: - BloomFilterTest#testCheckFalseHit: use single seeded Random instance instead of per-character Random(System.nanoTime()) which produced duplicate strings in tight loops - TransactionalMessageServiceImplTest#testDeletePrepareMessage_maxSize: increase verify timeout from 50ms to 3000ms to accommodate slow thread scheduling - DefaultMQConsumerWithTraceTest#testPullMessage_WithTrace_Success: call pullMessage directly instead of async PullMessageService to eliminate race condition - DefaultMQLitePullConsumerWithTraceTest: set RebalanceService.waitInterval as static field in @before to avoid instance-level race condition Also remove rerun-workflow.yml to stop masking flaky tests with automatic CI retries.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
@Ignoreannotationsrerun-workflow.ymlto stop masking flaky tests with automatic CI retriesChanges
BloomFilterTest#testCheckFalseHitnew Random(System.nanoTime())per-character produced duplicate stringsRandom(42)instanceTransactionalMessageServiceImplTest#testDeletePrepareMessage_maxSizeverify(bridge, timeout(50))too short for slow CItimeout(3000)DefaultMQConsumerWithTraceTest#testPullMessage_WithTrace_SuccessPullMessageServicerace conditionpullMessagedirectlyDefaultMQLitePullConsumerWithTraceTest(2 methods)RebalanceService.waitIntervalrace@BeforeTest plan