[improve][broker] Improve the topic loading observability and performance#24577
Closed
BewareMyPower wants to merge 14 commits into
Closed
[improve][broker] Improve the topic loading observability and performance#24577BewareMyPower wants to merge 14 commits into
BewareMyPower wants to merge 14 commits into
Conversation
0214bd2 to
4ffe9e8
Compare
Contributor
Author
|
I will write a PIP for the topic initialization optimization later, close this PR |
Contributor
Author
|
Regarding the duplicated ownership check method, I will use split it into a smaller PR. |
Contributor
Author
|
I'm splitting this PR into multiple small PRs, the 1st one is #24780 |
Contributor
Author
|
2nd PR: #24785 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
I observed a topic has been loaded for more than 30 seconds in production environment recently after its namespace bundle ownership was transferred. From the logs and heap dump, the conclusion is that the topic has been stuck before calling
asyncOpenincreatePersistentTopic0.After revisiting the long code path of topic loading, I found many places are not efficient, and the existing
pulsar_topic_load_timesmetric is incorrect. This metric only counts the time from the beginning ofcreatePersistentTopic0. However, there are some other time-consuming tasks before it's called.pulsar/pulsar-broker/src/main/java/org/apache/pulsar/broker/service/BrokerService.java
Line 1171 in c96f27a
It waits for topic policies of this topic are available before inserting a topic future to
BrokerService#topics. In extreme cases, there could be many concurrentTopicPoliciesService#getTopicPoliciesAsynccall.pulsar/pulsar-broker/src/main/java/org/apache/pulsar/broker/service/BrokerService.java
Line 1670 in c96f27a
In
loadOrCreatePersistentTopic, it will check the ownership viacheckTopicNsOwnershipbefore acquiring the topic load semaphore here:pulsar/pulsar-broker/src/main/java/org/apache/pulsar/broker/service/BrokerService.java
Line 1674 in c96f27a
This violates the semantics of the
maxConcurrentTopicLoadRequestconfig.After acquiring the semaphore, it calls
checkOwnershipAndCreatePersistentTopic, which validates the ownership again viaNamespaceService#isServiceUnitActiveAsync:pulsar/pulsar-broker/src/main/java/org/apache/pulsar/broker/service/BrokerService.java
Line 1739 in c96f27a
Actually, the implementation of
isServiceUnitActiveAsyncis exactly the same withcheckTopicNsOwnership, where the only difference is that the previous one returns a boolean future, while the latter one returns a failed future if it's false.Even after that, it could fetch the topic policies before
createPersistentTopic0:pulsar/pulsar-broker/src/main/java/org/apache/pulsar/broker/service/BrokerService.java
Line 1745 in c96f27a
Modifications
Major changes:
getTopicPoliciesBypassSystemTopiccall before inserting a topic future. This method is only used ingetManagedLedgerConfig.BrokerService#getTopicwill succeed even when the topic's bundle is not owned. These tests could pass just because the system topic reader creation will trigger acquiring the bundles in the same namespace.getManagedLedgerConfigandfetchPartitionedTopicMetadataAsyncrepeatedly by executing these tasks before other tasks that depend on them.checkMaxTopicsPerNamespace,checkTopicAlreadyMigrated,validateTopicConsistency.Though many tasks are metadata store access with
MetadataCacheused, executing them concurrently is still more efficient.For observability,
pulsar_topic_load_timesmetric.hasMessageAvailableandreadNextloop could have poor performance when CPU pressure is high. This read loop is also too heavy.Other changes and refactoring:
TopicNameinstance across the whole flow to avoid unnecessary conversions betweenTopicNameandString.isTransactionInternalNameat the beginningisServiceUnitActiveAsyncwithcheckTopicNsOwnershipand remove this method to avoid users using this method, which makes code hard to read.failTopicFutureto invalidate the topic cache for failures during topic loadingDocumentation
docdoc-requireddoc-not-neededdoc-completeMatching PR in forked repository
PR in forked repository: BewareMyPower#50