[fix](editlog) Fix BDBEnvironment.close() ConcurrentModificationException#62061
[fix](editlog) Fix BDBEnvironment.close() ConcurrentModificationException#62061dataroaring wants to merge 1 commit intoapache:masterfrom
Conversation
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
There was a problem hiding this comment.
Pull request overview
Fixes a concurrency bug in BDBEnvironment.close() where iterating openedDatabases without synchronization could throw ConcurrentModificationException during concurrent openDatabase() calls, potentially killing the replayer thread.
Changes:
- Wrap
BDBEnvironment.close()withlock.writeLock()to serialize closes vs. opens/removes. - Ensure
openedDatabases,epochDB, andreplicatedEnvironmentare mutated under the same write lock used elsewhere.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| lock.writeLock().lock(); | ||
| try { | ||
| for (Database db : openedDatabases) { | ||
| try { | ||
| db.close(); | ||
| } catch (DatabaseException exception) { | ||
| LOG.error("Error closing db {} will exit", db.getDatabaseName(), exception); | ||
| } | ||
| } | ||
| } | ||
| openedDatabases.clear(); | ||
| openedDatabases.clear(); | ||
|
|
||
| if (epochDB != null) { | ||
| try { | ||
| epochDB.close(); | ||
| epochDB = null; | ||
| } catch (DatabaseException exception) { | ||
| LOG.error("Error closing db {} will exit", epochDB.getDatabaseName(), exception); | ||
| if (epochDB != null) { | ||
| try { | ||
| epochDB.close(); | ||
| epochDB = null; | ||
| } catch (DatabaseException exception) { | ||
| LOG.error("Error closing db {} will exit", epochDB.getDatabaseName(), exception); | ||
| } | ||
| } | ||
| } | ||
|
|
||
| if (replicatedEnvironment != null) { | ||
| try { | ||
| // Finally, close the store and environment. | ||
| replicatedEnvironment.close(); | ||
| replicatedEnvironment = null; | ||
| } catch (DatabaseException exception) { | ||
| LOG.error("Error closing replicatedEnvironment", exception); | ||
| if (replicatedEnvironment != null) { | ||
| try { | ||
| // Finally, close the store and environment. | ||
| replicatedEnvironment.close(); | ||
| replicatedEnvironment = null; | ||
| } catch (DatabaseException exception) { | ||
| LOG.error("Error closing replicatedEnvironment", exception); | ||
| } | ||
| } | ||
| } finally { | ||
| lock.writeLock().unlock(); | ||
| } |
There was a problem hiding this comment.
Holding the class write-lock while calling into external resource cleanup (db.close(), epochDB.close(), replicatedEnvironment.close()) can block unrelated threads for a long time and increases deadlock risk if those close paths indirectly contend on the same lock or other shared locks. A safer pattern is a two-phase close: under the write-lock, atomically “detach” resources (copy/swap openedDatabases to a local list, set fields like epochDB/replicatedEnvironment to null, optionally set a closing/closed flag to reject new opens), then release the lock and perform the actual close() calls outside the lock.
There was a problem hiding this comment.
Done. Adopted two-phase close: detach resources under writeLock, then close outside the lock. Also fixed the misleading log messages.
| try { | ||
| db.close(); | ||
| } catch (DatabaseException exception) { | ||
| LOG.error("Error closing db {} will exit", db.getDatabaseName(), exception); |
There was a problem hiding this comment.
The message fragment “will exit” is misleading here because close() does not exit the process (and the code continues closing other resources). Consider updating the log text to reflect the actual behavior (e.g., “Error closing db {}” / “Error closing epoch db {}”) to avoid confusion during incident triage.
| epochDB.close(); | ||
| epochDB = null; | ||
| } catch (DatabaseException exception) { | ||
| LOG.error("Error closing db {} will exit", epochDB.getDatabaseName(), exception); |
There was a problem hiding this comment.
The message fragment “will exit” is misleading here because close() does not exit the process (and the code continues closing other resources). Consider updating the log text to reflect the actual behavior (e.g., “Error closing db {}” / “Error closing epoch db {}”) to avoid confusion during incident triage.
…tion with two-phase close BDBEnvironment.close() iterates openedDatabases without acquiring lock.writeLock(), while openDatabase() concurrently modifies the same ArrayList under the write lock. This causes ConcurrentModificationException when a RollbackException triggers close() on the replayer thread while HTTP/heartbeat/RPC handler threads call openDatabase() via getMaxJournalId(). Fix with two-phase close pattern: - Phase 1 (under writeLock): detach all resources — copy openedDatabases to a local list, clear the original, and null out epochDB and replicatedEnvironment fields. This atomically makes them invisible to concurrent openDatabase() callers. - Phase 2 (outside lock): perform the actual db.close() and env.close() calls without holding the lock, avoiding blocking unrelated threads during potentially slow close operations. Also fix misleading "will exit" in log messages — close() does not exit.
a9df648 to
33f8fac
Compare
|
run buildall |
FE Regression Coverage ReportIncrement line coverage |
|
run buildall |
FE UT Coverage ReportIncrement line coverage |
FE Regression Coverage ReportIncrement line coverage |
2 similar comments
FE Regression Coverage ReportIncrement line coverage |
FE Regression Coverage ReportIncrement line coverage |
Summary
BDBEnvironment.close()iteratesopenedDatabaseswithout acquiringlock.writeLock()openDatabase()concurrently modifies the sameArrayListunder the write lockRollbackExceptionand callsclose(), while HTTP/heartbeat/RPC handler threads callopenDatabase()viagetMaxJournalId(),ConcurrentModificationExceptionis thrownFix: two-phase close pattern
openedDatabasesto local list, clear original, null outepochDBandreplicatedEnvironmentdb.close()/env.close()without holding the lock, avoiding blocking unrelated threads during potentially slow close operationsAlso fixes misleading "will exit" in log messages —
close()does not exit the process.Test plan
SHOW FRONTENDSworks concurrently with journal operations🤖 Generated with Claude Code