Skip to content

Commit f4418bb

Browse files
committed
feat: zombie instances auto-close during index status update
Signed-off-by: Oleksandr Porunov <alexandr.porunov@gmail.com>
1 parent b424a8f commit f4418bb

File tree

7 files changed

+83
-3
lines changed

7 files changed

+83
-3
lines changed

.github/workflows/ci-release.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -114,7 +114,7 @@ jobs:
114114
# Thus, we add `continue-on-error: true` here, but we should remove it
115115
# when either the issue is fixed (see: https://github.com/aquasecurity/trivy-action/issues/389)
116116
# or we self-host trivy database.
117-
uses: aquasecurity/trivy-action@0.32.0
117+
uses: aquasecurity/trivy-action@v0.35.0
118118
continue-on-error: true
119119
with:
120120
image-ref: 'ghcr.io/janusgraph/janusgraph:${{ env.JG_VER }}${{ matrix.tag_suffix }}'

docs/changelog.md

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -105,6 +105,23 @@ For more information on features and bug fixes in 1.2.0, see the GitHub mileston
105105

106106
Starting from version 1.2.0 JanusGraph supports ElasticSearch 9.
107107

108+
##### Zombie instances auto-close during index status update operations
109+
110+
Starting from version 1.2.0 JanusGraph can automatically force-close JanusGraph instances that are unreachable
111+
during index status update operations.
112+
113+
To enable this behavior, users can set the following configuration:
114+
```
115+
graph.management-auto-close-stale-instances=true
116+
```
117+
118+
By default, an instance is considered stale if it is not reachable for more than 2 minutes. However, this
119+
threshold can be adjusted using the following configuration:
120+
```
121+
graph.management-ack-timeout=240000 ms
122+
```
123+
This is a breaking change for users who use the `JanusGraphIndexStatusUpdate` interface.
124+
108125
### Version 1.1.0 (Release Date: November 7, 2024)
109126

110127
/// tab | Maven

docs/configs/janusgraph-cfg.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -52,6 +52,8 @@ General configuration options
5252
| graph.allow-upgrade | Setting this to true will allow certain fixed values to be updated such as storage-version. This should only be used for upgrading. | Boolean | false | MASKABLE |
5353
| graph.assign-timestamp | Whether to use JanusGraph generated client-side timestamp in mutations if the backend supports it. When enabled, JanusGraph assigns one timestamp to all insertions and another slightly earlier timestamp to all deletions in the same batch. When this is disabled, mutation behavior depends on the backend. Some might use server-side timestamp (e.g. HBase) while others might use client-side timestamp generated by driver (CQL). | Boolean | true | LOCAL |
5454
| graph.graphname | This config option is an optional configuration setting that you may supply when opening a graph. The String value you provide will be the name of your graph. If you use the ConfigurationManagement APIs, then you will be able to access your graph by this String representation using the ConfiguredGraphFactory APIs. | String | (no default value) | LOCAL |
55+
| graph.management-ack-timeout | Maximum time to wait for an instance to acknowledge a schema change before treating it as stale. Only effective when graph.management-auto-close-stale-instances is enabled. | Duration | 120000 ms | MASKABLE |
56+
| graph.management-auto-close-stale-instances | Whether to automatically force-close instances that fail to acknowledge schema changes within the configured timeout (graph.management-ack-timeout). When enabled, unresponsive instances are removed from the registration store so they no longer block schema status transitions. | Boolean | false | MASKABLE |
5557
| graph.replace-instance-if-exists | If a JanusGraph instance with the same instance identifier already exists, the usage of this configuration option results in the opening of this graph anyway. | Boolean | false | LOCAL |
5658
| graph.set-vertex-id | Whether user provided vertex ids should be enabled and JanusGraph's automatic vertex id allocation be disabled. Useful when operating JanusGraph in concert with another storage system that assigns long ids but disables some of JanusGraph's advanced features which can lead to inconsistent data. For example, users must ensure the vertex ids are unique to avoid duplication. Must use `graph.getIDManager().toVertexId(long)` to convert your id first. Once this is enabled, you have to provide vertex id when creating new vertices. EXPERT FEATURE - USE WITH GREAT CARE. | Boolean | false | GLOBAL_OFFLINE |
5759
| graph.storage-version | The version of JanusGraph storage schema with which this database was created. Automatically set on first start of graph. Should only ever be changed if upgrading to a new major release version of JanusGraph that contains schema changes | String | (no default value) | FIXED |

janusgraph-core/src/main/java/org/janusgraph/graphdb/configuration/GraphDatabaseConfiguration.java

100644100755
Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1124,6 +1124,19 @@ public boolean apply(@Nullable String s) {
11241124

11251125
public static final ConfigNamespace LOG_NS = new ConfigNamespace(GraphDatabaseConfiguration.ROOT_NS,"log","Configuration options for JanusGraph's logging system",true);
11261126

1127+
public static final ConfigOption<Boolean> MANAGEMENT_AUTO_CLOSE_STALE_INSTANCES = new ConfigOption<>(GRAPH_NS,
1128+
"management-auto-close-stale-instances",
1129+
"Whether to automatically force-close instances that fail to acknowledge schema changes within the " +
1130+
"configured timeout (graph.management-ack-timeout). When enabled, unresponsive instances are removed " +
1131+
"from the registration store so they no longer block schema status transitions.",
1132+
ConfigOption.Type.MASKABLE, false);
1133+
1134+
public static final ConfigOption<Duration> MANAGEMENT_ACK_TIMEOUT = new ConfigOption<>(GRAPH_NS,
1135+
"management-ack-timeout",
1136+
"Maximum time to wait for an instance to acknowledge a schema change before treating it as stale. " +
1137+
"Only effective when graph.management-auto-close-stale-instances is enabled.",
1138+
ConfigOption.Type.MASKABLE, Duration.ofSeconds(120));
1139+
11271140
public static final String MANAGEMENT_LOG = "janusgraph";
11281141
public static final String TRANSACTION_LOG = "tx";
11291142
public static final String USER_LOG = "user";

janusgraph-core/src/main/java/org/janusgraph/graphdb/database/StandardJanusGraph.java

100644100755
Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -115,6 +115,7 @@
115115

116116
import java.io.Closeable;
117117
import java.io.IOException;
118+
import java.time.Duration;
118119
import java.time.Instant;
119120
import java.util.ArrayList;
120121
import java.util.Collection;
@@ -134,6 +135,8 @@
134135
import javax.script.Bindings;
135136
import javax.script.ScriptException;
136137

138+
import static org.janusgraph.graphdb.configuration.GraphDatabaseConfiguration.MANAGEMENT_ACK_TIMEOUT;
139+
import static org.janusgraph.graphdb.configuration.GraphDatabaseConfiguration.MANAGEMENT_AUTO_CLOSE_STALE_INSTANCES;
137140
import static org.janusgraph.graphdb.configuration.GraphDatabaseConfiguration.REGISTRATION_TIME;
138141
import static org.janusgraph.graphdb.configuration.GraphDatabaseConfiguration.REPLACE_INSTANCE_IF_EXISTS;
139142
import static org.janusgraph.graphdb.configuration.GraphDatabaseConfiguration.SCRIPT_EVAL_ENABLED;
@@ -254,7 +257,9 @@ public StandardJanusGraph(GraphDatabaseConfiguration configuration) {
254257
globalConfig.set(REGISTRATION_TIME, times.getTime(), uniqueInstanceId);
255258

256259
Log managementLog = backend.getSystemMgmtLog();
257-
managementLogger = new ManagementLogger(this, managementLog, schemaCache, this.times);
260+
Duration ackTimeout = configuration.getConfiguration().get(MANAGEMENT_ACK_TIMEOUT);
261+
boolean autoCloseStaleInstances = configuration.getConfiguration().get(MANAGEMENT_AUTO_CLOSE_STALE_INSTANCES);
262+
managementLogger = new ManagementLogger(this, managementLog, schemaCache, this.times, ackTimeout, autoCloseStaleInstances);
258263
managementLog.registerReader(ReadMarker.fromNow(), managementLogger);
259264

260265
shutdownHook = new ShutdownThread(this);

janusgraph-core/src/main/java/org/janusgraph/graphdb/database/management/ManagementLogger.java

100644100755
Lines changed: 39 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -37,6 +37,7 @@
3737
import org.slf4j.LoggerFactory;
3838

3939
import java.time.Duration;
40+
import java.util.HashSet;
4041
import java.util.Iterator;
4142
import java.util.List;
4243
import java.util.Set;
@@ -71,11 +72,21 @@ public class ManagementLogger implements MessageReader {
7172
private final AtomicInteger evictionTriggerCounter = new AtomicInteger(0);
7273
private final ConcurrentMap<Long,EvictionTrigger> evictionTriggerMap = new ConcurrentHashMap<>();
7374

75+
private final Duration ackTimeout;
76+
private final boolean autoCloseStaleInstances;
77+
7478
public ManagementLogger(StandardJanusGraph graph, Log sysLog, SchemaCache schemaCache, TimestampProvider times) {
79+
this(graph, sysLog, schemaCache, times, Duration.ofSeconds(120), true);
80+
}
81+
82+
public ManagementLogger(StandardJanusGraph graph, Log sysLog, SchemaCache schemaCache, TimestampProvider times,
83+
Duration ackTimeout, boolean autoCloseStaleInstances) {
7584
this.graph = graph;
7685
this.schemaCache = schemaCache;
7786
this.sysLog = sysLog;
7887
this.times = times;
88+
this.ackTimeout = ackTimeout;
89+
this.autoCloseStaleInstances = autoCloseStaleInstances;
7990
Preconditions.checkNotNull(times);
8091
}
8192

@@ -163,11 +174,13 @@ private class EvictionTrigger {
163174
final List<Callable<Boolean>> updatedTypeTriggers;
164175
final StandardJanusGraph graph;
165176
final Set<String> instancesToBeAcknowledged;
177+
final long createdAtMillis;
166178

167179
private EvictionTrigger(long evictionId, List<Callable<Boolean>> updatedTypeTriggers, StandardJanusGraph graph) {
168180
this.graph = graph;
169181
this.evictionId = evictionId;
170182
this.updatedTypeTriggers = updatedTypeTriggers;
183+
this.createdAtMillis = System.currentTimeMillis();
171184
final JanusGraphManagement mgmt = graph.openManagement();
172185
this.instancesToBeAcknowledged = ConcurrentHashMap.newKeySet();
173186
instancesToBeAcknowledged.addAll(((ManagementSystem) mgmt).getOpenInstancesInternal());
@@ -204,8 +217,34 @@ int removeDroppedInstances() {
204217
final String instanceRemovedMsg = "Instance [{}] was removed list of open instances and therefore dropped from list of instances to be acknowledged.";
205218
instancesToBeAcknowledged.stream().filter(it -> !updatedInstances.contains(it)).filter(instancesToBeAcknowledged::remove).forEach(it -> log.debug(instanceRemovedMsg, it));
206219
mgmt.rollback();
220+
221+
if (autoCloseStaleInstances && !instancesToBeAcknowledged.isEmpty()
222+
&& (System.currentTimeMillis() - createdAtMillis) > ackTimeout.toMillis()) {
223+
log.warn("ACK timeout ({}s) exceeded for eviction [{}]. Force-closing unresponsive instances: {}",
224+
ackTimeout.getSeconds(), evictionId, instancesToBeAcknowledged);
225+
forceCloseStaleInstances(new HashSet<>(instancesToBeAcknowledged));
226+
instancesToBeAcknowledged.clear();
227+
}
228+
207229
return instancesToBeAcknowledged.size();
208230
}
231+
232+
private void forceCloseStaleInstances(Set<String> staleInstances) {
233+
try {
234+
final JanusGraphManagement mgmt = graph.openManagement();
235+
for (String instanceId : staleInstances) {
236+
try {
237+
mgmt.forceCloseInstance(instanceId);
238+
log.warn("Force-closed stale instance [{}]", instanceId);
239+
} catch (IllegalArgumentException e) {
240+
log.debug("Could not force-close instance [{}]: {}", instanceId, e.getMessage());
241+
}
242+
}
243+
mgmt.commit();
244+
} catch (Exception e) {
245+
log.error("Failed to force-close stale instances", e);
246+
}
247+
}
209248
}
210249

211250
private class SendAckOnTxClose implements Runnable {

janusgraph-core/src/main/java/org/janusgraph/graphdb/olap/job/IndexRepairJob.java

100644100755
Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -229,7 +229,11 @@ public void process(JanusGraphVertex vertex, ScanMetrics metrics) {
229229
@Override
230230
public void workerIterationEnd(final ScanMetrics metrics) {
231231
try {
232-
if (index instanceof JanusGraphIndex) {
232+
// If writeTx is `null` it means that `workerIterationStart` failed previously and write transaction was
233+
// never assigned. In this case, instead of raising NPE here we skip this block which will
234+
// properly propagate root cause exception instead of NPE exception.
235+
// See the first `catch` block in `StandardScannerExecutor.run`.
236+
if (index instanceof JanusGraphIndex && writeTx != null) {
233237
BackendTransaction mutator = writeTx.getTxHandle();
234238
IndexType indexType = managementSystem.getSchemaVertex(index).asIndexType();
235239
if (indexType.isMixedIndex() && documentsPerStore.size() > 0) {

0 commit comments

Comments
 (0)