Skip to content

Commit 20fb231

Browse files
drt: run roachtest-operations as a systemd unit
Previously, roachtest-operations ran directly in the foreground piped through tee. This made it difficult to manage as a service — stopping it required finding the process, and SIGINT handling via the tee pipe was fragile. Now the script uses systemd-run to create a transient systemd unit (roachtest-operations-<cluster>), following the same self-reinvocation pattern as roachprod's start.sh. The script detects re-invocation via the ROACHTEST_OPS_SYSTEMD env var and uses exec to replace bash, so roachtest-operations directly receives SIGTERM from systemctl stop for graceful shutdown. TimeoutStopSec=150 ensures the full 2-minute cleanup window is respected. Logs are available via journalctl. Epic: none Release note: None Co-Authored-By: roachdev-claude <roachdev-claude-bot@cockroachlabs.com>
1 parent 542a5b0 commit 20fb231

1 file changed

Lines changed: 30 additions & 12 deletions

File tree

pkg/cmd/drt/scripts/roachtest_operations_run.sh

Lines changed: 30 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -29,8 +29,11 @@ cd /home/ubuntu
2929

3030
export ROACHPROD_GCE_DEFAULT_PROJECT=cockroach-drt
3131
export ROACHPROD_DNS="drt.crdb.io"
32-
./drtprod sync
33-
sleep 20
32+
33+
if [[ "${ROACHTEST_OPS_SYSTEMD:-}" != "1" ]]; then
34+
./drtprod sync
35+
sleep 20
36+
fi
3437

3538
# Fetch secrets from cloud provider at runtime (not stored in any file)
3639
# AWS secrets are stored in us-east-1 region for consistency across all clusters
@@ -81,15 +84,30 @@ if [ -z "${DD_API_KEY}" ]; then
8184
exit 1
8285
fi
8386

84-
# Ignore SIGINT in the shell so that tee (which inherits the signal
85-
# disposition) stays alive when Ctrl+C is pressed. Only
86-
# roachtest-operations receives SIGINT and handles cleanup gracefully.
87-
trap '' INT
87+
UNIT_NAME="roachtest-operations-${CLUSTER}"
88+
89+
if [[ "${ROACHTEST_OPS_SYSTEMD:-}" == "1" ]]; then
90+
exec ./roachtest-operations run-operation "${CLUSTER}" ".*" \
91+
--datadog-api-key "${DD_API_KEY}" \
92+
--datadog-app-key "unused" \
93+
--datadog-tags "env:development,cluster:${CLUSTER},workload:${WORKLOAD_CLUSTER},team:drt,service:drt-cockroachdb" \
94+
--certs-dir ./certs --prom-port 2115 \
95+
--cloud "${CLOUD}" --workload-cluster "${WORKLOAD_CLUSTER}" --run-forever
96+
fi
97+
98+
# Set up systemd unit and start it, which will recursively
99+
# invoke this script but hit the above conditional.
100+
101+
if systemctl is-active -q "${UNIT_NAME}"; then
102+
echo "${UNIT_NAME} service already active"
103+
echo "To get more information: systemctl status ${UNIT_NAME}"
104+
exit 1
105+
fi
88106

89-
./roachtest-operations run-operation "${CLUSTER}" ".*" \
90-
--datadog-api-key "${DD_API_KEY}" \
91-
--datadog-app-key "unused" \
92-
--datadog-tags "env:development,cluster:${CLUSTER},workload:${WORKLOAD_CLUSTER},team:drt,service:drt-cockroachdb" \
93-
--certs-dir ./certs --prom-port 2115 \
94-
--cloud "${CLOUD}" --workload-cluster "${WORKLOAD_CLUSTER}" --run-forever 2>&1 | tee -a roachtest_ops.log
107+
sudo systemctl reset-failed "${UNIT_NAME}" 2>/dev/null || true
95108

109+
sudo systemd-run --unit "${UNIT_NAME}" \
110+
--same-dir --uid "$(id -u)" --gid "$(id -g)" \
111+
-p TimeoutStopSec=150 \
112+
--setenv ROACHTEST_OPS_SYSTEMD=1 \
113+
bash "${0}" "${CLUSTER}" "${WORKLOAD_CLUSTER}" "${CLOUD}"

0 commit comments

Comments
 (0)