Skip to content

Commit 19ef309

Browse files
committed
fix(backup): switch to binary SQLite snapshot, eliminate cross-table data smearing
The previous dump-conf-db produced gzipped text SQL via `sqlite3 .dump` to a shared `tmp.dmp` with no locking. Concurrent cron invocations interleaved their stdout into the same file, and the resulting archive parsed as valid (but semantically wrong) SQL — values from one table smeared into INSERTs of unrelated tables on restore. Symptom reported from the field: rows from m_CustomFiles surfacing inside m_ConferenceRooms and garbage in m_OutgoingRoutingTable after restoration. dump-conf-db - Use the SQLite online backup API (.backup) for a page-consistent binary snapshot instead of textual .dump. Eliminates the SQL parseability that made interleaving silently dangerous. - Atomic mkdir-lock with PID published via tmp+rename inside the owned lock directory. Stale detection combines kill -0 with a /proc/<pid>/comm + /proc/<pid>/cmdline two-stage identity check to defeat PID reuse on long-uptime appliances. - Stale-lock reclaim uses `mv lockDir lockDir.dead.\$\$` as the atomic arbiter — racing reclaimers can no longer both win and double-rm. - Per-PID temporary artefacts; .publish.*.gz staged on the same FS as the final target and atomically renamed. - WAL checkpointed before .backup; PRAGMA quick_check gates publish. - Content dedup via md5 of logical .dump (binary pages differ each run due to internal counters). Dedup is bypassed when no archive exists on disk to prevent a poisoned .last_hash from permanently disabling backups. - Defensive sweep of orphaned .publish.*.gz older than 60 minutes. - Single timestamp captured once to avoid hourly/daily date drift. SystemConfiguration::tryRestoreConf - Magic-byte detection (SQLite format 3\0) on the gunzipped payload: binary snapshots are restored via atomic file copy, legacy SQL dumps via sqlite3 replay (backward compat for pre-fix archives). - pickFreshestBackup prefers the newest BINARY archive across all mounted storage; legacy archives are chosen only when no binary is available — prevents regression to the smearing path on boxes with mixed-age archive directories. - isBinaryArchive probes via popen+fread with a full pipe drain before pclose to avoid SIGPIPE-induced false "probe failed" warnings on every successful binary archive. - Legacy SQL replay path emits LOG_WARNING so operators see the fallback in syslog and can audit data after boot. - purgeConfDb removes the live DB and -wal/-shm/-journal sidecars via PHP glob+unlink (no shell glob, no rm). - failRestore funnel: every failure path falls back to DEFAULT_CONFIG_DB with an explicit syslog reason; if the fallback copy itself fails it surfaces LOG_CRIT. - Reboot loop guard via /var/run/conf-restored marker; @touch return checked and reboot is refused if the marker cannot be created. - foreign_key_check runs as a WARN-only diagnostic post-restore. - Hard prerequisite check on Util::which('gzip')/('sqlite3'); the guard uses str_contains('/') because Util::which falls back to the bare command name rather than an empty string when nothing is found. - Operator breadcrumb (RESULT_SKIPPED + LOG_NOTICE) when no backup is found on any mounted storage. Verified functionally in Docker (mikopbx/mikopbx-arm64:2026.1.21): - Binary snapshot creation, dedup, atomic publish, lock contention (5 parallel runs), stale-lock reclaim (synthetic PID), and rotation all behave correctly. - restoreFromArchive replays a real legacy SQL archive from boffart.miko.ru (2026-05-07_mikopbx.db.gz, 39 tables) and produces a quick_check-clean DB. - pickFreshestBackup correctly picks an older binary archive over a newer legacy archive — confirms prefer-binary protection against re-introducing the smearing path on mixed shelves. - Zero "gzip probe failed" entries in syslog after the drain-loop fix; LEGACY SQL warning emitted as expected on the legacy path. Also sets executable bit on dump-conf-db (mode 100755 in index).
1 parent 89d9616 commit 19ef309

2 files changed

Lines changed: 611 additions & 66 deletions

File tree

Lines changed: 246 additions & 41 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
#!/bin/sh
22
#
33
# MikoPBX - free phone system for small business
4-
# Copyright © 2017-2022 Alexey Portnov and Nikolay Beketov
4+
# Copyright © 2017-2026 Alexey Portnov and Nikolay Beketov
55
#
66
# This program is free software: you can redistribute it and/or modify
77
# it under the terms of the GNU General Public License as published by
@@ -16,61 +16,266 @@
1616
# You should have received a copy of the GNU General Public License along with this program.
1717
# If not, see <https://www.gnu.org/licenses/>.
1818
#
19+
# ----------------------------------------------------------------------------
20+
# Creates an online binary backup of the MikoPBX configuration database via
21+
# the SQLite ".backup" API. Produces a page-consistent snapshot that does
22+
# not block concurrent readers/writers and cannot smear data across tables
23+
# (unlike a textual ".dump" stream which is parseable as SQL and therefore
24+
# vulnerable to interleaving under race conditions).
25+
#
26+
# Invariants:
27+
# 1. Mutual exclusion across concurrent invocations: atomic mkdir-lock
28+
# with stale-detection that distinguishes a live dump-conf-db from
29+
# an unrelated process that has inherited the same PID after reuse.
30+
# 2. The PID inside the lock directory is published atomically (tmp+mv)
31+
# so a racing reader can never observe an empty pid file.
32+
# 3. Temporary artefacts are PID-suffixed so a crashed previous run
33+
# cannot poison the next invocation.
34+
# 4. The gzipped archive is staged as a .part file and atomically
35+
# renamed into place — readers never observe a half-written gz.
36+
# 5. WAL is checkpointed before .backup so the snapshot reflects all
37+
# committed transactions.
38+
# 6. Snapshots are validated (PRAGMA quick_check) before being kept.
39+
# 7. Content deduplication uses md5 of the LOGICAL .dump (binary
40+
# pages always differ due to internal counters). If on-disk
41+
# archives are absent, dedup is bypassed — this prevents a
42+
# poisoned .last_hash from permanently disabling backups.
43+
# ----------------------------------------------------------------------------
1944

20-
# Extract the media mount point from the configuration file
21-
confBackupDir="$(/bin/busybox grep "confBackupDir" < /etc/inc/mikopbx-settings.json | /bin/busybox cut -f 4 -d '"')";
45+
# === Configuration ===
2246

23-
# Extract the path to the database from the configuration file
24-
dbPath="$(/bin/busybox grep "mikopbx.db" < /etc/inc/mikopbx-settings.json | /bin/busybox cut -f 4 -d '"')";
47+
confBackupDir="$(/bin/busybox grep 'confBackupDir' < /etc/inc/mikopbx-settings.json | /bin/busybox cut -f 4 -d '"')"
48+
dbPath="$(/bin/busybox grep 'mikopbx.db' < /etc/inc/mikopbx-settings.json | /bin/busybox cut -f 4 -d '"')"
2549

26-
# Define the backup directory path
2750
backupDir="${confBackupDir}"
28-
29-
# Catalog for daily backups
3051
backupDirDay="${confBackupDir}/day"
3152

32-
# Ensure the backup directory exists
33-
mkdir -p "$backupDirDay";
53+
if ! /bin/busybox mkdir -p "$backupDirDay" 2>/dev/null; then
54+
/bin/busybox logger -t dump-conf-db "ERROR: cannot create $backupDirDay (storage not mounted?)"
55+
exit 1
56+
fi
3457

35-
# Restrict backup file permissions to root only (prevent nginx/www from reading)
36-
umask 077;
58+
# Restrict backup readability — these files may contain credential hashes.
59+
umask 077
3760

38-
# A file for temporary storage of a backup copy
39-
tmpBackup="$backupDir/tmp.dmp";
40-
/usr/bin/sqlite3 "$dbPath" .dump > "$tmpBackup";
61+
# === Atomic mkdir-lock with hardened stale detection ===
62+
#
63+
# Race-safe lock acquisition:
64+
# 1. mkdir is the single arbiter — if it fails, somebody else holds the lock.
65+
# 2. PID is published via tmp + mv so a reader never observes an empty file.
66+
# 3. Stale detection checks both liveness (kill -0) AND identity
67+
# (/proc/<pid>/comm) to defeat PID reuse on long-uptime appliances.
4168

42-
# Calculate the MD5 checksum of the database dump
43-
md5="$(/bin/busybox cat "$tmpBackup" | /bin/gzip -c | /usr/bin/md5sum | cut -f 1 -d ' ')" ;
69+
lockDir="$backupDir/.dump-conf-db.lock"
4470

45-
# Check if a backup with the same MD5 checksum already exists
46-
/usr/bin/md5sum "$backupDir"/*_mikopbx.db.gz | /bin/busybox grep "$md5"> /dev/null 2> /dev/null
47-
result="$?"
71+
acquire_lock() {
72+
/bin/busybox mkdir "$lockDir" 2>/dev/null || return 1
73+
# Publish PID atomically inside the directory we already own.
74+
echo "$$" > "$lockDir/pid.tmp"
75+
/bin/busybox mv "$lockDir/pid.tmp" "$lockDir/pid"
76+
return 0
77+
}
4878

49-
# If no matching backup is found, create a new backup
50-
if [ "$result" = "1" ];then
51-
/bin/busybox cat "$tmpBackup" | /bin/gzip -c > "$backupDir/$(date '+%Y-%m-%d_h%H_m%M_s%S')_mikopbx.db.gz"
52-
fi;
79+
# Returns 0 if the held lock is stale (safe to break), 1 if it is live.
80+
#
81+
# Two-stage identity check defeats PID reuse on long-uptime appliances:
82+
# 1) /proc/<pid>/comm matches our binary basename exactly → live.
83+
# 2) For generic shell names (sh/ash/bash/busybox) — which are too common
84+
# to be conclusive — additionally inspect /proc/<pid>/cmdline for the
85+
# "dump-conf-db" substring. Only then accept as live.
86+
lock_is_stale() {
87+
pidFile="$lockDir/pid"
5388

54-
# Remove the oldest backups, keeping only the 5 most recent ones
55-
(
56-
cd "$backupDir" || exit;
57-
/bin/busybox ls -1tr | grep 'db.gz' | /bin/busybox head -n-5 | /bin/xargs /bin/busybox rm 2> /dev/null
58-
)
89+
# No pid file yet → very fresh lock held by a racing acquirer; respect it.
90+
[ ! -f "$pidFile" ] && return 1
91+
92+
oldPid="$(/bin/busybox cat "$pidFile" 2>/dev/null)"
93+
[ -z "$oldPid" ] && return 1
94+
95+
# Process is gone → certainly stale.
96+
if ! /bin/busybox kill -0 "$oldPid" 2>/dev/null; then
97+
return 0
98+
fi
99+
100+
comm="$(/bin/busybox cat "/proc/$oldPid/comm" 2>/dev/null)"
101+
case "$comm" in
102+
dump-conf-db)
103+
return 1
104+
;;
105+
sh|ash|bash|busybox)
106+
# Generic shell name — verify the actual command line.
107+
cmdline="$(/bin/busybox cat "/proc/$oldPid/cmdline" 2>/dev/null \
108+
| /bin/busybox tr '\0' ' ')"
109+
case "$cmdline" in
110+
*dump-conf-db*) return 1 ;; # live (script under a shell)
111+
*) return 0 ;; # PID reused by an unrelated shell
112+
esac
113+
;;
114+
*)
115+
return 0 # PID reused by unrelated process
116+
;;
117+
esac
118+
}
119+
120+
# Reclaim a stale lock atomically.
121+
#
122+
# Naive `rm -rf "$lockDir"; mkdir "$lockDir"` is racy: two reclaimers can
123+
# both observe "stale", both rm, the first one's fresh mkdir gets blown
124+
# away by the second's rm. mv is atomic at the FS level — only one
125+
# reclaimer succeeds, the other observes ENOENT and defers.
126+
reclaim_stale_lock() {
127+
deadDir="$lockDir.dead.$$"
128+
if /bin/busybox mv "$lockDir" "$deadDir" 2>/dev/null; then
129+
/bin/busybox rm -rf "$deadDir"
130+
return 0
131+
fi
132+
return 1
133+
}
134+
135+
if ! acquire_lock; then
136+
if lock_is_stale; then
137+
if reclaim_stale_lock; then
138+
/bin/busybox logger -t dump-conf-db \
139+
"WARN: reclaimed stale lock (pid=$oldPid comm=$comm)"
140+
if ! acquire_lock; then
141+
/bin/busybox logger -t dump-conf-db \
142+
"INFO: lost acquire race after reclaim, exiting"
143+
exit 0
144+
fi
145+
else
146+
# Another reclaimer won the atomic rename — defer to them.
147+
/bin/busybox logger -t dump-conf-db \
148+
"INFO: stale-reclaim lost to a concurrent reclaimer, exiting"
149+
exit 0
150+
fi
151+
else
152+
# A genuine concurrent run owns the lock; skip this tick quietly.
153+
exit 0
154+
fi
155+
fi
156+
157+
# Per-PID artefacts — uniquely owned by THIS invocation regardless of races.
158+
tmpBackup="$backupDir/.tmp.$$.db"
159+
tmpCompressed="$backupDir/.tmp.$$.db.gz"
160+
hourlyPart="$backupDir/.publish.$$.hourly.gz"
161+
dailyPart="$backupDirDay/.publish.$$.daily.gz"
162+
163+
cleanup() {
164+
/bin/busybox rm -f "$tmpBackup" "$tmpCompressed" "$hourlyPart" "$dailyPart"
165+
/bin/busybox rm -rf "$lockDir"
166+
}
167+
trap cleanup EXIT INT TERM HUP
168+
169+
# === Flush WAL into the main file so the .backup snapshot is current ===
170+
171+
/usr/bin/sqlite3 "$dbPath" "PRAGMA wal_checkpoint(TRUNCATE);" >/dev/null 2>&1
172+
173+
# === Create a consistent binary snapshot, capture stderr for diagnostics ===
174+
175+
if ! backupErr="$(/usr/bin/sqlite3 "$dbPath" ".backup '$tmpBackup'" 2>&1 >/dev/null)"; then
176+
/bin/busybox logger -t dump-conf-db "ERROR: sqlite3 .backup failed: $backupErr"
177+
exit 1
178+
fi
59179

60-
# Daily backup
61-
# Check if a backup with the same MD5 checksum already exists
62-
/usr/bin/md5sum "$backupDirDay"/*_mikopbx.db.gz | /bin/busybox grep "$md5"> /dev/null 2> /dev/null
63-
result="$?"
180+
# === Validate snapshot — quick_check is sufficient pre-publish and bounded ===
64181

65-
# If no matching backup is found, create a new backup
66-
if [ "$result" = "1" ];then
67-
/bin/busybox cat "$tmpBackup" | /bin/gzip -c > "$backupDirDay/$(date '+%Y-%m-%d')_mikopbx.db.gz"
68-
fi;
182+
quickCheck="$(/usr/bin/sqlite3 "$tmpBackup" "PRAGMA quick_check;" 2>/dev/null | /bin/busybox head -n 1)"
183+
if [ "$quickCheck" != "ok" ]; then
184+
/bin/busybox logger -t dump-conf-db "ERROR: quick_check failed: $quickCheck"
185+
exit 1
186+
fi
187+
188+
# === Content-based deduplication with poisoned-hash recovery ===
189+
#
190+
# The hash is computed over the LOGICAL .dump of the snapshot (which is
191+
# isolated from concurrent writers, so the smearing risk on the live DB
192+
# is irrelevant here). If, for any reason, every on-disk archive has
193+
# been removed but .last_hash remains, treat the hash as void and
194+
# force a backup — otherwise the file would silently disable us forever.
195+
196+
contentHash="$(/usr/bin/sqlite3 "$tmpBackup" .dump | /usr/bin/md5sum | /bin/busybox cut -f 1 -d ' ')"
197+
if [ -z "$contentHash" ]; then
198+
/bin/busybox logger -t dump-conf-db \
199+
"WARN: empty content hash — sqlite3 .dump or md5sum returned nothing"
200+
fi
201+
lastHashFile="$backupDir/.last_hash"
202+
prevHash=""
203+
[ -r "$lastHashFile" ] && prevHash="$(/bin/busybox cat "$lastHashFile" 2>/dev/null)"
204+
205+
archivesPresent=0
206+
for existing in "$backupDir"/*_mikopbx.db.gz "$backupDirDay"/*_mikopbx.db.gz; do
207+
if [ -f "$existing" ]; then
208+
archivesPresent=1
209+
break
210+
fi
211+
done
212+
213+
if [ "$archivesPresent" = "1" ] \
214+
&& [ -n "$contentHash" ] \
215+
&& [ "$contentHash" = "$prevHash" ]; then
216+
# Identical content AND we already have at least one archive on disk.
217+
exit 0
218+
fi
219+
220+
# === Compress the snapshot once ===
221+
222+
if ! /bin/gzip -c "$tmpBackup" > "$tmpCompressed"; then
223+
/bin/busybox logger -t dump-conf-db "ERROR: gzip failed"
224+
exit 1
225+
fi
226+
227+
# === Atomic publish: stage in same FS as target, then rename ===
228+
229+
# Capture the timestamp once so an unlucky second-boundary crossing
230+
# can't make the hourly and daily filenames disagree on the date.
231+
ts_full="$(/bin/busybox date '+%Y-%m-%d_h%H_m%M_s%S')"
232+
ts_day="${ts_full%%_h*}"
233+
234+
hourlyFile="$backupDir/${ts_full}_mikopbx.db.gz"
235+
dailyFile="$backupDirDay/${ts_day}_mikopbx.db.gz"
236+
237+
if ! /bin/busybox cp "$tmpCompressed" "$hourlyPart"; then
238+
/bin/busybox logger -t dump-conf-db "ERROR: cp $tmpCompressed -> $hourlyPart failed"
239+
exit 1
240+
fi
241+
if ! /bin/busybox mv "$hourlyPart" "$hourlyFile"; then
242+
/bin/busybox logger -t dump-conf-db "ERROR: mv hourly publish failed"
243+
exit 1
244+
fi
245+
246+
if ! /bin/busybox cp "$tmpCompressed" "$dailyPart"; then
247+
/bin/busybox logger -t dump-conf-db "ERROR: cp $tmpCompressed -> $dailyPart failed"
248+
exit 1
249+
fi
250+
if ! /bin/busybox mv "$dailyPart" "$dailyFile"; then
251+
/bin/busybox logger -t dump-conf-db "ERROR: mv daily publish failed"
252+
exit 1
253+
fi
254+
255+
# Persist content hash only after both artefacts are visible.
256+
echo "$contentHash" > "$lastHashFile"
257+
258+
# === Rotation: keep last 5 hourly, last 7 daily ===
69259

70-
# Remove the oldest backups, keeping only the 5 most recent ones
71260
(
72-
cd "$backupDirDay" || exit;
73-
/bin/busybox ls -1tr | /bin/busybox head -n-7 | /bin/xargs /bin/busybox rm 2> /dev/null
261+
cd "$backupDir" 2>/dev/null || exit 0
262+
/bin/busybox ls -1tr 2>/dev/null \
263+
| /bin/busybox grep '_mikopbx\.db\.gz$' \
264+
| /bin/busybox head -n -5 \
265+
| /bin/busybox xargs /bin/busybox rm -f 2>/dev/null
74266
)
75267

76-
rm -rf "$tmpBackup";
268+
(
269+
cd "$backupDirDay" 2>/dev/null || exit 0
270+
/bin/busybox ls -1tr 2>/dev/null \
271+
| /bin/busybox grep '_mikopbx\.db\.gz$' \
272+
| /bin/busybox head -n -7 \
273+
| /bin/busybox xargs /bin/busybox rm -f 2>/dev/null
274+
)
275+
276+
# Defensive sweep — stale .publish.*.gz left by crashes (own cleanup() handles
277+
# OUR crash, but a hard kill -9 can still leave files behind).
278+
/bin/busybox find "$backupDir" "$backupDirDay" -maxdepth 1 -type f \
279+
-name '.publish.*.gz' -mmin +60 -delete 2>/dev/null
280+
281+
exit 0

0 commit comments

Comments
 (0)