This document outlines why this utility is structured the way it is and how the different components collaborate to provide a failsafe environment.
In a standard daisy-chain cascading backup (1TB -> 2TB -> 4TB), using rsync --delete creates an identical mirror. This introduces two severe flaws:
- Silent Overwrites: If a file becomes silently corrupted on the source drive (e.g., bit-rot),
rsyncdetects a difference and faithfully overwrites the healthy backup downstream. This destroys the only good copy of the data. - Nested Redundancy: If you cascade A into B, and then sync B into C, you end up with C holding copies of both B and awkwardly nested copies of A (
/mnt/4tb/2tb_backup/1tb_backup).
To prevent redundant nesting, we utilize a Hub & Spoke mapping model configured in config.env. The 1TB backups reside independently on both the 2TB and the 4TB. To enforce this, the auditor_config.json natively excludes the 1tb_backup/ directory when syncing data from the 2TB to the 4TB, ensuring flat, independent archives.
Instead of directly overwriting or deleting files on the destination drive, core_sync.sh utilizes rsync's built-in --backup and --backup-dir flags.
How it works:
- Let's say
/mnt/1tb/photo.jpgis corrupted. core_syncdetects a difference between/mnt/1tb/photo.jpgand/mnt/2tb/1tb_backup/photo.jpg.- Before overwriting the 2TB version,
rsyncmoves the healthy 2TB version into a timestamped directory:/mnt/2tb/1tb_backup/Archive/2026-02-.../photo.jpg. - It then copies the corrupted 1TB version to the main pool on the 2TB drive.
This means you always retain the historical state of files prior to modification or deletion, safely tucked away in the Archive folder.
To solve the silent corruption issue before it gets synced downstream, we've implemented an intelligent SQLite hashing auditor.
Hashing 4TB of data takes hours. We need it to be fast.
- Unchanged Files: The auditor checks the file's Size and Modified Time (
mtime) against the SQLite DB. If they remain the same, it assumes the file hasn't changed and skips the expensive hashing process. - True Corruption (Bit-Rot): If the Size and
mtimeare identical to the Database, BUT a random or forced re-hash reveals theSHA-256signature changed, the auditor flags this as CRITICALLY CORRUPTED. - Intentional Edits: If the Size or
mtimechanged, the auditor recalculates the hash and updates the DB, assuming you intentionally edited the file. - Renames/Moves: If a "new" file appears, the auditor hashes it and checks the DB. If it finds the exact same hash and size under an older, missing filename, it intelligently updates the path without creating duplicate noise.
The Python auditor utilizes a lightweight SQLite3 database to map the state of your drive efficiently.
| Column Name | Type | Description |
|---|---|---|
id |
INTEGER | Primary Auto-Increment Key. |
file_path |
TEXT | Absolute path to the file on the drive (UNIQUE). |
file_name |
TEXT | The basename of the file. |
file_size |
INTEGER | Size of the file in bytes. |
mtime |
REAL | Modification timestamp of the file. |
sha256_hash |
TEXT | The calculated SHA-256 integrity hash of the file. |
last_seen |
REAL | Timestamp of the last time the auditor verified this file existed. |
status |
TEXT | Tracks file state. Values: active, missing, corrupted. |
- Single-Threaded Writing: SQLite locks the database file during writes. While read queries are fast, the current python script executes single-threaded directory walks.
- Absolute vs Relative Paths: The database tracks absolute paths (
/mnt/1TB/images/...). If you mount a drive to a different mount point (e.g.,/media/user/1TB), the database will treat all files as missing and new. Always use themain.shwrapper to ensure drives are mounted to consistent paths as defined in yourconfig.env. - Mass Deletions via
last_seen: The auditor doesn't actively track folder deletions in real-time. Instead, it sweeps the database at the end of every run. Any active file that belongs to the current target directory but has alast_seentime predating the start of the audit is marked asmissing.