diff --git a/documentation/operations/backup.md b/documentation/operations/backup.md index 08bfe029f..d634b2ad2 100644 --- a/documentation/operations/backup.md +++ b/documentation/operations/backup.md @@ -119,58 +119,6 @@ specifies a directory for atomic write operations during backup. | `backup.enable.partition.hashes` | Compute BLAKE3 hashes during backup | `false` | | `backup.verify.partition.hashes` | Verify hashes during restore | `false` | -### Run a backup - -Once configured, you can run a backup at any time using the following command: - -```questdb-sql title="Backup database" -BACKUP DATABASE; -``` - -Example output: - -| backup_timestamp | -| ----------------------------- | -| 2024-08-24T12:34:56.789123Z | - -The backup captures the committed database state at the moment the command -executes. In-flight transactions are not included. - -### Monitor and abort - -You can monitor backup progress and history using the `backups()` table function: - -```questdb-sql title="Backup history" -SELECT * FROM backups(); -``` - -Example output: - -| status | progress_percent | start_ts | end_ts | backup_error | cleanup_error | -|---------------------|------------------|-----------------------------|-----------------------------|------------------|---------------| -| backup complete | 100 | 2025-07-30T12:49:30.554262Z | 2025-07-30T16:19:48.554262Z | | | -| backup complete | 100 | 2025-08-06T14:15:22.882130Z | 2025-08-06T17:09:57.882130Z | | | -| backup failed | 35 | 2025-08-20T11:58:03.675219Z | 2025-08-20T12:14:07.675219Z | connection error | | -| backup in progress | 10 | 2025-08-27T15:42:18.281907Z | | | | -| cleanup in progress | 100 | 2025-08-13T13:37:41.103729Z | 2025-08-13T16:44:25.103729Z | | | - -Status values: - -| Status | Meaning | Action | -|-----------------------|----------------------------------|---------------------------------| -| `backup in progress` | Backup is currently running | Wait or run `BACKUP ABORT` | -| `backup complete` | Backup finished successfully | None required | -| `backup failed` | Backup encountered an error | Check `backup_error` column | -| `cleanup in progress` | Old backup data is being removed | Wait for completion | -| `cleanup complete` | Cleanup finished successfully | None required | -| `cleanup failed` | Cleanup encountered an error | Check `cleanup_error` column | - -To abort a running backup: - -```questdb-sql title="Abort backup" -BACKUP ABORT; -``` - ### Scheduled backups You can configure automatic scheduled backups using cron syntax. The example @@ -231,157 +179,58 @@ SELECT reload_config(); You can also use this to enable and disable the schedule by adding or commenting out the `backup.schedule.cron` config setting. +### Run a backup -### Backup instance name - -Each QuestDB instance has a backup instance name (three random words like -`gentle-forest-orchid`). This name is generated on the first backup and -organizes backups in the object store under `backup//`. - -To find your instance name, run: +Once configured, you can run a backup at any time using the following command: -```questdb-sql -SELECT backup_instance_name; +```questdb-sql title="Backup database" +BACKUP DATABASE; ``` -Returns `null` if no backup has been run yet. - -### Replication WAL cleanup integration - -When replication is enabled, the -[WAL cleaner](/docs/high-availability/wal-cleanup/) uses backup manifests to -determine which replicated WAL data in object storage can be safely deleted. -By default, the cleaner retains replication data for as many backups as your -[`backup.cleanup.keep.latest.n`](#backup-retention) setting (default 5) and -deletes everything older. No additional configuration is required — enabling -backups on a replicated instance is sufficient. - -### Performance characteristics - -Backup is designed to prioritize database availability over backup speed. Key -characteristics: - -- **Pressure-sensitive**: Backup automatically throttles itself to avoid - overwhelming the database instance during normal operations -- **Batch uploads**: Data uploads in batches rather than continuously - you may - see surges of activity followed by quieter periods in logs -- **Compressed**: Data is compressed before upload to reduce transfer time and - storage costs -- **Multi-threaded**: Backup uses multiple threads but is deliberately - throttled to maintain instance reliability - -Backup duration depends on data size. Large databases (1TB+) may take several -hours for a full initial backup. Subsequent incremental backups are faster as -only changed data is uploaded. - -### Estimate backup storage - -A safe estimate for total backup storage is **2× your uncompressed database -size on disk**. This provides headroom for the full backup plus incremental -history and edge cases. - -#### How storage accumulates - -| Backup type | What's uploaded | Estimated size | -|-------------|-----------------|----------------| -| Initial (full) | Entire database | DB size ÷ 4 (default compression) | -| Incremental | Changed partitions only | Changed data ÷ 4 | - -Total storage = full backup + (average incremental × retention count) - -The default compression level (5) achieves approximately 4× reduction. Higher -`backup.compression.level` values (up to 22) improve compression at the cost -of CPU time. - -#### Partition-level granularity - -Partitions are the smallest backup unit. Any modification to a partition—even -a single row or column update—causes the entire partition to be re-uploaded in -the next incremental backup. - -This means: - -- **Append-only workloads** (typical time-series): Very efficient. Only the - latest partition changes between backups. -- **Cross-partition updates**: Less efficient. An `UPDATE` without a - constraining `WHERE` clause touches all partitions, causing them all to be - re-uploaded. -- **Schema changes**: Column type changes cause affected partitions to be - re-uploaded. - -#### Example calculation - -A 500 GB database with daily backups, 7-day retention, and ~5% daily change: - -| Component | Calculation | Size | -|-----------|-------------|------| -| Full backup | 500 GB ÷ 4 | 125 GB | -| Daily incremental | 25 GB ÷ 4 | ~6 GB | -| 7 incrementals | 6 GB × 7 | ~42 GB | -| **Total** | | **~170 GB** | - -In this example, actual usage (~170 GB) is well under the 2× planning estimate -(1 TB). The 2× rule is intentionally conservative—use it for initial capacity -planning before you know your actual change patterns, then refine based on -observed usage. - -#### Check actual usage +Example output: -To verify your estimates against actual storage, browse your backup data in -the object store. Backups are stored under `backup//`. +| backup_timestamp | +| ----------------------------- | +| 2024-08-24T12:34:56.789123Z | -To find your instance name, see [Backup instance name](#backup-instance-name). +The backup captures the committed database state at the moment the command +executes. In-flight transactions are not included. -### Limitations +### Monitor and abort -- **Database-wide only**: Backup captures the entire database. You cannot - exclude tables or backup selected tables individually. Every backup includes - all user tables, materialized views, and metadata. -- **One backup at a time**: Only one backup can run at any given time. Starting - a new backup while one is running will return an error. -- **Primary and replica backups are separate**: Each QuestDB instance has its - own [`backup_instance_name`](#backup-instance-name), so backing up both - a primary and its replica creates two separate backup sets in the object - store. Typically, backing up the primary is sufficient since replicas sync - from the same data. -- **Same backup object store for all nodes**: When using replication, all - nodes in the cluster should use the same `backup.object.store` connection - string. The [WAL cleaner](/docs/high-availability/wal-cleanup/) reads - backup manifests from every node to determine what replication data can be - safely deleted. If nodes back up to different object stores, the cleaner - cannot see all manifests and will not trigger correctly. +You can monitor backup progress and history using the `backups()` table function: -### Backup validation +```questdb-sql title="Backup history" +SELECT * FROM backups(); +``` -Backup integrity is verified during restore, not as a standalone operation. +Example output: -#### Verification during restore +| status | progress_percent | start_ts | end_ts | backup_error | cleanup_error | +|---------------------|------------------|-----------------------------|-----------------------------|------------------|---------------| +| backup complete | 100 | 2025-07-30T12:49:30.554262Z | 2025-07-30T16:19:48.554262Z | | | +| backup complete | 100 | 2025-08-06T14:15:22.882130Z | 2025-08-06T17:09:57.882130Z | | | +| backup failed | 35 | 2025-08-20T11:58:03.675219Z | 2025-08-20T12:14:07.675219Z | connection error | | +| backup in progress | 10 | 2025-08-27T15:42:18.281907Z | | | | +| cleanup in progress | 100 | 2025-08-13T13:37:41.103729Z | 2025-08-13T16:44:25.103729Z | | | -QuestDB performs the following checks when restoring: +Status values: -- **Transaction log verification**: Header, hash, and size validation of - transaction log entries (always enabled) -- **Partition hash verification**: Optional BLAKE3 hash comparison for each - file in every partition -- **Manifest validation**: Version compatibility and path safety checks +| Status | Meaning | Action | +|-----------------------|----------------------------------|---------------------------------| +| `backup in progress` | Backup is currently running | Wait or run `BACKUP ABORT` | +| `backup complete` | Backup finished successfully | None required | +| `backup failed` | Backup encountered an error | Check `backup_error` column | +| `cleanup in progress` | Old backup data is being removed | Wait for completion | +| `cleanup complete` | Cleanup finished successfully | None required | +| `cleanup failed` | Cleanup encountered an error | Check `cleanup_error` column | -To enable partition hash verification, set these properties in `server.conf`: +To abort a running backup: -```conf -backup.enable.partition.hashes=true # Compute hashes during backup -backup.verify.partition.hashes=true # Verify hashes during restore +```questdb-sql title="Abort backup" +BACKUP ABORT; ``` -If verification fails, restore stops immediately with an error such as: -`hash mismatch [path=col1.d, expected=..., actual=...]` - -#### What's not available - -- No standalone `VALIDATE BACKUP` command -- No dry-run restore option -- Object store integrity relies on the storage provider (e.g., S3's built-in - checksums) - ### Restore Restore is fast—approximately 1.8 TiB can be restored in under 20 minutes, @@ -440,8 +289,8 @@ restore operations. ::: -The QuestDB version performing the restore must have the same major version as -the version that created the backup (e.g., 8.1.0 and 8.1.1 are compatible). +The QuestDB version performing the restore must be the same as or newer than +the version that created the backup. Restart QuestDB. If restore succeeds, `_backup_restore` is removed automatically. @@ -464,6 +313,31 @@ To recover from a failed restore: If you see the error "Failed restore directory found", a previous restore attempt failed. Remove the artifacts listed above before restarting. +#### Backup validation + +Backup integrity is verified during restore, not as a standalone operation. +QuestDB performs the following checks when restoring: + +- **Transaction log verification**: Header, hash, and size validation of + transaction log entries (always enabled) +- **Partition hash verification**: Optional BLAKE3 hash comparison for each + file in every partition +- **Manifest validation**: Version compatibility and path safety checks + +To enable partition hash verification, set these properties in `server.conf`: + +```conf +backup.enable.partition.hashes=true # Compute hashes during backup +backup.verify.partition.hashes=true # Verify hashes during restore +``` + +If verification fails, restore stops immediately with an error such as: +`hash mismatch [path=col1.d, expected=..., actual=...]` + +There is no standalone `VALIDATE BACKUP` command or dry-run restore option. +Object store integrity relies on the storage provider (e.g., S3's built-in +checksums). + ### Create a replica from a backup You can use a backup to bootstrap a new replica instance instead of relying @@ -497,6 +371,147 @@ more recent than the oldest available WAL data. For more details on replication setup, see the [replication guide](/docs/high-availability/setup/). +### Limitations + +- **Database-wide only**: Backup captures the entire database. You cannot + exclude tables or backup selected tables individually. Every backup includes + all user tables, materialized views, and metadata. +- **One backup at a time**: Only one backup can run at any given time. Starting + a new backup while one is running will return an error. +- **Primary and replica backups are separate**: Each QuestDB instance has its + own [`backup_instance_name`](#backup-instance-name), so backing up both + a primary and its replica creates two separate backup sets in the object + store. Typically, backing up the primary is sufficient since replicas sync + from the same data. +- **Same backup object store for all nodes**: When using replication, all + nodes in the cluster should use the same `backup.object.store` connection + string. The [WAL cleaner](/docs/high-availability/wal-cleanup/) reads + backup manifests from every node to determine what replication data can be + safely deleted. If nodes back up to different object stores, the cleaner + cannot see all manifests and will not trigger correctly. + +### Performance characteristics + +Backup is designed to prioritize database availability over backup speed. Key +characteristics: + +- **Pressure-sensitive**: Backup automatically throttles itself to avoid + overwhelming the database instance during normal operations +- **Batch uploads**: Data uploads in batches rather than continuously - you may + see surges of activity followed by quieter periods in logs +- **Compressed**: Data is compressed before upload to reduce transfer time and + storage costs +- **Multi-threaded**: Backup uses multiple threads but is deliberately + throttled to maintain instance reliability + +Backup duration depends on data size. Large databases (1TB+) may take several +hours for a full initial backup. Subsequent incremental backups are faster as +only changed data is uploaded. + +### Estimate backup storage + +A safe estimate for total backup storage is **2× your uncompressed database +size on disk**. This provides headroom for the full backup plus incremental +history and edge cases. + +#### How storage accumulates + +| Backup type | What's uploaded | Estimated size | +|-------------|-----------------|----------------| +| Initial (full) | Entire database | DB size ÷ 4 (default compression) | +| Incremental | Changed partitions only | Changed data ÷ 4 | + +Total storage = full backup + (average incremental × retention count) + +The default compression level (5) achieves approximately 4× reduction. Higher +`backup.compression.level` values (up to 22) improve compression at the cost +of CPU time. + +#### Partition-level granularity + +Partitions are the smallest backup unit. Any modification to a partition—even +a single row or column update—causes the entire partition to be re-uploaded in +the next incremental backup. + +This means: + +- **Append-only workloads** (typical time-series): Very efficient. Only the + latest partition changes between backups. +- **Cross-partition updates**: Less efficient. An `UPDATE` without a + constraining `WHERE` clause touches all partitions, causing them all to be + re-uploaded. +- **Schema changes**: Column type changes cause affected partitions to be + re-uploaded. + +#### Example calculation + +A 500 GB database with daily backups, 7-day retention, and ~5% daily change: + +| Component | Calculation | Size | +|-----------|-------------|------| +| Full backup | 500 GB ÷ 4 | 125 GB | +| Daily incremental | 25 GB ÷ 4 | ~6 GB | +| 7 incrementals | 6 GB × 7 | ~42 GB | +| **Total** | | **~170 GB** | + +In this example, actual usage (~170 GB) is well under the 2× planning estimate +(1 TB). The 2× rule is intentionally conservative—use it for initial capacity +planning before you know your actual change patterns, then refine based on +observed usage. + +#### Check actual usage + +To verify your estimates against actual storage, browse your backup data in +the object store. Backups are stored under `backup//`. + +To find your instance name, see [Backup instance name](#backup-instance-name). + +### How incremental backup works + +Each backup uploads only partitions that changed since the previous backup, +along with a **manifest** that lists every partition the backup consists of. +The manifest is a complete description of the database state at that point in +time. + +There is no separate "full" or "base" backup. The first backup uploads all +partitions because every partition counts as "changed" relative to the +non-existent previous backup. Subsequent backups upload only the partitions +that actually changed, reusing previously uploaded partition data for +everything else. + +**Restore** reads the manifest from the selected backup timestamp and +downloads only the partitions listed in it. Each backup is independently +restorable without replaying a chain of increments. + +**Cleanup** runs after a backup completes and removes partition data that is +no longer referenced by any backup within the retention window (controlled by +[`backup.cleanup.keep.latest.n`](#backup-retention)). Partitions still +referenced by at least one retained backup are preserved. + +### Backup instance name + +Each QuestDB instance has a backup instance name (three random words like +`gentle-forest-orchid`). This name is generated on the first backup and +organizes backups in the object store under `backup//`. + +To find your instance name, run: + +```questdb-sql +SELECT backup_instance_name; +``` + +Returns `null` if no backup has been run yet. + +### Replication WAL cleanup integration + +When replication is enabled, the +[WAL cleaner](/docs/high-availability/wal-cleanup/) uses backup manifests to +determine which replicated WAL data in object storage can be safely deleted. +By default, the cleaner retains replication data for as many backups as your +[`backup.cleanup.keep.latest.n`](#backup-retention) setting (default 5) and +deletes everything older. No additional configuration is required — enabling +backups on a replicated instance is sufficient. + ### Troubleshooting If you encounter errors during backup or restore: @@ -680,9 +695,8 @@ Follow these steps: #### Database versions -Restoring data is only possible if the backup and restore QuestDB versions have -the same major version number, for example: `8.1.0` and `8.1.1` are compatible. -`8.1.0` and `7.5.1` are not compatible. +The QuestDB version performing the restore must be the same as or newer than +the version that created the backup. #### Restore the root directory @@ -741,4 +755,4 @@ possible outcomes: ## Further reading - [`BACKUP` SQL reference](/docs/query/sql/backup/) - Enterprise backup command syntax -- [`CHECKPOINT` SQL reference](/docs/query/sql/checkpoint/) - OSS checkpoint command syntax +- [`CHECKPOINT` SQL reference](/docs/query/sql/checkpoint/) - OSS checkpoint command syntax \ No newline at end of file