The BulkLoad feature works in conjunction with :doc:`BulkDump <bulkdump>` to provide a complete data migration solution. BulkLoad takes manifest files and SST files generated by :doc:`BulkDump <bulkdump>` and efficiently loads them into a target FoundationDB cluster.
When a user wants to start a bulkload job, the user provides:
- Job ID: The unique identifier from the corresponding BulkDump job
- Key Range: The range of keys to load (must be within the dumped range)
- Source Path: Either a local directory or blobstore URL containing the dump files
Required Configuration: BulkLoad requires the following server knobs to be enabled:
--knob_shard_encode_location_metadata=1: Enables shard-aware location metadata--knob_enable_read_lock_on_range=1: Enables exclusive range locking during load operations
Input File Structure: BulkLoad expects the input files to be organized as produced by :doc:`BulkDump <bulkdump>`.
Currently, FDBCLI tools and low-level ManagementAPIs are provided to submit a job or clear a job. These operations are achieved by issuing transactions to update the bulkload metadata and taking exclusive locks on the target range. Submitting a job involves validating the input parameters, taking an exclusive read lock on the target range, and writing job metadata. When submitting a job, the API checks if there is any ongoing bulkload job or conflicting locks. If yes, it will reject the job. Otherwise, it accepts the job. Clearing a job releases the range lock and marks the job as cancelled in the metadata.
FDBCLI provides following interfaces to do the operations:
- Submit a job: bulkload load (JobID) (BeginKey) (EndKey) (RootFolder) // ...where JobID is from BulkDump, and RootFolder is a local directory or blobstore URL
- Clear a job: bulkload cancel (JobID)
- Enable the feature: bulkload mode on | off // "bulkload mode" command prints the current value (on or off) of the mode
- Check status: bulkload status // Shows current running job information
- View history: bulkload history // Shows completed job history
For detailed usage examples and quickstart guide, see :doc:`bulkload-user`.
ManagementAPI provides following interfaces to do the operations:
- Submit a job:
submitBulkLoadJob(BulkLoadJobState jobState) - Clear a job:
cancelBulkLoadJob(UID jobId) - Enable the feature:
setBulkLoadMode(int mode)// Set mode = 1 to enable; Set mode = 0 to disable - Get job status:
getBulkLoadJobStatus(Database cx) - BulkLoad job metadata is generated by
createBulkLoadJob()
- Users submit a BulkLoad job via
submitBulkLoadJob()specifying the source JobID, target range, and data location - The API validates the job parameters and checks for conflicting BulkLoad/BulkDump jobs
- An exclusive read lock is taken on the entire target range using
takeExclusiveReadLockOnRange() - Job metadata is persisted to bulkload job space (
\\xff/bulkLoadJob/prefix) and task space is initialized - DD's
bulkLoadJobManager()detects the new job and downloads the global job-manifest.txt file - DD parses the job manifest to build a map of manifest entries by key range
- DD creates BulkLoad tasks by grouping manifest entries (up to
MANIFEST_COUNT_MAX_PER_BULKLOAD_TASKper task) - Each task is persisted to task metadata space (
\\xff/bulkLoadTask/prefix) and triggers data movement - DD's
doBulkLoadTask()coordinates with data movement system to load SST files into target shards - Storage servers receive data movement requests containing BulkLoad task information
- Storage servers download SST files, validate integrity, and apply data using storage engine ingestion
- Tasks complete and are marked as
BulkLoadPhase::CompleteorBulkLoadPhase::Error - When all tasks finish, the job is finalized and moved to job history, and the range lock is released
BulkLoad uses FoundationDB's range locking mechanism to ensure data consistency:
registerRangeLockOwner()registers the BulkLoad system as a lock owner with name"BulkLoad"takeExclusiveReadLockOnRange()takes an exclusive read lock on the entire job range duringsubmitBulkLoadJob()- This prevents any concurrent transactions from modifying data in the target range
- Lock-aware transactions can still read from the range during the load process
- The lock is automatically released via
releaseExclusiveReadLockOnRange()when the job completes, is cancelled, or errors - Range locks are managed through the
\\xff/rangeLock/keyspace with owner information in\\xff/rangeLockOwner/ - BulkLoad jobs will fail with
range_lock_rejectif the target range is already locked by another operation
- At any time, FDB cluster accepts at most one bulkload job.
submitBulkLoadJob()checks for existing BulkLoad or BulkDump jobs and rejects withbulkload_task_failed()if conflicts exist - DD partitions jobs into tasks where each task contains up to
MANIFEST_COUNT_MAX_PER_BULKLOAD_TASKmanifest entries - Task ranges are determined by the union of all manifest ranges within the job range
- Tasks are assigned to data movement operations that target the appropriate storage servers for each shard
- BulkLoad tasks are tracked in
BulkLoadTaskCollectionto coordinate with data movement and prevent shard boundary changes - Each task validates source data integrity through manifest checksums and range validation
- Tasks complete atomically - either all manifests in a task succeed or the entire task is marked as error
- The job range remains exclusively locked throughout the entire operation until completion or cancellation
- Task metadata persists through DD restarts - incomplete tasks are automatically resumed
- Manifest Validation: Task ranges are validated against source manifest files using
getBulkLoadManifestMetadataFromEntry() - Job Coverage Validation: The job range must be entirely covered by the source dataset or the job fails with
bulkload_dataset_not_cover_required_range() - Task Atomicity: Each task either completes entirely or fails - partial task completion is not supported
- SST File Integrity: Storage engines validate SST file integrity during ingestion
- Range Alignment: Task ranges are aligned with shard boundaries and manifest boundaries
- DD Restart: Tasks persist through DD restarts via
\\xff/bulkLoadTask/metadata and are automatically resumed - Task Retry: Failed tasks are retried automatically by the BulkLoad engine up to configured limits
- Job Cancellation:
cancelBulkLoadJob()clears all metadata and releases range locks immediately - Data Movement Conflicts: Tasks coordinate with data movement system through
BulkLoadTaskCollectionto handle shard reassignments - Lock Conflicts: Jobs fail immediately with
range_lock_rejectif the target range is already locked - Manifest Download Failures: Network/S3 failures during manifest download cause the job to error and move to history
- Task Error Handling: Individual task failures are marked as
BulkLoadPhase::Errorand can be acknowledged by users - Range Coverage Failures: Jobs fail with
bulkload_dataset_not_cover_required_range()if source data doesn't cover the requested range
- Parallelism: Controlled by
DD_BULKLOAD_PARALLELISMknob for DD-level parallelism - Storage Server Load: Each storage server handles one bulkload task at a time
- Network Bandwidth: Large SST files may saturate network bandwidth during downloads
- Storage Engine Impact: Direct SST ingestion bypasses normal write paths for better performance
- Memory Usage: SST files are typically loaded into memory for validation before application