sleeper/docs/usage/properties/table/bulk_import.md at develop · gchq/sleeper · GitHub

18 lines (16 loc) · 7.94 KB

Table Properties - Bulk Import

The following table properties relate to bulk import, i.e. ingesting data using Spark jobs running on EMR or EKS.

Property Name	Description	Default Value
sleeper.table.bulk.import.emr.instance.architecture	(Non-persistent EMR mode only) Which architecture to be used for EC2 instance types in the EMR cluster. Must be either "x86_64" "arm64" or "x86_64,arm64". For more information, see the Bulk import using EMR - Instance types section in docs/usage/bulk-import.md	arm64
sleeper.table.bulk.import.emr.master.x86.instance.types	(Non-persistent EMR mode only) The EC2 x86_64 instance types and weights to be used for the master node of the EMR cluster. For more information, see the Bulk import using EMR - Instance types section in docs/usage/bulk-import.md	m7i.xlarge
sleeper.table.bulk.import.emr.executor.x86.instance.types	(Non-persistent EMR mode only) The EC2 x86_64 instance types and weights to be used for the executor nodes of the EMR cluster. For more information, see the Bulk import using EMR - Instance types section in docs/usage/bulk-import.md	m7i.4xlarge
sleeper.table.bulk.import.emr.master.arm.instance.types	(Non-persistent EMR mode only) The EC2 ARM64 instance types and weights to be used for the master node of the EMR cluster. For more information, see the Bulk import using EMR - Instance types section in docs/usage/bulk-import.md	m7g.xlarge
sleeper.table.bulk.import.emr.executor.arm.instance.types	(Non-persistent EMR mode only) The EC2 ARM64 instance types and weights to be used for the executor nodes of the EMR cluster. For more information, see the Bulk import using EMR - Instance types section in docs/usage/bulk-import.md	m7g.4xlarge
sleeper.table.bulk.import.emr.executor.market.type	(Non-persistent EMR mode only) The purchasing option to be used for the executor nodes of the EMR cluster. Valid values are ON_DEMAND or SPOT.	SPOT
sleeper.table.bulk.import.emr.executor.initial.capacity	(Non-persistent EMR mode only) The initial number of capacity units to provision as EC2 instances for executors in the EMR cluster. This is measured in instance fleet capacity units. These are declared alongside the requested instance types, as each type will count for a certain number of units. By default the units are the number of instances. This value overrides the default value in the instance properties. It can be overridden by a value in the bulk import job specification.	2
sleeper.table.bulk.import.emr.executor.max.capacity	(Non-persistent EMR mode only) The maximum number of capacity units to provision as EC2 instances for executors in the EMR cluster. This is measured in instance fleet capacity units. These are declared alongside the requested instance types, as each type will count for a certain number of units. By default the units are the number of instances. This value overrides the default value in the instance properties. It can be overridden by a value in the bulk import job specification.	10
sleeper.table.bulk.import.emr.release.label	(Non-persistent EMR mode only) The EMR release label to be used when creating an EMR cluster for bulk importing data using Spark running on EMR. This value overrides the default value in the instance properties. It can be overridden by a value in the bulk import job specification.	emr-7.12.0
sleeper.table.bulk.import.min.leaf.partitions	Specifies the minimum number of leaf partitions that are needed to run a bulk import job. If this minimum has not been reached, bulk import jobs will refuse to start	256
sleeper.table.bulk.import.partition.splitting.attempts	Specifies the number of times bulk import tries to create leaf partitions to meet the minimum number of leaf partitions. This will be retried if another process splits the same partitions at the same time.	3
sleeper.table.bulk.import.job.files.commit.async	If true, bulk import will add files via requests sent to the state store committer lambda asynchronously. If false, bulk import will commit new files at the end of the job synchronously. This is only applied if async commits are enabled for the table. The default value is set in an instance property.	true