Skip to content

Commit 62e282c

Browse files
authored
Enhance readme with DeltaLake optimization steps
Added detailed instructions for optimizing data lakes using DeltaLake functions, including configuration settings, optimization commands, and table detail checks.
1 parent 09eee4c commit 62e282c

1 file changed

Lines changed: 19 additions & 3 deletions

File tree

  • data-platform/open-source-data-platforms/oci-data-flow/code-examples/DeltaLake_Optimize

data-platform/open-source-data-platforms/oci-data-flow/code-examples/DeltaLake_Optimize/readme.md

Lines changed: 19 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -17,41 +17,57 @@ OCI Data Flow supports Delta Lake by default when your Applications run Spark 3.
1717
How to optimize data lake using DeltaLake functions:
1818
Configure your preferences (please check DeltaLake doc):
1919

20+
```
2021
spark.conf.set('spark.databricks.delta.retentionDurationCheck.enabled', 'False')
2122
spark.conf.set('spark.databricks.delta.optimize.repartition.enabled','True')
2223
spark.conf.set('spark.databricks.delta.optimize.preserveInsertionOrder', 'False')
2324
24-
Preserve vacuum history:
25+
#Preserve vacuum history:
2526
spark.conf.set('spark.databricks.delta.vacuum.logging.enabled','True')
2627
27-
Set retention time for optimized files (ready to delete:
28+
#Set retention time for optimized files (ready to delete:
2829
spark.conf.set("spark.databricks.delta.deletedFileRetentionDuration","0")
30+
```
2931

3032

3133
Check existing table details (look for numFiles and sizeInBytes:
34+
```
3235
spark.sql("describe detail atm").show(truncate=False)
36+
```
37+
```
3338
+------+------------------------------------+--------------------------------+-----------+----------------------------------------------------+-----------------------+-------------------+------------------+--------+-----------+----------+----------------+----------------+------------------------+
3439
|format|id |name |description|location |createdAt |lastModified |partitionColumns |numFiles|sizeInBytes|properties|minReaderVersion|minWriterVersion|tableFeatures |
3540
+------+------------------------------------+--------------------------------+-----------+----------------------------------------------------+-----------------------+-------------------+------------------+--------+-----------+----------+----------------+----------------+------------------------+
3641
|delta |c15ad4ca-8c0f-4747-b064-1492d7b4b3c4|spark_catalog.default.hsl_trains|NULL |oci://dataflow_app@fro8fl9kuqli/hsl_trains_data_part|2024-09-05 10:19:10.057|2024-09-06 08:45:01|[year, month, day]|2024 |16333676 |{} |1 |2 |[appendOnly, invariants]|
3742
+------+------------------------------------+--------------------------------+-----------+----------------------------------------------------+-----------------------+-------------------+------------------+--------+-----------+----------+----------------+----------------+------------------------+
43+
```
3844

3945
Run optimzation:
46+
```
4047
spark.sql("OPTIMIZE atm").show(truncate=False)
48+
```
4149

4250
Check files you can delete:
51+
```
4352
spark.sql("vacuum atm RETAIN 0 HOURS DRY RUN")
53+
```
4454

4555
Delete optimized and consolidated files:
56+
```
4657
spark.sql("vacuum atm RETAIN 0 HOURS")
58+
```
4759

4860
and check details of your table:
61+
```
4962
spark.sql("describe detail atm").show(truncate=False)
63+
```
64+
```
5065
+----------------+----------------+------------------------+
5166
|format|id |name |description|location |createdAt |lastModified |partitionColumns |numFiles|sizeInBytes|properties|minReaderVersion|minWriterVersion|tableFeatures |
5267
+------+------------------------------------+--------------------------------+-----------+----------------------------------------------------+-----------------------+-------------------+------------------+--------+-----------+----------+----------------+----------------+------------------------+
5368
|delta |c15ad4ca-8c0f-4747-b064-1492d7b4b3c4|spark_catalog.default.hsl_trains|NULL |oci://dataflow_app@fro8fl9kuqli/hsl_trains_data_part|2024-09-05 10:19:10.057|2024-09-06 08:47:48|[year, month, day]|7 |1583521 |{} |1 |2 |[appendOnly, invariants]|
54-
+------+------------------------------------+--------------------------------+-----------+----------------------------------------------------+-----------------------+-------------------+------------------+--------+-----------+----------+----------------+----------------+------------------------+
69+
+------+------------------------------------+--------------------------------+-----------+----------------------------------------------------+-----------------------+-------------------+------------------+--------+-----------+----------+----------------+----------------+------------------------+
70+
```
5571

5672
Enjoy increased performance of your queries!
5773

0 commit comments

Comments
 (0)