Skip to content

Commit 5e788ba

Browse files
committed
added blog about delta lake storage
1 parent 6af7000 commit 5e788ba

4 files changed

Lines changed: 263 additions & 6 deletions

File tree

1.88 MB
Loading
Lines changed: 246 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,246 @@
1+
---
2+
title: "Delta Lake: An Introduction to Trustworthy Data Storage"
3+
authors: [Aditya-Singh-Rathore]
4+
sidebar_label: "Delta lake Storage"
5+
tags: [deltalake,storage, Big Data, cloud, Data Engineering, fabric]
6+
date: 2026-05-01
7+
8+
description: Delta Lake is an open-source storage framework that enables building a format agnostic Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, Hive, Snowflake, Google BigQuery, Athena, Redshift, Databricks, Azure Fabric and APIs for Scala, Java, Rust, and Python. With Delta Universal Format aka UniForm, you can read now Delta tables with Iceberg and Hudi clients.
9+
10+
draft: false
11+
canonical_url:
12+
# meta:
13+
# - name: "robots"
14+
# content: "index, follow"
15+
# - property: "og:title"
16+
# content: "What is Google DeepMind AI?"
17+
# - property: "og:description"
18+
# content: "DeepMind is an auxiliary of Google that centers around man-made brainpower. All the more explicitly, it utilizes a part of AI called AI"
19+
# - property: "og:type"
20+
# content: "article"
21+
# - property: "og:url"
22+
# content: "/blog/getting-started-with-mern"
23+
# - property: "og:image"
24+
# content: "/assets/images/mern-8a27add30515e58f789f89a4c9072818.jpg"
25+
# - name: "twitter:card"
26+
# content: "summary_large_image"
27+
# - name: "twitter:title"
28+
# content: "A Comprehensive Guide to Get You Started with MERN Stack"
29+
# - name: "twitter:description"
30+
# content: "DeepMind is an auxiliary of Google that centers around man-made brainpower. All the more explicitly, it utilizes a part of AI called AI"
31+
# - name: "twitter:image"
32+
# content: "assets/images/mern-8a27add30515e58f789f89a4c9072818.jpg"
33+
34+
---
35+
<!-- truncate -->
36+
## There Is Something Wrong With Your Data Lake
37+
38+
Imagine this: your firm receives hundreds of records per hour, be it users signing up for an account, making purchases, or using your mobile application. You store all these records in a data lake, which is hosted on the cloud. Got it?
39+
40+
Now, imagine something happening to this system. Two pipelines write to the same table simultaneously, overwriting each other. And now half of your data is gone. No one notices until it becomes obvious in the weekly report.
41+
42+
The issue described above is a common one when using traditional data lakes. The thing is that data lakes were created to solve a different problem, one of storing information rather than ensuring its reliability.
43+
And that's what **Delta Lake** is designed to solve.
44+
45+
![delta-lake](./Images/delta-lakepng.png)
46+
47+
48+
## What is Delta Lake, in Plain English?
49+
50+
Consider a traditional data lake to be a folder in Google Drive, where anyone has the ability to edit or even delete anything inside without leaving an audit trail or version history.
51+
What if that folder was:
52+
53+
- 1. Version-controlled and could be rolled back to any previous state
54+
- 2. Guaranteed to have a clean schema
55+
- 3. Structured such that bad data can't possibly get stored
56+
- 4. Secure against race conditions when used by multiple writers
57+
58+
This folder would be a Delta Lake. It operates over the storage already provided for your organization and makes all those promises without asking you to move off your storage infrastructure.
59+
60+
61+
## The Four Unique Features of Delta Lake
62+
63+
### 1. ACID Transactions: Corruption-Free Data!
64+
65+
ACID Transactions are `Atomicity`, `Consistency`, `Isolation`, and `Durability`. It is not mandatory to memorize these terminologies, but it is essential to understand how they operate.
66+
Delta Lake provides us a guarantee that when two processes attempt to modify the same dataset, none of them will overwrite the other's modification. Each process either proceeds or waits for their turn, which gives us consistency in our data like a queue at the cashier.
67+
68+
### 2. Time Travel: The "Undo" Feature
69+
70+
When working with a Delta table, all of your operations are kept in versioning. Accidentally deleted a record? Performed a bad update operation? With the time travel feature, we can revert changes and query the data at any point in time in history of our table.
71+
72+
### 3. Schema Enforcement: Bad Data Rejection
73+
Suppose that your schema requires a certain field to only contain numerical values while another client attempts to send you a record that contains a string. In this case, Delta Lake blocks this row from being entered into the dataset.
74+
75+
### 4. Schema Evolution – Evolving without Breaking Anything
76+
77+
As your product matures, so does your data. Want to add an extra column? Delta Lake makes schema evolution easy – your data remains untouched while your workflows continue uninterrupted.
78+
79+
## And How Exactly Does That Work?
80+
81+
All the magic above happens because of a mechanism known as the Transaction Log, and it’s kept in a folder named `_delta_log` within your table itself.
82+
Every individual action, be it inserting, deleting, or updating records, is logged in a JSON format within that log. Delta Lake relies on this transaction log to keep track of the latest status of your table, and which older files can be safely deleted from the system.
83+
84+
## Here’s how your table appears on the disk:
85+
86+
```python
87+
my_table/
88+
├── _delta_log/
89+
│ ├── 00000000000000000000.json ← "Table was created"
90+
│ ├── 00000000000000000001.json ← "10 rows were added"
91+
│ └── 00000000000000000002.json ← "Salary column was updated"
92+
├── part-00001.parquet
93+
├── part-00002.parquet
94+
└── part-00003.parquet
95+
```
96+
The real data is stored in Parquet files, which are highly efficient in terms of querying. The transaction log is the brain, and the Parquet files are the data store..
97+
98+
## Let's Write Some Code
99+
100+
### Setting Up
101+
```Python
102+
pip install delta-spark pyspark
103+
from pyspark.sql import SparkSession
104+
from delta import configure_spark_with_delta_pip
105+
106+
builder = SparkSession.builder \
107+
.appName("MyFirstDeltaTable") \
108+
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
109+
.config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")
110+
111+
spark = configure_spark_with_delta_pip(builder).getOrCreate()
112+
```
113+
### Creating a Delta Table
114+
```python
115+
# Let's create a simple employee dataset
116+
employees = [
117+
(1, "Priya Sharma", "Engineering", 82000),
118+
(2, "Liam O'Brien", "Marketing", 67000),
119+
(3, "Yuki Tanaka", "Engineering", 91000),
120+
(4, "Carlos Mendez", "Sales", 74000),
121+
]
122+
columns = ["id", "name", "department", "salary"]
123+
124+
df = spark.createDataFrame(employees, columns)
125+
126+
# Save it as a Delta table
127+
df.write.format("delta").mode("overwrite").save("/data/employees")
128+
```
129+
130+
That's it. You now have a Delta table with a transaction log, version history, and all the reliability features built in automatically.
131+
132+
### Reading It Back
133+
```python
134+
df = spark.read.format("delta").load("/data/employees")
135+
df.show()
136+
```
137+
```text
138+
+---+-------------+------------+------+
139+
| id| name| department|salary|
140+
+---+-------------+------------+------+
141+
| 1| Priya Sharma| Engineering| 82000|
142+
| 2| Liam O'Brien| Marketing| 67000|
143+
| 3| Yuki Tanaka| Engineering| 91000|
144+
| 4|Carlos Mendez| Sales| 74000|
145+
+---+-------------+------------+------+
146+
```
147+
148+
### Using Time Travel
149+
150+
Let's say you update some salaries, then realize the update was wrong:
151+
152+
```python
153+
from delta.tables import DeltaTable
154+
155+
delta_table = DeltaTable.forPath(spark, "/data/employees")
156+
157+
# Give everyone in Engineering a raise
158+
delta_table.update(
159+
condition="department = 'Engineering'",
160+
set={"salary": "salary + 5000"}
161+
)
162+
```
163+
Oops! turns out that update was wrong. No panic. Just travel back to version 0:
164+
165+
```python
166+
# Check the history first
167+
delta_table.history().show()
168+
169+
# Read the original data before the update
170+
original_df = spark.read \
171+
.format("delta") \
172+
.option("versionAsOf", 0) \
173+
.load("/data/employees")
174+
175+
original_df.show()
176+
```
177+
178+
You get your original data back, untouched. You can restore it, compare it, or just use it to figure out what went wrong.
179+
180+
181+
### Inserting and Updating at the Same Time (MERGE)
182+
183+
One of the most useful everyday operations is `MERGE`, often called an upsert.
184+
It means: update the record if it exists, insert it if it doesn't.
185+
186+
```python
187+
# Some incoming data -- one update, one brand new employee
188+
incoming = [
189+
(2, "Liam O'Brien", "Marketing", 71000), # salary updated
190+
(5, "Amara Osei", "HR", 69000), # new employee
191+
]
192+
193+
incoming_df = spark.createDataFrame(incoming, columns)
194+
195+
delta_table.alias("existing").merge(
196+
incoming_df.alias("new"),
197+
"existing.id = new.id"
198+
).whenMatchedUpdate(set={
199+
"salary": "new.salary"
200+
}).whenNotMatchedInsert(values={
201+
"id": "new.id",
202+
"name": "new.name",
203+
"department": "new.department",
204+
"salary": "new.salary"
205+
}).execute()
206+
```
207+
One operation. No duplicates. No manual checking. Clean results every time.
208+
209+
### Keeping Your Table Healthy
210+
211+
Over time, Delta Lake accumulates old data files for time travel. You'll want to periodically clean those up:
212+
213+
```python
214+
# Remove files older than 7 days
215+
spark.sql("VACUUM delta.`/data/employees` RETAIN 168 HOURS")
216+
217+
And if your table gets many small files over time (which slows down queries), compact them:
218+
python
219+
# Compact small files into larger, more efficient ones
220+
spark.sql("OPTIMIZE delta.`/data/employees`")
221+
```
222+
223+
Think of `VACUUM` as taking out the trash and `OPTIMIZE` as reorganizing your desk. Both are good habits to run on a schedule.
224+
225+
## When Should You Utilize Delta Lake?
226+
227+
Delta Lake is perfect for use when:
228+
229+
- 1. There are several pipelines or multiple parties writing to the same data set.
230+
- 2. An audit history of all changes is necessary.
231+
- 3. The schema of your data can change.
232+
- 4. You would like to detect any data that could cause problems.
233+
- 5. Real-time streams and batch historical data are being combined.
234+
235+
If you have static files that are never going to be changed, then regular Parquet will be sufficient. However, the second your data becomes dynamic, it's worth its weight in gold.
236+
237+
## Conclusion
238+
239+
In essence, Delta Lake starts with taking the idea of a data lake – low-cost, scalable, and flexible storage – and makes it reliable. The ACID transaction model eliminates silent corruptions, time travel allows you to get back your data on any mistake, while schema enforcement prevents bad data from entering your system, while at the same time schema evolution makes sure your data stack evolves easily.
240+
241+
And at the heart of this system lies nothing else but a transaction log – an easy and audit-ready record of every transaction made to your data.
242+
243+
When it comes to building data pipelines where data quality really matters – which happens sooner or later – Delta Lake cannot be anything else but the base of your stack. But most importantly, it’s very easy to implement.
244+
245+
246+
<GiscusComments/>

src/database/blogs/index.tsx

Lines changed: 17 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -10,9 +10,9 @@ interface Blog {
1010
}
1111

1212
const blogs: Blog[] = [
13-
13+
1414
{
15-
id: 2,
15+
id: 1,
1616
title: "Land a Job in UI/UX Design",
1717
image: "/img/blogs/04-ux-job-design.png",
1818
description:
@@ -24,7 +24,7 @@ const blogs: Blog[] = [
2424
},
2525

2626
{
27-
id: 6,
27+
id: 2,
2828
title: "What is GitHub Copilot",
2929
image: "/img/blogs/06-github-agent.png",
3030
description:
@@ -35,7 +35,7 @@ const blogs: Blog[] = [
3535
tags: ["GitHub", "AI", "Coding", "Tools"],
3636
},
3737
{
38-
id: 7,
38+
id: 3,
3939
title: "Apache Spark Architecture Explained",
4040
image: "img/blogs/07-spark-blog-banner.png",
4141
description:
@@ -46,7 +46,7 @@ const blogs: Blog[] = [
4646
tags: ["Apache Spark", "Big Data", "Data Engineering", "Architecture"],
4747
},
4848
{
49-
id: 8,
49+
id: 4,
5050
title: "N8N: The Future of Workflow Automation",
5151
image: "/img/blogs/n8n-logo.png",
5252
description:
@@ -57,7 +57,7 @@ const blogs: Blog[] = [
5757
tags: ["Automation", "Workflow", "N8N", "Tools"],
5858
},
5959
{
60-
id: 9,
60+
id: 5,
6161
title: "OpenAI AgentKit: Building AI Agents Without the Complexity",
6262
image: "/img/blogs/Agent_Builder.png",
6363
description:
@@ -67,6 +67,17 @@ const blogs: Blog[] = [
6767
category: "AI & Tech",
6868
tags: ["AI", "OpenAI", "Development", "Agents"],
6969
},
70+
{
71+
id: 6,
72+
title: "Delta Lake: An Introduction to Trustworthy Data Storage",
73+
image: "/img/blogs/delta-lake-logo.png",
74+
description:
75+
"Delta Lake is an open-source storage layer that brings ACID transactions to Apache Spark and big data workloads.",
76+
slug: "deltalake-data-storage",
77+
authors: ["Aditya-Singh-Rathore"],
78+
category: "data engineering",
79+
tags: ["Delta Lake", "Big Data", "Data Engineering", "Storage"],
80+
},
7081
];
7182

7283
export default blogs;
23.1 KB
Loading

0 commit comments

Comments
 (0)