You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: blog/azure-storage-options/index.md
+10-12Lines changed: 10 additions & 12 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -89,6 +89,8 @@ There's no schema enforcement, no type checking. You put a file in, you get it b
89
89
90
90
Depending on how your data is written, you'll use one of three blob types:
91
91
92
+

93
+
92
94
-**Block Blob :** Upload a file all at once. This covers 95% of data engineering use cases, your CSVs, Parquet files, JSON exports all go here.
93
95
-**Append Blob :** Add data continuously without modifying what's already there. Perfect for log files that grow over time.
94
96
-**Page Blob :** Optimised for random read/write operations. Used mainly for VM disks. You'll rarely touch this directly.
@@ -114,24 +116,20 @@ Plain Blob Storage works perfectly for general file storage. But for big data an
114
116
115
117
## The Problem with Plain Blob Storage at Scale
116
118
117
-
In standard Blob Storage, **folders don't actually exist.**
119
+
Here's something I found out the hard way six months into working with Azure pipelines.
118
120
119
-
What looks like a folder structure:
120
-
```
121
-
raw/2024/jan/sales.csv
122
-
raw/2024/feb/sales.csv
123
-
raw/2024/mar/sales.csv
124
-
```
121
+
I had a container full of raw sales data — about 40,000 Parquet files organised under a path that looked like `raw/2024/`. My team decided to rename it to `bronze/2024/` to match our Medallion Architecture convention. Simple enough, right?
125
122
126
-
...is actually just flat key names. The `/` characters are part of the key string, not real directory separators. There are no real folders underneath.
123
+
It took **47 minutes**.
127
124
128
-
This creates a serious problem when your data grows:
125
+
Not because Azure was slow. Because what looked like a folder called `raw/` was never actually a folder. In plain Blob Storage, everything lives at the same flat level, the slashes in a path like
126
+
`raw/2024/jan/file.parquet` are just characters in a key name, the same way a filename on your desktop could technically be called `raw-2024-jan-file.parquet` with dashes instead.
129
127
130
-
Imagine you need to rename the `raw/2024/` "folder" to `bronze/2024/`. In regular Blob Storage, Azure has to copy each file to the new key name and delete the old one,**one file at a time**. With a thousand files, that's a thousand individual operations. With ten million files, you're waiting hours.
128
+
There is no directory underneath. So renaming means Azure copies each file to the new key name and deletes the old one,one file at a time, 40,000 times in a row.
131
129
132
-
At big data scale, this is a dealbreaker.
130
+
At big data scale where you're managing millions of files across Bronze, Silver, and Gold layers that's not a minor inconvenience. It's a pipeline blocker.
133
131
134
-
This is the exact problem that **Azure Data Lake Storage Gen2** was built to solve.
132
+
This is the exact problem **ADLS Gen2** was built to fix.
0 commit comments