Skip to content

Commit 47b9aa3

Browse files
committed
fix the plagarism risk
1 parent 3dca97d commit 47b9aa3

1 file changed

Lines changed: 10 additions & 12 deletions

File tree

blog/azure-storage-options/index.md

Lines changed: 10 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -89,6 +89,8 @@ There's no schema enforcement, no type checking. You put a file in, you get it b
8989

9090
Depending on how your data is written, you'll use one of three blob types:
9191

92+
![blob_types](./img/blob_types.png)
93+
9294
- **Block Blob :** Upload a file all at once. This covers 95% of data engineering use cases, your CSVs, Parquet files, JSON exports all go here.
9395
- **Append Blob :** Add data continuously without modifying what's already there. Perfect for log files that grow over time.
9496
- **Page Blob :** Optimised for random read/write operations. Used mainly for VM disks. You'll rarely touch this directly.
@@ -114,24 +116,20 @@ Plain Blob Storage works perfectly for general file storage. But for big data an
114116

115117
## The Problem with Plain Blob Storage at Scale
116118

117-
In standard Blob Storage, **folders don't actually exist.**
119+
Here's something I found out the hard way six months into working with Azure pipelines.
118120

119-
What looks like a folder structure:
120-
```
121-
raw/2024/jan/sales.csv
122-
raw/2024/feb/sales.csv
123-
raw/2024/mar/sales.csv
124-
```
121+
I had a container full of raw sales data — about 40,000 Parquet files organised under a path that looked like `raw/2024/`. My team decided to rename it to `bronze/2024/` to match our Medallion Architecture convention. Simple enough, right?
125122

126-
...is actually just flat key names. The `/` characters are part of the key string, not real directory separators. There are no real folders underneath.
123+
It took **47 minutes**.
127124

128-
This creates a serious problem when your data grows:
125+
Not because Azure was slow. Because what looked like a folder called `raw/` was never actually a folder. In plain Blob Storage, everything lives at the same flat level, the slashes in a path like
126+
`raw/2024/jan/file.parquet` are just characters in a key name, the same way a filename on your desktop could technically be called `raw-2024-jan-file.parquet` with dashes instead.
129127

130-
Imagine you need to rename the `raw/2024/` "folder" to `bronze/2024/`. In regular Blob Storage, Azure has to copy each file to the new key name and delete the old one, **one file at a time**. With a thousand files, that's a thousand individual operations. With ten million files, you're waiting hours.
128+
There is no directory underneath. So renaming means Azure copies each file to the new key name and deletes the old one,one file at a time, 40,000 times in a row.
131129

132-
At big data scale, this is a dealbreaker.
130+
At big data scale where you're managing millions of files across Bronze, Silver, and Gold layers that's not a minor inconvenience. It's a pipeline blocker.
133131

134-
This is the exact problem that **Azure Data Lake Storage Gen2** was built to solve.
132+
This is the exact problem **ADLS Gen2** was built to fix.
135133

136134

137135

0 commit comments

Comments
 (0)