-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy path2_dim_date_table_creation.py
More file actions
169 lines (155 loc) · 5.91 KB
/
2_dim_date_table_creation.py
File metadata and controls
169 lines (155 loc) · 5.91 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
# Databricks notebook source
from pyspark.sql import functions as F
# COMMAND ----------
start_date = '2024-01-01'
end_date = '2025-12-01'
# COMMAND ----------
# Generate 1 row per month start between start_date and end_date
df = (
spark.sql(f"""
SELECT explode(
sequence(
to_date('{start_date}'),
to_date('{end_date}'),
interval 1 month
)
) AS month_start_date
""")
)
# add useful analytics column
df = (
df
.withColumn("date_key", F.date_format("month_start_date", "yyyyMM").cast("int"))
.withColumn("year", F.year("month_start_date"))
.withColumn("month_name", F.date_format("month_start_date", "MMMM"))
.withColumn("month_short_name", F.date_format("month_start_date", "MMM"))
.withColumn("quarter", F.concat(F.lit("Q"), F.lit(F.quarter("month_start_date"))))
.withColumn("year_quarter", F.concat(F.col("year"), F.lit("-Q"), F.quarter("month_start_date")))
)
# COMMAND ----------
display(df)
# COMMAND ----------
# now save the dataframe as table
df.write \
.mode("overwrite") \
.format("delta") \
.saveAsTable("fmcg.gold.dim_date")
# COMMAND ----------
# MAGIC %md
# MAGIC # Here is what is going on above
# MAGIC
# MAGIC Sure! 😊 Let’s go through this **PySpark** code step by step — in **simple terms** and with an **example** so it’s easy to understand.
# MAGIC
# MAGIC ---
# MAGIC
# MAGIC ### 🧠 What this code does
# MAGIC
# MAGIC It **creates a DataFrame** with **one row per month** between two given dates (e.g., from `2023-01-01` to `2023-12-01`)
# MAGIC and then adds some **useful date-related columns** for analytics.
# MAGIC
# MAGIC ---
# MAGIC
# MAGIC ### 💡 Step-by-step explanation
# MAGIC
# MAGIC #### 1. Create monthly dates between `start_date` and `end_date`
# MAGIC
# MAGIC ```python
# MAGIC df = (
# MAGIC spark.sql(f"""
# MAGIC SELECT explode(
# MAGIC sequence(
# MAGIC to_date('{start_date}'),
# MAGIC to_date('{end_date}'),
# MAGIC interval 1 month
# MAGIC )
# MAGIC ) AS month_start_date
# MAGIC """)
# MAGIC )
# MAGIC ```
# MAGIC
# MAGIC Let’s assume:
# MAGIC
# MAGIC ```python
# MAGIC start_date = "2023-01-01"
# MAGIC end_date = "2023-06-01"
# MAGIC ```
# MAGIC
# MAGIC ##### 👉 What happens here:
# MAGIC
# MAGIC * `sequence(to_date('2023-01-01'), to_date('2023-06-01'), interval 1 month)`
# MAGIC → creates a list of dates spaced by **1 month**:
# MAGIC
# MAGIC ```
# MAGIC [2023-01-01, 2023-02-01, 2023-03-01, 2023-04-01, 2023-05-01, 2023-06-01]
# MAGIC ```
# MAGIC
# MAGIC * `explode(...)` turns that list into **multiple rows**, one per date.
# MAGIC
# MAGIC So the DataFrame (`df`) now looks like this:
# MAGIC
# MAGIC | month_start_date |
# MAGIC | ---------------- |
# MAGIC | 2023-01-01 |
# MAGIC | 2023-02-01 |
# MAGIC | 2023-03-01 |
# MAGIC | 2023-04-01 |
# MAGIC | 2023-05-01 |
# MAGIC | 2023-06-01 |
# MAGIC
# MAGIC ---
# MAGIC
# MAGIC #### 2. Add extra columns for analytics
# MAGIC
# MAGIC ```python
# MAGIC df = (
# MAGIC df
# MAGIC .withColumn("date_key", F.date_format("month_start_date", "yyyyMM").cast("int"))
# MAGIC .withColumn("year", F.year("month_start_date"))
# MAGIC .withColumn("month_name", F.date_format("month_start_date", "MMMM"))
# MAGIC .withColumn("month_short_name", F.date_format("month_start_date", "MMM"))
# MAGIC .withColumn("quarter", F.concat(F.lit("Q"), F.lit(F.quarter("month_start_date"))))
# MAGIC .withColumn("year_quarter", F.concat(F.col("year"), F.lit("-Q"), F.quarter("month_start_date")))
# MAGIC )
# MAGIC ```
# MAGIC
# MAGIC Let’s break down each column:
# MAGIC
# MAGIC | Column | Meaning | Example (for 2023-03-01) |
# MAGIC | -------------------- | ---------------------------------- | ------------------------ |
# MAGIC | **date_key** | A numeric key (YYYYMM) | `202303` |
# MAGIC | **year** | Year number | `2023` |
# MAGIC | **month_name** | Full month name | `March` |
# MAGIC | **month_short_name** | Short month name | `Mar` |
# MAGIC | **quarter** | Quarter of the year (Q1, Q2, etc.) | `Q1` |
# MAGIC | **year_quarter** | Year and quarter combined | `2023-Q1` |
# MAGIC
# MAGIC ---
# MAGIC
# MAGIC ### 🧾 Final Output Example
# MAGIC
# MAGIC | month_start_date | date_key | year | month_name | month_short_name | quarter | year_quarter |
# MAGIC | ---------------- | -------- | ---- | ---------- | ---------------- | ------- | ------------ |
# MAGIC | 2023-01-01 | 202301 | 2023 | January | Jan | Q1 | 2023-Q1 |
# MAGIC | 2023-02-01 | 202302 | 2023 | February | Feb | Q1 | 2023-Q1 |
# MAGIC | 2023-03-01 | 202303 | 2023 | March | Mar | Q1 | 2023-Q1 |
# MAGIC | 2023-04-01 | 202304 | 2023 | April | Apr | Q2 | 2023-Q2 |
# MAGIC | 2023-05-01 | 202305 | 2023 | May | May | Q2 | 2023-Q2 |
# MAGIC | 2023-06-01 | 202306 | 2023 | June | Jun | Q2 | 2023-Q2 |
# MAGIC
# MAGIC ---
# MAGIC
# MAGIC ### ✅ Summary
# MAGIC
# MAGIC In plain English:
# MAGIC
# MAGIC > This code makes a list of all the first days of each month between two dates,
# MAGIC > then adds columns like month name, year, quarter, and a numeric key (YYYYMM)
# MAGIC > to help with reporting and analytics.
# MAGIC
# MAGIC ---
# MAGIC
# MAGIC Would you like me to show how to do the same thing in **pure Python (Pandas)** instead of Spark, for easier testing locally?
# MAGIC
# COMMAND ----------
# MAGIC %sql
# MAGIC select count('id') from fmcg.gold.dim_date