You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
## How is my capstone project going to be evaluated?
14
+
Contribute to this FAQ by transforming [existing FAQ from Google Docs](https://docs.google.com/document/d/19bnYs80DwuUimHM65UV3sylsCn2j1vziPOwzBwQrebw/edit?tab=t.0) to the requred format.
16
15
17
-
* Each submitted project will be evaluated by 3 (three) randomly assigned students that have also submitted the project.
16
+
### Contributing FAQ Best Practices
18
17
19
-
* You will also be responsible for grading the projects from 3 fellow students yourself. Please be aware that: not complying to this rule also implies you failing to achieve the Certificate at the end of the course.
20
-
21
-
* The final grade you get will be the median score of the grades you get from the peer reviewers.
22
-
23
-
* And of course, the peer review criteria for evaluating or being evaluated must follow the guidelines defined [**here**](https://github.com/DataTalksClub/data-engineering-zoomcamp/tree/main/week_7_project#peer-review-criteria).
24
-
25
-
## Can I collaborate with others on the capstone project?
26
-
27
-
Collaboration is not allowed for the capstone submission. However, you can discuss ideas and get feedback from peers in the forums or Slack channels.
28
-
29
-
## Project 1 & Project 2
30
-
31
-
There is only ONE project for this Zoomcamp. You do not need to submit or create two projects.
32
-
33
-
There are simply TWO chances to pass the course. You can use the Second Attempt if you a) fail the first attempt b) do not have the time due to other engagements such as holiday or sickness etc. to enter your project into the first attempt.
34
-
35
-
## Project evaluation - Reproducibility
36
-
37
-
The question is that sometimes even if you take plenty of effort to document every single step, and we can't even sure if the person doing the peer review will be able to follow-up, so how this criteria will be evaluated?
38
-
39
-
Alex clarifies: “Ideally yes, you should try to re-run everything. But I understand that not everyone has time to do it, so if you check the code by looking at it and try to spot errors, places with missing instructions and so on \- then it's already great”
40
-
41
-
## Certificates: how do I get it?
42
-
43
-
A: [See the certificate.mdx file](#certificate---generating,-receiving-after-projects-graded)
44
-
45
-
## Does anyone know nice and relatively large datasets?
46
-
47
-
See a list of datasets here: [https://github.com/DataTalksClub/data-engineering-zoomcamp/blob/main/projects/datasets.md](https://github.com/DataTalksClub/data-engineering-zoomcamp/blob/main/projects/datasets.md)
48
-
49
-
## How to run python as start up script?
50
-
51
-
You need to redefine the python environment variable to that of your user account
52
-
53
-
## Spark Streaming - How do I read from multiple topics in the same Spark Session
54
-
55
-
Initiate a Spark Session
56
-
57
-
`spark = (SparkSession`
58
-
`.builder`
59
-
`.appName(app_name)`
60
-
`.master(master=master)`
61
-
`.getOrCreate())`
62
-
63
-
`spark.streams.resetTerminated()`
64
-
65
-
`query1 = spark`
66
-
`.readStream`
67
-
`…`
68
-
`…`
69
-
`.load()`
70
-
71
-
`query2 = spark`
72
-
`.readStream`
73
-
`…`
74
-
`…`
75
-
`.load()`
76
-
77
-
`query3 = spark`
78
-
`.readStream`
79
-
`…`
80
-
`…`
81
-
`.load()`
82
-
83
-
`query1.start()`
84
-
`query2.start()`
85
-
`query3.start()`
86
-
87
-
`spark.streams.awaitAnyTermination() #waits for any one of the query to receive kill signal or error failure. This is asynchronous`
88
-
89
-
`# On the contrary query3.start().awaitTermination() is a blocking ex call. Works well when we are reading only from one topic.`
90
-
91
-
## Data Transformation from Databricks to Azure SQL DB
92
-
93
-
Transformed data can be moved in to azure blob storage and then it can be moved in to azure SQL DB, instead of moving directly from databricks to Azure SQL DB.
94
-
95
-
## Orchestrating dbt with Airflow
96
-
97
-
The trial dbt account provides access to dbt API. Job will still be needed to be added manually. Airflow will run the job using a python operator calling the API. You will need to provide api key, job id, etc. (be careful not committing it to Github).
Source code example here: [https://github.com/sungchun12/airflow-toolkit/blob/95d40ac76122de337e1b1cdc8eed35ba1c3051ed/dags/examples/dbt_cloud_example.py](https://github.com/sungchun12/airflow-toolkit/blob/95d40ac76122de337e1b1cdc8eed35ba1c3051ed/dags/examples/dbt_cloud_example.py)
- Service Account User (explanation [here](https://stackoverflow.com/questions/63941429/user-not-authorized-to-act-as-service-account-when-using-workload-identity))
114
-
115
-
Use DataprocSubmitPySparkJobOperator, DataprocDeleteClusterOperator and DataprocCreateClusterOperator.
116
-
117
-
When using DataprocSubmitPySparkJobOperator, do not forget to add:
Because DataProc does not already have the BigQuery Connector.
122
-
123
-
## Orchestrating dbt cloud with Mage
124
-
125
-
You can trigger your dbt job in Mage pipeline. For this get your dbt cloud api key under settings/Api tokens/personal tokens. Add it safely to your .env
126
-
127
-
For example:
18
+
Please include your FAQ in the following format:
128
19
129
20
```
130
-
dbt_api_trigger=dbt_**
131
-
```
132
-
133
-
Navigate to job page and find api trigger link
134
-
135
-
Then create a custom mage Python block with a simple http request like [here](https://github.com/Nogromi/ukraine-vaccinations/blob/master/2_mage/vaccination/custom/trigger_dbt_cloud.py)
Voila! You triggered dbt job from your mage pipeline.
158
-
159
-
## Key Vault in Azure cloud stack
160
-
161
-
The key valut in Azure cloud is used to store credentials or passwords or secrets of different tech stack used in Azure. For example if u do not want to expose the password in SQL database, then we can save the password under a given name and use them in other Azure stack.
162
-
163
-
## How to connect Pyspark with BigQuery?
164
-
165
-
The following line should be included in pyspark configuration
## How to run a dbt-core project as an Airflow Task Group on Google Cloud Composer using a service account JSON key
178
-
179
-
1.[Install](https://cloud.google.com/composer/docs/composer-2/install-python-dependencies#install_custom_packages_in_a_environment) the [***astronomer-cosmos***](https://github.com/astronomer/astronomer-cosmos) package as a dependency. (see Terraform [example](https://github.com/wndrlxx/ca-trademarks-data-pipeline/blob/4e6a0e757495a99e01ff6c8b981a23d6dc421046/terraform/main.tf#L100)).
180
-
2. Make a new folder, **dbt/**, inside the **dags/** folder of your Composer GCP bucket and copy paste your dbt-core project there. (see [example](https://github.com/wndrlxx/ca-trademarks-data-pipeline/tree/4e6a0e757495a99e01ff6c8b981a23d6dc421046/dags/dbt/ca_trademarks_dp))
181
-
3. Ensure your *profiles.yml* is configured to authenticate with a service account key. (see BigQuery [example](https://docs.getdbt.com/docs/core/connect-data-platform/bigquery-setup#service-account-file))
182
-
4. Create a new DAG using the **DbtTaskGroup** class and a **ProfileConfig** specifying a *profiles\_yml\_filepath* that points to the location of your JSON key file. (see [example](https://github.com/wndrlxx/ca-trademarks-data-pipeline/blob/4e6a0e757495a99e01ff6c8b981a23d6dc421046/dags/6_dbt_cosmos_task_group.py#L47))
183
-
5. Your dbt lineage graph should now appear as tasks inside a task group.
184
-
185
-
## How can I run UV in Kestra without installing it on every flow execution?
186
-
187
-
To avoid reinstalling uv on each flow run, you can create a custom Docker image based on the official Kestra image with uv pre-installed. Here's how:
188
-
189
-
- Create a Dockerfile (e.g., Dockerfile) with the following content:
190
-
191
-
| FROM kestra/kestra:latestUSER rootRUN pip install uvCMD \["server", "standalone"\]|
192
-
| :---- |
193
-
194
-
195
-
- Update your docker-compose.yml to build this custom image instead of pulling the default one:
This approach ensures that uv is available in the container at runtime without requiring installation during each flow execution.
201
-
202
-
## Is it possible to create external tables in BigQuery using URLs, such as those from the NY Taxi data website?
203
-
204
-
Answer: Not really, only Bigtable, Cloud Storage, and Google Drive are supported data stores.
205
-
206
-
## Is it ok to use NY\_Taxi data for the project?
207
-
208
-
No
209
-
210
-
## How to use dbt-core with Athena?
211
-
212
-
If you don’t have access to dbt Cloud which is already natively being supported by AWS, refer here: [1](https://aws.amazon.com/blogs/big-data/from-data-lakes-to-insights-dbt-adapter-for-amazon-athena-now-supported-in-dbt-cloud/), [2](https://youtu.be/JEizJAaaBkg?si=niTYdWoeiyC_w3h7), [3](https://docs.getdbt.com/guides/athena?step=1), & [4](https://docs.getdbt.com/docs/core/connect-data-platform/athena-setup), you can use the community built [dbt-Athena Adapter](https://dbt-athena.github.io/) for dbt-Core.
213
-
214
-
### Key Features
215
-
216
-
* Enables dbt to work with AWS Athena using dbt Core
217
-
* Allows data transformation using CREATE TABLE AS or CREATE VIEW SQL queries
218
-
* Not yet supported features:
219
-
1. Python models
220
-
2. Persisting documentation for views
221
-
222
-
This adapter can be a valuable resource for those who need to work with Athena using dbt Core, and I hope this entry can help others discover it.
223
-
224
-
## Solving dbt-Athena library conflicts
225
-
226
-
When working on a dbt-Athena project, do not install dbt-athena-adapter. Instead, always use the dbt-athena-community package, ensuring it matches the version of dbt-core to avoid compatibility conflicts.
227
-
228
-
### Best Practice
229
-
230
-
* Always pin the versions of dbt-core and dbt-athena-community to the same version.
231
-
232
-
* Example:
233
-
234
-
dbt-core\~=1.9.3
235
-
236
-
dbt-athena-community\~=1.9.3
237
-
238
-
### Why?
239
-
240
-
* dbt-athena-adapter is outdated and no longer maintained.
241
-
* dbt-athena-community is the actively maintained package and is compatible with the latest versions of dbt-core.
242
-
243
-
### Steps to Avoid Conflicts
21
+
**Q: Your Question**
244
22
245
-
1. Always check the compatibility matrix in the [dbt-athena-community](https://github.com/dbt-labs/dbt-adapters/tree/main/dbt-athena-community) GitHub repository.
246
-
2. Update requirements.txt to use the latest compatible versions of dbt-core and dbt-athena-community.
247
-
3. Avoid mixing dbt-athena-adapter with dbt-athena-community in the same environment.
23
+
`Exact terminal error` if it exists
248
24
249
-
By following this practice, you can avoid the conflicts we faced previously and ensure a smooth development experience.
0 commit comments