Skip to content

Commit 3cd4d98

Browse files
committed
Updated FAQs
1 parent 2452e86 commit 3cd4d98

7 files changed

Lines changed: 75 additions & 4236 deletions

File tree

  • docs/data-engineering-zoomcamp

docs/data-engineering-zoomcamp/final-project/faq.md

Lines changed: 10 additions & 233 deletions
Original file line numberDiff line numberDiff line change
@@ -6,244 +6,21 @@ nav_order: 2
66

77
# Module 6 FAQ
88

9-
## Table of contents
10-
{: .no_toc .text-delta }
9+
Coming soon
10+
{: .label .label-yellow }
1111

12-
1. TOC
13-
{:toc}
12+
## Contribute
1413

15-
## How is my capstone project going to be evaluated?
14+
Contribute to this FAQ by transforming [existing FAQ from Google Docs](https://docs.google.com/document/d/19bnYs80DwuUimHM65UV3sylsCn2j1vziPOwzBwQrebw/edit?tab=t.0) to the requred format.
1615

17-
* Each submitted project will be evaluated by 3 (three) randomly assigned students that have also submitted the project.
16+
### Contributing FAQ Best Practices
1817

19-
* You will also be responsible for grading the projects from 3 fellow students yourself. Please be aware that: not complying to this rule also implies you failing to achieve the Certificate at the end of the course.
20-
21-
* The final grade you get will be the median score of the grades you get from the peer reviewers.
22-
23-
* And of course, the peer review criteria for evaluating or being evaluated must follow the guidelines defined [**here**](https://github.com/DataTalksClub/data-engineering-zoomcamp/tree/main/week_7_project#peer-review-criteria).
24-
25-
## Can I collaborate with others on the capstone project?
26-
27-
Collaboration is not allowed for the capstone submission. However, you can discuss ideas and get feedback from peers in the forums or Slack channels.
28-
29-
## Project 1 & Project 2
30-
31-
There is only ONE project for this Zoomcamp. You do not need to submit or create two projects.
32-
33-
There are simply TWO chances to pass the course. You can use the Second Attempt if you a) fail the first attempt b) do not have the time due to other engagements such as holiday or sickness etc. to enter your project into the first attempt.
34-
35-
## Project evaluation - Reproducibility
36-
37-
The question is that sometimes even if you take plenty of effort to document every single step, and we can't even sure if the person doing the peer review will be able to follow-up, so how this criteria will be evaluated?
38-
39-
Alex clarifies: “Ideally yes, you should try to re-run everything. But I understand that not everyone has time to do it, so if you check the code by looking at it and try to spot errors, places with missing instructions and so on \- then it's already great”
40-
41-
## Certificates: how do I get it?
42-
43-
A: [See the certificate.mdx file](#certificate---generating,-receiving-after-projects-graded)
44-
45-
## Does anyone know nice and relatively large datasets?
46-
47-
See a list of datasets here: [https://github.com/DataTalksClub/data-engineering-zoomcamp/blob/main/projects/datasets.md](https://github.com/DataTalksClub/data-engineering-zoomcamp/blob/main/projects/datasets.md)
48-
49-
## How to run python as start up script?
50-
51-
You need to redefine the python environment variable to that of your user account
52-
53-
## Spark Streaming - How do I read from multiple topics in the same Spark Session
54-
55-
Initiate a Spark Session
56-
57-
`spark = (SparkSession`
58-
`.builder`
59-
`.appName(app_name)`
60-
`.master(master=master)`
61-
`.getOrCreate())`
62-
63-
`spark.streams.resetTerminated()`
64-
65-
`query1 = spark`
66-
`.readStream`
67-
``
68-
``
69-
`.load()`
70-
71-
`query2 = spark`
72-
`.readStream`
73-
``
74-
``
75-
`.load()`
76-
77-
`query3 = spark`
78-
`.readStream`
79-
``
80-
``
81-
`.load()`
82-
83-
`query1.start()`
84-
`query2.start()`
85-
`query3.start()`
86-
87-
`spark.streams.awaitAnyTermination() #waits for any one of the query to receive kill signal or error failure. This is asynchronous`
88-
89-
`# On the contrary query3.start().awaitTermination() is a blocking ex call. Works well when we are reading only from one topic.`
90-
91-
## Data Transformation from Databricks to Azure SQL DB
92-
93-
Transformed data can be moved in to azure blob storage and then it can be moved in to azure SQL DB, instead of moving directly from databricks to Azure SQL DB.
94-
95-
## Orchestrating dbt with Airflow
96-
97-
The trial dbt account provides access to dbt API. Job will still be needed to be added manually. Airflow will run the job using a python operator calling the API. You will need to provide api key, job id, etc. (be careful not committing it to Github).
98-
99-
Detailed explanation here: [https://docs.getdbt.com/blog/dbt-airflow-spiritual-alignment](https://docs.getdbt.com/blog/dbt-airflow-spiritual-alignment)
100-
101-
Source code example here: [https://github.com/sungchun12/airflow-toolkit/blob/95d40ac76122de337e1b1cdc8eed35ba1c3051ed/dags/examples/dbt_cloud_example.py](https://github.com/sungchun12/airflow-toolkit/blob/95d40ac76122de337e1b1cdc8eed35ba1c3051ed/dags/examples/dbt_cloud_example.py)
102-
103-
## Orchestrating DataProc with Airflow
104-
105-
[https://airflow.apache.org/docs/apache-airflow-providers-google/stable/_api/airflow/providers/google/cloud/operators/dataproc/index.html](https://airflow.apache.org/docs/apache-airflow-providers-google/stable/_api/airflow/providers/google/cloud/operators/dataproc/index.html)
106-
107-
[https://airflow.apache.org/docs/apache-airflow-providers-google/stable/_modules/airflow/providers/google/cloud/operators/dataproc.html](https://airflow.apache.org/docs/apache-airflow-providers-google/stable/_modules/airflow/providers/google/cloud/operators/dataproc.html)
108-
109-
Give the following roles to you service account:
110-
111-
- DataProc Administrator
112-
113-
- Service Account User (explanation [here](https://stackoverflow.com/questions/63941429/user-not-authorized-to-act-as-service-account-when-using-workload-identity))
114-
115-
Use DataprocSubmitPySparkJobOperator, DataprocDeleteClusterOperator and DataprocCreateClusterOperator.
116-
117-
When using DataprocSubmitPySparkJobOperator, do not forget to add:
118-
119-
`dataproc_jars = ["gs://spark-lib/bigquery/spark-bigquery-with-dependencies_2.12-0.24.0.jar"]`
120-
121-
Because DataProc does not already have the BigQuery Connector.
122-
123-
## Orchestrating dbt cloud with Mage
124-
125-
You can trigger your dbt job in Mage pipeline. For this get your dbt cloud api key under settings/Api tokens/personal tokens. Add it safely to your .env
126-
127-
For example:
18+
Please include your FAQ in the following format:
12819

12920
```
130-
dbt_api_trigger=dbt_**
131-
```
132-
133-
Navigate to job page and find api trigger link
134-
135-
Then create a custom mage Python block with a simple http request like [here](https://github.com/Nogromi/ukraine-vaccinations/blob/master/2_mage/vaccination/custom/trigger_dbt_cloud.py)
136-
137-
```python
138-
from dotenv import load_dotenv
139-
from pathlib import Path
140-
dotenv_path = Path('/home/src/.env')
141-
load_dotenv(dotenv_path=dotenv_path)
142-
dbt_api_trigger = os.getenv('dbt_api_trigger')
143-
144-
url = f"https://cloud.getdbt.com/api/v2/accounts/{dbt_account_id}/jobs/<job_id>/run/"
145-
146-
headers = {
147-
"Authorization": f"Token {dbt_api_trigger}",
148-
"Content-Type": "application/json"
149-
}
150-
151-
body = {
152-
"cause": "Triggered via API"
153-
}
154-
response = requests.post(url, headers=headers, json=body)
155-
```
156-
157-
Voila! You triggered dbt job from your mage pipeline.
158-
159-
## Key Vault in Azure cloud stack
160-
161-
The key valut in Azure cloud is used to store credentials or passwords or secrets of different tech stack used in Azure. For example if u do not want to expose the password in SQL database, then we can save the password under a given name and use them in other Azure stack.
162-
163-
## How to connect Pyspark with BigQuery?
164-
165-
The following line should be included in pyspark configuration
166-
167-
```python
168-
# Example initialization of SparkSession variable
169-
spark = (SparkSession.builder
170-
.master(...)
171-
.appName(...)
172-
# Add the following configuration
173-
.config("spark.jars.packages", "com.google.cloud.spark:spark-3.5-bigquery:0.37.0")
174-
)
175-
```
176-
177-
## How to run a dbt-core project as an Airflow Task Group on Google Cloud Composer using a service account JSON key
178-
179-
1. [Install](https://cloud.google.com/composer/docs/composer-2/install-python-dependencies#install_custom_packages_in_a_environment) the [***astronomer-cosmos***](https://github.com/astronomer/astronomer-cosmos) package as a dependency. (see Terraform [example](https://github.com/wndrlxx/ca-trademarks-data-pipeline/blob/4e6a0e757495a99e01ff6c8b981a23d6dc421046/terraform/main.tf#L100)).
180-
2. Make a new folder, **dbt/**, inside the **dags/** folder of your Composer GCP bucket and copy paste your dbt-core project there. (see [example](https://github.com/wndrlxx/ca-trademarks-data-pipeline/tree/4e6a0e757495a99e01ff6c8b981a23d6dc421046/dags/dbt/ca_trademarks_dp))
181-
3. Ensure your *profiles.yml* is configured to authenticate with a service account key. (see BigQuery [example](https://docs.getdbt.com/docs/core/connect-data-platform/bigquery-setup#service-account-file))
182-
4. Create a new DAG using the **DbtTaskGroup** class and a **ProfileConfig** specifying a *profiles\_yml\_filepath* that points to the location of your JSON key file. (see [example](https://github.com/wndrlxx/ca-trademarks-data-pipeline/blob/4e6a0e757495a99e01ff6c8b981a23d6dc421046/dags/6_dbt_cosmos_task_group.py#L47))
183-
5. Your dbt lineage graph should now appear as tasks inside a task group.
184-
185-
## How can I run UV in Kestra without installing it on every flow execution?
186-
187-
To avoid reinstalling uv on each flow run, you can create a custom Docker image based on the official Kestra image with uv pre-installed. Here's how:
188-
189-
- Create a Dockerfile (e.g., Dockerfile) with the following content:
190-
191-
| FROM kestra/kestra:latest USER root RUN pip install uv CMD \["server", "standalone"\] |
192-
| :---- |
193-
194-
195-
- Update your docker-compose.yml to build this custom image instead of pulling the default one:
196-
197-
| *\# image: kestra/kestra:latest* build: context: . dockerfile: Dockerfile |
198-
| :---- |
199-
200-
This approach ensures that uv is available in the container at runtime without requiring installation during each flow execution.
201-
202-
## Is it possible to create external tables in BigQuery using URLs, such as those from the NY Taxi data website?
203-
204-
Answer: Not really, only Bigtable, Cloud Storage, and Google Drive are supported data stores.
205-
206-
## Is it ok to use NY\_Taxi data for the project?
207-
208-
No
209-
210-
## How to use dbt-core with Athena?
211-
212-
If you don’t have access to dbt Cloud which is already natively being supported by AWS, refer here: [1](https://aws.amazon.com/blogs/big-data/from-data-lakes-to-insights-dbt-adapter-for-amazon-athena-now-supported-in-dbt-cloud/), [2](https://youtu.be/JEizJAaaBkg?si=niTYdWoeiyC_w3h7), [3](https://docs.getdbt.com/guides/athena?step=1), & [4](https://docs.getdbt.com/docs/core/connect-data-platform/athena-setup), you can use the community built [dbt-Athena Adapter](https://dbt-athena.github.io/) for dbt-Core.
213-
214-
### Key Features
215-
216-
* Enables dbt to work with AWS Athena using dbt Core
217-
* Allows data transformation using CREATE TABLE AS or CREATE VIEW SQL queries
218-
* Not yet supported features:
219-
1. Python models
220-
2. Persisting documentation for views
221-
222-
This adapter can be a valuable resource for those who need to work with Athena using dbt Core, and I hope this entry can help others discover it.
223-
224-
## Solving dbt-Athena library conflicts
225-
226-
When working on a dbt-Athena project, do not install dbt-athena-adapter. Instead, always use the dbt-athena-community package, ensuring it matches the version of dbt-core to avoid compatibility conflicts.
227-
228-
### Best Practice
229-
230-
* Always pin the versions of dbt-core and dbt-athena-community to the same version.
231-
232-
* Example:
233-
234-
dbt-core\~=1.9.3
235-
236-
dbt-athena-community\~=1.9.3
237-
238-
### Why?
239-
240-
* dbt-athena-adapter is outdated and no longer maintained.
241-
* dbt-athena-community is the actively maintained package and is compatible with the latest versions of dbt-core.
242-
243-
### Steps to Avoid Conflicts
21+
**Q: Your Question**
24422
245-
1. Always check the compatibility matrix in the [dbt-athena-community](https://github.com/dbt-labs/dbt-adapters/tree/main/dbt-athena-community) GitHub repository.
246-
2. Update requirements.txt to use the latest compatible versions of dbt-core and dbt-athena-community.
247-
3. Avoid mixing dbt-athena-adapter with dbt-athena-community in the same environment.
23+
`Exact terminal error` if it exists
24824
249-
By following this practice, you can avoid the conflicts we faced previously and ensure a smooth development experience.
25+
A: answer
26+
```

0 commit comments

Comments
 (0)