Skip to content

Commit 52649f4

Browse files
Major cleanup of DE Zoomcamp FAQ
- Reorganize sections: split general into general+environment, split module-1 (was 145 files) into module-1-{data,docker, postgres,gcp,terraform}, rename module-7 from "Streaming with Kafka" to just "Streaming" - Move ~25 misplaced files to correct sections (Kestra→m2, dbt→m4, Spark→m6, Bruin→m5, GCP setup→general, etc.) - Merge ~25 clusters of duplicate FAQs (biggest: 11 dbt+BigQuery region-mismatch FAQs into one; 10 Postgres connection failures into one; 9 Postgres data-folder permission errors into one) - Rewrite outdated content: redirect cohort/deadline/playlist questions to the course repo instead of hardcoding years; drop references to Mage, Prefect, Faust, RisingWave (no longer in syllabus); generalize Terraform 1.1.3 download to current - Drop 22 module-6 FAQs tied to the old Spark 3.x manual install (findspark, winutils, SPARK_HOME, PYTHONPATH/py4j zip, Spark Standalone, Hadoop on Windows). The course now uses Spark 4.x bundled via PySpark with uv, JDK 17/21 — those workarounds aren't needed - Drop Anaconda-specific FAQs and replace with a short "use uv" pointer; strip incidental conda mentions throughout - Renumber sort_order sequentially within every section; fix files with non-hex IDs and truncated filenames - Total: 522 → 393 files (25% smaller) de-faq-cleanup-plan.md at the repo root tracks every individual change made, organized by stage.
1 parent 81eb85c commit 52649f4

849 files changed

Lines changed: 9128 additions & 11906 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

_questions/data-engineering-zoomcamp/_metadata.yaml

Lines changed: 21 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -3,8 +3,25 @@ course_name: "Data Engineering Zoomcamp"
33
sections:
44
- id: general
55
name: "General Course-Related Questions"
6-
- id: module-1
7-
name: "Module 1: Docker and Terraform"
6+
comment: "Course logistics: cohort schedule, certificate, deadlines, leaderboard, project rules, contributing"
7+
- id: environment
8+
name: "Environment & Setup"
9+
comment: "Where/how to run the course: local vs Codespaces vs GCP VM, Python version, Windows/WSL/Mac, troubleshooting workflow, IDE/editor tips"
10+
- id: module-1-data
11+
name: "Module 1: Taxi Data (download & handling)"
12+
comment: "How to download / unzip / handle the NY taxi datasets used throughout the course (CSV.GZ, Parquet, wget/curl issues)"
13+
- id: module-1-docker
14+
name: "Module 1: Docker"
15+
comment: "Docker engine, Docker Compose, volumes, networking, WSL — anything containerization-related in module 1"
16+
- id: module-1-postgres
17+
name: "Module 1: Postgres, pgAdmin & Python ingestion"
18+
comment: "Postgres, pgcli, pgAdmin, Python ingestion (pandas, SQLAlchemy, psycopg), and SQL"
19+
- id: module-1-gcp
20+
name: "Module 1: GCP setup & VM"
21+
comment: "GCP account, billing, free trial, SDK install, service accounts, VM (Compute Engine) setup, SSH"
22+
- id: module-1-terraform
23+
name: "Module 1: Terraform"
24+
comment: "Terraform IaC for GCP — provider, credentials, errors, state, teardown"
825
- id: module-2
926
name: "Module 2: Workflow Orchestration"
1027
comment: "Questions about Kestra and workflow orchestration go here"
@@ -21,8 +38,8 @@ sections:
2138
name: "Module 6: Spark"
2239
comment: "Questions about Apache Spark, PySpark, and Dataproc go here"
2340
- id: module-7
24-
name: "Module 7: Streaming with Kafka"
25-
comment: "Questions about Kafka, streaming, and real-time data processing go here"
41+
name: "Module 7: Streaming"
42+
comment: "Questions about streaming and real-time data processing go here. Do not name specific tools (Kafka, PyFlink, Spark Streaming, Redpanda, etc.) in the section name — the course's streaming stack changes between cohorts."
2643
- id: project
2744
name: "Project"
2845
- id: workshop-1-dlthub
Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
---
2+
id: 29e58c5c37
3+
question: 'Environment: which Python version should I use?'
4+
sort_order: 1
5+
---
6+
7+
Python 3.10 or 3.11 is a safe default — it works with the libraries used across the course (pandas, SQLAlchemy, dbt, dlt, PySpark with recent Spark releases, etc.).
8+
9+
If you're following older recorded videos that use Python 3.9, that still works for everything except the very latest library versions; troubleshooting against the videos is easier on the version they use.
10+
11+
If a specific module uses a stricter requirement, the course repo's module README will say so.
Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,34 @@
1+
---
2+
id: 4f1fe161b1
3+
question: 'Environment: which OS / cloud / dev setup should I use? (local vs GCP VM
4+
vs Codespaces, AWS alternative, OS support)'
5+
sort_order: 2
6+
---
7+
8+
## OS support
9+
10+
Linux is the smoothest, but the course works on macOS and Windows too. Students in the most recent cohorts have completed it on all three. Windows users typically need WSL2 to avoid friction with shell scripts in later modules.
11+
12+
## Where to run the course
13+
14+
You have three good options. Pick whichever suits you:
15+
16+
1. Local machine (laptop / PC). Easiest if you're already comfortable with Docker locally. Windows users should use WSL2 from the start.
17+
2. GitHub Codespaces. A free Linux dev environment with Docker, Python, and many CLI tools pre-installed. Useful if your laptop is underpowered, or if you switch between home and office machines. Ports for things like Kestra/pgAdmin are exposed via Codespaces' forwarded URL — not `http://localhost`.
18+
3. Google Cloud VM. The course videos demonstrate this setup. Useful if you want a persistent remote environment to SSH into, especially while staying logged in across machines.
19+
20+
You don't need both Codespaces and a GCP VM — pick one. You will need a GCP account regardless because the course uses BigQuery (in Module 3 and the project), but GCP for compute is optional.
21+
22+
## Can I use AWS / Snowflake / Azure / a different stack?
23+
24+
Yes. The capstone project is graded on creating a data pipeline and producing a visualization — it doesn't mandate any specific cloud. Considerations:
25+
26+
- The lessons are recorded against GCP, so you'll need to translate steps yourself.
27+
- You may need to explain your choice during peer review.
28+
- Fewer fellow students will be using AWS/Azure, so help in Slack may be slower.
29+
30+
If you only want to run the course locally without any cloud, you can do that for everything except Module 3's BigQuery homework, which requires GCP.
31+
32+
## Is the course Windows / macOS / Linux friendly?
33+
34+
All three work. Linux is best by default. On Windows, install WSL2 and run everything inside a WSL distro — Git Bash and MINGW64 are not always sufficient for shell scripts later in the course.
Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,23 @@
1+
---
2+
id: b92da7c113
3+
question: 'Environment - Could not establish connection to "MyServerName": Got bad
4+
result from install script'
5+
sort_order: 3
6+
---
7+
8+
This issue occurs when attempting to connect to a GCP VM using VSCode on a Windows machine. You can resolve it by changing a registry value in the registry editor.
9+
10+
Open the Run command window:
11+
- Use the shortcut keys `Windows + R`, or
12+
- Right-click "Start" and click "Run".
13+
14+
Open the Registry Editor:
15+
- Type `regedit` in the Run command window, then press Enter.
16+
17+
Change the registry value:
18+
- Navigate to `HKEY_CURRENT_USER\Software\Microsoft\Command Processor`.
19+
- Change the "Autorun" value from "if exists" to a blank.
20+
21+
Alternatively, you can delete the saved fingerprint within the known_hosts file:
22+
23+
In Windows, locate the file at `C:\Users\<your_user_name>\.ssh\known_hosts` and remove the entry for the server.
Lines changed: 33 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,33 @@
1+
---
2+
id: a8219681ec
3+
question: 'GCP for the course: free trial vs sandbox, paying, country restrictions,
4+
why GCP'
5+
sort_order: 4
6+
---
7+
8+
## Why GCP and not AWS / Azure?
9+
10+
For uniformity across the cohort. The course uses BigQuery, which is GCP-only, and most students already have a Google account that works for sign-up. The concepts (data warehouse, object storage, IaC) translate to AWS/Azure, but the lessons are recorded against GCP. You can use a different cloud — see [the environment FAQ](#4f1fe161b1) for tradeoffs.
11+
12+
## Do I have to pay?
13+
14+
No. GCP offers a free trial with $300 in credits for new accounts. The course materials fit comfortably within that budget if you destroy unused resources (VMs, datasets, buckets) after each module. Check your billing dashboard daily, especially after spinning up Compute Engine VMs.
15+
16+
To sign up for the free trial you need a valid credit/debit card; GCP uses it to verify identity but doesn't charge it without your consent.
17+
18+
## Free Trial vs Sandbox — which one?
19+
20+
GCP has two free options. They are not equivalent for this course:
21+
22+
- Free Trial ($300 credit, 90 days). Required for the course — gives you VMs, GCS buckets, and full BigQuery functionality.
23+
- Sandbox (free, no credit card). Limited services. It does not include VMs or GCS, and BigQuery features are restricted, so you cannot complete the course on Sandbox alone.
24+
25+
Use the Free Trial.
26+
27+
## My country isn't supported / my card isn't accepted
28+
29+
GCP isn't available in some countries, and some cards are rejected even where it is. Workarounds students have used:
30+
31+
- Try a different card. Cards from some banks (e.g. Kazakhstan-based Kaspi) sometimes don't work; cards from other banks/countries (e.g. TBC in Georgia) do.
32+
- Pyypl and similar virtual cards have worked for some.
33+
- If you can't get a GCP account at all, you can still complete most of the course locally — only Module 3's homework strictly requires BigQuery. See the environment FAQ for which parts have local alternatives.
Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
---
2+
id: 8b0214d089
3+
question: 'Environment: shell scripts (*.sh) don''t work for Windows users without WSL'
4+
sort_order: 5
5+
---
6+
7+
Several modules use shell scripts (`*.sh`) for setup or runtime tasks. Most Windows users running them outside WSL will hit issues — Git Bash and MINGW64 are not always sufficient. Set up a WSL environment from the start to avoid getting blocked partway through the course.
Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,30 @@
1+
---
2+
id: a83f047f52
3+
question: How to troubleshoot issues and ask good questions
4+
sort_order: 6
5+
---
6+
7+
## Try to solve it yourself first
8+
9+
- Read the error message carefully — it usually includes a line number, a stack trace, and a description of what went wrong.
10+
- Search the message: copy the most specific part of the error (not the whole stack trace) into Google. The format `<tool> <error message>` works well, e.g. `pgcli error column c.relhasoids does not exist`.
11+
- Check the official documentation of the tool you're using.
12+
- Use Ctrl+F in this FAQ and in Slack channel pinned messages.
13+
- Restart the process / container / shell / VM and try once more — many transient errors resolve this way.
14+
- If you suspect the install is broken, uninstall first, then reinstall. Reinstalling on top of a broken install rarely helps.
15+
16+
## Asking for help in Slack / forums
17+
18+
When the troubleshooting steps don't help and you need another pair of eyes, include enough info that someone can actually help without going back and forth:
19+
20+
- Operating system and version (e.g. Windows 11 + WSL Ubuntu 24.04, Mac M2, Linux Ubuntu 22.04).
21+
- Which lesson / video you're following, and which command failed.
22+
- The exact command and the exact error — paste both as text inside triple-backtick code blocks. Don't paste screenshots of text.
23+
- What you've already tried. If you skip this, helpers' first suggestions will be the things you already tried.
24+
- Stay in one thread. Reply to your own question; don't open a new post for a follow-up.
25+
26+
If the same problem recurs, post in the same thread with what changed in your environment since last time.
27+
28+
## Help others by contributing back
29+
30+
If your problem isn't yet covered in this FAQ, consider [opening a PR](https://github.com/DataTalksClub/faq) so the next student doesn't have to debug it from scratch.

_questions/data-engineering-zoomcamp/module-1/007_1ba19ed6a0_git-bash-backslash-as-an-escape-character-in-git-b.md renamed to _questions/data-engineering-zoomcamp/environment/007_1ba19ed6a0_git-bash-backslash-as-an-escape-character-in-git-b.md

File renamed without changes.
Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
---
2+
id: e5dc51eac9
3+
question: 'VS Code: Tab using spaces'
4+
sort_order: 8
5+
---
6+
7+
8+
9+
Error:
10+
11+
```
12+
Makefile:2: *** missing separator. Stop.
13+
```
14+
15+
Solution:
16+
17+
Tabs in documents should be converted to Tab instead of spaces. [Follow this stack](https://stackoverflow.com/questions/36814642/visual-studio-code-convert-spaces-to-tabs).
Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
---
2+
id: 5b54567e89
3+
question: Opening an HTML file with a Windows browser from Linux running on WSL
4+
sort_order: 9
5+
---
6+
7+
If you’re running Linux on Windows Subsystem for Linux (WSL) 2, you can open HTML files from the guest (Linux) with any Internet Browser installed on the host (Windows). Just install [wslu](https://wslutiliti.es/wslu/install.html) and open the page using `wslview`:
8+
9+
```bash
10+
wslview index.html
11+
```
12+
13+
You can customize which browser to use by setting the `BROWSER` environment variable first. For example:
14+
15+
```bash
16+
export BROWSER='/mnt/c/Program Files/Firefox/firefox.exe'
17+
```

0 commit comments

Comments
 (0)