Skip to content

feat: Allow configuring dataset location in BigQueryCache#689

Merged
Aaron ("AJ") Steers (aaronsteers) merged 4 commits into
airbytehq:mainfrom
honzabilek4:main
Jun 20, 2025
Merged

feat: Allow configuring dataset location in BigQueryCache#689
Aaron ("AJ") Steers (aaronsteers) merged 4 commits into
airbytehq:mainfrom
honzabilek4:main

Conversation

@honzabilek4

@honzabilek4 Jan Bílek (honzabilek4) commented Jun 3, 2025

Copy link
Copy Markdown
Contributor

Resolves #688.

Summary by CodeRabbit

  • New Features
    • Added support for specifying the geographic location of BigQuery datasets through a new configuration option. The dataset location now defaults to "US" but can be customized as needed.

@coderabbitai

coderabbitai Bot commented Jun 3, 2025

Copy link
Copy Markdown
Contributor
📝 Walkthrough

"""

Walkthrough

A new optional dataset_location field was added to the BigQueryConfig class, defaulting to "US". The code that constructs the DestinationBigquery object now uses the dataset_location from the cache, wrapped in a DatasetLocation object, instead of the previously hardcoded value.

Changes

File(s) Change Summary
airbyte/_processors/sql/bigquery.py Added dataset_location: str = "US" to BigQueryConfig with docstring and default value; updated client creation and schema creation to use location.
airbyte/destinations/_translate_cache_to_dest.py Imported DatasetLocation and changed dataset_location in DestinationBigquery to use cache's location via DatasetLocation.
airbyte/destinations/_translate_dest_to_cache.py Added dataset_location argument to BigQueryCache constructor from destination_configuration.

Sequence Diagram(s)

sequenceDiagram
    participant Caller
    participant BigQueryCache
    participant DestinationBigquery

    Caller->>BigQueryCache: Access project_name, dataset_name, dataset_location, credentials_path
    Caller->>DestinationBigquery: Create with project_id, dataset_id, dataset_location, credentials_json, loading_method
Loading

Assessment against linked issues

Objective Addressed Explanation
Make dataset_location configurable in BigQuery cache and use it in destination configuration (#688)

Assessment against linked issues: Out-of-scope changes

No out-of-scope changes detected.
"""

Would you like me to help you draft a quick test or documentation update for this new dataset_location field as well? Wdyt?


📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5558a0a and c56b440.

📒 Files selected for processing (1)
  • airbyte/_processors/sql/bigquery.py (3 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • airbyte/_processors/sql/bigquery.py
⏰ Context from checks skipped due to timeout of 90000ms (2)
  • GitHub Check: Pytest (No Creds)
  • GitHub Check: Pytest (Fast)
✨ Finishing Touches
  • 📝 Generate Docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

‼️ IMPORTANT
Auto-reply has been disabled for this repository in the CodeRabbit settings. The CodeRabbit bot will not respond to your replies unless it is explicitly tagged.

  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai or @coderabbitai title anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
airbyte/_processors/sql/bigquery.py (1)

50-52: Nice addition! Just a small formatting fix needed.

The new dataset_location field looks great - good default value, proper typing, and helpful documentation linking to the official BigQuery locations. One small thing though - there's trailing whitespace on line 51 that's causing the linter to complain. Could you remove it? wdyt?

    dataset_location: str = "US"
-    """The geographic location of the BigQuery dataset (e.g., 'US', 'EU', etc.). 
+    """The geographic location of the BigQuery dataset (e.g., 'US', 'EU', etc.).
    Defaults to 'US'. See: https://cloud.google.com/bigquery/docs/locations"""
🧰 Tools
🪛 Ruff (0.11.9)

51-51: Trailing whitespace

Remove trailing whitespace

(W291)

🪛 GitHub Actions: Run Linters

[warning] 51-51: Ruff W291: Trailing whitespace detected. Remove trailing whitespace.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 796200b and 00dada2.

📒 Files selected for processing (2)
  • airbyte/_processors/sql/bigquery.py (1 hunks)
  • airbyte/destinations/_translate_cache_to_dest.py (2 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (1)
airbyte/destinations/_translate_cache_to_dest.py (1)
airbyte/_processors/sql/bigquery.py (2)
  • project_name (55-57)
  • dataset_name (60-62)
🪛 GitHub Actions: Run Linters
airbyte/destinations/_translate_cache_to_dest.py

[warning] 4-22: Ruff I001: Import block is un-sorted or un-formatted. Organize imports.

airbyte/_processors/sql/bigquery.py

[warning] 51-51: Ruff W291: Trailing whitespace detected. Remove trailing whitespace.

🪛 Ruff (0.11.9)
airbyte/_processors/sql/bigquery.py

51-51: Trailing whitespace

Remove trailing whitespace

(W291)

🔇 Additional comments (1)
airbyte/destinations/_translate_cache_to_dest.py (1)

130-130: Great integration of the configurable dataset location!

I love how you've replaced the hardcoded "US" with DatasetLocation(cache.dataset_location) - this properly uses the new configurable field from the BigQueryConfig. The DatasetLocation wrapper appears to be the correct type expected by the API, and since cache.dataset_location has a default value of "US" in the config, this should be safe. Nice work!

Comment thread airbyte/destinations/_translate_cache_to_dest.py Outdated
@honzabilek4

Copy link
Copy Markdown
Contributor Author

Hi Aaron ("AJ") Steers (@aaronsteers),
I’d love to get this through. In the EU, we’re required to store user data within the European Union, and since BigQuery doesn’t support cross-region queries, this makes it impossible to work with datasets in US.

@aaronsteers

Aaron ("AJ") Steers (aaronsteers) commented Jun 4, 2025

Copy link
Copy Markdown
Member

/test-pr

PR test job started... Check job output.

✅ Tests passed.

@aaronsteers

Copy link
Copy Markdown
Member

Jan Bílek (@honzabilek4) - On first review, this looks great! Have you been able to test manually with any success?

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0a68ca3 and 5558a0a.

📒 Files selected for processing (2)
  • airbyte/_processors/sql/bigquery.py (3 hunks)
  • airbyte/destinations/_translate_dest_to_cache.py (1 hunks)
✅ Files skipped from review due to trivial changes (1)
  • airbyte/destinations/_translate_dest_to_cache.py
🧰 Additional context used
🪛 Ruff (0.11.9)
airbyte/_processors/sql/bigquery.py

211-211: Line too long (106 > 100)

(E501)


212-212: Change outer quotes to avoid escaping inner quotes

Change outer quotes to avoid escaping inner quotes

(Q003)

🪛 GitHub Actions: Run Linters
airbyte/_processors/sql/bigquery.py

[error] 211-211: E501 Line too long (106 > 100)


[error] 212-212: Q003 Change outer quotes to avoid escaping inner quotes

⏰ Context from checks skipped due to timeout of 90000ms (2)
  • GitHub Check: Pytest (Fast)
  • GitHub Check: Pytest (No Creds)
🔇 Additional comments (2)
airbyte/_processors/sql/bigquery.py (2)

50-53: Great addition for EU compliance requirements!

The new dataset_location field perfectly addresses the data residency requirements mentioned in the PR discussion. The default value maintains backward compatibility while the documentation clearly explains the purpose. wdyt?


97-97: Nice integration with BigQuery client initialization!

Passing the dataset_location to the BigQuery client ensures consistency across operations. This looks spot on for the feature requirements.

Comment thread airbyte/_processors/sql/bigquery.py Outdated
@honzabilek4

Copy link
Copy Markdown
Contributor Author

Aaron ("AJ") Steers (@aaronsteers) – I’ve tested this locally and made a few additional changes to fully resolve the feature. When setting the region, it must be used consistently across all steps. If not, some jobs can fail due to region mismatches. It’s now working correctly.

@aaronsteers

Aaron ("AJ") Steers (aaronsteers) commented Jun 5, 2025

Copy link
Copy Markdown
Member

/test-pr

PR test job started... Check job output.

❌ Tests failed.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a fantastic contribution. Thank you!!

@honzabilek4

Copy link
Copy Markdown
Contributor Author

One test failed, but it looks unrelated. It’s testing the MotherDuck connection.

@honzabilek4

Copy link
Copy Markdown
Contributor Author

Aaron ("AJ") Steers (@aaronsteers) Any idea what's happening here with the test? Is it flaky?

@aaronsteers

Copy link
Copy Markdown
Member

Aaron ("AJ") Steers (@aaronsteers) Any idea what's happening here with the test? Is it flaky?

Jan Bílek (@honzabilek4) - Apologies for the delay. This test is not specifically known to be flaky, but it is most likely an interim issue that is now resolved. I have some time today to rerun and review more closely. Will merge and release if all looks good.

@aaronsteers Aaron ("AJ") Steers (aaronsteers) merged commit 345d648 into airbytehq:main Jun 20, 2025
14 of 15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Allow Configuring Dataset Location in BigQuery Cache

2 participants