Skip to content

feat(datasets): support Google Sheets URLs in dataset loader#290

Merged
msoedov merged 3 commits into
msoedov:mainfrom
ykd007:feat/google-sheets-dataset-support
May 14, 2026
Merged

feat(datasets): support Google Sheets URLs in dataset loader#290
msoedov merged 3 commits into
msoedov:mainfrom
ykd007:feat/google-sheets-dataset-support

Conversation

@ykd007
Copy link
Copy Markdown
Contributor

@ykd007 ykd007 commented May 14, 2026

Closes #86

What

Adds transparent Google Sheets URL normalization to fetch_csv_content.

When a public Google Sheets share/edit link is passed as a dataset URL, it is automatically rewritten to the /export?format=csv form before fetching — no change required from callers.

How

  • _normalize_google_sheets_url(url) — pure regex transform, handles /edit#gid=N, query-param gid, and passes through URLs that are already in export format
  • fetch_csv_content calls the normalizer before httpx.get, with follow_redirects=True added for robustness
  • import re moved to module level

Tests

5 unit tests added to test_data.py covering: passthrough (non-Sheets URL), edit+gid conversion, edit-no-gid conversion, already-export passthrough, pub-output-csv passthrough.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds transparent Google Sheets URL normalization to fetch_csv_content, so public Sheets share/edit links are automatically rewritten to the /export?format=csv form before fetching. This resolves issue #86 by letting datasets configured with Sheets URLs be loaded without manual conversion.

Changes:

  • New _normalize_google_sheets_url helper that regex-matches Sheets URLs, preserves already-exported forms, and appends gid when present.
  • fetch_csv_content now normalizes the URL and uses follow_redirects=True for robustness.
  • 5 unit tests covering passthrough and conversion cases.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
agentic_security/probe_data/data.py Adds _normalize_google_sheets_url and integrates it into fetch_csv_content; module-level re import.
agentic_security/probe_data/test_data.py Adds TestNormalizeGoogleSheetsUrl covering passthrough, edit→export with/without gid, and export/pub passthrough.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@ykd007
Copy link
Copy Markdown
Contributor Author

ykd007 commented May 14, 2026

The 6 test failures in CI are pre-existing ModuleNotFoundError issues for anthropic and openai — unrelated to this PR. They appear on main as well (the maintainer has a fix(ci): commit in flight right now). Our TestNormalizeGoogleSheetsUrl tests all pass within the 346 passing tests. Pre-Commit checks are now green ✅

@msoedov msoedov merged commit e38365c into msoedov:main May 14, 2026
1 of 3 checks passed
@msoedov
Copy link
Copy Markdown
Owner

msoedov commented May 14, 2026

@ykd007 thank you for the patch!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Enable support for Google Sheets-based datasets

3 participants