The table below includes the information about all SQL functions exposed by Gaggle.
| # | Function | Return Type | Description |
|---|---|---|---|
| 1 | gaggle_set_credentials(username VARCHAR, key VARCHAR) |
BOOLEAN |
Sets Kaggle API credentials from SQL (alternatively use env vars or ~/.kaggle/kaggle.json). Returns true on success. |
| 2 | gaggle_download(dataset_path VARCHAR) |
VARCHAR |
Downloads a Kaggle dataset to the local cache directory and returns the local dataset path. This function is idempotent. |
| 3 | gaggle_search(query VARCHAR, page INTEGER, page_size INTEGER) |
VARCHAR (JSON) |
Searches Kaggle datasets and returns a JSON array. Constraints: page >= 1, 1 <= page_size <= 100. |
| 4 | gaggle_info(dataset_path VARCHAR) |
VARCHAR (JSON) |
Returns metadata for a dataset as JSON (for example: title, url, last_updated). |
| 5 | gaggle_version() |
VARCHAR |
Returns the extension version string (for example: "0.1.0"). |
| 6 | gaggle_clear_cache() |
BOOLEAN |
Clears the dataset cache directory. Returns true on success. |
| 7 | gaggle_cache_info() |
VARCHAR (JSON) |
Returns cache info JSON with path, size_mb, limit_mb, usage_percent, is_soft_limit, and type fields. |
| 8 | gaggle_enforce_cache_limit() |
BOOLEAN |
Manually enforces cache size limit using LRU eviction. Returns true on success. (Automatic with soft limit by default). |
| 9 | gaggle_is_current(dataset_path VARCHAR) |
BOOLEAN |
Checks if cached dataset is the latest version from Kaggle. Returns false if not cached or outdated. |
| 10 | gaggle_update_dataset(dataset_path VARCHAR) |
VARCHAR |
Forces update to latest version (ignores cache). Returns local path to freshly downloaded dataset. |
| 11 | gaggle_version_info(dataset_path VARCHAR) |
VARCHAR (JSON) |
Returns version info: cached_version, latest_version, is_current, is_cached. |
| 12 | gaggle_json_each(json VARCHAR) |
VARCHAR |
Expands a JSON object into newline-delimited JSON rows with fields: key, value, type, path. Users normally shouldn't use this function. |
| 13 | gaggle_file_path(dataset_path VARCHAR, filename VARCHAR) |
VARCHAR |
Resolves a specific file's local path inside a downloaded dataset. |
| 14 | gaggle_ls(dataset_path VARCHAR[, recursive BOOLEAN]) |
TABLE(name VARCHAR, size BIGINT, path VARCHAR) |
Lists files in the dataset's local directory; non-recursive by default. When recursive=true will walk subdirectories. path values are returned as owner/dataset/<relative-path> (not an absolute filesystem path); size is in MB. |
Note
-
The
gaggle_file_pathfunction will retrieve and cache the file if it is not already downloaded; setGAGGLE_STRICT_ONDEMAND=1to prevent fallback to a full dataset download on failures. -
Dataset paths must be in the form
owner/datasetwhereowneris the username anddatasetis the dataset name on Kaggle. For example:habedi/flickr-8k-dataset-clean. You can also read files directly using the replacement scan with thekaggle:scheme. For example:'kaggle:habedi/flickr-8k-dataset-clean/flickr8k.parquet.
To be able to use most of the examples below, you need to have a valid Kaggle username and API key. Check out the Kaggle API documentation for more information on how to get your username and API key.
-- Load the Gaggle extension
load
'build/release/extension/gaggle/gaggle.duckdb_extension';
-- Set Kaggle credentials (or read from environment variables or from `~/.kaggle/kaggle.json` file)
select gaggle_set_credentials('your-username', 'your-api-key');
-- Check version
select gaggle_version();
-- Search datasets (returns a JSON array)
-- (This function is disabled in offline mode (when GAGGLE_OFFLINE=1))
select gaggle_search('iris', 1, 5);
-- Download a dataset and get its local path
select gaggle_download('uciml/iris') as local_path;
-- Get dataset metadata (as a JSON object)
-- (This function is disabled in offline mode (when GAGGLE_OFFLINE=1))
select gaggle_info('uciml/iris') as dataset_metadata;-- List files as a table (non-recursive)
select *
from gaggle_ls('uciml/iris') limit 5;
-- List files as a table (recursive)
select *
from gaggle_ls('suganthidevasagayam/social-media-post-of-postpartum-depression', true) limit 10;
-- List files as a JSON array
select to_json(list(struct_pack(name := name, size := size, path := path))) as files_json
from gaggle_ls('uciml/iris');
-- Note: returned `path` values are in the form 'owner/dataset/...',
-- which work for use with replacement scans or as an identifier inside the cache;
-- to get an absolute filesystem path use `gaggle_file_path(owner_dataset, relative_path)`.
-- Resolve a file path and read it via a prepared statement
prepare rp as select * from read_parquet(?) limit 10;
execute rp(gaggle_file_path('owner/dataset', 'file.parquet'));-- Replacement scan: read a single Parquet file via `kaggle:` URL scheme
select count(*)
from 'kaggle:owner/dataset/file.parquet';
-- Replacement scan: glob Parquet files in a dataset directory
select count(*)
from 'kaggle:owner/dataset/*.parquet';-- Check if cached dataset is the latest version
select gaggle_is_current('owner/dataset') as is_current;
-- Get detailed version information
select gaggle_version_info('owner/dataset') as version_info;
-- Returns: {"cached_version": "3", "latest_version": "5", "is_current": false, "is_cached": true}
-- Force update to latest version (ignores cache)
select gaggle_update_dataset('owner/dataset') as updated_path;
-- Download specific version (version pinning)
select gaggle_download('owner/dataset@v2'); -- Version 2
select gaggle_download('owner/dataset@5'); -- Version 5 (without 'v' prefix)
select gaggle_download('owner/dataset@latest');
-- Explicit latest
-- Use versioned datasets in queries
select *
from 'kaggle:owner/dataset@v2/file.csv';
select *
from 'kaggle:owner/dataset@v5/*.parquet';
-- Smart download: update only if outdated
select case
when gaggle_is_current('owner/dataset') then gaggle_download('owner/dataset')
else gaggle_update_dataset('owner/dataset')
end as path;-- Purge cache and see cache info
select gaggle_clear_cache();
select gaggle_cache_info();
-- Manually enforce cache size limit
-- (Automatic enforcement is done with a soft limit by default and older files are removed first)
select gaggle_enforce_cache_limit();
-- Expand JSON into newline-delimited rows
select gaggle_json_each('{"a":1,"b":[true,{"c":"x"}]}') as rows;
-- Parse search results as JSON and extract a couple of fields
with s as (select from_json(gaggle_search('iris', 1, 10)) as j)
select json_extract_string(value, '$.ref') as ref,
json_extract_string(value, '$.title') as title
from json_each((select j from s)) limit 5;To build Gaggle from source, you need GNU Make, CMake, a modern C++ compiler (like GCC or Clang), Rust and Cargo.
-
Clone the repository:
git clone --recursive https://github.com/CogitatorTech/gaggle.git cd gaggle -
Build the extension:
make release
This will create a
duckdbexecutable insidebuild/release/and a loadable extension atbuild/release/extension/gaggle/gaggle.duckdb_extension. -
Run the custom DuckDB shell:
./build/release/duckdb
You can load the extension with:
load 'build/release/extension/gaggle/gaggle.duckdb_extension';
Gaggle is made up of two main components:
-
Rust Core (
gaggle/src/) that handles:- Credentials management
- HTTP client with timeout and exponential backoff
- Dataset download with safe ZIP extraction and file resolution
- Search and metadata requests
- A few C-compatible FFI functions for use by DuckDB
-
C++ DuckDB Bindings (
gaggle/bindings/) that:- Defines the custom SQL functions (for example:
gaggle_ls,gaggle_file_path, andgaggle_search) - Integrates with DuckDB’s extension system and replacement scans (
'kaggle:...') - Marshals values between DuckDB vectors and the Rust FFI
- Defines the custom SQL functions (for example:
- ERROR_CODES.md: information about the error codes returned by Gaggle.
- CONFIGURATION.md: details about environment variables that can be used to configure Gaggle's behavior.
- examples/: example SQL scripts that showcase various Gaggle features.