Downloading ~100MB parquet files significantly adds to the startup costs, which is particularly annoying while using idc-index in a skill.
We could query parquet files directly, or offer this as an alternative mechanism to try before/instead of downloading the parquet indices, but as I discovered today, there are HTTP response constraints on how much can be received back, and to work around those, one would need to add a parameter to download the entire parquet file.
# Connect to an in-memory DuckDB instance
con = duckdb.connect()
# Required for some servers that send more data than the HTTP content-length
# header indicates (e.g. Google Cloud Storage), which confuses DuckDB's
# default streaming reader. force_download fetches the whole file first.
con.execute("SET force_download=true")
Yet another alternative would be to revamp IDC REST API and rely on that, but from what I understand, it is unsafe to allow passing SQL query to API, so it will not be as flexible as the parquet-based approach. Maybe we can devise an API that could address basic needs and revert to downloading parquet only if needed, for more complex questions?
Downloading ~100MB parquet files significantly adds to the startup costs, which is particularly annoying while using idc-index in a skill.
We could query parquet files directly, or offer this as an alternative mechanism to try before/instead of downloading the parquet indices, but as I discovered today, there are HTTP response constraints on how much can be received back, and to work around those, one would need to add a parameter to download the entire parquet file.
Yet another alternative would be to revamp IDC REST API and rely on that, but from what I understand, it is unsafe to allow passing SQL query to API, so it will not be as flexible as the parquet-based approach. Maybe we can devise an API that could address basic needs and revert to downloading parquet only if needed, for more complex questions?