Skip to content

Investigate if/how we could support search without downloading parquet files #254

@fedorov

Description

@fedorov

Downloading ~100MB parquet files significantly adds to the startup costs, which is particularly annoying while using idc-index in a skill.

We could query parquet files directly, or offer this as an alternative mechanism to try before/instead of downloading the parquet indices, but as I discovered today, there are HTTP response constraints on how much can be received back, and to work around those, one would need to add a parameter to download the entire parquet file.

    # Connect to an in-memory DuckDB instance
    con = duckdb.connect()

    # Required for some servers that send more data than the HTTP content-length
    # header indicates (e.g. Google Cloud Storage), which confuses DuckDB's
    # default streaming reader. force_download fetches the whole file first.
    con.execute("SET force_download=true")

Yet another alternative would be to revamp IDC REST API and rely on that, but from what I understand, it is unsafe to allow passing SQL query to API, so it will not be as flexible as the parquet-based approach. Maybe we can devise an API that could address basic needs and revert to downloading parquet only if needed, for more complex questions?

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions