Investigate if/how we could support search without downloading parquet files

Downloading ~100MB parquet files significantly adds to the startup costs, which is particularly annoying while using idc-index in a skill.

We could query parquet files directly, or offer this as an alternative mechanism to try before/instead of downloading the parquet indices, but as I discovered today, there are HTTP response constraints on how much can be received back, and to work around those, one would need to add a parameter to download the entire parquet file.

```python
    # Connect to an in-memory DuckDB instance
    con = duckdb.connect()

    # Required for some servers that send more data than the HTTP content-length
    # header indicates (e.g. Google Cloud Storage), which confuses DuckDB's
    # default streaming reader. force_download fetches the whole file first.
    con.execute("SET force_download=true")
```

Yet another alternative would be to revamp IDC REST API and rely on that, but from what I understand, it is unsafe to allow passing SQL query to API, so it will not be as flexible as the parquet-based approach. Maybe we can devise an API that could address basic needs and revert to downloading parquet only if needed, for more complex questions?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate if/how we could support search without downloading parquet files #254

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Investigate if/how we could support search without downloading parquet files #254

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions