Skip to content

use a shadow table for faster rowgroup pruning #36

@cldellow

Description

@cldellow

See discussion in #34.

ATM we prune row groups by doing a linear scan over the row group statistics. For ~10-20K row groups, this takes ~20ms. Where possible, it'd be nice to do this pruning via a faster mechanism. This could make some parquet-backed tables almost competitive with SQLite itself for needle in a haystack queries. eg I think this is possible:

  • sqlite: 0.1ms
  • parquet w/fast rowgroup pruning: 1-2ms
  • parquet w/linear rowgroup pruning: 20ms

Option 1: SQLite shadow table

Perhaps on table connect we could populate a shadow table with the statistics, then use that at query time to generate an array of row group IDs to traverse.

I'm not as sure that this is a slamdunk. SQLite doesn't support using more than 1 index in a query, so it will often end up doing a table scan.

Option 2: our own indexing

We could do the indexing ourself in C. I'm not enthused about this.

I guess we could sort the ranges by their lower bound into some array arr.

Then binary search arr for lower bound < our lowest value (order by lo desc limit 1) and lower bound > our highest value (order by low asc limit 1), which gives us the bounds to search.

Actually, I guess we could translate that into a series of SQLite calls.

Option 3: constrain the problem

We could support it only for monotonically increasing values -- then we could do 2 binary searches to determine the lower and upper bounds of the row groups to search. We could either do the search in C, or delegate it to SQLite.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions