Skip to content

Scan Iceberg table sorted on partition key without sort order #966

@BTheunissen

Description

@BTheunissen

Feature Request / Improvement

I am very excited by the recent additions to support PyArrow Record Batch Readers (#786) to better support out of memory reads of very large Iceberg tables.

I currently have a use-case where I want to read a large Iceberg table (1B+) rows, where a sort order key cannot be defined on the table due to the upstream sink processor not supporting it. The table has a partition_by on modified_date_hour, so cannot be used as a strict ordering key.

Example table schema:

user(
  1: op: optional string,
  4: lsn: optional long,
  5: __deleted: optional string,
  6: usr_created_date: optional long,
  7: usr_modified_date: optional long,
  8: modified_date: optional timestamptz,
  9: usr_id: optional string,
  12: table: optional string,
  13: ts_ms: optional long
),
partition by: [modified_date_hour],
sort order: [],
snapshot: Operation.APPEND: id=4107982468622810123, parent_id=4969834436785268669, schema_id=0

Is it possible currently in 0.7.0rc1, or would it be possible with some changes to support reading a PyArrow Batch Reader which fetches batches according to the partition key, and then sorts those batches in-memory by another table column provided to the client, in this case usr_modified_date, to return a stream of PyArrow batches. The idea here is to support being able to efficiently fetch and resume the stream via a connector to extract the records incrementally.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions