Skip to content

Minimum Viable Product (MVP) Roadmap #2

@wgtmac

Description

@wgtmac

Below are planned roadmap for MVP as discussed in different places (e.g. dev ML, Github issues & PRs, slack channel, etc.). Note that it is only for the native C++ implementation. For the effort of Rust C++ binding, please refer to https://lists.apache.org/thread/hotlcdw86nrmt7cf5o5o7kq6gwo98758.

Convention

Goal

  • Implement read path for parsing metadata files of Iceberg v1 & v2. It is a nice-to-have feature to read data files depending on the bandwidth of contributors.
  • Provide a light-weight io-less iceberg library with minimal dependencies (like apache/nanoarrow and nlohmann/json) to mainly deal with the Iceberg metadata. Downstream projects are required to provide their own implementations like I/O, Parquet, Avro and write adaptation code.
  • Provide a battery-included iceberg-bundle library backed by Apache Arrow C++ and Apache Avro C++ libraries.

Workitems

(Disclaimer: this is not an exhaustive list and is subject to change as the development goes on)

API of metadata or building block

  • Add Schema (including data types)
  • Add DataFile
    - [ ] Add DeleteFile
  • Add ManifestFile
  • Add ManifestEntry
  • Add Snapshot
  • Add PartitionSpec
  • Add SortOrder
  • Add ManifestList
  • Add TableMetadata

Catalog

  • Define Catalog interface.
  • Implement an in-memory catalog.

IO

  • Define FileIO interface with minimal operations.
  • Provide default FileIO implementation backed by arrow::FileSystem for different storage providers.

Table

  • Define Table interface.
  • Provide a basic Table implementation to have access to its metadata. @lishuxu
  • Implement Table::NewScan function and TableScan class to support planning files for reading a specific snapshot. @gty404
  • Implement partition pruning and data file pruning in the TableScan. (Postponed to 0.2.0)

JSON Serialization

  • SortOrder
  • PartitionSpec
  • Schema
  • Snapshot
  • TableMetadata
  • NameMapping

Metadata File Reader

  • JSON file reader.
  • JSON parser for metadata objects.
  • Reading gzip-compressed metadata json file.

File Format Reader

  • Define format-agnostic FileReader interface with Arrow C Data as the contract.
  • Implement manifest list file reader. @dongxiao1198
  • Implement manifest file reader.
  • Provide default Avro reader implementation in the iceberg-bundle library.
  • Provide default Parquet reader implementation in the iceberg-bundle library.

Schema/Data conversion

  • Bi-directional conversion to Arrow C schema.
  • ArrowArray -> avro::GenericDatum
  • avro::GenericDatum -> ArrowArray

Expression

Third-party library

  • Add nanoarrow to libiceberg
  • Add nlohmann/json to libiceberg @yingcai-cy
  • Add avro-cpp to libiceberg-bundle
  • Add arrow-cpp to libiceberg-bundle

First release

  • Check licenses.
  • Check documentations.
  • Add release script.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions