Skip to content

Latest commit

 

History

History
266 lines (242 loc) · 8.16 KB

File metadata and controls

266 lines (242 loc) · 8.16 KB
hide
navigation

Python CLI

Pyiceberg comes with a CLI that's available after installing the pyiceberg package.

You can pass the path to the Catalog using --uri and --credential. For REST catalogs, you can also set --warehouse to request a specific warehouse from the catalog service. It is still recommended to set up a ~/.pyiceberg.yaml config as described in the Catalog section.

➜  pyiceberg --help
Usage: pyiceberg [OPTIONS] COMMAND [ARGS]...

Options:
  --catalog TEXT
  --verbose BOOLEAN
  --output [text|json]
  --ugi TEXT
  --uri TEXT
  --credential TEXT
  --warehouse TEXT
  --help                Show this message and exit.

Commands:
  create            Operation to create a namespace.
  describe          Describe a namespace or a table.
  drop              Operations to drop a namespace or table.
  expire-snapshots  Expire snapshots from a table by ID or age.
  files             List all the files of the table.
  list              List tables or namespaces.
  list-refs         List all the refs in the provided table.
  location          Return the location of the table.
  properties        Properties on tables/namespaces.
  rename            Rename a table.
  schema            Get the schema of the table.
  spec              Return the partition spec of the table.
  uuid              Return the UUID of the table.
  version           Print pyiceberg version.

This example assumes that you have a default catalog set. If you want to load another catalog, for example, the rest example above. Then you need to set --catalog rest.

➜  pyiceberg list
default
nyc
➜  pyiceberg list nyc
nyc.taxis
➜  pyiceberg describe nyc.taxis
Table format version  1
Metadata location     file:/.../nyc.db/taxis/metadata/00000-aa3a3eac-ea08-4255-b890-383a64a94e42.metadata.json
Table UUID            6cdfda33-bfa3-48a7-a09e-7abb462e3460
Last Updated          1661783158061
Partition spec        []
Sort order            []
Current schema        Schema, id=0
├── 1: VendorID: optional long
├── 2: tpep_pickup_datetime: optional timestamptz
├── 3: tpep_dropoff_datetime: optional timestamptz
├── 4: passenger_count: optional double
├── 5: trip_distance: optional double
├── 6: RatecodeID: optional double
├── 7: store_and_fwd_flag: optional string
├── 8: PULocationID: optional long
├── 9: DOLocationID: optional long
├── 10: payment_type: optional long
├── 11: fare_amount: optional double
├── 12: extra: optional double
├── 13: mta_tax: optional double
├── 14: tip_amount: optional double
├── 15: tolls_amount: optional double
├── 16: improvement_surcharge: optional double
├── 17: total_amount: optional double
├── 18: congestion_surcharge: optional double
└── 19: airport_fee: optional double
Current snapshot      Operation.APPEND: id=5937117119577207079, schema_id=0
Snapshots             Snapshots
└── Snapshot 5937117119577207079, schema 0: file:/.../nyc.db/taxis/metadata/snap-5937117119577207079-1-94656c4f-4c66-4600-a4ca-f30377300527.avro
Properties            owner                 root
write.format.default  parquet

Or output in JSON for automation:

➜  pyiceberg --output json describe nyc.taxis | jq
{
  "identifier": [
    "nyc",
    "taxis"
  ],
  "metadata_location": "file:/.../nyc.db/taxis/metadata/00000-aa3a3eac-ea08-4255-b890-383a64a94e42.metadata.json",
  "metadata": {
    "location": "file:/.../nyc.db/taxis",
    "table-uuid": "6cdfda33-bfa3-48a7-a09e-7abb462e3460",
    "last-updated-ms": 1661783158061,
    "last-column-id": 19,
    "schemas": [
      {
        "type": "struct",
        "fields": [
          {
            "id": 1,
            "name": "VendorID",
            "type": "long",
            "required": false
          },
...
          {
            "id": 19,
            "name": "airport_fee",
            "type": "double",
            "required": false
          }
        ],
        "schema-id": 0,
        "identifier-field-ids": []
      }
    ],
    "current-schema-id": 0,
    "partition-specs": [
      {
        "spec-id": 0,
        "fields": []
      }
    ],
    "default-spec-id": 0,
    "last-partition-id": 999,
    "properties": {
      "owner": "root",
      "write.format.default": "parquet"
    },
    "current-snapshot-id": 5937117119577207000,
    "snapshots": [
      {
        "snapshot-id": 5937117119577207000,
        "timestamp-ms": 1661783158061,
        "manifest-list": "file:/.../nyc.db/taxis/metadata/snap-5937117119577207079-1-94656c4f-4c66-4600-a4ca-f30377300527.avro",
        "summary": {
          "operation": "append",
          "spark.app.id": "local-1661783139151",
          "added-data-files": "1",
          "added-records": "2979431",
          "added-files-size": "46600777",
          "changed-partition-count": "1",
          "total-records": "2979431",
          "total-files-size": "46600777",
          "total-data-files": "1",
          "total-delete-files": "0",
          "total-position-deletes": "0",
          "total-equality-deletes": "0"
        },
        "schema-id": 0
      }
    ],
    "snapshot-log": [
      {
        "snapshot-id": "5937117119577207079",
        "timestamp-ms": 1661783158061
      }
    ],
    "metadata-log": [],
    "sort-orders": [
      {
        "order-id": 0,
        "fields": []
      }
    ],
    "default-sort-order-id": 0,
    "refs": {
      "main": {
        "snapshot-id": 5937117119577207000,
        "type": "branch"
      }
    },
    "format-version": 1,
    "schema": {
      "type": "struct",
      "fields": [
        {
          "id": 1,
          "name": "VendorID",
          "type": "long",
          "required": false
        },
...
        {
          "id": 19,
          "name": "airport_fee",
          "type": "double",
          "required": false
        }
      ],
      "schema-id": 0,
      "identifier-field-ids": []
    },
    "partition-spec": []
  }
}

You can also add, update or remove properties on tables or namespaces:

➜  pyiceberg properties set table nyc.taxis write.metadata.delete-after-commit.enabled true
Set write.metadata.delete-after-commit.enabled=true on nyc.taxis

➜  pyiceberg properties get table nyc.taxis
write.metadata.delete-after-commit.enabled  true

➜  pyiceberg properties remove table nyc.taxis write.metadata.delete-after-commit.enabled
Property write.metadata.delete-after-commit.enabled removed from nyc.taxis

➜  pyiceberg properties get table nyc.taxis write.metadata.delete-after-commit.enabled
Could not find property write.metadata.delete-after-commit.enabled on nyc.taxis

Expire snapshots

expire-snapshots removes snapshots from a table. Snapshots that are the HEAD of a branch or that are referenced by a tag are protected and will be skipped.

Pass --snapshot-id one or more times to expire snapshots by ID, and/or --older-than <ISO datetime> to expire all unprotected snapshots older than the given timestamp. At least one of the two options is required.

➜  pyiceberg expire-snapshots nyc.taxis --snapshot-id 5937117119577207079
Expired snapshots on nyc.taxis
➜  pyiceberg expire-snapshots nyc.taxis \
    --snapshot-id 5937117119577207079 \
    --snapshot-id 4123987645210000000
Expired snapshots on nyc.taxis
➜  pyiceberg expire-snapshots nyc.taxis --older-than 2024-01-01T00:00:00
Expired snapshots on nyc.taxis