Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions docs/hub/_toctree.yml
Original file line number Diff line number Diff line change
Expand Up @@ -283,6 +283,10 @@
title: Performing data transformations
- local: datasets-polars-optimizations
title: Performance optimizations
- local: datasets-pyarrow
title: PyArrow
- local: datasets-pyiceberg
title: PyIceberg
- local: datasets-spark
title: Spark
- local: datasets-webdataset
Expand Down
202 changes: 171 additions & 31 deletions docs/hub/datasets-duckdb.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,7 @@
# DuckDB

[DuckDB](https://github.com/duckdb/duckdb) is an in-process SQL [OLAP](https://en.wikipedia.org/wiki/Online_analytical_processing) database management system.
You can use the Hugging Face paths (`hf://`) to access data on the Hub:

<div class="flex justify-center">
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/duckdb_hf_url.png"/>
</div>
You can use the Hugging Face paths (`hf://`) to access data on the Hub, or an Iceberg Datasets Catalog.

The [DuckDB CLI](https://duckdb.org/docs/api/cli/overview.html) (Command Line Interface) is a single, dependency-free executable.
There are also other APIs available for running DuckDB, including Python, C++, Go, Java, Rust, and more. For additional details, visit their [clients](https://duckdb.org/docs/api/overview.html) page.
Expand All @@ -20,55 +16,66 @@ Starting from version `v0.10.3`, the DuckDB CLI includes native support for acce
- Combine datasets and export it to different formats
- Conduct vector similarity search on embedding datasets
- Implement full-text search on datasets
- Use an Iceberg Datasets Catalog

For a complete list of DuckDB features, visit the DuckDB [documentation](https://duckdb.org/docs/).

To start the CLI, execute the following command in the installation folder:
## Authentication

To access gated and private datasets, login to Hugging Face with:

```bash
./duckdb
hf auth login
```

Then in DuckDB, load the hf_token with this command:

```sql
CREATE SECRET hf_token (TYPE HUGGINGFACE, PROVIDER credential_chain);
```

## Forging the Hugging Face URL
See more details on authentication in the [DuckDB authentication documentation for Hugging face](./datasets-duckdb-auth).

## Querying files on Hugging Face

To access Hugging Face datasets, use the following URL format:

```plaintext
hf://datasets/{my-username}/{my-dataset}/{path_to_file}
```

- **my-username**, the user or organization of the dataset, e.g. `ibm`
- **my-dataset**, the dataset name, e.g: `duorc`
- **my-username**, the user or organization of the dataset, e.g. `stanfordnlp`
- **my-dataset**, the dataset name, e.g: `imdb`
- **path_to_parquet_file**, the parquet file path which supports glob patterns, e.g `**/*.parquet`, to query all parquet files


> [!TIP]
> You can query auto-converted Parquet files using the @~parquet branch, which corresponds to the `refs/convert/parquet` revision. For more details, refer to the documentation at https://huggingface.co/docs/datasets-server/en/parquet#conversion-to-parquet.
>
> To reference the `refs/convert/parquet` revision of a dataset, use the following syntax:
>
> ```plaintext
> hf://datasets/{my-username}/{my-dataset}@~parquet/{path_to_file}
> ```
>
> Here is a sample URL following the above syntax:
>
> ```plaintext
> hf://datasets/ibm/duorc@~parquet/ParaphraseRC/test/0000.parquet
> ```

Let's start with a quick demo to query all the rows of a dataset:
For example, to query the train split of the [stanfordnlp/imdb](https://huggingface.co/datasets/stanfordnlp/imdb) dataset:

```sql
FROM 'hf://datasets/ibm/duorc/ParaphraseRC/*.parquet' LIMIT 3;
SELECT * FROM 'hf://datasets/stanfordnlp/imdb/**/train-*.parquet' LIMIT 10;
```

Or using traditional SQL syntax:
Which returns:

```sql
SELECT * FROM 'hf://datasets/ibm/duorc/ParaphraseRC/*.parquet' LIMIT 3;
```
In the following sections, we will cover more complex operations you can perform with DuckDB on Hugging Face datasets.
┌──────────────────────────────────────────────────────────────────────┬───────┐
│ text │ label │
│ varchar │ int64 │
├──────────────────────────────────────────────────────────────────────┼───────┤
│ I rented I AM CURIOUS-YELLOW from my video store because of all th… │ 0 │
│ "I Am Curious: Yellow" is a risible and pretentious steaming pile.… │ 0 │
│ If only to avoid making this type of film in the future. This film… │ 0 │
│ This film was probably inspired by Godard's Masculin, féminin and … │ 0 │
│ Oh, brother...after hearing about this ridiculous film for umpteen… │ 0 │
│ I would put this at the top of my list of films in the category of… │ 0 │
│ Whoever wrote the screenplay for this movie obviously never consul… │ 0 │
│ When I first saw a glimpse of this movie, I quickly noticed the ac… │ 0 │
│ Who are these "They"- the actors? the filmmakers? Certainly couldn… │ 0 │
│ This is said to be a personal film for Peter Bogdonavitch. He base… │ 0 │
├──────────────────────────────────────────────────────────────────────┴───────┤
│ 10 rows 2 columns │
└──────────────────────────────────────────────────────────────────────────────┘
```

> [!TIP]
> **Querying Storage Buckets**: When using the DuckDB Python client, you can query data stored in [Storage Buckets](./storage-buckets) by registering the Hugging Face filesystem:
Expand All @@ -79,3 +86,136 @@ In the following sections, we will cover more complex operations you can perform
> duckdb.sql("SELECT * FROM 'hf://buckets/username/my-bucket/data.parquet' LIMIT 10")
> ```
Native `hf://buckets/` support in DuckDB is expected in a future release.

## Query an Iceberg Datasets Catalog

Use the PyIceberg library `faceberg` to deploy an Iceberg catalog (see next section) you can use to query datasets on Huggging Face using an easy syntax.

In particular you can query datasets as `faceberg.namespace.dataset_name` instead of having to pass a file pattern, and it automatically adds a `split` column to differentiate between train/test/validation splits.

For example, here is the syntax to query the [stanfordnlp/imdb](https://huggingface.co/datasets/stanfordnlp/imdb) dataset:

```sql
SELECT * FROM faceberg.stanfordnlp.imdb LIMIT 10;
```

```
┌─────────┬────────────────────────────────────────────────────────────┬───────┐
│ split │ text │ label │
│ varchar │ varchar │ int64 │
├─────────┼────────────────────────────────────────────────────────────┼───────┤
│ train │ I rented I AM CURIOUS-YELLOW from my video store because… │ 0 │
│ train │ "I Am Curious: Yellow" is a risible and pretentious stea… │ 0 │
│ train │ If only to avoid making this type of film in the future.… │ 0 │
│ train │ This film was probably inspired by Godard's Masculin, fé… │ 0 │
│ train │ Oh, brother...after hearing about this ridiculous film f… │ 0 │
│ train │ I would put this at the top of my list of films in the c… │ 0 │
│ train │ Whoever wrote the screenplay for this movie obviously ne… │ 0 │
│ train │ When I first saw a glimpse of this movie, I quickly noti… │ 0 │
│ train │ Who are these "They"- the actors? the filmmakers? Certai… │ 0 │
│ train │ This is said to be a personal film for Peter Bogdonavitc… │ 0 │
├─────────┴────────────────────────────────────────────────────────────┴───────┤
│ 10 rows 3 columns │
└──────────────────────────────────────────────────────────────────────────────┘
```

And you can simply filter by split like this:


```sql
SELECT * FROM faceberg.stanfordnlp.imdb WHERE split = 'test' LIMIT 10;
```

```
┌─────────┬────────────────────────────────────────────────────────────┬───────┐
│ split │ text │ label │
│ varchar │ varchar │ int64 │
├─────────┼────────────────────────────────────────────────────────────┼───────┤
│ test │ I love sci-fi and am willing to put up with a lot. Sci-f… │ 0 │
│ test │ Worth the entertainment value of a rental, especially if… │ 0 │
│ test │ its a totally average film with a few semi-alright actio… │ 0 │
│ test │ STAR RATING: ***** Saturday Night **** Friday Night *** … │ 0 │
│ test │ First off let me say, If you haven't enjoyed a Van Damme… │ 0 │
│ test │ I had high hopes for this one until they changed the nam… │ 0 │
│ test │ Isaac Florentine has made some of the best western Marti… │ 0 │
│ test │ It actually pains me to say it, but this movie was horri… │ 0 │
│ test │ Technically I'am a Van Damme Fan, or I was. this movie i… │ 0 │
│ test │ Honestly awful film, bad editing, awful lighting, dire d… │ 0 │
├─────────┴────────────────────────────────────────────────────────────┴───────┤
│ 10 rows 3 columns │
└──────────────────────────────────────────────────────────────────────────────┘
```

### Deploy a catalog on HuggingFace Hub

To deploy an Iceberg Datasets catalog, run `pip install faceberg` and run this command using your own Hugging Face username instead of "user":

```bash
faceberg user/mycatalog init
```

### Add datasets

Once your catalog is ready, add datasets using the following command:

```bash
faceberg user/mycatalog add stanfordnlp/imdb
faceberg user/mycatalog add openai/gsm8k --config main
```

### Query with interactive DuckDB shell

`faceberg` comes with an builtin DuckDB shell you can run like this:

```bash
faceberg user/mycatalog quack
```

```sql
SELECT label, substr(text, 1, 100) as preview
FROM faceberg.stanfordnlp.imdb
LIMIT 10;
```

Alternatively DuckDB shell is also available in the catalog's web interface:

<div class="flex justify-center">
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/datasets-faceberg-space-imdb-sql-min.png"/>
</div>

### More information

Find more information on `faceberg` and the PyIceberg integration with the Hugging Face Hub in the [documentation](./datasets-pyiceberg).


## Auto-converted Parquet files

You can query auto-converted Parquet files using the @~parquet branch, which corresponds to the `refs/convert/parquet` revision. For more details, refer to the documentation at https://huggingface.co/docs/datasets-server/en/parquet#conversion-to-parquet:


<div class="flex justify-center">
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/duckdb_hf_url.png"/>
</div>

To reference the `refs/convert/parquet` revision of a dataset, use the following syntax:

```plaintext
hf://datasets/{my-username}/{my-dataset}@~parquet/{path_to_file}
```

Here is a sample URL following the above syntax for the [fka/prompts.chat](https://huggingface.co/datasets/fka/prompts.chat) dataset to a file in its [Parquet branch](https://huggingface.co/datasets/fka/prompts.chat/tree/refs%2Fconvert%2Fparquet):

```plaintext
hf://datasets/fka/prompts.chat@~parquet/default/train/0000.parquet
```

In the following sections, we will cover more complex operations you can perform with DuckDB on Hugging Face datasets.

## Use-cases and examples

Find more use-cases and examples with Hugging Face Datasets here:

* [Query datasets](./datasets-duckdb-select.md)
* [Perform SQL operations](./datasets-duckdb-sql)
* [Combine datasets and export](./datasets-duckdb-combine-and-export.md)
* [Perform vector similarity search](./datasets-duckdb-vector-similarity-search.md)
1 change: 1 addition & 0 deletions docs/hub/datasets-libraries.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,7 @@ The table below summarizes the supported libraries and their level of integratio
| [Pandas](./datasets-pandas) | Python data analysis toolkit. | ✅ | ❌ | ✅ | ❌ | ✅* |
| [Polars](./datasets-polars) | A DataFrame library on top of an OLAP query engine. | ✅ | ✅ | ✅ | ❌ | ❌ |
| [PyArrow](./datasets-pyarrow) | Apache Arrow is a columnar format and a toolbox for fast data interchange and in-memory analytics. | ✅ | ✅ | ✅ | ❌ | ✅* |
| [PyIceberg](./datasets-pyiceberg) | Apache Iceberg is a high performance open-source format for large analytic tables. | ✅ | ✅ | ❌ | ❌ | ❌ |
| [Spark](./datasets-spark) | Real-time, large-scale data processing tool in a distributed environment. | ✅ | ✅ | ✅ | ✅ | ✅ |
| [WebDataset](./datasets-webdataset) | Library to write I/O pipelines for large datasets. | ✅ | ✅ | ❌ | ❌ | ❌ |

Expand Down
Loading
Loading