add sample_by="document" support for jsonl files by cfahlgren1 · Pull Request #8111 · huggingface/datasets

cfahlgren1 · 2026-03-31T16:23:57Z

With more agentic traces (ie Claude Code / Codex / Pi etc) being published to the Hub as JSONL files, it would be nice to easily support loading / viewing them in Dataset Viewer.

Datasets like https://huggingface.co/datasets/cfahlgren1/agent-sessions contain one JSONL file per session. Without sample_by, each line becomes a separate row, losing the file-level grouping. With sample_by="document", each file is preserved as a single row:

  ds = load_dataset("cfahlgren1/agent-sessions", sample_by="document")
  # Each row is one session file

Also configurable via YAML metadata:

  ---
  configs:
    - config_name: default
      data_files: "*.jsonl"
      sample_by: "document"
  ---

HuggingFaceDocBuilderDev · 2026-03-31T16:26:37Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

lhoestq

I think we should decode the JSON data instead of yielding them as text. Also we could name the column "documents" instead of "text".

Moreover since I expect agent traces to have a fixed schema we need to use the main code path based on paj.read_json which handles mixed types. We could simply yield Key(..., pa.concat_tables(tables)) once per file instead of at every batch

julien-c · 2026-03-31T18:53:32Z

would the auto-detection of agent harness (claude vs. codex vs. opencode etc) and other metadata, live in datasets in your opinion, @lhoestq? or in dataset-server, or in the frontend?

cfahlgren1 · 2026-03-31T19:17:33Z

would the auto-detection of agent harness (claude vs. codex vs. opencode etc) and other metadata, live in datasets in your opinion, @lhoestq? or in dataset-server, or in the frontend?

would be nice personally if we could have it live in datasets-server so we can do less parsing in frontend and add some sort of detection of trace datasets

cfahlgren1 · 2026-04-01T16:56:59Z

closing this in favor of #8113

add sample_by="document" support to json builder

dccd8e9

lhoestq reviewed Mar 31, 2026

View reviewed changes

cfahlgren1 closed this Apr 1, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add sample_by="document" support for jsonl files#8111

add sample_by="document" support for jsonl files#8111
cfahlgren1 wants to merge 1 commit intomainfrom
cfahlgren1/add-jsonl-sample-by-document

cfahlgren1 commented Mar 31, 2026

Uh oh!

HuggingFaceDocBuilderDev commented Mar 31, 2026

Uh oh!

lhoestq left a comment

Uh oh!

julien-c commented Mar 31, 2026

Uh oh!

cfahlgren1 commented Mar 31, 2026

Uh oh!

cfahlgren1 commented Apr 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

cfahlgren1 commented Mar 31, 2026

Uh oh!

HuggingFaceDocBuilderDev commented Mar 31, 2026

Uh oh!

lhoestq left a comment

Choose a reason for hiding this comment

Uh oh!

julien-c commented Mar 31, 2026

Uh oh!

cfahlgren1 commented Mar 31, 2026

Uh oh!

cfahlgren1 commented Apr 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants