Skip to content

add sample_by="document" support for jsonl files#8111

Closed
cfahlgren1 wants to merge 1 commit intomainfrom
cfahlgren1/add-jsonl-sample-by-document
Closed

add sample_by="document" support for jsonl files#8111
cfahlgren1 wants to merge 1 commit intomainfrom
cfahlgren1/add-jsonl-sample-by-document

Conversation

@cfahlgren1
Copy link
Copy Markdown
Contributor

With more agentic traces (ie Claude Code / Codex / Pi etc) being published to the Hub as JSONL files, it would be nice to easily support loading / viewing them in Dataset Viewer.

Datasets like https://huggingface.co/datasets/cfahlgren1/agent-sessions contain one JSONL file per session. Without sample_by, each line becomes a separate row, losing the file-level grouping. With sample_by="document", each file is preserved as a single row:

  ds = load_dataset("cfahlgren1/agent-sessions", sample_by="document")
  # Each row is one session file

Also configurable via YAML metadata:

  ---
  configs:
    - config_name: default
      data_files: "*.jsonl"
      sample_by: "document"
  ---

@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Copy Markdown
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should decode the JSON data instead of yielding them as text. Also we could name the column "documents" instead of "text".

Moreover since I expect agent traces to have a fixed schema we need to use the main code path based on paj.read_json which handles mixed types. We could simply yield Key(..., pa.concat_tables(tables)) once per file instead of at every batch

@julien-c
Copy link
Copy Markdown
Member

would the auto-detection of agent harness (claude vs. codex vs. opencode etc) and other metadata, live in datasets in your opinion, @lhoestq? or in dataset-server, or in the frontend?

@cfahlgren1
Copy link
Copy Markdown
Contributor Author

would the auto-detection of agent harness (claude vs. codex vs. opencode etc) and other metadata, live in datasets in your opinion, @lhoestq? or in dataset-server, or in the frontend?

would be nice personally if we could have it live in datasets-server so we can do less parsing in frontend and add some sort of detection of trace datasets

image

@cfahlgren1
Copy link
Copy Markdown
Contributor Author

closing this in favor of #8113

@cfahlgren1 cfahlgren1 closed this Apr 1, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants