add sample_by="document" support for jsonl files#8111
add sample_by="document" support for jsonl files#8111cfahlgren1 wants to merge 1 commit intomainfrom
Conversation
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
lhoestq
left a comment
There was a problem hiding this comment.
I think we should decode the JSON data instead of yielding them as text. Also we could name the column "documents" instead of "text".
Moreover since I expect agent traces to have a fixed schema we need to use the main code path based on paj.read_json which handles mixed types. We could simply yield Key(..., pa.concat_tables(tables)) once per file instead of at every batch
|
would the auto-detection of agent harness (claude vs. codex vs. opencode etc) and other metadata, live in datasets in your opinion, @lhoestq? or in dataset-server, or in the frontend? |
would be nice personally if we could have it live in datasets-server so we can do less parsing in frontend and add some sort of detection of trace datasets
|
|
closing this in favor of #8113 |

With more agentic traces (ie Claude Code / Codex / Pi etc) being published to the Hub as JSONL files, it would be nice to easily support loading / viewing them in Dataset Viewer.
Datasets like https://huggingface.co/datasets/cfahlgren1/agent-sessions contain one JSONL file per session. Without sample_by, each line becomes a separate row, losing the file-level grouping. With sample_by="document", each file is preserved as a single row:
Also configurable via YAML metadata: