Truncate Excess Columns Instead of Failing When Columns > columns_max_number by ArjunJagdale · Pull Request #3255 · huggingface/dataset-viewer

ArjunJagdale · 2025-11-16T18:24:02Z

This PR implements graceful truncation behavior for datasets with extremely wide schemas (i.e., thousands of columns), addressing #1172 and related discussions on improving the viewer’s robustness for modern AI-scale tabular datasets.

Previously, when the number of columns exceeded columns_max_number (default: 1000), several viewer steps—such as first-rows and opt-in/out URL scan—would raise TooManyColumnsError.
This made the viewer unusable for many large-scale datasets, even when a partial preview would have been perfectly acceptable.

Instead of failing, we now gracefully truncate the schema to the first columns_max_number columns and continue processing normally.

Implemented in:

libs/libcommon/src/libcommon/viewer_utils/rows.py

1] Replaces the hard error with truncation

2] Adds response["truncated_columns"] (list of dropped columns)

3] Marks response["truncated"] = True when applicable

services/worker/src/worker/job_runners/split/opt_in_out_urls_scan_from_streaming.py

1] Truncates image_url_columns instead of raising TooManyColumnsError

2] Emits a warning

3] Propagates truncation info to get_rows_or_raise

Log a warning and truncate image URL columns if they exceed the maximum allowed number.

ArjunJagdale · 2025-11-16T18:36:06Z

@severo would like your thoughts on this :)

severo

Good idea.

Can you also add the tests for these cases?

severo · 2025-11-17T08:36:44Z

+    )
+
+    if columns_were_truncated:
+        response["truncated_columns"] = truncated_columns


I think we don't need the list of missing columns in the response. Just a boolean, I guess.

severo · 2025-11-17T08:38:19Z

    response = response_features_only
    response["rows"] = row_items
-    response["truncated"] = (not rows_content.all_fetched) or truncated
+    response["truncated"] = (


I thinl we should keep this field for truncated rows (we could have named it truncated_rows to be more explicit--maybe we can add that field, and deprecate truncated at some point?), and have another field for truncated columns (let's call it truncated_columns).

Also, we need to apply this truncation to first_rows, I guess. And we should update the docs (the openapi spec in particular)

severo · 2025-11-17T08:40:20Z

        num_scanned_rows=num_scanned_rows,
        has_urls_columns=True,
        full_scan=rows_content.all_fetched,
+        truncated_columns=truncated,


better here, a boolean, not a list of column names

ArjunJagdale · 2025-11-17T10:09:12Z

@severo Also, regarding first_rows.py - since it calls create_first_rows_response(), will it automatically get these new fields, or does something need to be changed there as well?

In opt_in_out_urls_scan_from_streaming.py, the truncated_columns=truncated is already passing a boolean value as you suggested.

also in rows.py -

response["truncated"] = (
    (not rows_content.all_fetched)
    or truncated
    or columns_were_truncated
)

response["truncated_rows"] = (not rows_content.all_fetched) or truncated
response["truncated_columns"] = columns_were_truncated

Do I also need to update the type definition for SplitFirstRowsResponse to include the new truncated_rows and truncated_columns fields? If so, which file should I modify?

severo · 2025-11-17T10:29:09Z

We should keep "truncated" as it was before (only for truncated rows), otherwise we would report incorrectly in the dataset viewer for previously computed datasets.

SplitFirstRowsResponse: yes, in https://github.com/huggingface/dataset-viewer/blob/main/libs/libcommon/src/libcommon/dtos.py. And also update https://github.com/huggingface/dataset-viewer/blob/main/docs/source/openapi.json

@severo Also, regarding first_rows.py - since it calls create_first_rows_response(), will it automatically get these new fields, or does something need to be changed there as well?

indeed

ArjunJagdale · 2025-11-17T15:27:07Z

i will do the changes, and will let you know!

ArjunJagdale · 2025-11-17T17:38:58Z

@severo the changes are now applied in all four affected files.
The logic in libs/libcommon/viewer_utils/rows.py seems consistent with the new behavior, but let me know if you see anything else that should be adjusted.

severo · 2025-11-21T16:03:58Z

It's in good shape! As I mentioned before, can you add unit tests for the changes?

HuggingFaceDocBuilderDev · 2025-11-21T16:04:16Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

ArjunJagdale added 3 commits November 16, 2025 23:18

Handle too many image URL columns with truncation

c632c64

Log a warning and truncate image URL columns if they exceed the maximum allowed number.

Update rows.py

f5865c2

Update rows.py

2fcad18

ArjunJagdale changed the title ~~Process part~~ Truncate Excess Columns Instead of Failing When Columns > columns_max_number Nov 16, 2025

severo reviewed Nov 17, 2025

View reviewed changes

ArjunJagdale added 2 commits November 17, 2025 15:16

Update rows.py

e8769b6

Update rows.py

9af008c

ArjunJagdale added 2 commits November 17, 2025 20:14

Simplify truncated response logic in rows.py

f1c6ad7

Update dtos.py

ba4b214

ArjunJagdale and others added 2 commits November 17, 2025 21:08

Update openapi.json

f82b6f2

Apply Updates

5d903d9

ArjunJagdale requested a review from severo November 20, 2025 18:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Truncate Excess Columns Instead of Failing When Columns > columns_max_number#3255

Truncate Excess Columns Instead of Failing When Columns > columns_max_number#3255
ArjunJagdale wants to merge 9 commits into
huggingface:mainfrom
ArjunJagdale:process_part

ArjunJagdale commented Nov 16, 2025 •

edited

Loading

Uh oh!

ArjunJagdale commented Nov 16, 2025

Uh oh!

severo left a comment

Uh oh!

severo Nov 17, 2025

Uh oh!

severo Nov 17, 2025

Uh oh!

severo Nov 17, 2025

Uh oh!

severo Nov 17, 2025

Uh oh!

ArjunJagdale commented Nov 17, 2025 •

edited

Loading

Uh oh!

severo commented Nov 17, 2025

Uh oh!

ArjunJagdale commented Nov 17, 2025

Uh oh!

ArjunJagdale commented Nov 17, 2025 •

edited

Loading

Uh oh!

severo commented Nov 21, 2025

Uh oh!

HuggingFaceDocBuilderDev commented Nov 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

ArjunJagdale commented Nov 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ArjunJagdale commented Nov 16, 2025

Uh oh!

severo left a comment

Choose a reason for hiding this comment

Uh oh!

severo Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

severo Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

severo Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

severo Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

ArjunJagdale commented Nov 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

severo commented Nov 17, 2025

Uh oh!

ArjunJagdale commented Nov 17, 2025

Uh oh!

ArjunJagdale commented Nov 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

severo commented Nov 21, 2025

Uh oh!

HuggingFaceDocBuilderDev commented Nov 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ArjunJagdale commented Nov 16, 2025 •

edited

Loading

ArjunJagdale commented Nov 17, 2025 •

edited

Loading

ArjunJagdale commented Nov 17, 2025 •

edited

Loading