Skip to content

Truncate Excess Columns Instead of Failing When Columns > columns_max_number#3255

Open
ArjunJagdale wants to merge 9 commits into
huggingface:mainfrom
ArjunJagdale:process_part
Open

Truncate Excess Columns Instead of Failing When Columns > columns_max_number#3255
ArjunJagdale wants to merge 9 commits into
huggingface:mainfrom
ArjunJagdale:process_part

Conversation

@ArjunJagdale

@ArjunJagdale ArjunJagdale commented Nov 16, 2025

Copy link
Copy Markdown
Contributor

This PR implements graceful truncation behavior for datasets with extremely wide schemas (i.e., thousands of columns), addressing #1172 and related discussions on improving the viewer’s robustness for modern AI-scale tabular datasets.

Previously, when the number of columns exceeded columns_max_number (default: 1000), several viewer steps—such as first-rows and opt-in/out URL scan—would raise TooManyColumnsError.
This made the viewer unusable for many large-scale datasets, even when a partial preview would have been perfectly acceptable.

Instead of failing, we now gracefully truncate the schema to the first columns_max_number columns and continue processing normally.

Implemented in:

libs/libcommon/src/libcommon/viewer_utils/rows.py

1] Replaces the hard error with truncation

2] Adds response["truncated_columns"] (list of dropped columns)

3] Marks response["truncated"] = True when applicable

services/worker/src/worker/job_runners/split/opt_in_out_urls_scan_from_streaming.py

1] Truncates image_url_columns instead of raising TooManyColumnsError

2] Emits a warning

3] Propagates truncation info to get_rows_or_raise

Log a warning and truncate image URL columns if they exceed the maximum allowed number.
@ArjunJagdale

Copy link
Copy Markdown
Contributor Author

@severo would like your thoughts on this :)

@ArjunJagdale ArjunJagdale changed the title Process part Truncate Excess Columns Instead of Failing When Columns > columns_max_number Nov 16, 2025

@severo severo left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea.

Can you also add the tests for these cases?

)

if columns_were_truncated:
response["truncated_columns"] = truncated_columns

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we don't need the list of missing columns in the response. Just a boolean, I guess.

response = response_features_only
response["rows"] = row_items
response["truncated"] = (not rows_content.all_fetched) or truncated
response["truncated"] = (

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thinl we should keep this field for truncated rows (we could have named it truncated_rows to be more explicit--maybe we can add that field, and deprecate truncated at some point?), and have another field for truncated columns (let's call it truncated_columns).

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, we need to apply this truncation to first_rows, I guess. And we should update the docs (the openapi spec in particular)

num_scanned_rows=num_scanned_rows,
has_urls_columns=True,
full_scan=rows_content.all_fetched,
truncated_columns=truncated,

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better here, a boolean, not a list of column names

@ArjunJagdale

ArjunJagdale commented Nov 17, 2025

Copy link
Copy Markdown
Contributor Author

@severo Also, regarding first_rows.py - since it calls create_first_rows_response(), will it automatically get these new fields, or does something need to be changed there as well?

In opt_in_out_urls_scan_from_streaming.py, the truncated_columns=truncated is already passing a boolean value as you suggested.

also in rows.py -

response["truncated"] = (
    (not rows_content.all_fetched)
    or truncated
    or columns_were_truncated
)

response["truncated_rows"] = (not rows_content.all_fetched) or truncated
response["truncated_columns"] = columns_were_truncated

Do I also need to update the type definition for SplitFirstRowsResponse to include the new truncated_rows and truncated_columns fields? If so, which file should I modify?

@severo

severo commented Nov 17, 2025

Copy link
Copy Markdown
Collaborator

We should keep "truncated" as it was before (only for truncated rows), otherwise we would report incorrectly in the dataset viewer for previously computed datasets.

SplitFirstRowsResponse: yes, in https://github.com/huggingface/dataset-viewer/blob/main/libs/libcommon/src/libcommon/dtos.py. And also update https://github.com/huggingface/dataset-viewer/blob/main/docs/source/openapi.json

@severo Also, regarding first_rows.py - since it calls create_first_rows_response(), will it automatically get these new fields, or does something need to be changed there as well?

indeed

@ArjunJagdale

Copy link
Copy Markdown
Contributor Author

i will do the changes, and will let you know!

@ArjunJagdale

ArjunJagdale commented Nov 17, 2025

Copy link
Copy Markdown
Contributor Author

@severo the changes are now applied in all four affected files.
The logic in libs/libcommon/viewer_utils/rows.py seems consistent with the new behavior, but let me know if you see anything else that should be adjusted.

@ArjunJagdale ArjunJagdale requested a review from severo November 20, 2025 18:54
@severo

severo commented Nov 21, 2025

Copy link
Copy Markdown
Collaborator

It's in good shape! As I mentioned before, can you add unit tests for the changes?

@HuggingFaceDocBuilderDev

Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants