Truncate Excess Columns Instead of Failing When Columns > columns_max_number#3255
Truncate Excess Columns Instead of Failing When Columns > columns_max_number#3255ArjunJagdale wants to merge 9 commits into
Conversation
Log a warning and truncate image URL columns if they exceed the maximum allowed number.
|
@severo would like your thoughts on this :) |
severo
left a comment
There was a problem hiding this comment.
Good idea.
Can you also add the tests for these cases?
| ) | ||
|
|
||
| if columns_were_truncated: | ||
| response["truncated_columns"] = truncated_columns |
There was a problem hiding this comment.
I think we don't need the list of missing columns in the response. Just a boolean, I guess.
| response = response_features_only | ||
| response["rows"] = row_items | ||
| response["truncated"] = (not rows_content.all_fetched) or truncated | ||
| response["truncated"] = ( |
There was a problem hiding this comment.
I thinl we should keep this field for truncated rows (we could have named it truncated_rows to be more explicit--maybe we can add that field, and deprecate truncated at some point?), and have another field for truncated columns (let's call it truncated_columns).
There was a problem hiding this comment.
Also, we need to apply this truncation to first_rows, I guess. And we should update the docs (the openapi spec in particular)
| num_scanned_rows=num_scanned_rows, | ||
| has_urls_columns=True, | ||
| full_scan=rows_content.all_fetched, | ||
| truncated_columns=truncated, |
There was a problem hiding this comment.
better here, a boolean, not a list of column names
|
@severo Also, regarding first_rows.py - since it calls create_first_rows_response(), will it automatically get these new fields, or does something need to be changed there as well? In opt_in_out_urls_scan_from_streaming.py, the truncated_columns=truncated is already passing a boolean value as you suggested. also in rows.py - Do I also need to update the type definition for SplitFirstRowsResponse to include the new truncated_rows and truncated_columns fields? If so, which file should I modify? |
|
We should keep "truncated" as it was before (only for truncated rows), otherwise we would report incorrectly in the dataset viewer for previously computed datasets. SplitFirstRowsResponse: yes, in https://github.com/huggingface/dataset-viewer/blob/main/libs/libcommon/src/libcommon/dtos.py. And also update https://github.com/huggingface/dataset-viewer/blob/main/docs/source/openapi.json
indeed |
|
i will do the changes, and will let you know! |
|
@severo the changes are now applied in all four affected files. |
|
It's in good shape! As I mentioned before, can you add unit tests for the changes? |
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
This PR implements graceful truncation behavior for datasets with extremely wide schemas (i.e., thousands of columns), addressing #1172 and related discussions on improving the viewer’s robustness for modern AI-scale tabular datasets.
Previously, when the number of columns exceeded columns_max_number (default: 1000), several viewer steps—such as first-rows and opt-in/out URL scan—would raise TooManyColumnsError.
This made the viewer unusable for many large-scale datasets, even when a partial preview would have been perfectly acceptable.
Instead of failing, we now gracefully truncate the schema to the first columns_max_number columns and continue processing normally.
Implemented in:
libs/libcommon/src/libcommon/viewer_utils/rows.py1] Replaces the hard error with truncation
2] Adds response["truncated_columns"] (list of dropped columns)
3] Marks response["truncated"] = True when applicable
services/worker/src/worker/job_runners/split/opt_in_out_urls_scan_from_streaming.py1] Truncates image_url_columns instead of raising TooManyColumnsError
2] Emits a warning
3] Propagates truncation info to get_rows_or_raise