Skip to content

FIX: [DEV-15236] improvements to scheduler and table model predict speed.#856

Open
benleetownsend wants to merge 4 commits intodevelopmentfrom
fix/table_model_scheduler_and_performance
Open

FIX: [DEV-15236] improvements to scheduler and table model predict speed.#856
benleetownsend wants to merge 4 commits intodevelopmentfrom
fix/table_model_scheduler_and_performance

Conversation

@benleetownsend
Copy link
Copy Markdown
Contributor

@benleetownsend benleetownsend commented Apr 24, 2026

Note

Medium Risk
Changes chunking/span-token accounting logic that influences how table text is split into model inputs; performance-focused but could subtly alter chunk boundaries and downstream predictions.

Overview
Improves table-model preprocessing/chunking performance by avoiding repeated token counting and by accelerating token-overlap calculation when token spans are monotonic.

Also makes get_axis_spans build axis span buckets in a single pass (skipping negative indices), which can slightly change how spans are grouped before chunking in edge cases.

Reviewed by Cursor Bugbot for commit fc41b70. Bugbot is set up for automated code reviews on this repo. Configure here.

max_row = max(r[context_key] for r in context)
row_spans = [
[
row_spans = [[] for _ in range(max_row + 1)]
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This bucketing change is part of the chunker speedup. On the synthetic 150 x 20 table chunking benchmark, the chunker dropped from 55.443s to 0.263s (~99.5% faster) after the get_axis_spans / _make_chunks / combine_row_spans optimization pass.

Comment thread finetune/util/table_labeler.py Outdated
[mark_token(t) for t in token_spans if overlaps_token(row_span, t)]
)
num_tokens = 0
token_idx = bisect.bisect_left(token_ends, row_span["start"])
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This bisect-bounded token scan is another part of the same chunker win. The synthetic 150 x 20 chunking benchmark went from 55.443s to 0.263s (~99.5% faster) after the chunker changes, while preserving output digests.

@@ -1,6 +1,7 @@
"""
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This files changes are probably worth pulling in. Big improvements for the risk on larger tables.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other changes maybe we drop as being not valuable enough for the risk at this point.

@madisonmay madisonmay self-requested a review April 30, 2026 12:55
"""
Finetune-style interface for running a pipeline of table and non-table models.
"""
import bisect
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't know this was part of stdlib!

@benleetownsend benleetownsend requested a review from madisonmay May 1, 2026 17:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants