FIX: [DEV-15236] improvements to scheduler and table model predict speed.#856
FIX: [DEV-15236] improvements to scheduler and table model predict speed.#856benleetownsend wants to merge 4 commits intodevelopmentfrom
Conversation
| max_row = max(r[context_key] for r in context) | ||
| row_spans = [ | ||
| [ | ||
| row_spans = [[] for _ in range(max_row + 1)] |
There was a problem hiding this comment.
This bucketing change is part of the chunker speedup. On the synthetic 150 x 20 table chunking benchmark, the chunker dropped from 55.443s to 0.263s (~99.5% faster) after the get_axis_spans / _make_chunks / combine_row_spans optimization pass.
| [mark_token(t) for t in token_spans if overlaps_token(row_span, t)] | ||
| ) | ||
| num_tokens = 0 | ||
| token_idx = bisect.bisect_left(token_ends, row_span["start"]) |
There was a problem hiding this comment.
This bisect-bounded token scan is another part of the same chunker win. The synthetic 150 x 20 chunking benchmark went from 55.443s to 0.263s (~99.5% faster) after the chunker changes, while preserving output digests.
| @@ -1,6 +1,7 @@ | |||
| """ | |||
There was a problem hiding this comment.
This files changes are probably worth pulling in. Big improvements for the risk on larger tables.
There was a problem hiding this comment.
The other changes maybe we drop as being not valuable enough for the risk at this point.
| """ | ||
| Finetune-style interface for running a pipeline of table and non-table models. | ||
| """ | ||
| import bisect |
There was a problem hiding this comment.
I didn't know this was part of stdlib!
Note
Medium Risk
Changes chunking/span-token accounting logic that influences how table text is split into model inputs; performance-focused but could subtly alter chunk boundaries and downstream predictions.
Overview
Improves table-model preprocessing/chunking performance by avoiding repeated token counting and by accelerating token-overlap calculation when token spans are monotonic.
Also makes
get_axis_spansbuild axis span buckets in a single pass (skipping negative indices), which can slightly change how spans are grouped before chunking in edge cases.Reviewed by Cursor Bugbot for commit fc41b70. Bugbot is set up for automated code reviews on this repo. Configure here.