PERF: speed up to_offset string parsing#65395
Merged
Merged
Conversation
The dict cache inside _get_offset was added in 2012 to avoid re-running offset construction. Construction is now fast enough (~0.3 us) that the cache provides only a sub-microsecond per-call savings, and to_offset itself never returned cached identity anyway because the trailing `offset * stride` step always produces a fresh instance.
Construct Tick subclasses directly for integer-stride tick names (h/min/s/ms/us/ns/D), avoiding the Timedelta + delta_to_tick + Tick.__mul__(float) chain whose float multiplication invoked np.isclose. Drop np.fabs in the non-tick branch in favor of explicit sign-aware multiplication, convert _warn_about_deprecated_aliases and _validate_to_offset_alias to cdef + cache .upper() on the alias, precompute c_PERIOD_AND_OFFSET_DEPR_FREQSTR.values() as a frozenset, and walk the regex split by index instead of via zip + slice triples. Tick offsets like "h", "5min", "3s" go from ~10us to ~0.8-1.0us; compound expressions like "1D1h" go from ~21us to ~1.6us; non-tick names like "ME", "BMS", "YS-MAR" go from ~2us to ~1.0-1.5us. Also adds asv benchmarks for to_offset itself, which had none. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
mroeschke
approved these changes
May 7, 2026
| - Performance improvement in :func:`read_sas` for SAS7BDAT files with full-precision (8-byte) numeric columns, with up to ~2x speedup on bulk reads (:issue:`47339`) | ||
| - Performance improvement in :func:`read_sas` for compressed SAS7BDAT files by reusing the decompression buffer instead of allocating per row (:issue:`47339`) | ||
| - Performance improvement in :func:`read_sas` when decoding strings (:issue:`47339`) | ||
| - Performance improvement in :func:`tseries.frequencies.to_offset` parsing of frequency strings, especially for tick-resolution offsets (e.g. ``"h"``, ``"5min"``, ``"3s"``) and compound expressions (e.g. ``"1D1h"``) (:issue:`XXXXX`) |
Member
There was a problem hiding this comment.
Suggested change
| - Performance improvement in :func:`tseries.frequencies.to_offset` parsing of frequency strings, especially for tick-resolution offsets (e.g. ``"h"``, ``"5min"``, ``"3s"``) and compound expressions (e.g. ``"1D1h"``) (:issue:`XXXXX`) | |
| - Performance improvement in :func:`tseries.frequencies.to_offset` parsing of frequency strings, especially for tick-resolution offsets (e.g. ``"h"``, ``"5min"``, ``"3s"``) and compound expressions (e.g. ``"1D1h"``) (:issue:`65395`) |
Comment on lines
+7393
to
+7395
| # split has 4*N + 1 elements where N is the number of segments; | ||
| # walking by index avoids three list-slice copies + zip overhead | ||
| # vs ``zip(split[0::4], split[1::4], split[2::4])``. |
Member
There was a problem hiding this comment.
Suggested change
| # split has 4*N + 1 elements where N is the number of segments; | |
| # walking by index avoids three list-slice copies + zip overhead | |
| # vs ``zip(split[0::4], split[1::4], split[2::4])``. | |
| # split has 4*N + 1 elements where N is the number of segments |
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Member
|
Thanks @jbrockmendel |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
doc/source/whatsnew/vX.X.X.rstfile if fixing a bug or adding a new feature.Summary
Speeds up
pandas.tseries.frequencies.to_offsetstring parsing. Stacked on top of GH-65390 (remove_offset_mapcache); review/merge that first.The big win is on tick-resolution offsets ("h", "min", "s", "ms", "us", "ns", "D"). The previous implementation built a
Timedelta(1, unit=name), calleddelta_to_tickto wrap it as a Tick, then multiplied byfloat(stride)— which callsTick.__mul__(float), which callsnp.iscloseon every invocation. That last step alone dominated profiles. Now we look the prefix up in a small{name: (TickKlass, factor)}dict and construct the Tick subclass directly for integer strides; fractional strides like "2.5min" still fall through to the old path so unit-promotion semantics are preserved.A few smaller wins on the non-tick path:
int(np.fabs(stride) * stride_sign)(numpy scalar dance) in favor of explicit sign-aware Python int multiplication, mirroring the tick path._warn_about_deprecated_aliasesand_validate_to_offset_aliastocdef. Both are only called fromto_offset, so no public-API impact.alias.upper()in_validate_to_offset_alias(was being called up to 3 times).c_PERIOD_AND_OFFSET_DEPR_FREQSTR.values()as a frozenset (was an O(n) values-view lookup per call).zip(split[0::4], split[1::4], split[2::4])— drops three list-slice copies and the zip object on every call.Perf
Microbench, per-call timings (Python 3.13, M-series mac):
"h""5min""D""3s""3D""-3D""1D1h""5h30min""ME""BMS""YS-MAR""B""2.5min"BaseOffsetpassthrough,None, andtimedeltapaths are unchanged.Also adds asv benchmarks (
ToOffset,ToOffsetPassthrough) since none existed forto_offset.Test plan
pandas/tests/tslibs/test_to_offset.py— passespandas/tests/tseries/offsets/— passespandas/tests/tseries/frequencies/— passespandas/tests/tslibs/— passespandas/tests/indexes/period/— passespandas/tests/indexes/datetimes/test_date_range.py— passespandas/tests/indexes/timedeltas/— passes