perf(create_disrnn_dataset): group by session once instead of per-session df.query by hanhou · Pull Request #29 · AllenNeuralDynamics/aind_disrnn_utils

hanhou · 2026-06-21T15:59:02Z

Summary

create_disrnn_dataset looped df_trials["ses_idx"].unique() twice (once for xs, once for ys), calling df_trials.query("ses_idx == @ses_idx") per session. df.query re-parses the expression string and re-scans the whole frame on every call, so on cohorts with many sessions this dominated dataset construction (it was ~24% of a profiled 441s multisubject load()).

Change

Replace both loops with a single groupby("ses_idx", sort=False) pass. sort=False preserves the first-appearance session order that defines the column index dex (matching df_trials["ses_idx"].unique()), so xs/ys are identical.

Measured performance (real cohort: 878 subjects, 23,569 sessions, 11.3M trials)

Per-subject xs/ys build, summed over all subjects (onprem H200 node, CPU):

	time
OLD (per-session `df.query`)	63.2 s
NEW (`groupby`, `sort=False`)	8.5 s
	7.4× faster, ~55 s saved

Verification

Real-cohort benchmark above: xs/ys bit-identical to the old query-loop output for all 878 subjects (identical=True).
Standalone equivalence test (interleaved, non-sorted sessions): identical — and would fail under the default sort=True, which is why sort=False is required.
End-to-end: create_disrnn_dataset returns a valid DatasetRNN with correct shapes.

🤖 Generated with Claude Code

…sion df.query create_disrnn_dataset looped df_trials["ses_idx"].unique() twice (xs then ys), calling df.query("ses_idx == @ses_idx") per session. df.query re-parses the expression string and re-scans the frame on every call, so for cohorts with many sessions this dominated dataset construction (~107s of a profiled 441s multisubject load). Replace both loops with a single groupby("ses_idx", sort=False) pass; sort=False preserves the first-appearance session order that defines the column index, matching unique(). xs/ys output is identical (verified). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf(create_disrnn_dataset): group by session once instead of per-session df.query#29

perf(create_disrnn_dataset): group by session once instead of per-session df.query#29
hanhou wants to merge 1 commit into
mainfrom
perf/vectorize-session-load

hanhou commented Jun 21, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

hanhou commented Jun 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Change

Measured performance (real cohort: 878 subjects, 23,569 sessions, 11.3M trials)

Verification

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

hanhou commented Jun 21, 2026 •

edited

Loading