Skip to content

perf(create_disrnn_dataset): group by session once instead of per-session df.query#29

Open
hanhou wants to merge 1 commit into
mainfrom
perf/vectorize-session-load
Open

perf(create_disrnn_dataset): group by session once instead of per-session df.query#29
hanhou wants to merge 1 commit into
mainfrom
perf/vectorize-session-load

Conversation

@hanhou

@hanhou hanhou commented Jun 21, 2026

Copy link
Copy Markdown
Collaborator

Summary

create_disrnn_dataset looped df_trials["ses_idx"].unique() twice (once for xs, once for ys), calling df_trials.query("ses_idx == @ses_idx") per session. df.query re-parses the expression string and re-scans the whole frame on every call, so on cohorts with many sessions this dominated dataset construction (it was ~24% of a profiled 441s multisubject load()).

Change

Replace both loops with a single groupby("ses_idx", sort=False) pass. sort=False preserves the first-appearance session order that defines the column index dex (matching df_trials["ses_idx"].unique()), so xs/ys are identical.

Measured performance (real cohort: 878 subjects, 23,569 sessions, 11.3M trials)

Per-subject xs/ys build, summed over all subjects (onprem H200 node, CPU):

time
OLD (per-session df.query) 63.2 s
NEW (groupby, sort=False) 8.5 s
7.4× faster, ~55 s saved

Verification

  • Real-cohort benchmark above: xs/ys bit-identical to the old query-loop output for all 878 subjects (identical=True).
  • Standalone equivalence test (interleaved, non-sorted sessions): identical — and would fail under the default sort=True, which is why sort=False is required.
  • End-to-end: create_disrnn_dataset returns a valid DatasetRNN with correct shapes.

🤖 Generated with Claude Code

…sion df.query

create_disrnn_dataset looped df_trials["ses_idx"].unique() twice (xs then ys),
calling df.query("ses_idx == @ses_idx") per session. df.query re-parses the
expression string and re-scans the frame on every call, so for cohorts with many
sessions this dominated dataset construction (~107s of a profiled 441s
multisubject load). Replace both loops with a single groupby("ses_idx",
sort=False) pass; sort=False preserves the first-appearance session order that
defines the column index, matching unique(). xs/ys output is identical (verified).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant