Skip to content

Commit 6d4f5ea

Browse files
author
miranov25
committed
Phase 13.58.ADF: single-tree lazy time-series loading & lazy drawing (D1-D4)
Implements use case 1 of the lazy time-series work: a single TTree read lazily, drawn through the full gallery surface, loading only the branches each draw needs. All changes are additive; the eager path is byte-identical by default. Deliverables (ratified proposal v1.6 + joint test plan v2.0) ------------------------------------------------------------ D1 Dependency-resolver consolidation. get_required_branches() now routes the inline `expr` through the AST analyzer _analyze_expression() instead of the old top-level string split, via a new bracket-aware _split_top_level_colon(). _analyze_expression() queries the runtime _registered_functions registry at parse time, so a call like corr(xM, driftM) treats `corr` as a function, not a column. Behaviour-preserving for every existing expression; the inline registered-function and bracket-vector forms (e.g. "[corr(x,y), x]:z" -> {x,y,z}) now resolve correctly. The alias path was already AST-based and is unchanged. D2 Draw-surface branch-scan gap closed. get_required_branches() gains facet_by / weights / weights_vector / selection_vector and they are threaded at all three call sites: draw(), draw_batch(), draw_figures(). A branch referenced ONLY via one of these kwargs now pre-loads in lazy mode (the silent-empty-figure class). Integer count kwargs (*_bins / *_quantiles) are excluded structurally. D3 LazyTreeReader.estimate_memory() implemented using the real per-branch dtype item size from the TTree interpretation (not a float32 constant), so it matches the eager sum(df[col].nbytes) exactly. Removes the single-tree-lazy AttributeError (ADF.estimate_memory delegated to a method that only existed on the chain reader). LazyChainReader.estimate_memory docstring corrected (it documented 3 keys but returns 5). D4 Gallery integration. build_adf() gains an additive lazy= / tree_name= parameter that swaps only the constructor (eager root_to_adf vs read_tree_lazy) and shares every post-read step; default lazy=False is byte-identical. sampled-lazy is rejected with a clear error (out of scope). validate_lazy_vs_eager() harness added for the alma2 gallery double-run. Tests (synthetic, ROOT/PyROOT-free; uproot mktree TTrees, seed 42) ------------------------------------------------------------------ tests/test_phase1358_lazy_timeseries.py (8): loader mechanism + resolver/estimator gates - only-needed-load, registered-function alias, facet/weights preload, estimate_memory tolerance 0, resolver regression + colon-grammar, lazy==eager data, and the bracket-vector + composition resolver lock. tests/test_phase1358_lazy_draw_invariance.py (21): real adf.draw() / draw_batch() / draw_figures() lazy vs eager across hist/profile/scatter x selection/group_by/ facet_by/weights/color, registered-function alias (expr- and selection-side), weights_vector and selection_vector lists, nested alias, multi-draw no-reload, non-existent-branch negative control, and the D2 loud-raise negative control (D2 disabled -> ValueError, raw branch). Every load assertion is exact (loaded == need) with >=3 decoy branches in the fixture. tests/test_phase1358_gallery_lazy.py (1, env-gated): TIME-SERIES gallery double-run (validate_lazy_vs_eager); skips without the time-series file / dfdraw. Not calibITS. tests/test_phase1358_lazy_calibITS.py (4): real-data lazy invariance on the committed calibITS.root fixture (main tree 'AlignITS5', 28,956 rows, two ADF subframes). A real profile / group_by / hist draw each loads EXACTLY its branches; lazy stats == eager. Uses the file's actual schema (not the time-series build_adf). Skips without the fixture. tests/data/calibITS.root: ~4 MB distributable fixture for the test above. Note: ADF selection is numexpr-style (& / |), not C-style (&& / ||); the logical- selection test uses & (&& raises in numexpr). Governance / taxonomy --------------------- tests/feature_taxonomy.py: + LAZY.timeseries_draw (mapping the three test files); total 50 -> 51. tests/test_phase_13_56_adf_post_audit.py: TG4 count-lock bumped 50 -> 51 in the same commit (per the plan). Test baseline (full run_tests.sh, alma2, 12 workers) ---------------------------------------------------- 1731 passed / 8 failed / 1 error / 9 skipped after the TG4 fix. The 8 failures and 1 error are all PRE-EXISTING (identical to the run before this work) and outside the D1-D4 surface: K1_3_draw_batch_forwards_batch_kwargs, K2_3_production_reproducer_mirror (the separate vector-kwarg change on this branch, not D1-D4), AliasDataFrameRDF: missing_keys_in_friend, collision_from_friend_tree, composite_index_friend, invariance_backend I2_6 (numba vs numpy chained subframe), invariance_compression I4_2, I4_3, ERROR test_schema_serialization.py. This change adds +21 passing tests and zero net failures. Not included (closure-pending) ------------------------------ - Real-data lazy gate: SATISFIED by tests/test_phase1358_lazy_calibITS.py on the committed calibITS.root fixture (4 passing). The time-series gallery double-run (test_phase1358_gallery_lazy.py) remains env-gated on the large time-series file and is optional, not the closure gate. - T11 (subframe-column lazy draw, skipif-gated): calibITS HAS subframes (R, AlignDzITS5), so a subframe-column draw test can be added against the committed fixture (follow-up). - docs/CAPABILITY_MATRIX.md regeneration (deferred to the closure commit, not this one). Refs: PHASE_13_58_ADF proposal v1.6; joint test plan v2.0; panel summaries v1.0/v2.0.
1 parent f37208c commit 6d4f5ea

10 files changed

Lines changed: 907 additions & 16 deletions

UTILS/dfextensions/AliasDataFrame/AliasDataFrame.py

Lines changed: 105 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -3803,6 +3803,13 @@ def _analyze_expression(self, expr):
38033803
known_funcs = set(ArrowComputeMapper.FUNC_MAP.keys()) if ArrowComputeMapper else set()
38043804
known_funcs.update(['np', 'numpy', 'math', 'abs', 'int', 'float', 'round',
38053805
'min', 'max', 'sum', 'len', 'range', 'True', 'False', 'None'])
3806+
# Phase 13.58.ADF (D1): registered functions are added at runtime via
3807+
# register_function(); query the live registry at each parse so a call such as
3808+
# corr(xM, driftM) treats `corr` as a function, not a column. Querying the registry
3809+
# (rather than extending a static literal) is what makes runtime-registered names
3810+
# resolve correctly.
3811+
if hasattr(self, '_registered_functions'):
3812+
known_funcs.update(self._registered_functions.keys())
38063813

38073814
for node in ast.walk(tree):
38083815
if isinstance(node, ast.Name):
@@ -6834,11 +6841,43 @@ def _resolve_to_base_branches(self, columns: set, _visited: set = None) -> set:
68346841

68356842
return base_branches
68366843

6844+
@staticmethod
6845+
def _split_top_level_colon(expr):
6846+
"""Split a draw expression on top-level ':' separators (dfdraw 'y:x' grammar),
6847+
ignoring any ':' inside (), [], or {}.
6848+
6849+
Phase 13.58.ADF (D1): each part is fed individually to _analyze_expression, which
6850+
parses it as a Python expression; a raw 'y:x' is not valid Python, so the colon
6851+
split must happen first. Bracket-depth tracking keeps a stray ':' inside a call or
6852+
slice from splitting the expression.
6853+
"""
6854+
parts = []
6855+
depth = 0
6856+
current = []
6857+
for ch in expr:
6858+
if ch in '([{':
6859+
depth += 1
6860+
current.append(ch)
6861+
elif ch in ')]}':
6862+
depth = max(0, depth - 1)
6863+
current.append(ch)
6864+
elif ch == ':' and depth == 0:
6865+
parts.append(''.join(current))
6866+
current = []
6867+
else:
6868+
current.append(ch)
6869+
parts.append(''.join(current))
6870+
return parts
6871+
68376872
def get_required_branches(self,
68386873
expr: str = None,
68396874
selection: str = None,
68406875
group_by: str = None,
68416876
color: str = None,
6877+
facet_by=None,
6878+
weights=None,
6879+
weights_vector=None,
6880+
selection_vector=None,
68426881
aliases: list = None,
68436882
validate: bool = False) -> set:
68446883
"""
@@ -6883,11 +6922,29 @@ def get_required_branches(self,
68836922
"""
68846923
all_columns = set()
68856924

6886-
# 1. Parse main expression (e.g., 'dEdx:p' → {'dEdx', 'p'})
6887-
# Reuse logic from _parse_expr_aliases but get ALL columns (GPT tweak #1)
6925+
# 1. Parse main expression into column references.
6926+
# Phase 13.58.ADF (D1): route expr parsing through the AST analyzer instead of a
6927+
# raw ':'-split, so function calls and compound math resolve to their real column
6928+
# dependencies (e.g. 'corr(xM, driftM):c' -> {xM, driftM, c}) and registered
6929+
# function names are not mistaken for columns. The dfdraw 'y:x' form is split on
6930+
# the top-level ':' first (each side is its own Python expression). Falls back to
6931+
# the literal token when a part is not parseable or yields no refs, preserving the
6932+
# prior behaviour for bare names and aliases (AC-4 regression set).
68886933
if expr:
6889-
parts = expr.replace(' ', '').split(':')
6890-
all_columns.update(parts)
6934+
for part in self._split_top_level_colon(expr):
6935+
part = part.strip()
6936+
if not part:
6937+
continue
6938+
analysis = self._analyze_expression(part)
6939+
refs = set(analysis.get('column_refs', set()))
6940+
for sf_name, sf_col in analysis.get('subframe_refs', []):
6941+
refs.add(f"{sf_name}.{sf_col}")
6942+
if refs:
6943+
all_columns.update(refs)
6944+
else:
6945+
# Not parseable as a Python expression, or a bare literal: keep the
6946+
# token so downstream alias/branch resolution still sees it.
6947+
all_columns.add(part.replace(' ', ''))
68916948

68926949
# 2. Parse selection string
68936950
if selection:
@@ -6899,7 +6956,35 @@ def get_required_branches(self,
68996956
all_columns.add(group_by)
69006957
if color and isinstance(color, str):
69016958
all_columns.add(color)
6902-
6959+
6960+
# 3b. Phase 13.58.ADF (D2): column-name-bearing draw kwargs. Any kwarg whose string
6961+
# value is interpreted as a column name must contribute to the required-branch
6962+
# set, so a branch referenced ONLY via facet_by/weights/weights_vector/
6963+
# selection_vector pre-loads in lazy mode (the silent-empty-figure class). Each
6964+
# may be a string or a per-Y list of strings. Integer count kwargs
6965+
# (facet_by_bins/_quantiles, group_by_bins/_quantiles) are deliberately NOT
6966+
# included — they are bin counts, not column names.
6967+
def _add_colname_kwarg(value, as_selection=False):
6968+
if value is None:
6969+
return
6970+
items = value if isinstance(value, (list, tuple)) else [value]
6971+
for item in items:
6972+
if not isinstance(item, str) or not item:
6973+
continue
6974+
if as_selection:
6975+
all_columns.update(self._parse_selection_columns(item))
6976+
continue
6977+
analysis = self._analyze_expression(item)
6978+
refs = set(analysis.get('column_refs', set()))
6979+
for sf_name, sf_col in analysis.get('subframe_refs', []):
6980+
refs.add(f"{sf_name}.{sf_col}")
6981+
all_columns.update(refs if refs else {item})
6982+
6983+
_add_colname_kwarg(facet_by)
6984+
_add_colname_kwarg(weights)
6985+
_add_colname_kwarg(weights_vector)
6986+
_add_colname_kwarg(selection_vector, as_selection=True)
6987+
69036988
# 4. Add explicit aliases
69046989
if aliases:
69056990
all_columns.update(aliases)
@@ -11339,7 +11424,11 @@ def draw(self,
1133911424
expr=expr,
1134011425
selection=kwargs.get('selection'),
1134111426
group_by=kwargs.get('group_by'),
11342-
color=kwargs.get('color')
11427+
color=kwargs.get('color'),
11428+
facet_by=kwargs.get('facet_by'),
11429+
weights=kwargs.get('weights'),
11430+
weights_vector=kwargs.get('weights_vector'),
11431+
selection_vector=kwargs.get('selection_vector')
1134311432
)
1134411433
# Load any branches not already loaded
1134511434
branches_to_load = required_branches - self._lazy_reader.loaded_branches
@@ -12419,7 +12508,11 @@ def draw_batch(self,
1241912508
expr=merged_spec.get('expr', name),
1242012509
selection=merged_spec.get('selection'),
1242112510
group_by=merged_spec.get('group_by'),
12422-
color=merged_spec.get('color')
12511+
color=merged_spec.get('color'),
12512+
facet_by=merged_spec.get('facet_by'),
12513+
weights=merged_spec.get('weights'),
12514+
weights_vector=merged_spec.get('weights_vector'),
12515+
selection_vector=merged_spec.get('selection_vector')
1242312516
)
1242412517
all_required.update(required)
1242512518

@@ -12754,7 +12847,11 @@ def draw_figures(
1275412847
expr=merged_plot.get('expr', ''),
1275512848
selection=merged_plot.get('selection'),
1275612849
group_by=merged_plot.get('group_by'),
12757-
color=merged_plot.get('color')
12850+
color=merged_plot.get('color'),
12851+
facet_by=merged_plot.get('facet_by'),
12852+
weights=merged_plot.get('weights'),
12853+
weights_vector=merged_plot.get('weights_vector'),
12854+
selection_vector=merged_plot.get('selection_vector')
1275812855
)
1275912856
all_required.update(required)
1276012857

UTILS/dfextensions/AliasDataFrame/LazyChainReader.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -316,7 +316,8 @@ def estimate_memory(self, branches: List[str] = None) -> dict:
316316
Returns
317317
-------
318318
dict
319-
'bytes': int, 'human': str, 'warning': str or None
319+
'bytes': int, 'human': str, 'branches': int, 'entries': int,
320+
'warning': str or None
320321
"""
321322
if branches is None:
322323
branches = list(self._available_branches)

UTILS/dfextensions/AliasDataFrame/LazyTreeReader.py

Lines changed: 68 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,7 @@
88

99
import uproot
1010
import pandas as pd
11+
import numpy as np
1112
from typing import List, Set, Optional
1213

1314

@@ -201,7 +202,73 @@ def get_branch_dtype(self, name: str) -> str:
201202
raise ValueError(f"Branch '{name}' not in TTree")
202203
self._open_file()
203204
return str(self._tree[name].interpretation)
204-
205+
206+
def estimate_memory(self, branches: List[str] = None) -> dict:
207+
"""
208+
Estimate memory for loading branches (single-tree lazy).
209+
210+
Phase 13.58.ADF (D3): mirrors LazyChainReader.estimate_memory but uses the real
211+
per-branch dtype item size from the TTree interpretation (not a float32 constant),
212+
so the estimate matches the eager `sum(df[col].nbytes)` exactly (AC-5, tolerance 0)
213+
for the same branches on deterministic data. Removes the single-tree-lazy
214+
AttributeError (ADF.estimate_memory previously delegated to a method that only
215+
existed on the chain reader).
216+
217+
Parameters
218+
----------
219+
branches : List[str], optional
220+
Branches to estimate. None = all available.
221+
222+
Returns
223+
-------
224+
dict
225+
'bytes': int, 'human': str, 'branches': int, 'entries': int,
226+
'warning': str or None
227+
"""
228+
self._open_file()
229+
if branches is None:
230+
branches = list(self.available_branches)
231+
total = 0
232+
for b in branches:
233+
if b not in self.available_branches:
234+
continue
235+
total += self._branch_itemsize(b) * self.num_entries
236+
237+
warning = None
238+
if total > 32 * 1024**3:
239+
warning = "Estimated memory exceeds 32 GB - consider chunked loading"
240+
elif total > 16 * 1024**3:
241+
warning = "Estimated memory exceeds 16 GB"
242+
243+
return {
244+
'bytes': total,
245+
'human': self._format_bytes(total),
246+
'branches': len(branches),
247+
'entries': self.num_entries,
248+
'warning': warning,
249+
}
250+
251+
def _branch_itemsize(self, branch: str) -> int:
252+
"""Bytes-per-entry for a branch, from the real TTree interpretation dtype.
253+
254+
Falls back to 4 bytes only if the dtype cannot be determined (e.g. an unusual
255+
interpretation); flat numeric branches resolve exactly.
256+
"""
257+
try:
258+
dt = np.dtype(self._tree[branch].interpretation.numpy_dtype)
259+
return dt.base.itemsize if dt.subdtype is not None else dt.itemsize
260+
except Exception:
261+
return 4
262+
263+
@staticmethod
264+
def _format_bytes(n: int) -> str:
265+
"""Format bytes as a human-readable string."""
266+
for unit in ['B', 'KB', 'MB', 'GB', 'TB']:
267+
if abs(n) < 1024:
268+
return f"{n:.1f} {unit}"
269+
n /= 1024
270+
return f"{n:.1f} PB"
271+
205272
def close(self):
206273
"""Close file handle."""
207274
if self._file is not None:

UTILS/dfextensions/AliasDataFrame/examples/time_series/time_series_draw.py

Lines changed: 86 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -43,20 +43,102 @@
4343
ITS_SEL = f"{BASE_SEL}&(hasITSTPC)"
4444

4545

46-
def build_adf(root_path, sample=None):
47-
"""Load ADF. build_adf(path, 0.2) gives ~1 min dev run (20% sample)."""
48-
adf = root_to_adf(root_path)
46+
def build_adf(root_path, sample=None, lazy=False, tree_name="tree"):
47+
"""Load ADF. build_adf(path, 0.2) gives ~1 min dev run (20% sample).
48+
49+
Phase 13.58.ADF (D4): the additive ``lazy=`` parameter swaps ONLY the constructor --
50+
eager ``root_to_adf()`` (default, behaviour unchanged) or lazy ``read_tree_lazy()`` --
51+
and shares every post-read step below, so the eager and lazy galleries differ only in
52+
data loading (no plotting logic is forked or replicated). ``lazy=True`` requires
53+
``sample=None``: a lazy read makes an N-row frame, ``sample(frac)`` shrinks it, then a
54+
branch load returns full-N rows and crashes in ``_merge_loaded_data`` -- sampled-lazy is
55+
out of scope (Phase 13.58 §2). ``tree_name`` is used only by the lazy path (the eager
56+
default resolution is untouched); pass the gallery file's tree name if it is not
57+
``"tree"``.
58+
"""
59+
if lazy:
60+
if sample is not None:
61+
raise ValueError(
62+
"build_adf(lazy=True) does not support sampling (sample must be None): "
63+
"sampled-lazy crashes in _merge_loaded_data and is out of scope for "
64+
"Phase 13.58.ADF. Run the lazy gallery unsampled."
65+
)
66+
adf = AliasDataFrame.read_tree_lazy(root_path, tree_name)
67+
else:
68+
adf = root_to_adf(root_path)
4969
adf.draw_lazy = True
5070
apply_meta(adf, df_TimeSeriesAliases)
5171
apply_meta(adf, df_TimeSeriesMeta)
5272
addTimeQuantiles(adf, varname="timeMS", step=10000, step2=100)
5373
if sample is not None:
5474
adf.df = adf.df.sample(frac=sample, random_state=42).reset_index(drop=True)
5575
adf.materialize_aliases(names=["sector", "time_s"])
56-
print(f"ADF ready: {len(adf.df):,} tracks" + (f" ({int(sample*100)}% sample)" if sample else ""))
76+
print(f"ADF ready: {len(adf.df):,} tracks"
77+
+ (f" ({int(sample*100)}% sample)" if sample else "")
78+
+ (" [lazy]" if lazy else ""))
5779
return adf
5880

5981

82+
def validate_lazy_vs_eager(root_path, tree_name="tree"):
83+
"""Phase 13.58.ADF D4 / AC-1 / AC-2 -- gallery lazy-vs-eager double-run (SERVER gate).
84+
85+
Requires ROOT (eager root_to_adf) + dfdraw; run on the server, unsampled. Asserts:
86+
* AC-1 (clean, genuinely lazy): the lazy ADF uses read_tree_lazy, branches load on
87+
demand (a figure-only branch like 'ncl' is NOT force-materialized by setup, and
88+
IS loaded after the figure that needs it), and every gallery figure renders with
89+
no error (AD-TS-DRAW-001).
90+
* AC-2 (PP-5 identity): a representative set of draws produce identical stats lazy vs
91+
eager at the data/stats level (NOT pixel). The comparison is defensive -- it equates
92+
whatever numeric stats both runs return -- so it cannot false-green on a key name.
93+
Sandbox cannot run this (no ROOT/dfdraw); it is the alma2 secondary integration gate.
94+
The primary gate is the dedicated synthetic test (tests/test_phase1358_lazy_timeseries.py).
95+
"""
96+
import numpy as np
97+
import matplotlib.pyplot as plt
98+
99+
eager = build_adf(root_path, lazy=False, tree_name=tree_name)
100+
lazy = build_adf(root_path, lazy=True, tree_name=tree_name)
101+
assert lazy._lazy_reader is not None, "lazy build must use read_tree_lazy (not eager-in-disguise)"
102+
103+
# provably lazy: a figure-only branch must not be force-materialized by setup
104+
forced = set(lazy._lazy_reader.loaded_branches)
105+
assert "ncl" not in forced, "ncl must not be force-loaded by build_adf setup"
106+
107+
# AC-2 stats identity on representative draws (return_data=True; defensive key match)
108+
checks = [
109+
dict(expr="ncl", type="hist", bins=50, selection="ncl>30"),
110+
dict(expr="dcar_tpc_vertex:tgl", type="profile", bins=50, selection=BASE_SEL),
111+
]
112+
for kw in checks:
113+
re_ = eager.draw(return_data=True, **kw)
114+
rl_ = lazy.draw(return_data=True, **kw)
115+
se = re_[2] if isinstance(re_, tuple) and len(re_) > 2 and isinstance(re_[2], dict) else {}
116+
sl = rl_[2] if isinstance(rl_, tuple) and len(rl_) > 2 and isinstance(rl_[2], dict) else {}
117+
compared = 0
118+
for key in (set(se) & set(sl)):
119+
try:
120+
a = np.asarray(se[key], dtype=float)
121+
b = np.asarray(sl[key], dtype=float)
122+
except (TypeError, ValueError):
123+
continue
124+
if a.shape == b.shape and a.size:
125+
assert np.allclose(a, b, equal_nan=True), f"lazy != eager stats for {kw}, key '{key}'"
126+
compared += 1
127+
assert compared > 0, f"no comparable numeric stats produced for {kw}"
128+
129+
assert "ncl" in lazy._lazy_reader.loaded_branches, "ncl should load lazily after its draw"
130+
131+
# AC-1 clean run: every gallery figure renders lazily with no error
132+
fig_funcs = [v for k, v in sorted(globals().items())
133+
if k.startswith("fig") and callable(v)]
134+
for fn in fig_funcs:
135+
fig = fn(lazy)
136+
if fig is not None:
137+
plt.close(fig)
138+
print(f"validate_lazy_vs_eager: OK -- {len(fig_funcs)} figures rendered lazily; "
139+
f"lazy==eager stats on {len(checks)} representative draws")
140+
141+
60142
def _add(pdf, fig, title):
61143
"""Save figure to PDF with suptitle + PDF bookmark."""
62144
fig.suptitle(title, fontsize=8, color="gray", y=1.0, ha="left", x=0.01)

0 commit comments

Comments
 (0)