Add Python example for thread-safe Define using rdfslot_#21798
Add Python example for thread-safe Define using rdfslot_#21798gayatripadalia wants to merge 2 commits into
Conversation
Adds a tutorial demonstrating a thread-safe pattern in Python using rdfslot_, as a workaround for DefineSlot which is not available in Python. Closes #20839
vepadulano
left a comment
There was a problem hiding this comment.
Dear @gayatripadalia ,
Thanks for attempting to increase the coverage of our suite of tutorials! The need to have more thread-safe examples is understood. Nonetheless, this PR does not contribute enough in that direction. Your example does not use the rdfslot_ implicit column in any meaningful manner, and it does not include examples of DefineSlot with comments motivating why that operation would be needed.
Improve rdfslot_ tutorial with thread-safe use cases.
|
Dear @vepadulano, |
|
Dear @vepadulano and @couet |
vepadulano
left a comment
There was a problem hiding this comment.
Dear @gayatripadalia ,
I appreciate your effort, but I am not sure I see the direction you want to take. I have two major concerns about the tutorial you are proposing:
- Have you tried actually running this code? From a cursory look, most of it is wrong and will probably not work as you expect. Please send back a full reproducer showing that the code works, including a recipe on the exact steps to run the code and the platform you ran it on (with screenshots showing the correct output)
- How is this tutorial improving/different from the other one proposed at #20898 ? That tutorial is already showing usage of thread-safe RNGs with
DefineSlot.
| ## random-number generator or one histogram per thread — eliminating data races | ||
| ## without requiring a mutex. | ||
| ## | ||
| ## These APIs are currently not available in PyROOT. |
| ## In short: | ||
| ## Shared resource (unsafe): one RNG / histogram shared across threads | ||
| ## Per-slot resource (safe): one RNG / histogram per slot via rdfslot_ |
| # Helper: print a section banner so the terminal output is easy to follow | ||
| def banner(title: str) -> None: | ||
| width = 72 | ||
| print("\n" + "=" * width) | ||
| print(f" {title}") | ||
| print("=" * width) |
There was a problem hiding this comment.
This helper is really redundant, in tutorials we favour less lines of code to ease the reader's experience.
| print("\n" + "=" * width) | ||
| print(f" {title}") | ||
| print("=" * width) | ||
| # Background: DefineSlot in C++ vs the Python workaround |
| # In C++, you would write the following to safely smear values per-thread: | ||
| # | ||
| # // One RNG per slot — constructed before the event loop | ||
| # unsigned int nSlots = df.GetNSlots(); | ||
| # std::vector<TRandom3> rngs(nSlots); | ||
| # for (unsigned int i = 0; i < nSlots; ++i) rngs[i].SetSeed(i + 1); | ||
| # | ||
| # auto df_smeared = df.DefineSlot( | ||
| # "smeared_pt", | ||
| # [&rngs](unsigned int slot, double pt) { | ||
| # // `slot` is guaranteed unique per thread — no data race | ||
| # return pt + rngs[slot].Gaus(0.0, 0.01 * pt); | ||
| # }, | ||
| # {"true_pt"} | ||
| # ); |
There was a problem hiding this comment.
Too many details, if anything there should be a full C++ tutorial to link to as an example. Also, beware of false-sharing.
| rng_init = np.random.default_rng(seed=42) | ||
| true_pt = rng_init.normal(loc=50.0, scale=10.0, size=N_EVENTS).astype(np.float64) | ||
| true_eta = rng_init.normal(loc=0.0, scale=2.0, size=N_EVENTS).astype(np.float64) | ||
| # Wrap numpy arrays as ROOT RVecs so RDataFrame can consume them directly |
There was a problem hiding this comment.
Redundant details
| # Wrap numpy arrays as ROOT RVecs so RDataFrame can consume them directly | |
| # Read the numpy arrays directly with RDataFrame |
| banner(f"Implicit MT enabled | slots used = {n_slots}") | ||
| print(f"\n Dataset: {N_EVENTS:,} events | columns: true_pt, true_eta") |
There was a problem hiding this comment.
See double use of banner and then print, doesn't really help the tutorial.
| # Wrong approach (DO NOT DO THIS): | ||
| # shared_rng = np.random.default_rng(seed=0) # one RNG for all threads | ||
| # df.Define("smeared_pt", | ||
| # lambda pt: pt + shared_rng.normal(0, 0.01*pt), # RACE CONDITION | ||
| # ["true_pt"]) |
There was a problem hiding this comment.
Do not show wrong approaches in tutorials
| banner("Summary") | ||
| print(""" | ||
| The implicit column rdfslot_ is the Python equivalent of the `slot` | ||
| argument provided by C++ DefineSlot / RedefineSlot. By listing | ||
| "rdfslot_" first in the column list and receiving it as the first | ||
| parameter of any callable (Define or Foreach), you can: | ||
| • Index into per-slot RNG instances → lock-free random smearing | ||
| • Fill per-slot histograms → lock-free histogram filling | ||
| • Accumulate per-slot partial sums → lock-free custom aggregations | ||
| In every case the pattern is: | ||
| per_slot_resource = [Resource(seed=s) for s in range(n_slots)] | ||
| def my_func(slot, *columns): | ||
| # per_slot_resource[slot] is owned by exactly one thread | ||
| return per_slot_resource[slot].compute(*columns) | ||
| df.Define("result", my_func, ["rdfslot_", "col_a", "col_b", ...]) | ||
| This pattern serves as a practical replacement for DefineSlot until that API | ||
| becomes available in PyROOT. | ||
| """) |
| display_a = df_smeared.Define( | ||
| "slot_id", "rdfslot_" # expose slot index as a named column | ||
| ).Display(["slot_id", "true_pt", "smeared_pt"], nRows=8) | ||
| display_a.Print() |
There was a problem hiding this comment.
This is the wrong way of using RDataFrame in general, you're triggering the computation graph twice when requesting the Display.
|
Dear @vepadulano, |
Summary
This PR adds a Python tutorial demonstrating thread-safe patterns in ROOT RDataFrame using
rdfslot_.In C++, ROOT provides
DefineSlotandRedefineSlotfor thread-safe operations by exposing the slot index to user-defined callables. These APIs are currently not available in PyROOT.This tutorial shows how to reproduce the same behavior in Python by explicitly forwarding
rdfslot_, enabling safe and lock-free access to per-slot resources in multi-threaded workflows.Motivation
Issue #20839 highlights the lack of Python examples for thread-safe operations similar to
DefineSlot.This contribution provides a practical Python example of slot-based computation and demonstrates safe usage of mutable state in multi-threaded RDataFrame workflows, reflecting the design and intent of
DefineSlotusing existing PyROOT features.Related Issue
Closes #20839