Skip to content

[RDF] RDF.RSnapshotOptions with a distributed RDF #21756

@toicca

Description

@toicca

Check duplicate issues.

  • Checked for duplicates

Description

With a distributed RDF, I would like to use ROOT.RDF.RSnapshotOptions() for writing a TTree into an existing file (and for e.g. lazy evaluation and other options), but I get the error shown at the end of this issue. For using the "UPDATE" option with Snapshot and distributed executor there's naturally the issue of the workers writing to the same file and how to manage this.

The issue is not critical, since the workaround is easy (not using the distributed RDF or not using the RSnapshotOptions and just e.g. hadding the files after distributed processing).

The Traceback of the error seems also confusing, since it says that the issue is TypeError: takes at most 4 arguments (5 given) although there's clearly 3 arguments being passed.

Kind of related to issue #17136.

Error:

Traceback (most recent call last):
  File "/eos/home-n/ntoikka/rdf_dask_reproducer/reproducer.py", line 25, in <module>
    try:
  File "/cvmfs/sft.cern.ch/lcg/views/LCG_109_swan/x86_64-el9-gcc13-opt/lib/DistRDF/Proxy.py", line 273, in _create_new_op
    return get_proxy_for(op, newnode)
  File "/cvmfs/sft.cern.ch/lcg/releases/Python/3.13.11-41c2c/x86_64-el9-gcc13-opt/lib/python3.13/functools.py", line 934, in wrapper
    return dispatch(args[0].__class__)(*args, **kw)
           ~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^
  File "/cvmfs/sft.cern.ch/lcg/views/LCG_109_swan/x86_64-el9-gcc13-opt/lib/DistRDF/Proxy.py", line 301, in _
    return get_proxy_for.dispatch(Operation.InstantAction)(operation, node)
           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^
  File "/cvmfs/sft.cern.ch/lcg/views/LCG_109_swan/x86_64-el9-gcc13-opt/lib/DistRDF/Proxy.py", line 290, in _
    execute_graph(node)
    ~~~~~~~~~~~~~^^^^^^
  File "/cvmfs/sft.cern.ch/lcg/views/LCG_109_swan/x86_64-el9-gcc13-opt/lib/DistRDF/Proxy.py", line 58, in execute_graph
    node.get_head().execute_graph()
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^
  File "/cvmfs/sft.cern.ch/lcg/views/LCG_109_swan/x86_64-el9-gcc13-opt/lib/DistRDF/HeadNode.py", line 255, in execute_graph
    returned_values = self._execute_and_retrieve_results(mapper, local_nodes)
  File "/cvmfs/sft.cern.ch/lcg/views/LCG_109_swan/x86_64-el9-gcc13-opt/lib/DistRDF/HeadNode.py", line 205, in _execute_and_retrieve_results
    return self.backend.ProcessAndMerge(self._build_ranges(), mapper, distrdf_reducer)
           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/cvmfs/sft.cern.ch/lcg/views/LCG_109_swan/x86_64-el9-gcc13-opt/lib/DistRDF/Backends/Dask/Backend.py", line 215, in ProcessAndMerge
    return final_results.compute()
           ~~~~~~~~~~~~~~~~~~~~~^^
  File "/cvmfs/sft.cern.ch/lcg/views/LCG_109_swan/x86_64-el9-gcc13-opt/lib/python3.13/site-packages/dask/base.py", line 374, in compute
    (result,) = compute(self, traverse=False, **kwargs)
                ~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/cvmfs/sft.cern.ch/lcg/views/LCG_109_swan/x86_64-el9-gcc13-opt/lib/python3.13/site-packages/dask/base.py", line 662, in compute
    results = schedule(dsk, keys, **kwargs)
  File "/cvmfs/sft.cern.ch/lcg/views/LCG_109_swan/x86_64-el9-gcc13-opt/lib/DistRDF/Backends/Dask/Backend.py", line 161, in dask_mapper
    return mapper(current_range)
    ^^^^^^^^^^^^^^^
  File "/cvmfs/sft.cern.ch/lcg/views/LCG_109_swan/x86_64-el9-gcc13-opt/lib/DistRDF/Backends/Base.py", line 105, in distrdf_mapper
    mergeables = get_mergeable_values(rdf_plus.rdf, current_range.id, computation_graph_callable,
    ^^^^^^^^^^^
  File "/cvmfs/sft.cern.ch/lcg/views/LCG_109_swan/x86_64-el9-gcc13-opt/lib/DistRDF/Backends/Base.py", line 62, in get_mergeable_values
    actions = computation_graph_callable(starting_node, range_id, exec_id)
      ^^^^^^^^^^^^^^^^^
  File "/cvmfs/sft.cern.ch/lcg/views/LCG_109_swan/x86_64-el9-gcc13-opt/lib/DistRDF/ComputationGraphGenerator.py", line 220, in trigger_computation_graph
    _ACTIONS_REGISTER[exec_id] = generate_computation_graph(graph, starting_node, range_id)
    ^^^^^^^^^^^^^^^
  File "/cvmfs/sft.cern.ch/lcg/views/LCG_109_swan/x86_64-el9-gcc13-opt/lib/DistRDF/ComputationGraphGenerator.py", line 187, in generate_computation_graph
    rdf_node, in_task_op = _call_rdf_operation(node.operation, graph[node.parent_id].rdf_node, range_id)
    ^^^^^^^^^^^^^^^
  File "/cvmfs/sft.cern.ch/lcg/releases/Python/3.13.11-41c2c/x86_64-el9-gcc13-opt/lib/python3.13/functools.py", line 934, in wrapper
    return dispatch(args[0].__class__)(*args, **kw)
    ^^^^^^^^^^^^^^^
  File "/cvmfs/sft.cern.ch/lcg/views/LCG_109_swan/x86_64-el9-gcc13-opt/lib/DistRDF/ComputationGraphGenerator.py", line 135, in _call_rdf_operation
    rdf_node = rdf_operation(*in_task_op.args, **in_task_op.kwargs)
      ^^^^^^^^^^^^^^^^^
TypeError: Template method resolution failed:
  none of the 3 overloaded methods succeeded. Full details:
  ROOT::RDF::RResultPtr<ROOT::RDF::RInterface<ROOT::Detail::RDF::RLoopManager,void> > ROOT::RDF::RInterface<ROOT::Detail::RDF::RLoopManager,void>::Snapshot(string_view treename, string_view filename, initializer_list<string> columnList, const ROOT::RDF::RSnapshotOptions& options = RSnapshotOptions()) =>
    TypeError: takes at most 4 arguments (5 given)
  ROOT::RDF::RResultPtr<ROOT::RDF::RInterface<ROOT::Detail::RDF::RLoopManager,void> > ROOT::RDF::RInterface<ROOT::Detail::RDF::RLoopManager,void>::Snapshot(string_view treename, string_view filename, const ROOT::RDF::ColumnNames_t& columnList, const ROOT::RDF::RSnapshotOptions& options = RSnapshotOptions()) =>
    TypeError: takes at most 4 arguments (5 given)
  ROOT::RDF::RResultPtr<ROOT::RDF::RInterface<ROOT::Detail::RDF::RLoopManager,void> > ROOT::RDF::RInterface<ROOT::Detail::RDF::RLoopManager,void>::Snapshot(string_view treename, string_view filename, string_view columnNameRegexp = "", const ROOT::RDF::RSnapshotOptions& options = RSnapshotOptions()) =>
    TypeError: takes at most 4 arguments (5 given)
  none of the 3 overloaded methods succeeded. Full details:
  ROOT::RDF::RResultPtr<ROOT::RDF::RInterface<ROOT::Detail::RDF::RLoopManager,void> > ROOT::RDF::RInterface<ROOT::Detail::RDF::RLoopManager,void>::Snapshot(string_view treename, string_view filename, initializer_list<string> columnList, const ROOT::RDF::RSnapshotOptions& options = RSnapshotOptions()) =>
    TypeError: takes at most 4 arguments (5 given)
  ROOT::RDF::RResultPtr<ROOT::RDF::RInterface<ROOT::Detail::RDF::RLoopManager,void> > ROOT::RDF::RInterface<ROOT::Detail::RDF::RLoopManager,void>::Snapshot(string_view treename, string_view filename, const ROOT::RDF::ColumnNames_t& columnList, const ROOT::RDF::RSnapshotOptions& options = RSnapshotOptions()) =>
    TypeError: takes at most 4 arguments (5 given)
  ROOT::RDF::RResultPtr<ROOT::RDF::RInterface<ROOT::Detail::RDF::RLoopManager,void> > ROOT::RDF::RInterface<ROOT::Detail::RDF::RLoopManager,void>::Snapshot(string_view treename, string_view filename, string_view columnNameRegexp = "", const ROOT::RDF::RSnapshotOptions& options = RSnapshotOptions()) =>
    TypeError: takes at most 4 arguments (5 given)
  ROOT::RDF::RResultPtr<ROOT::RDF::RInterface<ROOT::Detail::RDF::RLoopManager,void> > ROOT::RDF::RInterface<ROOT::Detail::RDF::RLoopManager,void>::Snapshot(string_view treename, string_view filename, string_view columnNameRegexp = "", const ROOT::RDF::RSnapshotOptions& options = RSnapshotOptions()) =>
    TypeError: takes at most 4 arguments (5 given)
  Failed to instantiate "Snapshot(std::string,std::string,std::string,::ROOT::RDF::RSnapshotOptions*)"
  ROOT::RDF::RResultPtr<ROOT::RDF::RInterface<ROOT::Detail::RDF::RLoopManager,void> > ROOT::RDF::RInterface<ROOT::Detail::RDF::RLoopManager,void>::Snapshot(string_view treename, string_view filename, string_view columnNameRegexp = "", const ROOT::RDF::RSnapshotOptions& options = RSnapshotOptions()) =>
    TypeError: takes at most 4 arguments (5 given)

Reproducer

from dask.distributed import LocalCluster, Client
import ROOT
 
 
def create_connection():
    cluster = LocalCluster(n_workers=2, threads_per_worker=1, processes=True, memory_limit="2GiB")
    client = Client(cluster)
    return client
 
if __name__ == "__main__":
    connection = create_connection()
    df0 = ROOT.RDataFrame(1000, executor=connection)
    df1 = ROOT.RDataFrame(1000, executor=connection)
 
    ROOT.gRandom.SetSeed(1)
    df0 = df0.Define("gaus", "gRandom->Gaus(10, 1)").Define("exponential", "gRandom->Exp(10)")
    df1 = df1.Define("gaus", "gRandom->Gaus(10, 1)").Define("exponential", "gRandom->Exp(10)")
 
    # Create a snapshot without options
    df0.Snapshot("df0", "df0.root")

    # Create a snapshot to update the previous file(s)
    options = ROOT.RDF.RSnapshotOptions()
    options.fMode = "UPDATE"
    try:
        df1.Snapshot("df1", "df0.root", options=options)  # this is a bit misleading, since the file df0.root doesn't exist because of the _0, _1 split that distribution causes
    except Exception as e:
        print(f"Error: {e}")

ROOT version

ROOT Version: 6.38.00
Built for linuxx8664gcc on Feb 03 2026, 21:53:45
From tags/6-38-00@6-38-00

Installation method

LCG 109

Operating system

EL9

Additional context

No response

Metadata

Metadata

Assignees

Type

No type

Projects

Status

No status

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions