Skip to content

Commit 2edffe2

Browse files
author
miranov25
committed
Add RDataFrameDSL (Phases 1-5): IR, Reflection, and C++ backend
Python DSL for ROOT RDataFrame enabling transparent C++ code generation from Python expressions with compile-time validation. Included components: - IR Core: type system, node hierarchy, error model with recovery modes - Type Inference: ROOT TTree reflection, schema support, vector/RVec handling - IR Builder: Python AST → IR with type promotion and rank broadcasting - Class Reflection: TClass-first lookup with schema fallback, fuzzy matching - C++ Backend: scalar code generation, gInterpreter compilation, RDataFrame integration Key design decisions: - Two-phase validation: compile helpers before RDataFrame execution - Standalone helper functions (not lambdas) for debuggability - TClass primary, schema fallback for reflection - Alphabetical input ordering for deterministic signatures Tests: 313 passing (78 IR + 41 type + 76 builder + 41 reflection + 77 backend) Includes end-to-end RDataFrame Define/Filter tests. Reviewed by: GPT, Gemini, Claude Specification: RDataFrameDSL_Specification_Final.md Prepares infrastructure for Phase 6 (RVec/vector operations).
1 parent 7f9f645 commit 2edffe2

15 files changed

Lines changed: 9094 additions & 0 deletions
Lines changed: 49 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,49 @@
1+
# Phase 1: IR Core
2+
3+
## Files Added
4+
- `RDataFrameDSL/__init__.py` - Package exports
5+
- `RDataFrameDSL/ir_types.py` - Type system (IRType, IRTypeKind, promote_types)
6+
- `RDataFrameDSL/ir_nodes.py` - IR node classes (ConstantNode, VariableNode, BinaryOpNode, etc.)
7+
- `RDataFrameDSL/ir_errors.py` - Error handling (IRError, SourceLocation, ErrorCollector)
8+
- `tests/__init__.py` - Test package
9+
- `tests/test_ir_core.py` - 78 test cases for Phase 1
10+
11+
## Files Modified
12+
- (none - initial implementation)
13+
14+
## Key Implementation Notes
15+
16+
### Type System (ir_types.py)
17+
- IRTypeKind enum with 9 types: Float32, Float64, Int32, Int64, UInt32, UInt64, Bool, Object, Unknown
18+
- IRType dataclass with helper methods: is_numeric(), is_float(), is_int(), is_object(), to_cpp()
19+
- Type promotion rules following C++ semantics (float wins, wider wins)
20+
- Complete C++ type mapping including ROOT typedefs (Float_t, Int_t, Long64_t, etc.)
21+
22+
### IR Nodes (ir_nodes.py)
23+
- Base IRNode with dtype, rank, is_jagged, and tree traversal methods
24+
- Leaf nodes: ConstantNode (auto-infers type), VariableNode (with namespace for subframes)
25+
- Operators: UnaryOpNode, BinaryOpNode (18 ops), TernaryOpNode
26+
- Functions: CallNode (with namespace), MethodCallNode, PropertyAccessNode
27+
- Indexing: SliceNode, SubscriptNode, CollectionIndexNode (with safe_mode)
28+
- Factory functions for convenient node creation
29+
30+
### Error Handling (ir_errors.py)
31+
- IRErrorKind enum with 9 categories including MISSING_DICT
32+
- SourceLocation for precise error positioning
33+
- IRError with suggestions support ("Did you mean X?")
34+
- ErrorCollector with 3 recovery modes: FAIL_ALL, SKIP_CONTINUE, FAIL_CHAIN
35+
- Helper functions for common error patterns
36+
37+
## Design Decisions
38+
1. SliceNode inherits from IRNode for uniform tree traversal
39+
2. Added walk_postorder() for bottom-up type inference
40+
3. Added factory functions (make_constant, make_variable, etc.) for convenience
41+
4. BroadcastInfo defined but populated during type inference (Phase 2)
42+
43+
## Known Limitations
44+
1. SubscriptNode.is_fancy detection deferred to Phase 3 (IRBuilder)
45+
2. BroadcastInfo not yet computed (Phase 2)
46+
47+
## Test Results
48+
- 78 tests passing
49+
- Coverage: types, nodes, errors, tree traversal, integration
Lines changed: 195 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,195 @@
1+
"""
2+
RDataFrameDSL - Python DSL for ROOT RDataFrame
3+
4+
This package provides a domain-specific language for constructing
5+
ROOT RDataFrame analysis workflows with:
6+
7+
- C++ object navigation via Python syntax (track.getX())
8+
- Automatic type inference from ROOT tree reflection
9+
- N-key composite indices for calibration table joins
10+
- NumPy-style slicing (1D and 2D)
11+
- Two-phase validation (compile before RDF execution)
12+
13+
Basic Usage:
14+
from RDataFrameDSL import RDFBuilder
15+
16+
builder = RDFBuilder.from_tree("data.root", "tree")
17+
builder.add_alias("pt", "sqrt(px**2 + py**2)")
18+
rdf, handle = builder.build()
19+
rdf.Histo1D("pt").Draw()
20+
21+
Phase 1 Exports (IR Core):
22+
- Types: IRType, IRTypeKind
23+
- Nodes: ConstantNode, VariableNode, BinaryOpNode, etc.
24+
- Errors: IRError, IRErrorKind, ErrorCollector
25+
"""
26+
27+
__version__ = "0.1.0"
28+
29+
# IR Types
30+
from .ir_types import (
31+
IRTypeKind,
32+
IRType,
33+
promote_types,
34+
comparison_result_type,
35+
cpp_type_to_ir,
36+
CPP_TO_IR_TYPE,
37+
IR_TO_CPP_TYPE,
38+
)
39+
40+
# IR Nodes
41+
from .ir_nodes import (
42+
# Enums
43+
UnaryOp,
44+
BinaryOp,
45+
# Base
46+
IRNode,
47+
# Leaf nodes
48+
ConstantNode,
49+
VariableNode,
50+
# Operator nodes
51+
UnaryOpNode,
52+
BinaryOpNode,
53+
TernaryOpNode,
54+
# Function/method nodes
55+
CallNode,
56+
MethodCallNode,
57+
PropertyAccessNode,
58+
# Indexing nodes
59+
SliceNode,
60+
SubscriptNode,
61+
CollectionIndexNode,
62+
# Helpers
63+
BroadcastInfo,
64+
# Factory functions
65+
make_constant,
66+
make_variable,
67+
make_binary_op,
68+
make_unary_op,
69+
make_call,
70+
make_method_call,
71+
make_subscript,
72+
)
73+
74+
# Class Reflection
75+
from .reflection import (
76+
ReflectionCache,
77+
MethodInfo,
78+
PropertyInfo,
79+
)
80+
81+
# IR Builder
82+
from .ir_builder import (
83+
IRBuilder,
84+
BuildContext,
85+
)
86+
87+
# Type Inference
88+
from .type_inferrer import (
89+
TypeInferrer,
90+
VariableInfo,
91+
extract_inner_type,
92+
is_vector_type,
93+
is_rvec_type,
94+
)
95+
96+
# IR Errors
97+
from .ir_errors import (
98+
IRErrorKind,
99+
SourceLocation,
100+
IRError,
101+
ErrorRecoveryMode,
102+
ErrorCollector,
103+
# Helper functions
104+
type_mismatch_error,
105+
unknown_variable_error,
106+
method_not_found_error,
107+
property_not_found_error,
108+
missing_dictionary_error,
109+
rank_mismatch_error,
110+
unsupported_operation_error,
111+
compile_error,
112+
)
113+
114+
# C++ Code Generation
115+
from .backend_cpp import (
116+
CppCodeGenerator,
117+
GeneratedFunction,
118+
FunctionLibrary,
119+
FUNCTION_HEADERS,
120+
)
121+
122+
__all__ = [
123+
# Version
124+
'__version__',
125+
126+
# Types
127+
'IRTypeKind',
128+
'IRType',
129+
'promote_types',
130+
'comparison_result_type',
131+
'cpp_type_to_ir',
132+
'CPP_TO_IR_TYPE',
133+
'IR_TO_CPP_TYPE',
134+
135+
# Type Inference
136+
'TypeInferrer',
137+
'VariableInfo',
138+
'extract_inner_type',
139+
'is_vector_type',
140+
'is_rvec_type',
141+
142+
# IR Builder
143+
'IRBuilder',
144+
'BuildContext',
145+
146+
# Class Reflection
147+
'ReflectionCache',
148+
'MethodInfo',
149+
'PropertyInfo',
150+
151+
# Nodes
152+
'UnaryOp',
153+
'BinaryOp',
154+
'IRNode',
155+
'ConstantNode',
156+
'VariableNode',
157+
'UnaryOpNode',
158+
'BinaryOpNode',
159+
'TernaryOpNode',
160+
'CallNode',
161+
'MethodCallNode',
162+
'PropertyAccessNode',
163+
'SliceNode',
164+
'SubscriptNode',
165+
'CollectionIndexNode',
166+
'BroadcastInfo',
167+
'make_constant',
168+
'make_variable',
169+
'make_binary_op',
170+
'make_unary_op',
171+
'make_call',
172+
'make_method_call',
173+
'make_subscript',
174+
175+
# Errors
176+
'IRErrorKind',
177+
'SourceLocation',
178+
'IRError',
179+
'ErrorRecoveryMode',
180+
'ErrorCollector',
181+
'type_mismatch_error',
182+
'unknown_variable_error',
183+
'method_not_found_error',
184+
'property_not_found_error',
185+
'missing_dictionary_error',
186+
'rank_mismatch_error',
187+
'unsupported_operation_error',
188+
'compile_error',
189+
190+
# C++ Code Generation
191+
'CppCodeGenerator',
192+
'GeneratedFunction',
193+
'FunctionLibrary',
194+
'FUNCTION_HEADERS',
195+
]

0 commit comments

Comments
 (0)