You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
- The format of the source data. E.g. supported formats are ``table``, ``parquet``, ``csv``, ``json``. All formats supported by spark see `PySpark Data Sources API <https://spark.apache.org/docs/3.5.3/sql-data-sources.html>`_.
113
113
* - **path**
114
114
- ``string``
115
-
- The location to load the source data from. This can be a table name or a path to a a file or directory with multiple snapshots. A placeholder ``{version}`` can be used in this path which will be substituted with the version value in run time.
115
+
- The location to load the source data from. This can be a table name or a path to a file or directory with multiple snapshots. Supports three path pattern styles for version extraction: the ``{version}`` placeholder (simple single-segment match), the ``{fragment}`` placeholder (for multi-file snapshots), and regex named capture groups (for complex partitioning). See :ref:`file-path-patterns` for details and examples.
116
116
* - **versionType**
117
117
- ``string``
118
118
- The type of versioning to use. Can be either ``int`` or ``datetime``.
- (*optional*) A list of select expressions to apply to the source data.
139
139
* - **filter**
140
140
- ``string``
141
-
- (*optional*) A filter expression to apply to the source data. This filter is applied to the dataframe as a WHERE clause when the source is read. A placeholder ``{version}`` can be used in this filter expression which will be substituted with the version value in run time.
141
+
- (*optional*) A filter expression to apply to the source data. This filter is applied to the dataframe as a WHERE clause when the source is read. The placeholder ``{version}`` can be used in this filter expression and will be substituted with the version value at run time (e.g. ``"year = '{version}'"``). Not applicable when using regex named capture groups in ``path``.
142
142
* - **recursiveFileLookup**
143
143
- ``boolean``
144
144
- (*optional*) When set to ``true``, enables recursive directory traversal to find snapshot files. This should be used when snapshots are stored in a nested directory structure such as Hive-style partitioning (e.g., ``/data/{version}/file.parquet``). When set to ``false`` (default), only files in the immediate directory are searched. Default: ``false``.
145
145
146
146
147
147
.. note::
148
-
If ``recursiveFileLookup`` is set to ``true``, ensure that the ``path`` parameter is specified in a way that is compatible with recursive directory traversal. I.e. the ``{version}`` placeholder is used in the path and not the filename.
148
+
If ``recursiveFileLookup`` is set to ``true``, ensure that the ``path`` parameter is compatible with recursive directory traversal. When using the ``{version}`` placeholder, place it in the directory portion of the path rather than the filename (e.g. ``/data/{version}/file.parquet``). When using regex named capture groups, the pattern spans the full relative path from the first dynamic segment, so ``recursiveFileLookup`` must be ``true`` if the version spans multiple directory levels.
149
+
150
+
.. _file-path-patterns:
151
+
152
+
File Path Patterns
153
+
^^^^^^^^^^^^^^^^^^
154
+
155
+
The ``path`` field supports three styles for expressing where the version (and optional fragment) appears in the file path. All styles can be combined with a static base path prefix that is resolved at run time (e.g. ``{sample_file_location}``).
156
+
157
+
.. list-table::
158
+
:header-rows: 1
159
+
:widths: 20 35 45
160
+
161
+
* - Style
162
+
- Syntax
163
+
- When to Use
164
+
* - ``{version}`` placeholder
165
+
- ``{version}``
166
+
- Version is contained in a single path segment or filename component. Simple and readable for flat or single-level partitioned layouts.
167
+
* - ``{fragment}`` placeholder
168
+
- ``{fragment}``
169
+
- Snapshot data for a single version is split across multiple files. Use alongside ``{version}`` to group files sharing the same version together.
170
+
* - Regex named capture groups
171
+
- ``(?P<version_<name>>.+)``
172
+
- Version is spread across multiple path segments or interleaved with other text. Supports complex partitioning schemes (e.g. Hive-style ``YEAR=.../MONTH=.../DAY=...``) where the version cannot be expressed as a single placeholder.
173
+
174
+
**``{version}`` — single-segment version**
175
+
176
+
The ``{version}`` placeholder matches one path segment or filename component. It is internally converted to a regex named capture group ``(?P<version_main>.+)``.
Use ``{fragment}`` alongside ``{version}`` when a single snapshot version is split across multiple files. All files sharing the same version are read and unioned together before CDC processing.
Files matched and grouped by version: ``customer_2024_01_01_split_1.csv``, ``customer_2024_01_01_split_2.csv`` → both ingested as version ``2024-01-01``.
214
+
215
+
**Regex named capture groups — multi-segment versions**
216
+
217
+
For cases where the version is distributed across multiple directory levels or interleaved with fixed text, use Python regex named capture groups with the prefix ``version_``. All groups whose names start with ``version_`` are extracted and concatenated **in the order they appear in the pattern** (left to right) to form the final version string, which is then parsed according to ``datetimeFormat`` or treated as an integer.
218
+
219
+
Group naming convention: ``(?P<version_<name>>.+)``. The ``<name>`` suffix is arbitrary but must be unique within the pattern. The concatenation order is determined by the position of each group in the path expression, not the name.
For the file ``2024/01/data/customer_15.csv``, the groups are captured left-to-right: ``version_year=2024``, ``version_month=01``, ``version_day=15``. These are concatenated in pattern order to produce ``"20240115"``, which is then parsed with ``datetimeFormat: "%Y%m%d"``.
231
+
232
+
.. tip::
233
+
234
+
Arrange your ``(?P<version_...>)`` groups in the path from left to right in the same order that their values should be concatenated to match your ``datetimeFormat``. The group names themselves only need to be unique — their order in the pattern controls concatenation.
235
+
236
+
See ``samples/bronze_sample/src/dataflows/feature_samples/dataflowspec/historical_snapshot_files_datetime_recursive_and_partitioned_regex_main.json`` for a complete working example.
149
237
150
238
The ``source`` object contains the following properties for ``table`` based sources:
0 commit comments