You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/ingestion/data-formats.md
+70-1Lines changed: 70 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -73,6 +73,74 @@ Besides text formats, Druid also supports binary formats such as [Orc](#orc) and
73
73
Druid supports custom text data formats and can use the Regex input format to parse them. However, be aware doing this to
74
74
parse data is less efficient than writing a native Java `InputFormat` extension, or using an external stream processor. We welcome contributions of new input formats.
75
75
76
+
## Regex engine configuration
77
+
78
+
The `regex` input format supports configurable regex engines using the runtime property:
enables the RE2/J regex engine for ingestion task `regex` input formats.
106
+
107
+
RE2/J helps protect against catastrophic backtracking and Regular Expression Denial of Service (ReDoS) attacks by guaranteeing linear-time regex evaluation.
108
+
109
+
### Compatibility differences
110
+
111
+
RE2/J does not support all Java regex features.
112
+
113
+
Unsupported or partially supported features include:
114
+
- backreferences
115
+
- lookbehind assertions
116
+
- some advanced backtracking behavior
117
+
118
+
Patterns using unsupported constructs will fail during regex compilation.
119
+
120
+
### Example of catastrophic backtracking
121
+
122
+
The following Java regex may cause catastrophic backtracking:
123
+
124
+
```regex
125
+
^(.*a){20}$
126
+
```
127
+
128
+
against input such as:
129
+
130
+
```text
131
+
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaX
132
+
```
133
+
134
+
Using `RE2J` avoids this issue.
135
+
136
+
### Performance considerations
137
+
138
+
-`JAVA` may support more advanced regex syntax and behavior.
139
+
-`RE2J` provides safer and more predictable runtime characteristics.
140
+
- For trusted internal ingestion specs, `JAVA` may be preferred for compatibility.
141
+
- For externally supplied regex patterns, `RE2J` is recommended.
142
+
143
+
76
144
## Input format
77
145
78
146
You can use the `inputFormat` field to specify the data format for your input data.
@@ -897,7 +965,8 @@ This query returns:
897
965
|---------------------|-----------------|
898
966
|`1680795276351`|`partition-1`|
899
967
900
-
## FlattenSpec
968
+
## Flat
969
+
tenSpec
901
970
902
971
You can use the `flattenSpec` object to flatten nested data, as an alternative to the Druid [nested columns](../querying/nested-columns.md) feature, and for nested input formats unsupported by the feature. It is an object within the `inputFormat` object.
0 commit comments