Storage Service Connect , path specification#27408
Conversation
✅ TypeScript Types Auto-UpdatedThe generated TypeScript types have been automatically updated based on JSON schema changes in this PR. |
|
The Python checkstyle failed. Please run You can install the pre-commit hooks with |
There was a problem hiding this comment.
Pull request overview
This PR introduces an inline “manifest entries” configuration for Storage ingestion to enable glob-based auto-discovery (with optional partition detection), and starts deprecating legacy manifest-file approaches.
Changes:
- Add
manifestto the Storage ingestion pipeline schema, plus a newManifestEntryschema definition. - Implement path-pattern utilities (glob → regex, table-root grouping, Hive partition detection) and wire inline manifest auto-discovery into the S3 connector.
- Add unit and integration tests covering pattern matching, auto-discovery, and S3 method behavior.
Reviewed changes
Copilot reviewed 9 out of 14 changed files in this pull request and generated 13 comments.
Show a summary per file
| File | Description |
|---|---|
| openmetadata-spec/src/main/resources/json/schema/metadataIngestion/storageServiceMetadataPipeline.json | Deprecates storageMetadataConfigSource in UI/schema and adds new manifest array configuration. |
| openmetadata-spec/src/main/resources/json/schema/metadataIngestion/storage/manifestEntry.json | New schema defining inline manifest entry fields (patterns, excludes, partition config, unstructured toggle). |
| ingestion/src/metadata/utils/path_pattern.py | New helper utilities for glob matching, table grouping, format inference, and partition detection. |
| ingestion/src/metadata/ingestion/source/storage/storage_service.py | Adds shared auto-discovery routine driven by manifest entries + deprecation warning for global manifest. |
| ingestion/src/metadata/ingestion/source/storage/s3/metadata.py | Integrates inline manifest auto-discovery into S3 ingestion and adds an S3 implementation of list_keys. |
| ingestion/tests/unit/test_path_pattern.py | Unit tests for the new path pattern utilities. |
| ingestion/tests/unit/topology/storage/test_s3_auto_discovery.py | Unit tests for manifest-entry-driven auto-discovery behavior. |
| ingestion/tests/unit/topology/storage/test_s3_methods.py | Unit tests for S3Source helper methods (listing, URL building, sampling, metrics, etc.). |
| ingestion/tests/integration/s3/test_s3_path_specs.py | Integration tests validating inline-manifest auto-discovery against MinIO test data and migration behavior. |
| # Replace placeholder with recursive match (zero or more path segments) | ||
| # Strip adjacent / to avoid double slashes: /DOUBLESTAR/ -> match | ||
| escaped = re.sub( | ||
| re.escape("/") + re.escape(placeholder) + re.escape("/"), | ||
| "(?:/|/.+/)", | ||
| escaped, | ||
| ) | ||
| escaped = escaped.replace(re.escape(placeholder), ".*") |
There was a problem hiding this comment.
pattern_to_regex doesn’t correctly implement the documented ** semantics (“zero or more path segments”) when ** appears at the start of the pattern (e.g. **/*.parquet). The generated regex requires a /, so it won’t match root-level files like file.parquet. Consider special-casing a leading **/ (and possibly trailing /**) so that ** can match an empty segment list.
| # Replace placeholder with recursive match (zero or more path segments) | |
| # Strip adjacent / to avoid double slashes: /DOUBLESTAR/ -> match | |
| escaped = re.sub( | |
| re.escape("/") + re.escape(placeholder) + re.escape("/"), | |
| "(?:/|/.+/)", | |
| escaped, | |
| ) | |
| escaped = escaped.replace(re.escape(placeholder), ".*") | |
| placeholder_escaped = re.escape(placeholder) | |
| slash_escaped = re.escape("/") | |
| # Replace placeholder with recursive match (zero or more path segments). | |
| # Special-case leading `**/` and trailing `/**` so `**` can match an | |
| # empty segment list without forcing an extra slash. | |
| escaped = re.sub( | |
| "^" + placeholder_escaped + slash_escaped, | |
| "(?:[^/]+/)*", | |
| escaped, | |
| ) | |
| escaped = re.sub( | |
| slash_escaped + placeholder_escaped + slash_escaped, | |
| "/(?:[^/]+/)*", | |
| escaped, | |
| ) | |
| escaped = re.sub( | |
| slash_escaped + placeholder_escaped + "$", | |
| "(?:/(?:[^/]+(?:/[^/]+)*)?)?", | |
| escaped, | |
| ) | |
| escaped = escaped.replace(placeholder_escaped, ".*") |
| for key in keys: | ||
| if not key.startswith(root_prefix): | ||
| continue | ||
|
|
||
| relative = key[len(root_prefix) :] | ||
| parts = relative.split("/") | ||
|
|
||
| # Collect partition segments (key=value) between root and file | ||
| current_partitions = [] | ||
| for part in parts[:-1]: # Exclude the filename | ||
| match = HIVE_PARTITION_PATTERN.match(part) | ||
| if match: | ||
| col_name = match.group(1) | ||
| col_value = match.group(2) | ||
| current_partitions.append(col_name) | ||
| partition_values.setdefault(col_name, []).append(col_value) | ||
| elif current_partitions: | ||
| # Non-partition segment after partition segments — inconsistent | ||
| break | ||
|
|
||
| if current_partitions: | ||
| partition_structures.append(current_partitions) | ||
|
|
There was a problem hiding this comment.
detect_hive_partitions claims to return partition columns only if all files share the same partition structure, but the current implementation ignores non-partitioned files (it only appends partition_structures when current_partitions is non-empty). This can incorrectly mark a table as partitioned when the group contains a mix of partitioned and flat files. Consider tracking whether any file under table_root has zero Hive partitions and treating that as inconsistent (return None).
| Handles compound extensions like .csv.gz and .parquet.snappy. | ||
| """ | ||
| lower_key = key.lower() | ||
| # Check compound extensions first (e.g., .csv.gz, .json.gz) | ||
| if lower_key.endswith(".gz") or lower_key.endswith(".zip"): | ||
| base = lower_key.rsplit(".", 1)[0] | ||
| for ext, fmt in EXTENSION_TO_FORMAT.items(): | ||
| if base.endswith(ext): | ||
| return fmt | ||
|
|
There was a problem hiding this comment.
infer_structure_format docstring says it handles compound extensions like .parquet.snappy, but the implementation only special-cases .gz and .zip. Either extend the logic to handle .snappy (and/or other common parquet suffixes) or update the docstring so it matches actual behavior.
| Handles compound extensions like .csv.gz and .parquet.snappy. | |
| """ | |
| lower_key = key.lower() | |
| # Check compound extensions first (e.g., .csv.gz, .json.gz) | |
| if lower_key.endswith(".gz") or lower_key.endswith(".zip"): | |
| base = lower_key.rsplit(".", 1)[0] | |
| for ext, fmt in EXTENSION_TO_FORMAT.items(): | |
| if base.endswith(ext): | |
| return fmt | |
| Handles common compound extensions like .csv.gz, .json.zip, and | |
| .parquet.snappy. | |
| """ | |
| lower_key = key.lower() | |
| compound_suffixes = (".gz", ".zip", ".snappy") | |
| # Check compound extensions first by removing one trailing | |
| # compression/archive suffix (e.g., .csv.gz, .json.zip, .parquet.snappy). | |
| for suffix in compound_suffixes: | |
| if lower_key.endswith(suffix): | |
| base = lower_key[: -len(suffix)] | |
| for ext, fmt in EXTENSION_TO_FORMAT.items(): | |
| if base.endswith(ext): | |
| return fmt | |
| break |
| * -> matches any single path segment (no /) | ||
| ** -> matches zero or more path segments (including /) | ||
| ? -> matches any single character (not /) | ||
|
|
||
| Examples: | ||
| "data/*/events/*.parquet" matches "data/warehouse/events/file.parquet" | ||
| "data/**/*.json" matches "data/a/b/c/file.json" | ||
| """ | ||
| # Replace ** with a placeholder, then handle * and ? | ||
| # ** must be handled first since * is a substring of ** | ||
| placeholder = "\x00DOUBLESTAR\x00" | ||
| pattern = pattern.replace("**", placeholder) | ||
|
|
||
| escaped = "" | ||
| for char in pattern: | ||
| if char == "*": | ||
| escaped += "[^/]+" | ||
| elif char == "?": | ||
| escaped += "[^/]" | ||
| else: | ||
| escaped += re.escape(char) |
There was a problem hiding this comment.
extract_static_prefix treats [ as a wildcard indicator (see tests for patterns like data/[abc]/*.parquet), but pattern_to_regex currently escapes [/] and does not implement glob character classes. As a result, patterns using bracket wildcards won’t match any keys. Either implement [...] glob support in pattern_to_regex or remove bracket-wildcard handling/tests to avoid implying support.
| * -> matches any single path segment (no /) | |
| ** -> matches zero or more path segments (including /) | |
| ? -> matches any single character (not /) | |
| Examples: | |
| "data/*/events/*.parquet" matches "data/warehouse/events/file.parquet" | |
| "data/**/*.json" matches "data/a/b/c/file.json" | |
| """ | |
| # Replace ** with a placeholder, then handle * and ? | |
| # ** must be handled first since * is a substring of ** | |
| placeholder = "\x00DOUBLESTAR\x00" | |
| pattern = pattern.replace("**", placeholder) | |
| escaped = "" | |
| for char in pattern: | |
| if char == "*": | |
| escaped += "[^/]+" | |
| elif char == "?": | |
| escaped += "[^/]" | |
| else: | |
| escaped += re.escape(char) | |
| * -> matches any single path segment (no /) | |
| ** -> matches zero or more path segments (including /) | |
| ? -> matches any single character (not /) | |
| [abc] -> matches a single character from the set | |
| [!abc] -> matches a single character not in the set | |
| Examples: | |
| "data/*/events/*.parquet" matches "data/warehouse/events/file.parquet" | |
| "data/**/*.json" matches "data/a/b/c/file.json" | |
| "data/[ab]/*.parquet" matches "data/a/file.parquet" | |
| """ | |
| # Replace ** with a placeholder, then handle *, ?, and character classes. | |
| # ** must be handled first since * is a substring of ** | |
| placeholder = "\x00DOUBLESTAR\x00" | |
| pattern = pattern.replace("**", placeholder) | |
| escaped = "" | |
| i = 0 | |
| while i < len(pattern): | |
| char = pattern[i] | |
| if char == "*": | |
| escaped += "[^/]+" | |
| elif char == "?": | |
| escaped += "[^/]" | |
| elif char == "[": | |
| end = i + 1 | |
| if end < len(pattern) and pattern[end] in ("!", "]"): | |
| end += 1 | |
| while end < len(pattern) and pattern[end] != "]": | |
| end += 1 | |
| if end >= len(pattern): | |
| escaped += re.escape(char) | |
| else: | |
| content = pattern[i + 1 : end] | |
| if not content: | |
| escaped += re.escape(char) | |
| else: | |
| is_negated = content[0] in ("!", "^") | |
| class_content = content[1:] if is_negated else content | |
| if not class_content: | |
| escaped += re.escape(char) | |
| else: | |
| regex_class = "" | |
| for idx, class_char in enumerate(class_content): | |
| if class_char == "\\": | |
| regex_class += "\\\\" | |
| elif class_char in ("[", "]"): | |
| regex_class += "\\" + class_char | |
| elif class_char == "^" and idx == 0: | |
| regex_class += "\\^" | |
| else: | |
| regex_class += class_char | |
| escaped += f"[{'^' if is_negated else ''}{regex_class}]" | |
| i = end | |
| else: | |
| escaped += re.escape(char) | |
| i += 1 |
|
The Python checkstyle failed. Please run You can install the pre-commit hooks with |
✅ TypeScript Types Auto-UpdatedThe generated TypeScript types have been automatically updated based on JSON schema changes in this PR. |
|
The Python checkstyle failed. Please run You can install the pre-commit hooks with |
|
The Python checkstyle failed. Please run You can install the pre-commit hooks with |
✅ TypeScript Types Auto-UpdatedThe generated TypeScript types have been automatically updated based on JSON schema changes in this PR. |
| */ | ||
| tags?: TagLabel[]; | ||
| name: string; | ||
| [property: string]: any; |
There was a problem hiding this comment.
💡 Quality: PartitionColumn allows arbitrary extra properties via index signature
The generated PartitionColumn interface now includes [property: string]: any; because the inline schema definition in containerMetadataConfig.json and manifestMetadataConfig.json omits "additionalProperties": false. This weakens TypeScript type-checking — any typo in a property name will silently compile. The related partitionColumnDetails in table.json correctly sets additionalProperties: false; the same should be applied here.
Suggested fix:
Add `"additionalProperties": false` to the inline PartitionColumn object definition in both `containerMetadataConfig.json` and `manifestMetadataConfig.json`, then regenerate the TypeScript types. This will remove the `[property: string]: any;` index signature and restore strict typing.
Was this helpful? React with 👍 / 👎 | Reply gitar fix to apply this suggestion
| onFocus?.(props.id, props.value); | ||
| }, [onFocus, props.id, props.value]); |
There was a problem hiding this comment.
onFocusHandler calls onFocus?.(props.id, props.value), but props.value is always undefined here because value was destructured out of the props. This means consumers receive the wrong value on focus. Use the destructured value (or effectiveValue if that's what should be surfaced) and update the dependency array accordingly.
| onFocus?.(props.id, props.value); | |
| }, [onFocus, props.id, props.value]); | |
| onFocus?.(props.id, value); | |
| }, [onFocus, props.id, value]); |
| const hasUserValue = typeof value === 'string' && value.trim().length > 0; | ||
| const effectiveValue = hasUserValue ? value : SAMPLE_MANIFEST_JSON; | ||
|
|
There was a problem hiding this comment.
The editor uses SAMPLE_MANIFEST_JSON as the displayed value whenever the actual form value is empty. This can mislead users (the UI appears populated/valid, but the submitted defaultManifest remains empty unless the user edits). Consider treating empty as truly empty for validation/status, and provide the sample via an explicit “Insert sample” action or separate template block.
| return { | ||
| status: 'error', | ||
| message: `Invalid JSON: ${ | ||
| err instanceof Error ? err.message : String(err) | ||
| }`, |
There was a problem hiding this comment.
validateManifestJson produces hardcoded English error text for JSON parse failures, which is then rendered directly in the UI. Please localize this via i18n (e.g., use the new manifest-invalid-json message key with the error detail interpolated).
| case 'string[]': | ||
| if (!Array.isArray(value)) { | ||
| return 'expected an array of strings'; | ||
| } | ||
|
|
There was a problem hiding this comment.
checkType returns hardcoded English validation strings (e.g., "expected an array of strings"), which are user-facing via the widget error <Alert>. Please route these through i18n (use the new manifest-entry-type-error / related keys) instead of returning literal English strings.
| const col = columns[i]; | ||
| if (typeof col !== 'object' || col === null) { | ||
| return `Entry ${ | ||
| entryIndex + 1 | ||
| }: partitionColumns[${i}] must be an object.`; | ||
| } |
There was a problem hiding this comment.
validatePartitionColumns returns hardcoded English errors (e.g., "partitionColumns[...] must be an object"), which are displayed directly to users. Please localize these messages via i18n (e.g., manifest-partition-column-must-be-object, etc.) with the relevant indexes/fields as interpolation params.
Code Review 👍 Approved with suggestions 6 resolved / 7 findingsRefactored partition logic resolves path matching errors for glob patterns and corrects legacy docstrings. Remove the index signature in PartitionColumn to restrict arbitrary properties. 💡 Quality: PartitionColumn allows arbitrary extra properties via index signature📄 openmetadata-ui/src/main/resources/ui/src/generated/metadataIngestion/storage/containerMetadataConfig.ts:102 📄 openmetadata-ui/src/main/resources/ui/src/generated/metadataIngestion/storage/manifestMetadataConfig.ts:103 📄 openmetadata-spec/src/main/resources/json/schema/metadataIngestion/storage/containerMetadataConfig.json:86-100 📄 openmetadata-spec/src/main/resources/json/schema/metadataIngestion/storage/manifestMetadataConfig.json:84-98 The generated Suggested fix✅ 6 resolved✅ Bug: Missing space in deprecation warning produces "thepipeline"
✅ Bug: pattern_to_regex:
|
| Compact |
|
Was this helpful? React with 👍 / 👎 | Gitar
| onFocus?.(props.id, props.value); | ||
| }, [onFocus, props.id, props.value]); |
There was a problem hiding this comment.
onFocusHandler calls onFocus?.(props.id, props.value), but value is destructured from the component props so it is not available on the props rest object. This will always pass undefined as the focused value, which can break RJSF focus handling (e.g. contextual help). Use the destructured value (or effectiveValue) instead of props.value when invoking onFocus.
| onFocus?.(props.id, props.value); | |
| }, [onFocus, props.id, props.value]); | |
| onFocus?.(props.id, value); | |
| }, [onFocus, props.id, value]); |
| // Display the sample JSON as a placeholder when the field is empty so | ||
| // users have a ready template. We purposely do NOT write it into form | ||
| // state on mount — the field may be populated asynchronously after a | ||
| // saved pipeline config loads, and writing our sample into form data | ||
| // would overwrite the real value. We also skip when disabled. | ||
| const hasUserValue = typeof value === 'string' && value.trim().length > 0; | ||
| const effectiveValue = hasUserValue ? value : SAMPLE_MANIFEST_JSON; | ||
|
|
||
| const handleChange = useCallback( | ||
| (next: string) => { | ||
| if (disabled) { | ||
| return; | ||
| } | ||
| onChange(next); | ||
| }, | ||
| [disabled, onChange] | ||
| ); | ||
|
|
||
| const validation = useMemo( | ||
| () => validateManifestJson(effectiveValue), | ||
| [effectiveValue] | ||
| ); | ||
|
|
||
| return ( | ||
| <div className="manifest-json-widget"> | ||
| <div className="manifest-json-widget-resize-wrapper"> | ||
| <SchemaEditor | ||
| className="manifest-json-widget-editor" | ||
| mode={JSON_EDITOR_MODE} | ||
| readOnly={disabled} | ||
| showCopyButton={false} | ||
| value={effectiveValue} | ||
| onChange={handleChange} | ||
| onFocus={onFocusHandler} | ||
| /> | ||
| </div> | ||
| <span className="manifest-json-widget-resize-hint"> | ||
| {t('message.drag-bottom-right-corner-to-resize')} | ||
| </span> | ||
| {validation.status === 'ok' && ( | ||
| <Alert | ||
| showIcon | ||
| className="m-t-xs" | ||
| message={ | ||
| <Text> | ||
| {t('label.valid-manifest-entry-count', { | ||
| count: validation.entryCount, | ||
| })} | ||
| </Text> | ||
| } | ||
| type="success" | ||
| /> |
There was a problem hiding this comment.
When the actual form value is empty, the widget renders SAMPLE_MANIFEST_JSON as the editor value and validates that sample (validateManifestJson(effectiveValue)), which will show a green “Valid manifest — N entries” banner even though defaultManifest is still unset and ingestion will ignore it. This can mislead users and also makes it impossible to visually distinguish a real saved manifest from the placeholder sample. Consider treating the sample as a true placeholder (not the effective value): validate/display status based on the real value, and only insert the sample into form state when the user explicitly chooses to (or on first edit).
|
|



Storage Connector: Wildcards in Manifest + Config-Side Default Manifest
Context
Today, the OpenMetadata storage connector (S3/GCS/Azure) uses a bucket-level
openmetadata.jsonmanifest file to declare which data paths to ingest. Two gaps prompted this work:dataPathliterally — users asked for glob patterns so a single entry can cover a dataset that grows over time.Design
Two principles drove the design:
openmetadata.jsonand the new config-sidedefaultManifestuse the same schema, so users learn the format once and reuse the same JSON everywhere.Precedence (bucket-by-bucket)
The first source that yields entries for a given bucket is used; the others are skipped for that bucket.
Schema changes
Only three files changed under
openmetadata-spec/.manifestMetadataConfig.jsonandcontainerMetadataConfig.jsonExisting fields kept (no breaking change). Added optional fields that only take effect when
dataPathcontains glob characters:autoPartitionDetectionfalsekey=valuepartition inference from matched pathsexcludePathsstring[]null(falls back to_delta_log,_temporary,_spark_metadata,.tmp,_SUCCESS)excludePatternsstring[]nullunstructuredDatafalseAlso
partitionColumnswas slimmed from the full recursiveColumnreference to a lightweight inline shape (name,dataType,dataTypeDisplay?,description?). This unblocks RJSF form rendering and keeps the JSON payload small; the ingestion code converts to fullColumnobjects when buildingContainerDataModel.dataPathsemantics are broadened (no schema change, description updated):"data/events"— same behavior as before.*— one path segment (no/)**— any depth?— single characterstorageServiceMetadataPipeline.jsonAdded one field:
Stored as a JSON string rather than a nested object for two reasons: (1) it renders as a single code editor in the UI without fighting RJSF's nested-array rendering, and (2) the "paste the same JSON you'd put in
openmetadata.json" mental model is preserved exactly.Backend parses the string to
ManifestMetadataConfigat runtime (see_parsed_default_manifestbelow).Python implementation
All changes are in
ingestion/src/metadata/ingestion/source/storage/.New helper:
expand_entry(bucket_name, entry)Added to the
StorageServiceSourcebase class. For each manifest entry:dataPath→ passes through unchanged. Zero behavior change vs. today.dataPath→ expanded into one or more concreteMetadataEntryobjects:list_keys(bucket, prefix)(provider-specific, already implemented for S3).The expanded entries flow through the existing
_generate_structured_containers/_generate_unstructured_containerspipeline — so container FQNs, sample-file sampling, column extraction, delete detection, and tag ingestion all work unchanged.New helper:
_resolve_manifest_entries(bucket_name)Encapsulates the precedence rules above. Returns a single
List[MetadataEntry]thatget_containers()then passes throughexpand_entries().This replaced two duplicated if/else branches in
s3/metadata.py::get_containers— one function call now covers all three sources.JSON parsing for
defaultManifestSince
defaultManifestarrives as a string from the UI:Invalid JSON is logged and dropped (ingestion continues); bucket manifest files and global manifest remain active.
Files touched (Python)
storage_service.pyhas_glob,expand_entry,expand_entries,_resolve_manifest_entries,_parsed_default_manifest,_partition_columns_to_table_columns. Updated_manifest_entries_to_metadata_entries_by_containerto pass the new fields through.s3/metadata.py_resolve_manifest_entries+expand_entries.list_keyswas already present.utils/path_pattern.pyextract_static_prefix,pattern_to_regex,group_files_by_table,detect_hive_partitions,infer_structure_formatall reused.UI
Form widget
A dedicated
ManifestJsonWidgetrendersdefaultManifest:Pre-populated sample JSON on an empty field (editable, not read-only):
{ "entries": [ { "containerName": "my-bucket", "dataPath": "data/*/events/*.parquet", "structureFormat": "parquet", "autoPartitionDetection": true }, { "containerName": "my-bucket", "dataPath": "logs/**/*.json", "structureFormat": "json" } ] }Resizable via native
resize: vertical— drag the bottom-right corner (min 180px, max 80vh).Live validation with specific error messages:
entries)"structuredFormat" — did you mean "structureFormat"?)containerName,dataPath)"autoPartitionDetection" expected true or false, got string)excludePaths/excludePatternsmust be arrays of strings)partitionColumnsitem shape (each needsname+dataType)Sidebar docs
One section added to
Storage/workflows/metadata.md(Default Manifest) with a worked JSON example. All the previous sections added for the inline-manifest exploration were reverted — we're not introducing form-field-level docs.Backwards compatibility
Short version: existing
openmetadata.jsonfiles and existing pipeline configs work unchanged.openmetadata.jsonwith literaldataPathexpand_entryuntouched.openmetadata.json, newautoPartitionDetection: trueadded on an existing literal-path entrystorageMetadataConfigSource(legacy global manifest)defaultManifest, no global manifest, and bucket has no filedefaultManifestwhile buckets still have their own filesdefaultManifestonly kicks in for buckets without a manifest file.defaultManifestThe
partitionColumnsschema shape changed from the fullColumnreference to a lightweightPartitionColumn. This required touching two existing tests (test_s3_storage.py,test_gcs_storage.py) to constructPartitionColumn(name=..., dataType=...)instead ofColumn(...). Any external consumer parsing a manifest programmatically and populatingpartitionColumnsneeds the same minor update.Migration paths
No forced migration. Two optional patterns users may adopt:
Condense a bulky manifest — replace many literal entries with a single glob. Same FQNs, fewer lines.
Bootstrap without touching buckets — admin adds
defaultManifestin the pipeline config; ingestion starts finding data. Later, bucket owners upload their ownopenmetadata.jsonto customize. The moment a bucket file appears, thedefaultManifestis ignored for that bucket.Verification
pytest ingestion/tests/unit/topology/storage/ ingestion/tests/unit/test_path_pattern.py→ 174 pass (19 new intest_manifest_wildcards.pycovering literal passthrough, glob expansion, partition detection, explicit vs auto partition priority, excludes, unstructured mode, JSON parsing errors, precedence).ingestion/tests/integration/s3/test_s3_manifest_wildcards.py, runs against MinIO testcontainer — 16 tests):dataPathin a bucket manifest resolves to multiple containers;excludePathsfilters correctly;autoPartitionDetectionsurfaces Hive columns;defaultManifeston pipeline config is used when the bucket has noopenmetadata.json.openmetadata.jsonwins overdefaultManifestwhen both are present.test_s3_storagefixture) still produces the same containers.defaultManifest.defaultManifest→ bucket container still created; no crash.defaultManifest.entries: []→ no containers created from bucket file; falls back todefaultManifest.structureFormat→ skipped with warning, no crash.unstructuredData: true→ each matched file becomes its own leaf container with nodataModel.status.warningwith the entry'sdataPathand the exception type/message.dataPathcontaining regex-special characters (+,[, parens) is matched verbatim, not as a regex pattern.make generate+json2ts-generate-all.sh+yarn parse-schemaall produce clean output with no orphanmanifestEntryfiles../docker/run_local_docker.sh -m ui -d mysql) comes up healthy; S3 storage service can be configured with adefaultManifestvia the UI and deployed.Whats remaining