Skip to content

Storage Service Connect , path specification#27408

Merged
ulixius9 merged 33 commits intomainfrom
storage_wildcard
Apr 21, 2026
Merged

Storage Service Connect , path specification#27408
ulixius9 merged 33 commits intomainfrom
storage_wildcard

Conversation

@harshach
Copy link
Copy Markdown
Collaborator

@harshach harshach commented Apr 15, 2026

Storage Connector: Wildcards in Manifest + Config-Side Default Manifest

Context

Today, the OpenMetadata storage connector (S3/GCS/Azure) uses a bucket-level openmetadata.json manifest file to declare which data paths to ingest. Two gaps prompted this work:

  1. Every new table requires a manifest update. The manifest must list each dataPath literally — users asked for glob patterns so a single entry can cover a dataset that grows over time.
  2. Quick setup for new buckets is painful. Admins want to seed a working configuration centrally; bucket owners want to self-serve by uploading their own manifest later. Both flows should converge on the same format.

Design

Two principles drove the design:

  1. One manifest format, two delivery channels. The bucket's openmetadata.json and the new config-side defaultManifest use the same schema, so users learn the format once and reuse the same JSON everywhere.
  2. Bucket file always wins. Clear precedence so self-service from bucket owners always trumps admin defaults.

Precedence (bucket-by-bucket)

1. Global manifest (storageMetadataConfigSource)  — matched by containerName
2. Bucket's own openmetadata.json                 — wins over defaultManifest
3. Pipeline config defaultManifest (fallback)     — matched by containerName

The first source that yields entries for a given bucket is used; the others are skipped for that bucket.

Schema changes

Only three files changed under openmetadata-spec/.

manifestMetadataConfig.json and containerMetadataConfig.json

Existing fields kept (no breaking change). Added optional fields that only take effect when dataPath contains glob characters:

Field Type Default Purpose
autoPartitionDetection boolean false Hive-style key=value partition inference from matched paths
excludePaths string[] null (falls back to _delta_log, _temporary, _spark_metadata, .tmp, _SUCCESS) Segments to skip during glob listing
excludePatterns string[] null Glob patterns to skip
unstructuredData boolean false Each matched file becomes its own container; no schema extraction

Also partitionColumns was slimmed from the full recursive Column reference to a lightweight inline shape (name, dataType, dataTypeDisplay?, description?). This unblocks RJSF form rendering and keeps the JSON payload small; the ingestion code converts to full Column objects when building ContainerDataModel.

dataPath semantics are broadened (no schema change, description updated):

  • Literal path, e.g. "data/events" — same behavior as before.
  • Glob pattern:
    • * — one path segment (no /)
    • ** — any depth
    • ? — single character

storageServiceMetadataPipeline.json

Added one field:

"defaultManifest": {
  "title": "Default Manifest (JSON)",
  "type": "string",
  "uiFieldType": "code",
  "format": "json"
}

Stored as a JSON string rather than a nested object for two reasons: (1) it renders as a single code editor in the UI without fighting RJSF's nested-array rendering, and (2) the "paste the same JSON you'd put in openmetadata.json" mental model is preserved exactly.

Backend parses the string to ManifestMetadataConfig at runtime (see _parsed_default_manifest below).

Python implementation

All changes are in ingestion/src/metadata/ingestion/source/storage/.

New helper: expand_entry(bucket_name, entry)

Added to the StorageServiceSource base class. For each manifest entry:

  • Literal dataPath → passes through unchanged. Zero behavior change vs. today.
  • Glob dataPath → expanded into one or more concrete MetadataEntry objects:
    1. Extract longest static prefix (for optimized listing).
    2. Compile glob → regex.
    3. list_keys(bucket, prefix) (provider-specific, already implemented for S3).
    4. Filter out excluded segments and patterns.
    5. Unstructured mode: yield one entry per file.
    6. Structured mode: group files by logical table root, optionally detect Hive partitions, yield one entry per table.

The expanded entries flow through the existing _generate_structured_containers / _generate_unstructured_containers pipeline — so container FQNs, sample-file sampling, column extraction, delete detection, and tag ingestion all work unchanged.

New helper: _resolve_manifest_entries(bucket_name)

Encapsulates the precedence rules above. Returns a single List[MetadataEntry] that get_containers() then passes through expand_entries().

This replaced two duplicated if/else branches in s3/metadata.py::get_containers — one function call now covers all three sources.

JSON parsing for defaultManifest

Since defaultManifest arrives as a string from the UI:

def _parsed_default_manifest(self) -> Optional[ManifestMetadataConfig]:
    raw = getattr(self.source_config, "defaultManifest", None)
    if not raw:
        return None
    try:
        return ManifestMetadataConfig.model_validate(json.loads(raw))
    except (json.JSONDecodeError, ValueError) as exc:
        logger.warning(f"Could not parse defaultManifest as JSON — ignoring. Error: {exc}")
        return None

Invalid JSON is logged and dropped (ingestion continues); bucket manifest files and global manifest remain active.

Files touched (Python)

File What changed
storage_service.py Added has_glob, expand_entry, expand_entries, _resolve_manifest_entries, _parsed_default_manifest, _partition_columns_to_table_columns. Updated _manifest_entries_to_metadata_entries_by_container to pass the new fields through.
s3/metadata.py Replaced 3 if/else manifest branches with one call to _resolve_manifest_entries + expand_entries. list_keys was already present.
utils/path_pattern.py Unchanged — extract_static_prefix, pattern_to_regex, group_files_by_table, detect_hive_partitions, infer_structure_format all reused.

UI

Form widget

A dedicated ManifestJsonWidget renders defaultManifest:

  • Pre-populated sample JSON on an empty field (editable, not read-only):

    {
      "entries": [
        {
          "containerName": "my-bucket",
          "dataPath": "data/*/events/*.parquet",
          "structureFormat": "parquet",
          "autoPartitionDetection": true
        },
        {
          "containerName": "my-bucket",
          "dataPath": "logs/**/*.json",
          "structureFormat": "json"
        }
      ]
    }
  • Resizable via native resize: vertical — drag the bottom-right corner (min 180px, max 80vh).

  • Live validation with specific error messages:

    • JSON parse errors with the underlying message
    • Unknown top-level fields (anything other than entries)
    • Unknown entry fields with typo suggestions (e.g. "structuredFormat" — did you mean "structureFormat"?)
    • Missing required fields (containerName, dataPath)
    • Type mismatches (e.g. "autoPartitionDetection" expected true or false, got string)
    • Array field element types (excludePaths / excludePatterns must be arrays of strings)
    • partitionColumns item shape (each needs name + dataType)
    • Green "Valid manifest — N entries." banner when all checks pass

Sidebar docs

One section added to Storage/workflows/metadata.md (Default Manifest) with a worked JSON example. All the previous sections added for the inline-manifest exploration were reverted — we're not introducing form-field-level docs.

Backwards compatibility

Short version: existing openmetadata.json files and existing pipeline configs work unchanged.

Scenario Behavior
Bucket has openmetadata.json with literal dataPath Identical to today — literal path passes through expand_entry untouched.
Bucket has openmetadata.json, new autoPartitionDetection: true added on an existing literal-path entry Flag is ignored for literal paths (defined in code + docs). No regression risk.
Pipeline configures storageMetadataConfigSource (legacy global manifest) Still works as before; takes highest precedence.
Pipeline has no defaultManifest, no global manifest, and bucket has no file Same as before — only the bucket itself is cataloged (no nested containers).
Pipeline adds defaultManifest while buckets still have their own files Bucket files win per-bucket; defaultManifest only kicks in for buckets without a manifest file.
Literal-path entry produces container name X today; tomorrow switched to a glob that resolves to X Same FQN — the same entity is updated, not duplicated. Lineage/tags/descriptions preserved.
Invalid JSON pasted into defaultManifest Logged as warning, ignored. Ingestion proceeds with bucket file / global manifest / empty.

The partitionColumns schema shape changed from the full Column reference to a lightweight PartitionColumn. This required touching two existing tests (test_s3_storage.py, test_gcs_storage.py) to construct PartitionColumn(name=..., dataType=...) instead of Column(...). Any external consumer parsing a manifest programmatically and populating partitionColumns needs the same minor update.

Migration paths

No forced migration. Two optional patterns users may adopt:

  1. Condense a bulky manifest — replace many literal entries with a single glob. Same FQNs, fewer lines.

    // before
    [
      {"dataPath": "data/users"},
      {"dataPath": "data/orders"},
      {"dataPath": "data/events"}
    ]
    // after
    [
      {"dataPath": "data/*", "structureFormat": "parquet"}
    ]
  2. Bootstrap without touching buckets — admin adds defaultManifest in the pipeline config; ingestion starts finding data. Later, bucket owners upload their own openmetadata.json to customize. The moment a bucket file appears, the defaultManifest is ignored for that bucket.

Verification

  • Python unit tests: pytest ingestion/tests/unit/topology/storage/ ingestion/tests/unit/test_path_pattern.py → 174 pass (19 new in test_manifest_wildcards.py covering literal passthrough, glob expansion, partition detection, explicit vs auto partition priority, excludes, unstructured mode, JSON parsing errors, precedence).
  • Python integration tests (ingestion/tests/integration/s3/test_s3_manifest_wildcards.py, runs against MinIO testcontainer — 16 tests):
    • Happy paths: glob dataPath in a bucket manifest resolves to multiple containers; excludePaths filters correctly; autoPartitionDetection surfaces Hive columns; defaultManifest on pipeline config is used when the bucket has no openmetadata.json.
    • Precedence: bucket's openmetadata.json wins over defaultManifest when both are present.
    • Backwards compat: legacy literal-path manifest (the full test_s3_storage fixture) still produces the same containers.
    • Malformed bucket manifest:
      • Invalid JSON → logged with line/column, reported on workflow status, falls back to defaultManifest.
      • Invalid JSON with no defaultManifest → bucket container still created; no crash.
      • Schema violation (missing required field) → logged with Pydantic per-field message, falls back to defaultManifest.
      • entries: [] → no containers created from bucket file; falls back to defaultManifest.
    • Malformed defaultManifest:
      • Invalid JSON → logged and ignored; bucket manifest (if any) still applies.
      • Schema violation → logged and ignored; no crash.
    • File-read edge cases:
      • Corrupt parquet in one table does not block other tables matched by the same glob.
      • Glob matches zero files → bucket container still created, no nested containers. - Unknown file extension + no structureFormat → skipped with warning, no crash.
      • unstructuredData: true → each matched file becomes its own leaf container with no dataModel.
    • Re-ingestion / migration guarantees:
      • Same config run twice → same entity ID (idempotency).
      • Literal-path manifest → glob manifest that resolves to the same path → same entity ID (lineage / tags / descriptions preserved during migration).
    • Per-entry resilience:
      • One entry triggers a runtime exception mid-expansion (e.g. simulated S3 AccessDenied) → other entries in the same manifest still process and produce their containers. The failing entry is logged + reported via status.warning with the entry's dataPath and the exception type/message.
      • Literal dataPath containing regex-special characters (+, [, parens) is matched verbatim, not as a regex pattern.
  • Schema regeneration: make generate + json2ts-generate-all.sh + yarn parse-schema all produce clean output with no orphan manifestEntry files.
  • End-to-end: Docker stack (./docker/run_local_docker.sh -m ui -d mysql) comes up healthy; S3 storage service can be configured with a defaultManifest via the UI and deployed.

Whats remaining

  • ** Real S3 service test**

----
## Summary by Gitar

- **UI localization:**
  - Added new i18n keys for `ManifestJsonWidget` validation errors and type mismatches across all language files in `openmetadata-ui/src/main/resources/ui/src/locale/languages/`.
- **Widget improvements:**
  - Refactored `ManifestJsonWidget` to use structured validation logic, replacing hardcoded strings with localized error codes.

<sub>This will update automatically on new commits.</sub>

@harshach harshach requested a review from a team as a code owner April 15, 2026 20:39
Copilot AI review requested due to automatic review settings April 15, 2026 20:39
@github-actions github-actions Bot added backend safe to test Add this label to run secure Github workflows on PRs labels Apr 15, 2026
Comment thread ingestion/src/metadata/ingestion/source/storage/s3/metadata.py Outdated
Comment thread ingestion/src/metadata/utils/path_pattern.py Outdated
Comment thread ingestion/src/metadata/utils/path_pattern.py
@github-actions
Copy link
Copy Markdown
Contributor

✅ TypeScript Types Auto-Updated

The generated TypeScript types have been automatically updated based on JSON schema changes in this PR.

@github-actions github-actions Bot requested a review from a team as a code owner April 15, 2026 20:44
@github-actions
Copy link
Copy Markdown
Contributor

The Python checkstyle failed.

Please run make py_format and py_format_check in the root of your repository and commit the changes to this PR.
You can also use pre-commit to automate the Python code formatting.

You can install the pre-commit hooks with make install_test precommit_install.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces an inline “manifest entries” configuration for Storage ingestion to enable glob-based auto-discovery (with optional partition detection), and starts deprecating legacy manifest-file approaches.

Changes:

  • Add manifest to the Storage ingestion pipeline schema, plus a new ManifestEntry schema definition.
  • Implement path-pattern utilities (glob → regex, table-root grouping, Hive partition detection) and wire inline manifest auto-discovery into the S3 connector.
  • Add unit and integration tests covering pattern matching, auto-discovery, and S3 method behavior.

Reviewed changes

Copilot reviewed 9 out of 14 changed files in this pull request and generated 13 comments.

Show a summary per file
File Description
openmetadata-spec/src/main/resources/json/schema/metadataIngestion/storageServiceMetadataPipeline.json Deprecates storageMetadataConfigSource in UI/schema and adds new manifest array configuration.
openmetadata-spec/src/main/resources/json/schema/metadataIngestion/storage/manifestEntry.json New schema defining inline manifest entry fields (patterns, excludes, partition config, unstructured toggle).
ingestion/src/metadata/utils/path_pattern.py New helper utilities for glob matching, table grouping, format inference, and partition detection.
ingestion/src/metadata/ingestion/source/storage/storage_service.py Adds shared auto-discovery routine driven by manifest entries + deprecation warning for global manifest.
ingestion/src/metadata/ingestion/source/storage/s3/metadata.py Integrates inline manifest auto-discovery into S3 ingestion and adds an S3 implementation of list_keys.
ingestion/tests/unit/test_path_pattern.py Unit tests for the new path pattern utilities.
ingestion/tests/unit/topology/storage/test_s3_auto_discovery.py Unit tests for manifest-entry-driven auto-discovery behavior.
ingestion/tests/unit/topology/storage/test_s3_methods.py Unit tests for S3Source helper methods (listing, URL building, sampling, metrics, etc.).
ingestion/tests/integration/s3/test_s3_path_specs.py Integration tests validating inline-manifest auto-discovery against MinIO test data and migration behavior.

Comment thread ingestion/src/metadata/ingestion/source/storage/storage_service.py Outdated
Comment thread ingestion/src/metadata/ingestion/source/storage/s3/metadata.py Outdated
Comment thread ingestion/src/metadata/ingestion/source/storage/storage_service.py Outdated
Comment on lines +123 to +130
# Replace placeholder with recursive match (zero or more path segments)
# Strip adjacent / to avoid double slashes: /DOUBLESTAR/ -> match
escaped = re.sub(
re.escape("/") + re.escape(placeholder) + re.escape("/"),
"(?:/|/.+/)",
escaped,
)
escaped = escaped.replace(re.escape(placeholder), ".*")
Copy link

Copilot AI Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pattern_to_regex doesn’t correctly implement the documented ** semantics (“zero or more path segments”) when ** appears at the start of the pattern (e.g. **/*.parquet). The generated regex requires a /, so it won’t match root-level files like file.parquet. Consider special-casing a leading **/ (and possibly trailing /**) so that ** can match an empty segment list.

Suggested change
# Replace placeholder with recursive match (zero or more path segments)
# Strip adjacent / to avoid double slashes: /DOUBLESTAR/ -> match
escaped = re.sub(
re.escape("/") + re.escape(placeholder) + re.escape("/"),
"(?:/|/.+/)",
escaped,
)
escaped = escaped.replace(re.escape(placeholder), ".*")
placeholder_escaped = re.escape(placeholder)
slash_escaped = re.escape("/")
# Replace placeholder with recursive match (zero or more path segments).
# Special-case leading `**/` and trailing `/**` so `**` can match an
# empty segment list without forcing an extra slash.
escaped = re.sub(
"^" + placeholder_escaped + slash_escaped,
"(?:[^/]+/)*",
escaped,
)
escaped = re.sub(
slash_escaped + placeholder_escaped + slash_escaped,
"/(?:[^/]+/)*",
escaped,
)
escaped = re.sub(
slash_escaped + placeholder_escaped + "$",
"(?:/(?:[^/]+(?:/[^/]+)*)?)?",
escaped,
)
escaped = escaped.replace(placeholder_escaped, ".*")

Copilot uses AI. Check for mistakes.
Comment on lines +207 to +229
for key in keys:
if not key.startswith(root_prefix):
continue

relative = key[len(root_prefix) :]
parts = relative.split("/")

# Collect partition segments (key=value) between root and file
current_partitions = []
for part in parts[:-1]: # Exclude the filename
match = HIVE_PARTITION_PATTERN.match(part)
if match:
col_name = match.group(1)
col_value = match.group(2)
current_partitions.append(col_name)
partition_values.setdefault(col_name, []).append(col_value)
elif current_partitions:
# Non-partition segment after partition segments — inconsistent
break

if current_partitions:
partition_structures.append(current_partitions)

Copy link

Copilot AI Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

detect_hive_partitions claims to return partition columns only if all files share the same partition structure, but the current implementation ignores non-partitioned files (it only appends partition_structures when current_partitions is non-empty). This can incorrectly mark a table as partitioned when the group contains a mix of partitioned and flat files. Consider tracking whether any file under table_root has zero Hive partitions and treating that as inconsistent (return None).

Copilot uses AI. Check for mistakes.
Comment on lines +51 to +60
Handles compound extensions like .csv.gz and .parquet.snappy.
"""
lower_key = key.lower()
# Check compound extensions first (e.g., .csv.gz, .json.gz)
if lower_key.endswith(".gz") or lower_key.endswith(".zip"):
base = lower_key.rsplit(".", 1)[0]
for ext, fmt in EXTENSION_TO_FORMAT.items():
if base.endswith(ext):
return fmt

Copy link

Copilot AI Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

infer_structure_format docstring says it handles compound extensions like .parquet.snappy, but the implementation only special-cases .gz and .zip. Either extend the logic to handle .snappy (and/or other common parquet suffixes) or update the docstring so it matches actual behavior.

Suggested change
Handles compound extensions like .csv.gz and .parquet.snappy.
"""
lower_key = key.lower()
# Check compound extensions first (e.g., .csv.gz, .json.gz)
if lower_key.endswith(".gz") or lower_key.endswith(".zip"):
base = lower_key.rsplit(".", 1)[0]
for ext, fmt in EXTENSION_TO_FORMAT.items():
if base.endswith(ext):
return fmt
Handles common compound extensions like .csv.gz, .json.zip, and
.parquet.snappy.
"""
lower_key = key.lower()
compound_suffixes = (".gz", ".zip", ".snappy")
# Check compound extensions first by removing one trailing
# compression/archive suffix (e.g., .csv.gz, .json.zip, .parquet.snappy).
for suffix in compound_suffixes:
if lower_key.endswith(suffix):
base = lower_key[: -len(suffix)]
for ext, fmt in EXTENSION_TO_FORMAT.items():
if base.endswith(ext):
return fmt
break

Copilot uses AI. Check for mistakes.
Comment on lines +101 to +121
* -> matches any single path segment (no /)
** -> matches zero or more path segments (including /)
? -> matches any single character (not /)

Examples:
"data/*/events/*.parquet" matches "data/warehouse/events/file.parquet"
"data/**/*.json" matches "data/a/b/c/file.json"
"""
# Replace ** with a placeholder, then handle * and ?
# ** must be handled first since * is a substring of **
placeholder = "\x00DOUBLESTAR\x00"
pattern = pattern.replace("**", placeholder)

escaped = ""
for char in pattern:
if char == "*":
escaped += "[^/]+"
elif char == "?":
escaped += "[^/]"
else:
escaped += re.escape(char)
Copy link

Copilot AI Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extract_static_prefix treats [ as a wildcard indicator (see tests for patterns like data/[abc]/*.parquet), but pattern_to_regex currently escapes [/] and does not implement glob character classes. As a result, patterns using bracket wildcards won’t match any keys. Either implement [...] glob support in pattern_to_regex or remove bracket-wildcard handling/tests to avoid implying support.

Suggested change
* -> matches any single path segment (no /)
** -> matches zero or more path segments (including /)
? -> matches any single character (not /)
Examples:
"data/*/events/*.parquet" matches "data/warehouse/events/file.parquet"
"data/**/*.json" matches "data/a/b/c/file.json"
"""
# Replace ** with a placeholder, then handle * and ?
# ** must be handled first since * is a substring of **
placeholder = "\x00DOUBLESTAR\x00"
pattern = pattern.replace("**", placeholder)
escaped = ""
for char in pattern:
if char == "*":
escaped += "[^/]+"
elif char == "?":
escaped += "[^/]"
else:
escaped += re.escape(char)
* -> matches any single path segment (no /)
** -> matches zero or more path segments (including /)
? -> matches any single character (not /)
[abc] -> matches a single character from the set
[!abc] -> matches a single character not in the set
Examples:
"data/*/events/*.parquet" matches "data/warehouse/events/file.parquet"
"data/**/*.json" matches "data/a/b/c/file.json"
"data/[ab]/*.parquet" matches "data/a/file.parquet"
"""
# Replace ** with a placeholder, then handle *, ?, and character classes.
# ** must be handled first since * is a substring of **
placeholder = "\x00DOUBLESTAR\x00"
pattern = pattern.replace("**", placeholder)
escaped = ""
i = 0
while i < len(pattern):
char = pattern[i]
if char == "*":
escaped += "[^/]+"
elif char == "?":
escaped += "[^/]"
elif char == "[":
end = i + 1
if end < len(pattern) and pattern[end] in ("!", "]"):
end += 1
while end < len(pattern) and pattern[end] != "]":
end += 1
if end >= len(pattern):
escaped += re.escape(char)
else:
content = pattern[i + 1 : end]
if not content:
escaped += re.escape(char)
else:
is_negated = content[0] in ("!", "^")
class_content = content[1:] if is_negated else content
if not class_content:
escaped += re.escape(char)
else:
regex_class = ""
for idx, class_char in enumerate(class_content):
if class_char == "\\":
regex_class += "\\\\"
elif class_char in ("[", "]"):
regex_class += "\\" + class_char
elif class_char == "^" and idx == 0:
regex_class += "\\^"
else:
regex_class += class_char
escaped += f"[{'^' if is_negated else ''}{regex_class}]"
i = end
else:
escaped += re.escape(char)
i += 1

Copilot uses AI. Check for mistakes.
@github-actions
Copy link
Copy Markdown
Contributor

The Python checkstyle failed.

Please run make py_format and py_format_check in the root of your repository and commit the changes to this PR.
You can also use pre-commit to automate the Python code formatting.

You can install the pre-commit hooks with make install_test precommit_install.

Copilot AI review requested due to automatic review settings April 15, 2026 22:11
@github-actions
Copy link
Copy Markdown
Contributor

✅ TypeScript Types Auto-Updated

The generated TypeScript types have been automatically updated based on JSON schema changes in this PR.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 9 out of 14 changed files in this pull request and generated 5 comments.

Comment thread ingestion/src/metadata/ingestion/source/storage/storage_service.py Outdated
Comment thread ingestion/tests/unit/topology/storage/test_s3_auto_discovery.py Outdated
Comment thread ingestion/src/metadata/ingestion/source/storage/storage_service.py Outdated
Comment thread ingestion/tests/integration/s3/test_s3_path_specs.py Outdated
@github-actions
Copy link
Copy Markdown
Contributor

The Python checkstyle failed.

Please run make py_format and py_format_check in the root of your repository and commit the changes to this PR.
You can also use pre-commit to automate the Python code formatting.

You can install the pre-commit hooks with make install_test precommit_install.

@github-actions
Copy link
Copy Markdown
Contributor

The Python checkstyle failed.

Please run make py_format and py_format_check in the root of your repository and commit the changes to this PR.
You can also use pre-commit to automate the Python code formatting.

You can install the pre-commit hooks with make install_test precommit_install.

Copilot AI review requested due to automatic review settings April 16, 2026 01:38
@github-actions
Copy link
Copy Markdown
Contributor

✅ TypeScript Types Auto-Updated

The generated TypeScript types have been automatically updated based on JSON schema changes in this PR.

Copilot AI review requested due to automatic review settings April 21, 2026 01:35
@harshach harshach review requested due to automatic review settings April 21, 2026 01:35
*/
tags?: TagLabel[];
name: string;
[property: string]: any;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Quality: PartitionColumn allows arbitrary extra properties via index signature

The generated PartitionColumn interface now includes [property: string]: any; because the inline schema definition in containerMetadataConfig.json and manifestMetadataConfig.json omits "additionalProperties": false. This weakens TypeScript type-checking — any typo in a property name will silently compile. The related partitionColumnDetails in table.json correctly sets additionalProperties: false; the same should be applied here.

Suggested fix:

Add `"additionalProperties": false` to the inline PartitionColumn object definition in both `containerMetadataConfig.json` and `manifestMetadataConfig.json`, then regenerate the TypeScript types. This will remove the `[property: string]: any;` index signature and restore strict typing.

Was this helpful? React with 👍 / 👎 | Reply gitar fix to apply this suggestion

Copilot AI review requested due to automatic review settings April 21, 2026 01:46
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 40 out of 46 changed files in this pull request and generated 5 comments.

Comment on lines +331 to +332
onFocus?.(props.id, props.value);
}, [onFocus, props.id, props.value]);
Copy link

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

onFocusHandler calls onFocus?.(props.id, props.value), but props.value is always undefined here because value was destructured out of the props. This means consumers receive the wrong value on focus. Use the destructured value (or effectiveValue if that's what should be surfaced) and update the dependency array accordingly.

Suggested change
onFocus?.(props.id, props.value);
}, [onFocus, props.id, props.value]);
onFocus?.(props.id, value);
}, [onFocus, props.id, value]);

Copilot uses AI. Check for mistakes.
Comment on lines +339 to +341
const hasUserValue = typeof value === 'string' && value.trim().length > 0;
const effectiveValue = hasUserValue ? value : SAMPLE_MANIFEST_JSON;

Copy link

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The editor uses SAMPLE_MANIFEST_JSON as the displayed value whenever the actual form value is empty. This can mislead users (the UI appears populated/valid, but the submitted defaultManifest remains empty unless the user edits). Consider treating empty as truly empty for validation/status, and provide the sample via an explicit “Insert sample” action or separate template block.

Copilot uses AI. Check for mistakes.
Comment on lines +223 to +227
return {
status: 'error',
message: `Invalid JSON: ${
err instanceof Error ? err.message : String(err)
}`,
Copy link

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

validateManifestJson produces hardcoded English error text for JSON parse failures, which is then rendered directly in the UI. Please localize this via i18n (e.g., use the new manifest-invalid-json message key with the error detail interpolated).

Copilot uses AI. Check for mistakes.
Comment on lines +148 to +152
case 'string[]':
if (!Array.isArray(value)) {
return 'expected an array of strings';
}

Copy link

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

checkType returns hardcoded English validation strings (e.g., "expected an array of strings"), which are user-facing via the widget error <Alert>. Please route these through i18n (use the new manifest-entry-type-error / related keys) instead of returning literal English strings.

Copilot uses AI. Check for mistakes.
Comment on lines +179 to +184
const col = columns[i];
if (typeof col !== 'object' || col === null) {
return `Entry ${
entryIndex + 1
}: partitionColumns[${i}] must be an object.`;
}
Copy link

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

validatePartitionColumns returns hardcoded English errors (e.g., "partitionColumns[...] must be an object"), which are displayed directly to users. Please localize these messages via i18n (e.g., manifest-partition-column-must-be-object, etc.) with the relevant indexes/fields as interpolation params.

Copilot uses AI. Check for mistakes.
ulixius9
ulixius9 previously approved these changes Apr 21, 2026
Comment thread openmetadata-ui/src/main/resources/ui/src/locale/languages/ar-sa.json Outdated
@gitar-bot
Copy link
Copy Markdown

gitar-bot Bot commented Apr 21, 2026

Code Review 👍 Approved with suggestions 6 resolved / 7 findings

Refactored partition logic resolves path matching errors for glob patterns and corrects legacy docstrings. Remove the index signature in PartitionColumn to restrict arbitrary properties.

💡 Quality: PartitionColumn allows arbitrary extra properties via index signature

📄 openmetadata-ui/src/main/resources/ui/src/generated/metadataIngestion/storage/containerMetadataConfig.ts:102 📄 openmetadata-ui/src/main/resources/ui/src/generated/metadataIngestion/storage/manifestMetadataConfig.ts:103 📄 openmetadata-spec/src/main/resources/json/schema/metadataIngestion/storage/containerMetadataConfig.json:86-100 📄 openmetadata-spec/src/main/resources/json/schema/metadataIngestion/storage/manifestMetadataConfig.json:84-98

The generated PartitionColumn interface now includes [property: string]: any; because the inline schema definition in containerMetadataConfig.json and manifestMetadataConfig.json omits "additionalProperties": false. This weakens TypeScript type-checking — any typo in a property name will silently compile. The related partitionColumnDetails in table.json correctly sets additionalProperties: false; the same should be applied here.

Suggested fix
Add `"additionalProperties": false` to the inline PartitionColumn object definition in both `containerMetadataConfig.json` and `manifestMetadataConfig.json`, then regenerate the TypeScript types. This will remove the `[property: string]: any;` index signature and restore strict typing.
✅ 6 resolved
Bug: Missing space in deprecation warning produces "thepipeline"

📄 ingestion/src/metadata/ingestion/source/storage/s3/metadata.py:911-912
The f-string at line 911-912 concatenates "...Use 'manifest' in the" with "pipeline config instead..." without a trailing space, producing the log output in thepipeline config instead of in the pipeline config.

Bug: pattern_to_regex: * requires 1+ chars, should allow 0+

📄 ingestion/src/metadata/utils/path_pattern.py:117
In pattern_to_regex, a single * is translated to [^/]+ (one or more non-slash characters). Standard glob semantics define * as zero or more characters. While empty path segments are rare in object storage, this also means patterns like prefix*suffix won't match when the wildcard portion is empty (e.g., data*.parquet won't match data.parquet). Use [^/]* instead.

Edge Case: ** at start of pattern fails to match zero-depth paths

📄 ingestion/src/metadata/utils/path_pattern.py:125-130
The ** handling in pattern_to_regex only replaces the /DOUBLESTAR/ form (surrounded by slashes) with (?:/|/.+/). When ** appears at the start of a pattern (e.g., **/*.parquet), the placeholder is not surrounded by / on the left, so it falls through to the generic .* replacement. The resulting regex ^.*/[^/]*\.parquet$ requires at least one /, meaning a root-level file like file.parquet won't match. This pattern is used in tests (test_s3_auto_discovery.py:88) and integration configs, so users are likely to use it. Consider handling DOUBLESTAR/ at the start and /DOUBLESTAR at the end as additional replacement cases.

Bug: has_subdirs fails to detect single-level non-Hive partitions

📄 ingestion/src/metadata/ingestion/source/storage/storage_service.py:494-501
The has_subdirs check is intended to detect non-Hive partitions like 20230412/file.parquet, but the logic requires two or more directory levels below table_root to return True.

Trace for table_root = "data/sales/", key = "data/sales/20230412/file.parquet":

  1. key[len(table_root):]"20230412/file.parquet"
  2. .lstrip("/")"20230412/file.parquet"
  3. .rsplit("/", 1)[0]"20230412" (strips the filename)
  4. "/" in "20230412"False

So a single partition directory (the most common case for date-prefix partitions) is never detected. Only two-level structures like 20230412/us-east/file.parquet would match.

The .rsplit(KEY_SEPARATOR, 1)[0] strips the filename but also collapses the single subdirectory case. Remove it and instead count separators to check for at least one directory level.

Quality: Stale docstring still says "Yields dicts" after dataclass refactor

📄 ingestion/src/metadata/ingestion/source/storage/storage_service.py:430
The docstring at line 430 says "Yields dicts with keys: name, prefix, metadata_entry, sample_key, files" but the return type was changed to Iterable[DiscoveredContainer]. The docstring should reflect the new typed return.

...and 1 more resolved from earlier reviews

🤖 Prompt for agents
Code Review: Refactored partition logic resolves path matching errors for glob patterns and corrects legacy docstrings. Remove the index signature in PartitionColumn to restrict arbitrary properties.

1. 💡 Quality: PartitionColumn allows arbitrary extra properties via index signature
   Files: openmetadata-ui/src/main/resources/ui/src/generated/metadataIngestion/storage/containerMetadataConfig.ts:102, openmetadata-ui/src/main/resources/ui/src/generated/metadataIngestion/storage/manifestMetadataConfig.ts:103, openmetadata-spec/src/main/resources/json/schema/metadataIngestion/storage/containerMetadataConfig.json:86-100, openmetadata-spec/src/main/resources/json/schema/metadataIngestion/storage/manifestMetadataConfig.json:84-98

   The generated `PartitionColumn` interface now includes `[property: string]: any;` because the inline schema definition in `containerMetadataConfig.json` and `manifestMetadataConfig.json` omits `"additionalProperties": false`. This weakens TypeScript type-checking — any typo in a property name will silently compile. The related `partitionColumnDetails` in `table.json` correctly sets `additionalProperties: false`; the same should be applied here.

   Suggested fix:
   Add `"additionalProperties": false` to the inline PartitionColumn object definition in both `containerMetadataConfig.json` and `manifestMetadataConfig.json`, then regenerate the TypeScript types. This will remove the `[property: string]: any;` index signature and restore strict typing.

Options

Display: compact → Showing less information.

Comment with these commands to change:

Compact
gitar display:verbose         

Was this helpful? React with 👍 / 👎 | Gitar

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 40 out of 46 changed files in this pull request and generated 2 comments.

Comment on lines +498 to +499
onFocus?.(props.id, props.value);
}, [onFocus, props.id, props.value]);
Copy link

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

onFocusHandler calls onFocus?.(props.id, props.value), but value is destructured from the component props so it is not available on the props rest object. This will always pass undefined as the focused value, which can break RJSF focus handling (e.g. contextual help). Use the destructured value (or effectiveValue) instead of props.value when invoking onFocus.

Suggested change
onFocus?.(props.id, props.value);
}, [onFocus, props.id, props.value]);
onFocus?.(props.id, value);
}, [onFocus, props.id, value]);

Copilot uses AI. Check for mistakes.
Comment on lines +501 to +552
// Display the sample JSON as a placeholder when the field is empty so
// users have a ready template. We purposely do NOT write it into form
// state on mount — the field may be populated asynchronously after a
// saved pipeline config loads, and writing our sample into form data
// would overwrite the real value. We also skip when disabled.
const hasUserValue = typeof value === 'string' && value.trim().length > 0;
const effectiveValue = hasUserValue ? value : SAMPLE_MANIFEST_JSON;

const handleChange = useCallback(
(next: string) => {
if (disabled) {
return;
}
onChange(next);
},
[disabled, onChange]
);

const validation = useMemo(
() => validateManifestJson(effectiveValue),
[effectiveValue]
);

return (
<div className="manifest-json-widget">
<div className="manifest-json-widget-resize-wrapper">
<SchemaEditor
className="manifest-json-widget-editor"
mode={JSON_EDITOR_MODE}
readOnly={disabled}
showCopyButton={false}
value={effectiveValue}
onChange={handleChange}
onFocus={onFocusHandler}
/>
</div>
<span className="manifest-json-widget-resize-hint">
{t('message.drag-bottom-right-corner-to-resize')}
</span>
{validation.status === 'ok' && (
<Alert
showIcon
className="m-t-xs"
message={
<Text>
{t('label.valid-manifest-entry-count', {
count: validation.entryCount,
})}
</Text>
}
type="success"
/>
Copy link

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the actual form value is empty, the widget renders SAMPLE_MANIFEST_JSON as the editor value and validates that sample (validateManifestJson(effectiveValue)), which will show a green “Valid manifest — N entries” banner even though defaultManifest is still unset and ingestion will ignore it. This can mislead users and also makes it impossible to visually distinguish a real saved manifest from the placeholder sample. Consider treating the sample as a true placeholder (not the effective value): validate/display status based on the real value, and only insert the sample into form state when the user explicitly chooses to (or on first edit).

Copilot uses AI. Check for mistakes.
@sonarqubecloud
Copy link
Copy Markdown

@sonarqubecloud
Copy link
Copy Markdown

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backend safe to test Add this label to run secure Github workflows on PRs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants