Skip to content

Fixes #24798: Prevent avro parser from leaking file content in warning logs#27498

Open
rajattiwari-stack wants to merge 1 commit intoopen-metadata:mainfrom
rajattiwari-stack:fix/avro-parser-sensitive-data-logging-24798
Open

Fixes #24798: Prevent avro parser from leaking file content in warning logs#27498
rajattiwari-stack wants to merge 1 commit intoopen-metadata:mainfrom
rajattiwari-stack:fix/avro-parser-sensitive-data-logging-24798

Conversation

@rajattiwari-stack
Copy link
Copy Markdown

@rajattiwari-stack rajattiwari-stack commented Apr 17, 2026

Describe your changes:

Fixes #24798

When the avro schema parser fails to parse a file during S3 ingestion, the exception message includes the actual file content. This was being logged at WARNING level via logger.warning(f"Unable to parse the avro schema: {exc}"), which can leak sensitive data into log storage.

Changed warning logs to only include the exception type name instead of the full exception message. Full exception details remain available at DEBUG level through the existing traceback.format_exc() call.

Before:

WARNING - Unable to parse the avro schema: No "type" property: {"secret_key": "super_secret_value_12345"}

After:

WARNING - Unable to parse the avro schema: SchemaParseException

Added a unit test to verify warning-level logs do not contain any of the sensitive file content.

Type of change:

  • Bug fix

Checklist:

  • I have read the CONTRIBUTING document.
  • My PR title is Fixes <issue-number>: <short explanation>
  • I have added a test that covers the exact scenario we are fixing.

@rajattiwari-stack rajattiwari-stack requested a review from a team as a code owner April 17, 2026 21:10
@github-actions
Copy link
Copy Markdown
Contributor

Hi there 👋 Thanks for your contribution!

The OpenMetadata team will review the PR shortly! Once it has been labeled as safe to test, the CI workflows
will start executing and we'll be able to make sure everything is working as expected.

Let us know if you need any help!

Comment thread ingestion/tests/integration/s3/test_avro_schema_logging.py Outdated
@rajattiwari-stack rajattiwari-stack force-pushed the fix/avro-parser-sensitive-data-logging-24798 branch from 6b5d865 to 3fbe124 Compare April 17, 2026 21:18
@github-actions
Copy link
Copy Markdown
Contributor

Hi there 👋 Thanks for your contribution!

The OpenMetadata team will review the PR shortly! Once it has been labeled as safe to test, the CI workflows
will start executing and we'll be able to make sure everything is working as expected.

Let us know if you need any help!

@rajattiwari-stack
Copy link
Copy Markdown
Author

Ready for review. Addressed the gitar-bot feedback — moved the test to unit tests.

except Exception as exc: # pylint: disable=broad-except
logger.debug(traceback.format_exc())
logger.warning(f"Unable to parse the avro schema: {exc}")
logger.warning(f"Unable to parse the avro schema: {type(exc).__name__}")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rajattiwari-stack instead of doing this can you do a special exception handling for type SchemaParseException

because even if you hide the message here, in the debug log the same content will be exposed

@rajattiwari-stack rajattiwari-stack force-pushed the fix/avro-parser-sensitive-data-logging-24798 branch from 3fbe124 to 80635c7 Compare April 22, 2026 18:40
@github-actions
Copy link
Copy Markdown
Contributor

Hi there 👋 Thanks for your contribution!

The OpenMetadata team will review the PR shortly! Once it has been labeled as safe to test, the CI workflows
will start executing and we'll be able to make sure everything is working as expected.

Let us know if you need any help!

Comment on lines +264 to +265
except avro.errors.SchemaParseException:
logger.warning("Unable to parse the avro schema: SchemaParseException")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Bug: SchemaParseException handler drops debug traceback

The new except avro.errors.SchemaParseException blocks don't call logger.debug(traceback.format_exc()), unlike the generic except Exception blocks that follow them. This means when the most common parse failure occurs (SchemaParseException), no traceback is logged even at DEBUG level, making it harder to diagnose parsing issues.

The PR description states "Full exception details remain available at DEBUG level through the existing traceback.format_exc() call", but this is only true for non-SchemaParseException exceptions.

Suggested fix:

except avro.errors.SchemaParseException:
    logger.debug(traceback.format_exc())
    logger.warning("Unable to parse the avro schema: SchemaParseException")

Was this helpful? React with 👍 / 👎 | Reply gitar fix to apply this suggestion

@gitar-bot
Copy link
Copy Markdown

gitar-bot Bot commented Apr 22, 2026

Code Review ⚠️ Changes requested 1 resolved / 2 findings

Prevents file content leakage in avro parser warning logs and removes the unnecessary MinIO dependency in integration tests. The current SchemaParseException handler still drops critical debug tracebacks.

⚠️ Bug: SchemaParseException handler drops debug traceback

📄 ingestion/src/metadata/parsers/avro_parser.py:264-265 📄 ingestion/src/metadata/parsers/avro_parser.py:309-310

The new except avro.errors.SchemaParseException blocks don't call logger.debug(traceback.format_exc()), unlike the generic except Exception blocks that follow them. This means when the most common parse failure occurs (SchemaParseException), no traceback is logged even at DEBUG level, making it harder to diagnose parsing issues.

The PR description states "Full exception details remain available at DEBUG level through the existing traceback.format_exc() call", but this is only true for non-SchemaParseException exceptions.

Suggested fix
except avro.errors.SchemaParseException:
    logger.debug(traceback.format_exc())
    logger.warning("Unable to parse the avro schema: SchemaParseException")
✅ 1 resolved
Quality: Integration test unnecessarily requires MinIO for a pure function

📄 ingestion/tests/integration/s3/test_avro_schema_logging.py:22-36
The test test_avro_parse_failure_does_not_leak_file_content requires the minio and bucket_name fixtures (which spin up a MinIO container), but parse_avro_schema() is a pure function that takes a string argument. The MinIO upload/download round-trip on lines 26-36 is completely unnecessary — the same string (INVALID_AVRO_CONTENT) is passed in and read back out unchanged.

This adds:

  • A heavy Docker container dependency for a trivial unit test
  • Slower CI feedback (~seconds to start MinIO vs milliseconds for a unit test)
  • A flaky surface (network, Docker availability)

This should be a simple unit test under tests/unit/ instead.

🤖 Prompt for agents
Code Review: Prevents file content leakage in avro parser warning logs and removes the unnecessary MinIO dependency in integration tests. The current SchemaParseException handler still drops critical debug tracebacks.

1. ⚠️ Bug: SchemaParseException handler drops debug traceback
   Files: ingestion/src/metadata/parsers/avro_parser.py:264-265, ingestion/src/metadata/parsers/avro_parser.py:309-310

   The new `except avro.errors.SchemaParseException` blocks don't call `logger.debug(traceback.format_exc())`, unlike the generic `except Exception` blocks that follow them. This means when the most common parse failure occurs (`SchemaParseException`), no traceback is logged even at DEBUG level, making it harder to diagnose parsing issues.
   
   The PR description states "Full exception details remain available at DEBUG level through the existing `traceback.format_exc()` call", but this is only true for non-`SchemaParseException` exceptions.

   Suggested fix:
   except avro.errors.SchemaParseException:
       logger.debug(traceback.format_exc())
       logger.warning("Unable to parse the avro schema: SchemaParseException")

Options

Display: compact → Showing less information.

Comment with these commands to change:

Compact
gitar display:verbose         

Was this helpful? React with 👍 / 👎 | Gitar

@rajattiwari-stack
Copy link
Copy Markdown
Author

@ulixius9 Updated — added specific except avro.errors.SchemaParseException handling as suggested. Traceback is intentionally omitted for this exception since the avro library embeds file content in the exception message, which would re-introduce the leak at DEBUG level. Other exceptions still get debug tracebacks. Test updated to verify no leak at any log level.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Avro parser logs actual file content when it fails to parse the file in S3 ingestion

2 participants