Report source lineage from HadoopFormatIO#37265
Report source lineage from HadoopFormatIO#37265shnapz wants to merge 7 commits intoapache:masterfrom
Conversation
Summary of ChangesHello @shnapz, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request enhances Apache Beam's data lineage capabilities by extending source lineage reporting to Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request adds source lineage reporting to HadoopFormatIO for file-based sources, which is a valuable addition. The implementation correctly identifies file-based splits and extracts their paths for lineage tracking.
The refactoring of the common lineage reporting logic into a shared utility method FileSystems.reportSourceLineage(List<ResourceId>) is a great improvement, reducing code duplication and centralizing the logic.
I have a couple of suggestions to improve maintainability and readability:
- In
FileSystems.java, I've suggested introducing a constant for the magic number100to make the code clearer. - In
HadoopFormatIO.java, I've proposed refactoring the newreportSourceLineagemethod to use Java Streams for a more concise and modern implementation.
Overall, this is a well-structured and useful contribution.
1938428 to
23e762c
Compare
|
Assigning reviewers: R: @ahmedabu98 for label java. Note: If you would like to opt out of this review, comment Available commands:
The PR bot will only process comments in the main thread (not review comments). |
|
Reminder, please take a look at this pr: @ahmedabu98 |
|
Assigning new set of reviewers because Pr has gone too long without review. If you would like to opt out of this review, comment R: @Abacn for label java. Available commands:
|
|
R: @ahmedabu98 Could you please take a look? |
|
Stopping reviewer notifications for this pull request: review requested by someone other than the bot, ceding control. If you'd like to restart, comment |
|
This change alone won't be sufficient to make lineage report effective, as HadoopFileSystem doesn't override the no-op reportLineage method in base FileSystem: Also it would be nice to unit test the change to make sure its effectiveness and does not break existing use cases |
32e8683 to
fc948dd
Compare
|
@Abacn I have added lineage reporting to HadoopFileSystem as well |
|
btw CI checks are OK, the only thing that failed is: |
| FileSystems.reportSourceLineage(uniqueDir); | ||
| } | ||
| } | ||
| List<ResourceId> resourceIds = |
There was a problem hiding this comment.
Just make this public and add "internal use only...." comments. Not necessarily to move code around
There was a problem hiding this comment.
Is the concern whether to commit to a new public API? I took a second look and wanted to propose an alternative before reverting the move:
I've marked the method @internal with a "for internal use only by Beam-provided file-based connectors; not a stable public API" note in the javadoc. FileSystems already uses this same pattern for setDefaultPipelineOptions and registerFileSystemsOnce, so there's precedent for @internal public static helpers living here
The reason I prefer FileSystems over making it public on FileBasedSource: HadoopFormatIO isn't a FileBasedSource (it's a generic InputFormat wrapper that only sometimes produces file splits) so reaching into FileBasedSource as a utility namespace feels off. FileSystems is already the shared home for lineage reporting, it has three other reportSourceLineage overloads and this logic is applicable to any file-producing connector, not just subclasses of FileBasedSource.
Iam happy to revert to the FileBasedSource approach if you still prefer it. Just wanted tofloat this as an option first
There was a problem hiding this comment.
A generic guide is specific component change ( HadoopIO/HadoopFormat) should have minimum touch on core (FileSystems) code.
It sounds weird because in current approach HadoopFormat lineage report uses internal method belongs to FileBasedSource. Moving it to FileSystems public API further diverge the encapsulations.
Another approach is to implement HadoopFormat's reportLineage with existing FileSystems.reportSourceLineage or it's own methods
| } | ||
|
|
||
| @Override | ||
| protected void reportLineage(HadoopResourceId resourceId, Lineage lineage) { |
There was a problem hiding this comment.
This duplicates base classes' default impl. We only need to override reportLineage(HadoopResourceId resourceId, Lineage lineage, LineageLevel level)
fc948dd to
a9e88ef
Compare
|
@Abacn would you mind taking another look? |
|
waiting on author |
|
latest comment was in #37265 (comment) |
This PR adds source lineage reporting to
HadoopFormatIOfor file-based sources.Background
While
FileBasedSourcealready reports source lineage metrics,HadoopFormatIOdid not.HadoopFormatIOis a more generic IO transform that supports various data sources including databases, NoSQL stores, and file systems through Hadoop'sInputFormat/OutputFormatinterfaces.Changes
InputSplitis an instance ofFileSplit)FileSystems.reportSourceLineage(List<ResourceId>)to reduce code duplicationImplementation Details
The implementation reports lineage only for file-based sources detected via
FileSplitinstances. Non-file sources (e.g., database reads viaDBInputFormat) are intentionally excluded as they don't represent file resources.Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:
addresses #123), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, commentfixes #<ISSUE NUMBER>instead.CHANGES.mdwith noteworthy changes.See the Contributor Guide for more tips on how to make review process smoother.
To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md
GitHub Actions Tests Status (on master branch)
See CI.md for more information about GitHub Actions CI or the workflows README to see a list of phrases to trigger workflows.