Tutorials: Add interleaved getting-started tutorial#1774
Merged
VibhuJawa merged 15 commits intoNVIDIA-NeMo:mainfrom Apr 24, 2026
Merged
Tutorials: Add interleaved getting-started tutorial#1774VibhuJawa merged 15 commits intoNVIDIA-NeMo:mainfrom
VibhuJawa merged 15 commits intoNVIDIA-NeMo:mainfrom
Conversation
- Add tutorials/interleaved/getting-started/ with: - interleaved_pipeline.py: CLI pipeline (WebDataset -> filter -> Parquet/WebDataset) - interleaved_data_quickstart.ipynb: end-to-end interactive notebook - README.md: setup, sample data download, usage examples - Add tutorials/interleaved/README.md: top-level interleaved tutorial index - Update tutorials/README.md: add Interleaved row - Remove tutorials/multimodal/interleaved_data_quickstart.ipynb and mint1t_mvp_pipeline.py (superseded by the new tutorial) Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
Contributor
Author
|
@claude review |
Contributor
|
Tip: Greploop — Automatically fix all review issues by running Use the Greptile plugin for Claude Code to query reviews, search comments, and manage custom context directly from your terminal. |
- Merge fragmented sentence into one in tutorials/interleaved/README.md - Remove duplicate 'Interleaved Data Model' link that resolved to the same URL as 'Core Concepts' Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Required by interleaved_data_quickstart.ipynb for visualisation. Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
arhamm1
reviewed
Apr 9, 2026
abhinavg4
reviewed
Apr 20, 2026
abhinavg4
approved these changes
Apr 20, 2026
Contributor
abhinavg4
left a comment
There was a problem hiding this comment.
A bunch of NITs, please see the path especially
- Rename "Nemotron-Parse PDF" to "PDF Extraction Pipeline (Nemotron-Parse)"
in tutorials/README.md, tutorials/interleaved/README.md, and update the
heading in tutorials/interleaved/nemotron_parse_pdf/README.md.
- Add a "Next Steps" section to getting-started/README.md pointing users
with PDF input to the PDF extraction pipeline.
- Quickstart notebook:
- Rewrite intro to lead with "Each MINT-1T PDF is a sequence of ...".
- Split the constants cell into two: pipeline filter thresholds and
rendering/display settings.
- Fix stale path in the Next Steps table
(tutorials/interleaved/interleaved_pipeline.py ->
tutorials/interleaved/getting-started/interleaved_pipeline.py).
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This pull request introduces a new set of hands-on tutorials and a robust pipeline script for curating interleaved multimodal data (text + images) using NeMo Curator. Additionally, it removes an older, less flexible MVP pipeline in favor of the new, more configurable approach.
New Interleaved Multimodal Data Curation Tutorials and Pipeline:
tutorials/README.md, linking to dedicated tutorials and tools.tutorials/interleaved/README.md, providing an overview, available tutorials, quick start instructions, and links to relevant documentation for interleaved data workflows.tutorials/interleaved/getting-started/README.md, including setup, sample data download, usage examples, and schema/output explanations for both notebook and pipeline script workflows.Pipeline Implementation and Enhancements:
interleaved_pipeline.py, a fully-featured, command-line pipeline script for curating interleaved multimodal data. It supports configurable image and text filtering, flexible output formats (Parquet/WebDataset), schema overrides, and robust error handling.Cleanup and Deprecation:
mint1t_mvp_pipeline.pyMVP pipeline, consolidating all interleaved multimodal data curation into the new, more flexible pipeline and documentation.