Skip to content

Tutorials: Add interleaved getting-started tutorial#1774

Merged
VibhuJawa merged 15 commits intoNVIDIA-NeMo:mainfrom
VibhuJawa:feat/align_iterleaved_tutorials
Apr 24, 2026
Merged

Tutorials: Add interleaved getting-started tutorial#1774
VibhuJawa merged 15 commits intoNVIDIA-NeMo:mainfrom
VibhuJawa:feat/align_iterleaved_tutorials

Conversation

@VibhuJawa
Copy link
Copy Markdown
Contributor

This pull request introduces a new set of hands-on tutorials and a robust pipeline script for curating interleaved multimodal data (text + images) using NeMo Curator. Additionally, it removes an older, less flexible MVP pipeline in favor of the new, more configurable approach.

New Interleaved Multimodal Data Curation Tutorials and Pipeline:

  • Added a new section for "Interleaved" (multimodal text + image) data curation to the main tutorials index in tutorials/README.md, linking to dedicated tutorials and tools.
  • Introduced tutorials/interleaved/README.md, providing an overview, available tutorials, quick start instructions, and links to relevant documentation for interleaved data workflows.
  • Added a detailed "Getting Started" guide in tutorials/interleaved/getting-started/README.md, including setup, sample data download, usage examples, and schema/output explanations for both notebook and pipeline script workflows.

Pipeline Implementation and Enhancements:

  • Implemented interleaved_pipeline.py, a fully-featured, command-line pipeline script for curating interleaved multimodal data. It supports configurable image and text filtering, flexible output formats (Parquet/WebDataset), schema overrides, and robust error handling.

Cleanup and Deprecation:

  • Removed the obsolete mint1t_mvp_pipeline.py MVP pipeline, consolidating all interleaved multimodal data curation into the new, more flexible pipeline and documentation.

- Add tutorials/interleaved/getting-started/ with:
  - interleaved_pipeline.py: CLI pipeline (WebDataset -> filter -> Parquet/WebDataset)
  - interleaved_data_quickstart.ipynb: end-to-end interactive notebook
  - README.md: setup, sample data download, usage examples
- Add tutorials/interleaved/README.md: top-level interleaved tutorial index
- Update tutorials/README.md: add Interleaved row
- Remove tutorials/multimodal/interleaved_data_quickstart.ipynb and
  mint1t_mvp_pipeline.py (superseded by the new tutorial)

Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Apr 8, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@VibhuJawa
Copy link
Copy Markdown
Contributor Author

@claude review

@VibhuJawa VibhuJawa marked this pull request as ready for review April 8, 2026 20:57
@VibhuJawa VibhuJawa requested a review from a team as a code owner April 8, 2026 20:57
@VibhuJawa VibhuJawa requested review from abhinavg4, arhamm1 and meatybobby and removed request for a team April 8, 2026 20:57
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 8, 2026

Tip:

Greploop — Automatically fix all review issues by running /greploops in Claude Code. It iterates: fix, push, re-review, repeat until 5/5 confidence.

Use the Greptile plugin for Claude Code to query reviews, search comments, and manage custom context directly from your terminal.

- Merge fragmented sentence into one in tutorials/interleaved/README.md
- Remove duplicate 'Interleaved Data Model' link that resolved to the
  same URL as 'Core Concepts'

Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Required by interleaved_data_quickstart.ipynb for visualisation.

Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Comment thread tutorials/README.md Outdated
Comment thread tutorials/interleaved/README.md Outdated
Comment thread tutorials/interleaved/getting-started/README.md
Comment thread tutorials/interleaved/getting-started/interleaved_data_quickstart.ipynb Outdated
Copy link
Copy Markdown
Contributor

@abhinavg4 abhinavg4 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bunch of NITs, please see the path especially

Comment thread tutorials/interleaved/getting-started/interleaved_data_quickstart.ipynb Outdated
Comment thread tutorials/interleaved/getting-started/interleaved_data_quickstart.ipynb Outdated
abhinavg4 and others added 7 commits April 20, 2026 09:37
- Rename "Nemotron-Parse PDF" to "PDF Extraction Pipeline (Nemotron-Parse)"
  in tutorials/README.md, tutorials/interleaved/README.md, and update the
  heading in tutorials/interleaved/nemotron_parse_pdf/README.md.
- Add a "Next Steps" section to getting-started/README.md pointing users
  with PDF input to the PDF extraction pipeline.
- Quickstart notebook:
  - Rewrite intro to lead with "Each MINT-1T PDF is a sequence of ...".
  - Split the constants cell into two: pipeline filter thresholds and
    rendering/display settings.
  - Fix stale path in the Next Steps table
    (tutorials/interleaved/interleaved_pipeline.py ->
     tutorials/interleaved/getting-started/interleaved_pipeline.py).

Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants