Added SFT Pre-Processing for Grain Input Pipeline#3437
Merged
copybara-service[bot] merged 10 commits intomainfrom Apr 1, 2026
Merged
Added SFT Pre-Processing for Grain Input Pipeline#3437copybara-service[bot] merged 10 commits intomainfrom
copybara-service[bot] merged 10 commits intomainfrom
Conversation
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
vlad-karp
reviewed
Mar 19, 2026
Collaborator
vlad-karp
left a comment
There was a problem hiding this comment.
It would also be great to test not only with maxtext general sft but with distillation sft pipeline as well
aireenmei
reviewed
Mar 20, 2026
vlad-karp
reviewed
Mar 23, 2026
vlad-karp
approved these changes
Mar 24, 2026
JamesDeng42
reviewed
Mar 25, 2026
JamesDeng42
approved these changes
Mar 25, 2026
Collaborator
JamesDeng42
left a comment
There was a problem hiding this comment.
The change is clean, left one comments and LGTM.
4f4c45c to
c7bd12f
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
This PR introduces SFT support to the Grain input pipeline by adding a separate
sft_preprocessing_pipelinefunction. Rather than cluttering the existing pretrain code, it uses simple conditionals inside the train and eval iterators to route to this new SFT logic. I followed the existing Hugging Face SFT implementation and adapted its logic to be compatible with Grain's element-wise datasets.Tests
I added a unit test to verify end-to-end functionality to make sure the Grain SFT pipeline formats the data and outputs correctly. Ran this command to execute the unit test:
pytest tests/unit/grain_data_processing_test.py::GrainSFTParquetProcessingTest -vThis is the output of the test: Test Passed Output
Also, ran the training pipeline in Maxtext with sft enabled using a grain dataset with this command:
python3 -m maxtext.trainers.post_train.sft.train_sft src/maxtext/configs/post_train/sft.yml run_name=test_grain_sft dataset_type=grain grain_file_type=parquet grain_train_files=gs://maxtext-dataset/hf/ultrachat_200k/train_sft-*.parquet steps=10 tokenizer_type=huggingface tokenizer_path=HuggingFaceH4/zephyr-7b-betaVerified that the sft processing changes worked and trained successfully: Logs
Checklist
Before submitting this PR, please make sure (put X in square brackets):
gemini-reviewlabel.