feat: add online data mixing plugin by kmehant · Pull Request #612 · foundation-model-stack/fms-hf-tuning

kmehant · 2025-09-24T23:03:56Z

Details provided in foundation-model-stack/fms-acceleration#152

github-actions · 2025-09-24T23:04:05Z

Thanks for making a pull request! 😃
One of the maintainers will review and advise on the next steps.

romitjain · 2025-09-25T08:57:36Z

    def _process_dataset_configs(
-        self, dataset_configs: List[DataSetConfig]
+        self, dataset_configs: List[DataSetConfig], odm_config=None
    ) -> Union[Dataset, IterableDataset]:


Do you want to update the annotation to include DatasetDict or dict?

train_datasets_dict will be a dict

@romitjain May I know which annotation you are talking about? The return type of _process_dataset_configs ? In our case it will be IterableDataset isn't it?

For _process_dataset_configs, if odm_config is not None, then we return from _process_datasets_for_odm, which is returning tuple[dict, dict]

I see right :)

Fixed @romitjain, types have turned out to be complex than I thought.

kmehant · 2025-09-25T17:09:45Z

NOTE: format/lint error should go once we have fms_acceleration_odm package available.

Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com>

dushyantbehl

Requesting minor revision of the documentation and restructing of code.

dushyantbehl · 2025-10-01T09:36:31Z


 Note: If a user specifies data sampling they can expect the datasets to be mixed and individual samples in the dataset to not be broken unless the max_seq_len argument is smaller than the length of individual samples in the dataset

+### Online Data Mixing


Can we make ODM a separate document so its easy for users to find.

Done, I have made it into a new doc and changed references accordingly.

dushyantbehl · 2025-10-07T07:11:40Z

+  - name: dataset_1
+    split:
+      train: 0.8
+      validation: 0.2


Can you add a comment after the ratio that this will be overloaded as a reward dataset too in odm.

If this is not explained in the documentation we should do this too

Comments are added and same is explained in docs as well.

dushyantbehl · 2025-10-07T07:13:12Z

+
+
+@pytest.mark.skipif(
+    not is_fms_accelerate_available(plugins="odm"),


Are we enabling installing odm plugin by default in the tox.ini file?
if not i am assmuming these tests are run separately

No we dont want to install the odm plugin by default. Yes I have run these tests on my end:

Check the screenshots below:

dushyantbehl · 2025-10-07T07:15:25Z

+                train_datasets_dict[d.name] = raw[train_split]
+            if eval_split in raw:
+                eval_datasets_dict[d.name] = raw[eval_split]
+        return train_datasets_dict, eval_datasets_dict


Are we just looking to filter train and test splits from the dataset here?
Do you need the returned values to be a raw dict instead of a container?

Are we just looking to filter train and test splits from the dataset here?

Yes

Do you need the returned values to be a raw dict instead of a container?

dict would be good over datasetdict but both ways it should work.

dushyantbehl · 2025-10-07T07:18:13Z

    if data_args.do_dataprocessing_only:
+        if odm_config:
+            raise ValueError(
+                "data processing with online data mixing is not currently supported"


Any reason we do not support data processing only mode for ODM config?

We can just dump the processed datasets and return right?

ODM then has to applied while training too but is there a fundamental problem in compatibility?

Any reason we do not support data processing only mode for ODM config?

We can just dump the processed datasets and return right?

ODM then has to applied while training too but is there a fundamental problem in compatibility?

True, I agree, I have removed this restriction

dushyantbehl · 2025-10-07T07:19:48Z

+        is_tokenized_dataset = is_pretokenized_dataset(train_dataset or eval_dataset)
+
+    data_collator = None
+    if odm_config:


is it possible to wrap all of ODM stuff in a separate function and call it once inside?

Done wrapped them here - func process_dataargs_odm

dushyantbehl · 2025-10-07T07:20:15Z

+    if odm_config:
+        # Third Party
+        # pylint: disable=import-outside-toplevel
+        if not is_fms_accelerate_available(plugins="odm"):


Shouldn't we do this check at the top as first thing?

Can you mention where exactly?

dushyantbehl · 2025-10-07T07:21:02Z

+            reward_type=odm_config.odm.reward_type,
+        )
+        dataset_kwargs["skip_prepare_dataset"] = True
+        train_args.accelerator_config = {"split_batches": True}


Is split_batches needed for ODM?

Yes, by design its needed.

dushyantbehl · 2025-10-07T07:22:44Z

    )
+
+    odm_config = None
+    if data_args.data_config_path:


Can we please do this inside the process_dataargs function?

We want to keep sft_trainer clean from any data related functionality.

I have thought about this, but we need to prepare odm_config variable for fms-acceleration plugin preparation step as well which happens before process_dataargs function. So its hard to keep this piece of code within process_dataargs which happens later point in time.

@kmehant
can process_dataargs not initialize ODM framework and then return anything if needed...as far as I see the only thing we return is a dataset of type ODM so why can't we do the ODM framework initialization inside process_dataargs and return train and eval datasets as usual just ODM this time.

dushyantbehl · 2025-10-07T07:24:44Z

        is_padding_free=is_padding_free,
        processor=processor,
        is_multipack=is_multipack,
+        odm_config=odm_config,


you can keep the data config load inside process data args and initialize odm inside the process_dataargs possibly post this

related - #612 (comment)

Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com>

dushyantbehl

Thanks for fixing the earlier comments...requesting some more clarifications.

dushyantbehl · 2025-10-07T15:40:02Z

+  - name: dataset_2
+    split:
+      train: 0.6
+      validation: 0.2 # validation set is also used in reward computation when reward_type is validation_loss.


Suggested change

validation: 0.2 # validation set is also used in reward computation when reward_type is validation_loss.

validation: 0.2 # validation set is also used in ODM reward computation when reward_type is validation_loss.

dushyantbehl · 2025-10-07T15:40:08Z

+  - name: dataset_1
+    split:
+      train: 0.8
+      validation: 0.2 # validation set is also used in reward computation when reward_type is validation_loss.


Suggested change

validation: 0.2 # validation set is also used in reward computation when reward_type is validation_loss.

validation: 0.2 # validation set is also used in ODM reward computation when reward_type is validation_loss.

There were couple more places so modified in all the files

dushyantbehl · 2025-10-07T15:40:30Z

  - `--multipack`: technique for *multi-gpu training* to balance out number of tokens processed in each device, to minimize waiting time.
 - [fast_moe_config](./tuning/config/acceleration_configs/fast_moe.py) (experimental):
  - `--fast_moe`: trains MoE models in parallel with [Scatter MoE kernels](https://github.com/foundation-model-stack/fms-acceleration/tree/main/plugins/accelerated-moe#fms-acceleration-for-mixture-of-experts), increasing throughput and decreasing memory usage.
+- [odm_config](./tuning/config/acceleration_configs/odm.py) (experimental): See [advanced data preprocessing](./advanced-data-preprocessing.md#online-data-mixing) for usage with data_config. This plugin allows dynamically mixing datasets online during training adapting to training signals.


Do you want to link the pytorch poster link here?

dushyantbehl · 2025-10-07T15:44:44Z

 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
+# See the License for the specificm language governing permissions and


Can you please fix this typo

specificm this is the typo isn't? What you want me to fix?

dushyantbehl · 2025-10-07T15:54:24Z

+        )
+
+    # pylint: disable=import-outside-toplevel
+    if not is_fms_accelerate_available(plugins="odm"):


Can we move this to line 508 top of this function.

dushyantbehl · 2025-10-07T16:01:18Z

            processed_datasets.append((d, raw_datasets))
-
+        if odm_config:
+            return self._process_datasets_for_odm(processed_datasets)


I missed this last time but is sampling and concatenation of datasets not compatible with ODM?

Can we let the normal data processing perform its function i.e give a processed dataset and then wrap ODM on top...

The way I see is you don't need to modify code in data_processors.py the function _process_datasets_for_odm gets called after the datasets are returned by _process_dataset_configs and then you apply odm processing by calling this function inside setup_dataprocessor.py

Sampling and concatenation should not be done with ODM since that would be handled by ODM dataloader.

dushyantbehl · 2025-10-07T16:05:34Z

        raise RuntimeError(f"Failed to dump dataset due to error {e}") from e


+def process_dataargs_odm(


nit: could we call this setup_train_dataset_for_odm

because you are not touching eval_dataset

True, done!

dushyantbehl · 2025-10-07T16:07:59Z

+    processor: AutoProcessor = None,
+    odm_config: ODMConfig = None,
+    train_dataset: Dict = None,
+    eval_dataset: Dict = None,


can we rename this to reward_dataset and pass eval_dataset here after mentioning a comment in the code

dushyantbehl · 2025-10-07T16:10:02Z

        )

    if data_args.data_config_path:
        train_dataset, eval_dataset, dataset_text_field = process_dataconfig_file(


We can potentially return odm_config from this function...or you can return the original data_config from this function to odm_config from inside it here.

We can initialize Acceleration framework ODM here too i think...this would save us loading data config twice

dushyantbehl · 2025-10-07T16:11:12Z

    )
+
+    odm_config = None
+    if data_args.data_config_path:


@kmehant
can process_dataargs not initialize ODM framework and then return anything if needed...as far as I see the only thing we return is a dataset of type ODM so why can't we do the ODM framework initialization inside process_dataargs and return train and eval datasets as usual just ODM this time.

Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com>

kmehant · 2025-10-07T17:22:26Z

@dushyantbehl All the above comments are addressed and pushed. Please take a pass!

kmehant · 2025-10-07T17:23:47Z

tests on latest changes pass

dushyantbehl

LGTM

kmehant requested review from aluu317, anhuong, dushyantbehl and fabianlim as code owners September 24, 2025 23:03

kmehant changed the title ~~[DO NOT MERGE] feat: add online data mixing plugin~~ feat: [DO NOT MERGE] add online data mixing plugin Sep 24, 2025

github-actions Bot added the feat label Sep 24, 2025

kmehant mentioned this pull request Sep 24, 2025

feat: add online data mixing plugin foundation-model-stack/fms-acceleration#152

Merged

romitjain reviewed Sep 25, 2025

View reviewed changes

kmehant force-pushed the odm-plugin branch 3 times, most recently from f86e1e6 to 4729c6f Compare September 29, 2025 11:33

kmehant changed the title ~~feat: [DO NOT MERGE] add online data mixing plugin~~ feat: add online data mixing plugin Sep 29, 2025

kmehant requested a review from ashokponkumar September 29, 2025 13:33

ashokponkumar previously approved these changes Sep 29, 2025

View reviewed changes

kmehant added 14 commits October 6, 2025 22:33

feat: add odm plugin user facing config

c9a519e

Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com>

feat: add odm plugin

d63eff0

Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com>

feat: add odm plugin

6f95e56

Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com>

feat: add odm plugin

8370997

Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com>

feat: add odm plugin

88f68b1

Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com>

feat: add odm plugin

a1b485b

Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com>

feat: add odm plugin

be3afbd

Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com>

feat: add odm plugin

6b0d4cd

Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com>

feat: add odm plugin

ad35a69

Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com>

feat: add odm plugin

245e2a2

Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com>

feat: add odm_plugin

b5cd791

Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com>

feat: add odm_plugin

e8d748b

Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com>

feat: add odm_plugin

3ebb844

Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com>

feat: add odm_plugin

5953bfb

Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com>

kmehant added 14 commits October 6, 2025 22:33

feat: add odm_plugin

dadb6f0

Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com>

feat: add odm_plugin

6ff740b

Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com>

feat: add odm_plugin

beda5df

Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com>

fix: code refactor

304de74

Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com>

fix: code refactor

9d69c85

Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com>

fix: code refactor

a35ca56

Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com>

fix: code refactor

0865130

Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com>

fix: code refactor

c6f333e

Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com>

fix: code refactor

c9945d1

Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com>

fix: code refactor

12ed5fe

Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com>

fix: function and argument types

9fc8238

Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com>

fix: lint errors

0bff8b4

Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com>

fix: unit tests

dc03331

Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com>

docs: add docs

d5db867

Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com>

kmehant dismissed ashokponkumar’s stale review via d5db867 October 6, 2025 17:11

kmehant force-pushed the odm-plugin branch from 9e56301 to d5db867 Compare October 6, 2025 17:11

kmehant requested a review from ashokponkumar October 6, 2025 19:03

dushyantbehl requested changes Oct 7, 2025

View reviewed changes

kmehant added 3 commits October 7, 2025 15:54

fix: refactor

e206f23

Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com>

fix: refactor

81d5917

Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com>

fix: refactor

ae7698e

Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com>

dushyantbehl requested changes Oct 7, 2025

View reviewed changes

kmehant added 3 commits October 7, 2025 22:45

feat: resume functionality

b72e421

Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com>

feat: resume functionality

86eb54a

Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com>

feat: resume functionality

5dd066f

Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com>

kmehant mentioned this pull request Oct 7, 2025

feat: [Dependency on #612] Adopt resumption feature of online data mixing #616

Closed

dushyantbehl approved these changes Oct 8, 2025

View reviewed changes

dushyantbehl merged commit 4ec1340 into foundation-model-stack:main Oct 8, 2025
9 checks passed


		Note: If a user specifies data sampling they can expect the datasets to be mixed and individual samples in the dataset to not be broken unless the max_seq_len argument is smaller than the length of individual samples in the dataset

		### Online Data Mixing



		@pytest.mark.skipif(
		not is_fms_accelerate_available(plugins="odm"),

	validation: 0.2 # validation set is also used in reward computation when reward_type is validation_loss.
	validation: 0.2 # validation set is also used in ODM reward computation when reward_type is validation_loss.

		raise RuntimeError(f"Failed to dump dataset due to error {e}") from e


		def process_dataargs_odm(

Conversation

kmehant commented Sep 24, 2025

Uh oh!

github-actions Bot commented Sep 24, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kmehant commented Sep 25, 2025

Uh oh!

dushyantbehl left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dushyantbehl left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment