fix: refactor

kmehant · kmehant · commit e206f2344b2f · 2025-10-07T15:54:15.000+05:30
Signed-off-by: Mehant Kammakomati &lt;mehant.kammakomati2@ibm.com&gt;
diff --git a/README.md b/README.md
@@ -8,6 +8,7 @@
   - [Advanced Data Processing](./docs/advanced-data-preprocessing.md#data-config)
   - [Guidelines on supported data formats](./docs/advanced-data-preprocessing.md#use-cases-supported-via-command-line-argument-training_data_path)
   - [Offline data processing](#offline-data-preprocessing)
+  - [Online data mixing](./docs/online-data-mixing.md)
 - [Additional Frameworks](#additional-frameworks)
   - [Inference](#inference)
   - [Validation](#validation)
diff --git a/docs/advanced-data-preprocessing.md b/docs/advanced-data-preprocessing.md
@@ -4,7 +4,6 @@ Our library also supports a powerful data processing backend which can be used b
 1. Creating custom data processing pipeline for the datasets.  
 1. Combining multiple datasets into one, even if they have different formats.  
 1. Mixing datasets as required and sampling each dataset with different weights.
-1. Dynamically mixing datasets online based on training signals through fms_acceleration_odm plugin.
 
 These things are supported via what we call a [`data_config`](#data-config) which can be passed as an argument to sft trainer.
 
@@ -137,6 +136,14 @@ Users can create a data config file in any of YAML or JSON format they choose (w
  - `chat_template` (optional, str): pass `chat_template` via data_config for multi-turn data, replaces existing default chat template.
  - `odm` (optional): if `type` is odm, this field is required to be specific to provide configuration for online data mixing.
 
+Data handlers are customizable components within the data config that allow users to preprocess or manipulate individual datasets. We use [Hugging Face Map API](https://huggingface.co/docs/datasets/en/process#map) to apply these routines.
+These functions can process the dataset in any way users require and the `list` of data handlers specified for each dataset are applied in order.
+Each data handler has:
+- `name`: The handler's unique identifier.
+- `arguments`: A dictionary of parameters specific to the handler.
+
+#### Online data mixing section
+
 `odm` config has the following fields and is required when `datapreprocessor` `type` is `odm`.
 
 `odm`:
@@ -154,11 +161,6 @@ Users can create a data config file in any of YAML or JSON format they choose (w
     - `split` (optional, dict[str: float]): Defines how to split the dataset into training and validation sets. Requires both `train` and `validation` keys.
     - `data_handlers` (optional, list): A list of data handler configurations which preprocess the dataset.
 
-Data handlers are customizable components within the data config that allow users to preprocess or manipulate individual datasets. We use [Hugging Face Map API](https://huggingface.co/docs/datasets/en/process#map) to apply these routines.
-These functions can process the dataset in any way users require and the `list` of data handlers specified for each dataset are applied in order.
-Each data handler has:
-- `name`: The handler's unique identifier.
-- `arguments`: A dictionary of parameters specific to the handler.
 
 We do provide some sample `data_configs` here, [predefined_data_configs](../tests/artifacts/predefined_data_configs/).
 
@@ -203,58 +205,6 @@ We also allow users to pass a [`seed`](https://huggingface.co/docs/datasets/v3.2
 
 Note: If a user specifies data sampling they can expect the datasets to be mixed and individual samples in the dataset to not be broken unless the max_seq_len argument is smaller than the length of individual samples in the dataset
 
-### Online Data Mixing
-Dataset mixing can be dynamic in nature that adapts online during the training based on the training signals. We provide this feature through fms_acceleration_odm plugin and more details can be found [here](https://github.com/foundation-model-stack/fms-acceleration/tree/main/plugins/online-data-mixing).
-
-#### How to Use
-
-`dataprocessor` `type` has to be set to `odm` and then `odm` config should be provided in the `odm` section of the data config file. An example is shown below:
-
-```yaml
-dataprocessor:
-    type: odm
-    odm:
-      update_interval: 1 # update every step
-      sampling_interval: 1 # sample category for every sample
-      reward_type: validation_loss # uses eval loss of each dataset as reward
-      gamma: 0.1 # MAB hyper-parameter
-      eta: 0.2 # MAB hyper-parameter
-```
-
-Here `update_interval` is set to `1` which is to update MAB on every step with validation loss as reward across the datasets. `sampling_interval` is set to `1` which is to choose a dataset to sample for every sample. `reward_type` is set to `validation_loss` to use validation loss across datasets as a training signal to reward MAB decisions during training. Example `datasets` section can look like below:
-
-```yaml
-datasets:
-  - name: dataset_1
-    split:
-      train: 0.8
-      validation: 0.2
-    data_paths:
-      - "FILE_PATH"
-    data_handlers:
-      - name: tokenize_and_apply_input_masking
-        arguments:
-          remove_columns: all
-          batched: false
-          fn_kwargs:
-            input_column_name: input
-            output_column_name: output
-  - name: dataset_2
-    split:
-      train: 0.9
-      validation: 0.1
-    data_paths:
-      - "FILE_PATH"
-    data_handlers:
-      - name: tokenize_and_apply_input_masking
-        arguments:
-          remove_columns: all
-          batched: false
-          fn_kwargs:
-            input_column_name: input
-            output_column_name: output
-```
-As you notice, `validation` under `split` is provided for each of the datasets and is necessary to be provided since the `reward_type` is `validation_loss` which requires validation datasets to be available. Same applies to the following rewards: `validation_loss`, `entropy`, `entropy3_varent1`, and `entropy_last_token`. While reward_types `train_loss` and `gradnorm` do not require validation split.
 
 ### Dataset Splitting
 
diff --git a/tests/artifacts/predefined_data_configs/multiple_datasets_with_odm.yaml b/tests/artifacts/predefined_data_configs/multiple_datasets_with_odm.yaml
@@ -12,7 +12,7 @@ datasets:
   - name: dataset_1
     split:
       train: 0.8
-      validation: 0.2
+      validation: 0.2 # validation set is also used in reward computation when reward_type is validation_loss.
     sampling: 0.3 # ignored
     data_paths:
       - "FILE_PATH"
@@ -27,7 +27,7 @@ datasets:
   - name: dataset_2
     split:
       train: 0.6
-      validation: 0.2
+      validation: 0.2 # validation set is also used in reward computation when reward_type is validation_loss.
     sampling: 0.4 # ignored
     data_paths:
       - "FILE_PATH"
@@ -42,7 +42,7 @@ datasets:
   - name: dataset_3
     split:
       train: 0.4
-      validation: 0.1
+      validation: 0.1 # validation set is also used in reward computation when reward_type is validation_loss.
     sampling: 0.3  # ignored
     data_paths:
       - "FILE_PATH"
@@ -57,7 +57,7 @@ datasets:
   - name: dataset_4
     split:
       train: 0.0
-      validation: 0.3  # ignored
+      validation: 0.3 # validation set is also used in reward computation when reward_type is validation_loss.
     data_paths:
       - "FILE_PATH"
     data_handlers:
@@ -67,4 +67,4 @@ datasets:
           batched: false
           fn_kwargs:
             input_column_name: input
-            output_column_name: output
+            output_column_name: output
diff --git a/tuning/data/setup_dataprocessor.py b/tuning/data/setup_dataprocessor.py
@@ -9,7 +9,7 @@
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
+# See the License for the specificm language governing permissions and
 # limitations under the License.
 
 # Standard
@@ -495,6 +495,72 @@ def dump_dataset(
         raise RuntimeError(f"Failed to dump dataset due to error {e}") from e
 
 
+def process_dataargs_odm(
+    data_args: DataArguments,
+    tokenizer: AutoTokenizer,
+    train_args: TrainingArguments,
+    is_padding_free: bool = False,
+    processor: AutoProcessor = None,
+    odm_config: ODMConfig = None,
+    train_dataset: Dict = None,
+    eval_dataset: Dict = None,
+    max_seq_length: str = None,
+):
+    collators = {}
+    eval_collators = {}
+    for k, v in train_dataset.items():
+        is_tokenized_dataset = is_pretokenized_dataset(v)
+        collators[k] = get_data_collator(
+            train_args.packing,
+            data_args.response_template,
+            tokenizer,
+            is_tokenized_dataset,
+            max_seq_length,
+            data_args.instruction_template,
+            is_padding_free=is_padding_free,
+            processor=processor,
+        )
+        data_collator = collators[k]
+    for k, v in eval_dataset.items():
+        is_tokenized_dataset = is_pretokenized_dataset(v)
+        eval_collators[k] = get_data_collator(
+            train_args.packing,
+            data_args.response_template,
+            tokenizer,
+            is_tokenized_dataset,
+            max_seq_length,
+            data_args.instruction_template,
+            is_padding_free=is_padding_free,
+            processor=processor,
+        )
+
+    # pylint: disable=import-outside-toplevel
+    if not is_fms_accelerate_available(plugins="odm"):
+        raise ImportError(
+            "use of odm data config feature requires"
+            "installation of fms_acceleration_odm package"
+        )
+    # Third Party
+    # pylint: disable=import-error
+    from fms_acceleration_odm import OnlineMixingDataset
+
+    train_dataset = OnlineMixingDataset(
+        train_dataset,
+        collators,
+        eval_dataset,
+        eval_collators,
+        None,
+        gamma=odm_config.odm.gamma,
+        eta=odm_config.odm.eta,
+        output_dir=train_args.output_dir,
+        sampling_interval=odm_config.odm.sampling_interval,
+        eval_batch_size=train_args.per_device_eval_batch_size,
+        reward_type=odm_config.odm.reward_type,
+    )
+    train_args.accelerator_config = {"split_batches": True}
+    return (True, train_dataset, True, data_collator)
+
+
 # If a data config file is provided, load it to get the training dataset.
 # - Assumes only the training dataset is specified in the config file.
 # - Expects a complete and valid data config file from the user.
@@ -595,10 +661,6 @@ def process_dataargs(
             "Check your data config or ensure split sizes are valid."
         )
     if data_args.do_dataprocessing_only:
-        if odm_config:
-            raise ValueError(
-                "data processing with online data mixing is not currently supported"
-            )
         dump_dir = Path(train_args.output_dir)
         if not dump_dir.is_absolute():
             dump_dir = dump_dir.absolute()
@@ -621,44 +683,31 @@ def process_dataargs(
         )
         return (train_dataset, eval_dataset, None, None, None, None)
 
-    # Note: This check should not be removed.
-    #       Its important to recompute this post handling to
-    #       check if we already tokenized the dataset or not.
+    dataset_kwargs = {}
+    data_collator = None
     if odm_config:
         is_tokenized_dataset = True
+        (
+            dataset_kwargs["skip_prepare_dataset"],
+            train_dataset,
+            dataset_kwargs,
+            data_collator,
+        ) = process_dataargs_odm(
+            data_args,
+            tokenizer,
+            train_args,
+            is_padding_free,
+            processor,
+            odm_config,
+            train_dataset,
+            eval_dataset,
+            max_seq_length,
+        )
     else:
+        # Note: This check should not be removed.
+        #       Its important to recompute this post handling to
+        #       check if we already tokenized the dataset or not.
         is_tokenized_dataset = is_pretokenized_dataset(train_dataset or eval_dataset)
-
-    data_collator = None
-    if odm_config:
-        collators = {}
-        eval_collators = {}
-        for k, v in train_dataset.items():
-            is_tokenized_dataset = is_pretokenized_dataset(v)
-            collators[k] = get_data_collator(
-                train_args.packing,
-                data_args.response_template,
-                tokenizer,
-                is_tokenized_dataset,
-                max_seq_length,
-                data_args.instruction_template,
-                is_padding_free=is_padding_free,
-                processor=processor,
-            )
-            data_collator = collators[k]
-        for k, v in eval_dataset.items():
-            is_tokenized_dataset = is_pretokenized_dataset(v)
-            eval_collators[k] = get_data_collator(
-                train_args.packing,
-                data_args.response_template,
-                tokenizer,
-                is_tokenized_dataset,
-                max_seq_length,
-                data_args.instruction_template,
-                is_padding_free=is_padding_free,
-                processor=processor,
-            )
-    else:
         data_collator = get_data_collator(
             train_args.packing,
             data_args.response_template,
@@ -669,34 +718,6 @@ def process_dataargs(
             is_padding_free=is_padding_free,
             processor=processor,
         )
-    dataset_kwargs = {}
-    if odm_config:
-        # Third Party
-        # pylint: disable=import-outside-toplevel
-        if not is_fms_accelerate_available(plugins="odm"):
-            raise ImportError(
-                "use of odm data config feature requires"
-                "installation of fms_acceleration_odm package"
-            )
-        # Third Party
-        # pylint: disable=import-error
-        from fms_acceleration_odm import OnlineMixingDataset
-
-        train_dataset = OnlineMixingDataset(
-            train_dataset,
-            collators,
-            eval_dataset,
-            eval_collators,
-            None,
-            gamma=odm_config.odm.gamma,
-            eta=odm_config.odm.eta,
-            output_dir=train_args.output_dir,
-            sampling_interval=odm_config.odm.sampling_interval,
-            eval_batch_size=train_args.per_device_eval_batch_size,
-            reward_type=odm_config.odm.reward_type,
-        )
-        dataset_kwargs["skip_prepare_dataset"] = True
-        train_args.accelerator_config = {"split_batches": True}
 
     # For vision model tuning prepare_dataset is skipped.
     if processor is not None: