This guide explains the structure and fields used in YAML configuration files for datasets within Eval Protocol. These configurations are typically located in conf/dataset/ or within an example's conf/dataset/ directory (e.g., examples/math_example/conf/dataset/). They are processed by eval_protocol.datasets.loader.py using Hydra.
There are two main types of dataset configurations: Base Datasets and Derived Datasets.
A base dataset configuration defines the connection to a raw data source and performs initial processing like column mapping.
Example File: conf/dataset/base_dataset.yaml (schema), examples/math_example/conf/dataset/gsm8k.yaml (concrete example)
-
_target_(Required)- Description: Specifies the Python function to instantiate for loading this dataset.
- Typical Value:
eval_protocol.datasets.loader.load_and_process_dataset - Example:
_target_: eval_protocol.datasets.loader.load_and_process_dataset
-
source_type(Required)- Description: Defines the type of the data source.
- Supported Values:
"huggingface": For datasets hosted on the Hugging Face Hub."jsonl": For local datasets in JSON Lines format."fireworks": (Not yet implemented) For datasets hosted on Fireworks AI.
- Example:
source_type: huggingface
-
path_or_name(Required)- Description: Identifier for the dataset.
- For
huggingface: The Hugging Face dataset name (e.g.,"gsm8k","cais/mmlu"). - For
jsonl: Path to the.jsonlfile (e.g.,"data/my_data.jsonl").
- For
- Example:
path_or_name: "gsm8k"
- Description: Identifier for the dataset.
-
split(Optional)- Description: Specifies the dataset split to load (e.g.,
"train","test","validation"). If loading a Hugging FaceDatasetDictor multiple JSONL files mapped viadata_files, this selects the split after loading. - Default:
"train" - Example:
split: "test"
- Description: Specifies the dataset split to load (e.g.,
-
config_name(Optional)- Description: For Hugging Face datasets with multiple configurations (e.g.,
"main","all"forgsm8k). Corresponds to thenameparameter in Hugging Face'sload_dataset. - Default:
null - Example:
config_name: "main"(forgsm8k)
- Description: For Hugging Face datasets with multiple configurations (e.g.,
-
data_files(Optional)- Description: Used for loading local files (like JSONL, CSV) with Hugging Face's
datasets.load_dataset. Can be a single file path, a list, or a dictionary mapping split names to file paths. - Example:
data_files: {"train": "path/to/train.jsonl", "test": "path/to/test.jsonl"}
- Description: Used for loading local files (like JSONL, CSV) with Hugging Face's
-
max_samples(Optional)- Description: Maximum number of samples to load from the dataset (or from each split if a
DatasetDictis loaded). Ifnullor0, all samples are loaded. - Default:
null - Example:
max_samples: 100
- Description: Maximum number of samples to load from the dataset (or from each split if a
-
column_mapping(Optional)- Description: A dictionary to rename columns from the source dataset to a standard internal format. Keys are the new standard names (e.g.,
"query","ground_truth"), and values are the original column names in the source dataset. This mapping is applied byeval_protocol.datasets.loader.py. - Default:
{"query": "query", "ground_truth": "ground_truth", "solution": null} - Example (
gsm8k.yaml):column_mapping: query: "question" ground_truth: "answer"
- Description: A dictionary to rename columns from the source dataset to a standard internal format. Keys are the new standard names (e.g.,
-
preprocessing_steps(Optional)- Description: A list of strings, where each string is a Python import path to a preprocessing function (e.g.,
"eval_protocol.datasets.loader.transform_codeparrot_apps_sample"). These functions are applied to the dataset after loading and before column mapping. - Default:
[] - Example:
preprocessing_steps: ["my_module.my_preprocessor_func"]
- Description: A list of strings, where each string is a Python import path to a preprocessing function (e.g.,
-
hf_extra_load_params(Optional)- Description: A dictionary of extra parameters to pass directly to Hugging Face's
datasets.load_dataset()(e.g.,trust_remote_code: True). - Default:
{} - Example:
hf_extra_load_params: {trust_remote_code: True}
- Description: A dictionary of extra parameters to pass directly to Hugging Face's
-
description(Optional, Metadata)- Description: A brief description of the dataset configuration for documentation purposes.
- Example:
description: "GSM8K (Grade School Math 8K) dataset."
A derived dataset configuration references a base dataset and applies further transformations, such as adding system prompts, changing the output format, or applying different column mappings or sample limits.
Example File: examples/math_example/conf/dataset/base_derived_dataset.yaml (schema), examples/math_example/conf/dataset/gsm8k_math_prompts.yaml (concrete example)
-
_target_(Required)- Description: Specifies the Python function to instantiate for loading this derived dataset.
- Typical Value:
eval_protocol.datasets.loader.load_derived_dataset - Example:
_target_: eval_protocol.datasets.loader.load_derived_dataset
-
base_dataset(Required)- Description: A reference to the base dataset configuration to derive from. This can be the name of another dataset configuration file (e.g.,
"gsm8k", which would loadconf/dataset/gsm8k.yaml) or a full inline base dataset configuration object. - Example:
base_dataset: "gsm8k"
- Description: A reference to the base dataset configuration to derive from. This can be the name of another dataset configuration file (e.g.,
-
system_prompt(Optional)- Description: A string that will be used as the system prompt. In the
evaluation_format, this prompt is added as asystem_promptfield alongsideuser_query. - Default:
null - Example (
gsm8k_math_prompts.yaml):"Solve the following math problem. Show your work clearly. Put the final numerical answer between <answer> and </answer> tags."
- Description: A string that will be used as the system prompt. In the
-
output_format(Optional)- Description: Specifies the final format for the derived dataset.
- Supported Values:
"evaluation_format": Converts dataset records to includeuser_query,ground_truth_for_eval, and optionallysystem_promptandid. This is the standard format for many evaluation scenarios."conversation_format": (Not yet implemented) Converts to a list of messages."jsonl": Keeps records in a format suitable for direct JSONL output (typically implies minimal transformation beyond base loading and initial mapping).
- Default:
"evaluation_format" - Example:
output_format: "evaluation_format"
-
transformations(Optional)- Description: A list of additional transformation functions to apply after the base dataset is loaded and initial derived processing (like system prompt addition) is done. (Currently not fully implemented in
loader.py). - Default:
[]
- Description: A list of additional transformation functions to apply after the base dataset is loaded and initial derived processing (like system prompt addition) is done. (Currently not fully implemented in
-
derived_column_mapping(Optional)- Description: A dictionary for column mapping applied after the base dataset is loaded and before the
output_formatconversion. This can override or extend the base dataset'scolumn_mapping. Keys are new names, values are names from the loaded base dataset. - Default:
{} - Example (
gsm8k_math_prompts.yaml):Note: These mapped columns (derived_column_mapping: query: "question" # Maps 'question' from base gsm8k to 'query' ground_truth: "answer" # Maps 'answer' from base gsm8k to 'ground_truth'
query,ground_truth) are then used byconvert_to_evaluation_formatto createuser_queryandground_truth_for_eval.
- Description: A dictionary for column mapping applied after the base dataset is loaded and before the
-
derived_max_samples(Optional)- Description: Maximum number of samples for this derived dataset. If specified, this overrides any
max_samplesfrom the base dataset configuration for the purpose of this derived dataset. - Default:
null - Example:
derived_max_samples: 5
- Description: Maximum number of samples for this derived dataset. If specified, this overrides any
-
description(Optional, Metadata)- Description: A brief description of this derived dataset configuration.
- Example:
description: "GSM8K dataset with math-specific system prompt in evaluation format."
The eval_protocol.datasets.loader.py script uses Hydra to:
- Compose these YAML configurations.
- Instantiate the appropriate loader function (
load_and_process_datasetorload_derived_dataset) with the parameters defined in the YAML. - The loader functions then use these parameters to fetch data (e.g., from Hugging Face or local files), apply mappings, execute preprocessing steps, and format the data as requested.
This structured configuration approach allows for flexible and reproducible dataset management within Eval Protocol.