Possible train/val/test leakage in ConstitutionOptimizer due to row-level splitting instead of group-aware splitting
Motivation
Hi, thanks for open-sourcing this project.
While reading ConstitutionOptimizer.get_train_test_val_data(), I noticed that the reward dataset appears to be split at the individual-row level after random.shuffle(reward_dataset), without grouping by root_id, root_prompt, parent_id, or parent_prompt.
From the current implementation, each reward example is built from exploration rows that retain strong structural dependencies:
- shared
root_prompt / root_id
- shared
parent_prompt / parent_id
- multiple ACE mutations derived from the same parent prompt
Potential issue
This seems important because the surrogate-classifier evaluation may then place highly correlated examples from the same exploration subtree into both train and val/test splits.
In that case, the reported validation/test loss may overestimate generalization, since the model is not being evaluated on truly independent prompt families.
More concretely:
exploration_data.csv is loaded
- labeled rows are converted into
reward_dataset
- the dataset is shuffled
- then split 80/10/10 at the example level
This could lead to optimistic estimates of surrogate performance, especially if multiple mutations from the same prompt subtree appear across splits, effectively reducing the independence of evaluation samples.
This might be particularly relevant if the goal is to assess generalization to unseen prompts rather than interpolation within a prompt family.
Question
I may be missing intended behavior, but I did not see a grouping-aware split.
Would it make sense to switch to a grouped split, for example:
- grouping by
root_id to test generalization to unseen root prompts / trees, or
- grouping by
parent_id / parent_prompt to ensure sibling mutations of the same prompt do not leak across splits?
Suggestion
A possible approach could be to use a group-aware split (e.g., GroupShuffleSplit / GroupKFold) where the grouping key is:
root_id (for stronger generalization evaluation), or
parent_id (for stricter independence among sibling mutations)
I think this would make the constitution-surrogate evaluation more robust and easier to interpret.
Closing
Happy to put together a PR if this direction makes sense.
Possible train/val/test leakage in ConstitutionOptimizer due to row-level splitting instead of group-aware splitting
Motivation
Hi, thanks for open-sourcing this project.
While reading
ConstitutionOptimizer.get_train_test_val_data(), I noticed that the reward dataset appears to be split at the individual-row level afterrandom.shuffle(reward_dataset), without grouping byroot_id,root_prompt,parent_id, orparent_prompt.From the current implementation, each reward example is built from exploration rows that retain strong structural dependencies:
root_prompt/root_idparent_prompt/parent_idPotential issue
This seems important because the surrogate-classifier evaluation may then place highly correlated examples from the same exploration subtree into both train and val/test splits.
In that case, the reported validation/test loss may overestimate generalization, since the model is not being evaluated on truly independent prompt families.
More concretely:
exploration_data.csvis loadedreward_datasetThis could lead to optimistic estimates of surrogate performance, especially if multiple mutations from the same prompt subtree appear across splits, effectively reducing the independence of evaluation samples.
This might be particularly relevant if the goal is to assess generalization to unseen prompts rather than interpolation within a prompt family.
Question
I may be missing intended behavior, but I did not see a grouping-aware split.
Would it make sense to switch to a grouped split, for example:
root_idto test generalization to unseen root prompts / trees, orparent_id/parent_promptto ensure sibling mutations of the same prompt do not leak across splits?Suggestion
A possible approach could be to use a group-aware split (e.g., GroupShuffleSplit / GroupKFold) where the grouping key is:
root_id(for stronger generalization evaluation), orparent_id(for stricter independence among sibling mutations)I think this would make the constitution-surrogate evaluation more robust and easier to interpret.
Closing
Happy to put together a PR if this direction makes sense.