Possible train/val/test leakage in ConstitutionOptimizer due to row-level splitting instead of group-aware splitting

## Possible train/val/test leakage in ConstitutionOptimizer due to row-level splitting instead of group-aware splitting

## Motivation

Hi, thanks for open-sourcing this project.

While reading `ConstitutionOptimizer.get_train_test_val_data()`, I noticed that the reward dataset appears to be split at the individual-row level after `random.shuffle(reward_dataset)`, without grouping by `root_id`, `root_prompt`, `parent_id`, or `parent_prompt`.

From the current implementation, each reward example is built from exploration rows that retain strong structural dependencies:

* shared `root_prompt` / `root_id`
* shared `parent_prompt` / `parent_id`
* multiple ACE mutations derived from the same parent prompt

## Potential issue

This seems important because the surrogate-classifier evaluation may then place highly correlated examples from the same exploration subtree into both train and val/test splits.

In that case, the reported validation/test loss may overestimate generalization, since the model is not being evaluated on truly independent prompt families.

More concretely:

* `exploration_data.csv` is loaded
* labeled rows are converted into `reward_dataset`
* the dataset is shuffled
* then split 80/10/10 at the example level

This could lead to optimistic estimates of surrogate performance, especially if multiple mutations from the same prompt subtree appear across splits, effectively reducing the independence of evaluation samples.

This might be particularly relevant if the goal is to assess generalization to unseen prompts rather than interpolation within a prompt family.

## Question

I may be missing intended behavior, but I did not see a grouping-aware split.

Would it make sense to switch to a grouped split, for example:

* grouping by `root_id` to test generalization to unseen root prompts / trees, or
* grouping by `parent_id` / `parent_prompt` to ensure sibling mutations of the same prompt do not leak across splits?

## Suggestion

A possible approach could be to use a group-aware split (e.g., GroupShuffleSplit / GroupKFold) where the grouping key is:

* `root_id` (for stronger generalization evaluation), or
* `parent_id` (for stricter independence among sibling mutations)

I think this would make the constitution-surrogate evaluation more robust and easier to interpret.

## Closing

Happy to put together a PR if this direction makes sense.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible train/val/test leakage in ConstitutionOptimizer due to row-level splitting instead of group-aware splitting #1