Skip to content

avoid recency bias in prompt construction #104

@AndreasKarasenko

Description

@AndreasKarasenko

Context
According to this paper ChatGPT (and likely other LLMs) suffer from a recency bias. Whatever class comes last has a higher propability of being selected.
Issue
Currently scikit-llm constructs prompts based on the order of the training data.
Since we are recommended to restrict the training data I would usually do something like this:

df = df.groupby(label_col).apply(lambda x: x.sample(n_samples))
df = df.reset_index(drop=True)

Which returns a sorted dataframe by label_col. Even if sort=False is passed to groupby the instances are still clustered by label.

Question/Solution
Should a method be implemented that randomizes the order of samples in the prompt / training data, or should users take care of that themselves?
The most straightforward way would be to simply add this to sampling:

df = df.sample(frac=1)

Which leaves it up to chance to balance it reasonably.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions