Reward Hacking, Shortcut Learning, and Spurious Correlation

Authors: Yihe Deng (twitter), Yu Yang (twitter)

Date: 12/23/2024

Thanks to Fan Yin for providing invaluable feedback on an earlier draft of this blog.

Spurious correlation or shortcut learning (Geirhos et al. 2020) in classification tasks is closely related to reward hacking.
— Reward Hacking in Reinforcement Learning

This note focuses on the connection between reward hacking and spurious correlation. Please see the blog for detailed discussion and a comprehensive references list.

1. Introduction

Reward Hacking (Classic RL Context)

Definition: Occurs when RL agents exploit flaws in the reward function to produce unintended behaviors (e.g., looping for infinite points, crashing the environment).
Key Point: Misalignment between the reward function and the true goal leads to perverse strategies.

Data-Induced Reward Hacking

In preference-based training (e.g., RLHF, LLM alignment), reward hacking can take the form of spurious correlations: the model learns superficial patterns (e.g., lengthier text = better) rather than genuine quality.
Reflects Goodhart’s Law: overoptimizing a proxy (human preference data) invalidates that proxy’s utility.

2. Spurious Correlation

A spurious correlation is a superficial feature strongly associated with the target label but not causally related. Classic examples include:

Vision: Background colors, water vs. land, or demographic attributes that overshadow the core features.
Language: Negation words, certain templates, or keywords (e.g., “safe,” “ethical”) correlated with higher or lower labels.

Spurious correlations often lead to shortcut learning, degrading model robustness, especially for minority or out-of-distribution examples.

3. How Preference Data Can Contain Spurious Correlations

Length as a Proxy for Quality

Longer Responses often earn higher human ratings for thoroughness or politeness.
DPO (Direct Preference Optimization) can inadvertently overfit to “longer = better,” causing the model to produce excessively long or rambling outputs.

“Keyword Hacking” in Safety and Alignment

Refusal Tokens (e.g., “I apologize,” “I cannot”) might be overused if preference data consistently labels these as safer or more responsible responses.
Shallow alignment becomes easy to bypass and does not address deeper generative behavior.

Other Potential Correlations

Formatting Bias: Bullet points, special tokens (emojis, exclamation marks), or bold text might inflate user ratings.
Confidence Tone: Overly confident or enthusiastic language can be misconstrued as correctness or helpfulness.
Positivity Bias: Cheerful or agreeable responses may be preferred, even when a neutral or critical stance is more accurate.

These spurious features mirror the phenomenon seen in image or text classification, where superficial cues overshadow the true underlying quality or correctness.

4. Further Questions

Does Negative Data Accelerate Shortcut Learning?

Including clearly dispreferred samples (shorter, impolite, etc.) may reinforce superficial features that separate negative from positive examples.
DPO vs. SFT comparisons reveal t

Will Iterative Self-Training Amplify Spurious Correlations?

Multi-round self-improvement cycles can amplify existing biases if the model keeps learning from its own biased outputs.
Without careful monitoring or correction, spurious signals may escalate in each iteration.

5. Potential Mitigations

Algorithmic Approaches

R-DPO: Adjusts per-example learning rates to balance out length or other spurious signals.
SimPO: Uses regularization to discourage overly long or superficial responses.
Reward Model Ensembles: Combines multiple RMs for a more conservative reward signal, though shared biases can still persist.
ODIN: Separates reward signals into length-correlated and length-independent heads to discourage length-based exploitation.

Data-Centric Approaches

RRM (Robust Reward Model Training): Augments or balances chosen vs. rejected responses to disrupt spurious correlations.
Group-Based Methods (e.g., GroupDRO, PDE, JTT): Identify spurious attributes (like backgrounds in vision tasks or style tokens in text) and balance them during training.
PDE (Progressive Data Expansion): Start with balanced subsets before adding potentially imbalanced data, preserving early robustness.

Theoretical Insights

Spurious features often dominate because they are easier to learn or highly correlated with the label.
Distributionally robust strategies and balanced sampling can mitigate these shortcuts by limiting reliance on the superficial signals.

6. Conclusion

Reward hacking and spurious correlation are deeply intertwined challenges, underscoring the gap between true objectives (e.g., factual correctness, genuine helpfulness) and over-optimized proxies (length, specific tokens, or superficial style). While recent advancements (e.g., R-DPO, ODIN, RRM) tackle these shortcuts algorithmically and through data manipulation, fully robust alignment requires:

Better understanding of which artifacts are truly spurious.
Iterative monitoring of model outputs and self-training loops.
Data balancing and annotation strategies that reduce reliance on superficial cues.

When Good Writing Hides Bad Content
High-quality formatting and engaging style can mask inaccuracies or shallow reasoning. The real challenge is defining (and enforcing) robust criteria for “good” responses across diverse user needs without encouraging misleading superficial correlations.