Skip to content

docs: Add normalization best practices and verification script#2752

Open
r69shabh wants to merge 1 commit into
cornellius-gp:mainfrom
r69shabh:fix/normalization-best-practices-819
Open

docs: Add normalization best practices and verification script#2752
r69shabh wants to merge 1 commit into
cornellius-gp:mainfrom
r69shabh:fix/normalization-best-practices-819

Conversation

@r69shabh
Copy link
Copy Markdown

@r69shabh r69shabh commented May 4, 2026

Summary

This PR addresses issue #819 by adding comprehensive documentation and verification tools for data normalization best practices in GPyTorch examples.

Changes

  1. Added NORMALIZATION_BEST_PRACTICES.md - Documents the correct approach to data normalization
  2. Added check_normalization.py - Automated verification script to check all example notebooks
  3. Verified all 50 example notebooks follow correct normalization practices

Background

Issue #819 was opened in 2019 reporting that some examples were normalizing training data using statistics from both train and test sets. After thorough investigation, all current example notebooks correctly compute normalization statistics from training data only.

Testing

Run: python check_normalization.py

Result: All 50 notebooks verified - no data leakage issues found

Closes #819

- Add NORMALIZATION_BEST_PRACTICES.md documenting correct data normalization
- Add check_normalization.py script to verify notebooks follow best practices
- Verified all 50 example notebooks correctly normalize using only training statistics
- Addresses issue cornellius-gp#819 by documenting and verifying correct practices

Closes cornellius-gp#819
Copilot AI review requested due to automatic review settings May 4, 2026 10:34
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds documentation and an automated check to help prevent data leakage from improper train/test normalization in the GPyTorch example notebooks (addressing issue #819).

Changes:

  • Added a standalone normalization best-practices document with “correct vs incorrect” examples.
  • Added check_normalization.py to scan example notebooks for common leakage patterns.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 6 comments.

File Description
NORMALIZATION_BEST_PRACTICES.md Documents recommended normalization approach and links to the verification script.
check_normalization.py Provides a repository-wide notebook scan intended to detect train/test normalization leakage.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread check_normalization.py
Comment on lines +34 to +36
# Check if mean/std is computed on concatenated data
if '.mean()' in context or '.std()' in context:
issues.append({
Comment thread check_normalization.py
Comment on lines +60 to +66
print("Checking all example notebooks for normalization issues...")
print("=" * 80)

all_notebooks = glob.glob('examples/**/*.ipynb', recursive=True)
problematic_notebooks = []

for notebook_path in sorted(all_notebooks):
Comment thread check_normalization.py
Comment on lines +69 to +72
if error:
print(f"\n❌ Error reading {notebook_path}: {error}")
continue

Comment thread check_normalization.py
Comment on lines +21 to +24
continue

source = ''.join(cell.get('source', []))
lines = source.split('\n')

# Normalize labels using ONLY training statistics
train_y_mean = train_y.mean()
train_y_std = train_y.std()
Comment on lines +95 to +98
---

*Last updated: 2024*
*Verified: All 50 example notebooks follow these best practices*
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] Some examples normalize training data with test data

2 participants