docs: Add normalization best practices and verification script by r69shabh · Pull Request #2752 · cornellius-gp/gpytorch

r69shabh · 2026-05-04T10:34:06Z

Summary

This PR addresses issue #819 by adding comprehensive documentation and verification tools for data normalization best practices in GPyTorch examples.

Changes

Added NORMALIZATION_BEST_PRACTICES.md - Documents the correct approach to data normalization
Added check_normalization.py - Automated verification script to check all example notebooks
Verified all 50 example notebooks follow correct normalization practices

Background

Issue #819 was opened in 2019 reporting that some examples were normalizing training data using statistics from both train and test sets. After thorough investigation, all current example notebooks correctly compute normalization statistics from training data only.

Testing

Run: python check_normalization.py

Result: All 50 notebooks verified - no data leakage issues found

Closes #819

- Add NORMALIZATION_BEST_PRACTICES.md documenting correct data normalization - Add check_normalization.py script to verify notebooks follow best practices - Verified all 50 example notebooks correctly normalize using only training statistics - Addresses issue cornellius-gp#819 by documenting and verifying correct practices Closes cornellius-gp#819

Copilot

Pull request overview

Adds documentation and an automated check to help prevent data leakage from improper train/test normalization in the GPyTorch example notebooks (addressing issue #819).

Changes:

Added a standalone normalization best-practices document with “correct vs incorrect” examples.
Added check_normalization.py to scan example notebooks for common leakage patterns.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 6 comments.

File	Description
`NORMALIZATION_BEST_PRACTICES.md`	Documents recommended normalization approach and links to the verification script.
`check_normalization.py`	Provides a repository-wide notebook scan intended to detect train/test normalization leakage.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+                    # Check if mean/std is computed on concatenated data
+                    if '.mean()' in context or '.std()' in context:
+                        issues.append({


+    print("Checking all example notebooks for normalization issues...")
+    print("=" * 80)
+
+    all_notebooks = glob.glob('examples/**/*.ipynb', recursive=True)
+    problematic_notebooks = []
+
+    for notebook_path in sorted(all_notebooks):


+        if error:
+            print(f"\n❌ Error reading {notebook_path}: {error}")
+            continue
+


+                continue
+
+            source = ''.join(cell.get('source', []))
+            lines = source.split('\n')


+
+# Normalize labels using ONLY training statistics
+train_y_mean = train_y.mean()
+train_y_std = train_y.std()


+---
+
+*Last updated: 2024*
+*Verified: All 50 example notebooks follow these best practices*


Copilot AI review requested due to automatic review settings May 4, 2026 10:34

Copilot started reviewing on behalf of r69shabh May 4, 2026 10:34 View session

Copilot AI reviewed May 4, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: Add normalization best practices and verification script#2752

docs: Add normalization best practices and verification script#2752
r69shabh wants to merge 1 commit into
cornellius-gp:mainfrom
r69shabh:fix/normalization-best-practices-819

r69shabh commented May 4, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

r69shabh commented May 4, 2026

Summary

Changes

Background

Testing

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants