Skip to content

Latest commit

 

History

History
223 lines (181 loc) · 7.89 KB

File metadata and controls

223 lines (181 loc) · 7.89 KB

NMR Mixture Quantification - Issues and Ideas Log

Current Status

  • 14 pure metabolite references validated with Lorentzian fitting (R² > 0.97 for most)
  • Model mixture File 10 quantified with Lorentzian method
  • Attempted LASSO approach but needs refinement

Issues Encountered

Issue 1: LASSO Column Normalization

Problem: LASSO failed with extreme R² (-2.9e+27) because dictionary columns weren't normalized Status: Partially fixed (added unit norm normalization) Next Step: Test if normalization alone fixes the problem

Issue 2: LASSO Sparsity Too Aggressive

Problem: With normalized columns, LASSO only detected Tyrosine (1.38 mM), missing all other metabolites Possible Causes:

  • Alpha too large (regularization too strong)
  • Columns have very different scales even after normalization
  • Mixture spectrum has different baseline/offset than references Next Step: Try smaller alpha, or use LassoCV with wider alpha range

Issue 3: Dynamic Range Problem

Problem: Lactate (800) vs Tyrosine (0.5) = 1600:1 dynamic range LASSO Impact: Strong peaks dominate, weak peaks get zeroed out Potential Solutions:

  • Region-wise fitting (aliphatic vs CH/OH separately)
  • Iterative subtraction (CLEAN-like algorithm)
  • Weighted LASSO (weight by peak uniqueness)
  • Log transform (but breaks linearity - see Issue 4)

Issue 4: Log Transform Breaks Linearity

Problem: log(A + B) ≠ log(A) + log(B) Impact: NMR follows Beer-Lambert law (linear), log breaks the physics Verdict: Not recommended for mixture quantification Alternative: Robust scaling or percentile-based scaling

Issue 5: Concentration-Dependent Chemical Shifts

Metabolites Affected: Asparagine, Aspartate, Glutamate Observation: Peaks shift to lower ppm as concentration decreases (pH effect) Current Solution: Wide integration regions + dynamic peak detection LASSO Solution: Shift-tolerant dictionary (multiple shifted versions) Status: Implemented but needs testing


Ideas to Try

Idea 1: Region-Wise LASSO

Approach: Divide spectrum into regions, fit separately

Region 1: 0.5-2.0 ppm (aliphatic CH3/CH2)
Region 2: 2.0-3.5 ppm (aliphatic with heteroatoms)
Region 3: 3.5-5.5 ppm (CH-OH, anomeric)
Region 4: 6.5-8.0 ppm (aromatic)

Pros: Each region has smaller dynamic range, region-specific alpha Cons: Metabolites spanning regions need stitching Priority: High

Idea 2: Iterative Subtraction (CLEAN Algorithm)

Approach:

1. Find strongest metabolite in mixture
2. Fit concentration
3. Subtract scaled pure spectrum
4. Repeat with residual
5. Until residual noise level

Pros: Physically interpretable, handles dynamic range naturally Cons: Computationally slower, order-dependent Priority: High

Idea 3: Weighted Dictionary (Reference Concentration)

Approach: Weight dictionary columns by reference concentration

weighted_spec = pure_spec * ref_conc / unit_norm
# Now coefficient = relative concentration (0-1 scale)

Pros: All coefficients on same scale (0-1), easier regularization Cons: Mixture spectrum must be in comparable units Priority: Medium

Idea 4: Ridge Regression (L2 instead of L1)

Approach: Use Ridge (L2) or ElasticNet (L1+L2) instead of pure LASSO

from sklearn.linear_model import Ridge, ElasticNet

# Ridge: All metabolites get non-zero values
# ElasticNet: Balance between sparsity and smoothness

Pros: Ridge handles correlated features better (overlapping peaks) Cons: No sparsity (all 14 metabolites will have values) Priority: Medium

Idea 5: Non-Negative Least Squares (NNLS)

Approach: Pure least squares with non-negativity constraint

from scipy.optimize import nnls

# No regularization, just minimize ||Ax - b||^2 with x >= 0

Pros: Simple, no hyperparameter tuning (no alpha) Cons: No sparsity, may overfit to noise Priority: High (as baseline comparison)

Idea 6: Feature Selection Before LASSO

Approach: Identify which metabolites are present first, then quantify

# Step 1: Use correlation or peak detection to identify present metabolites
# Step 2: Build dictionary with only those metabolites
# Step 3: Use NNLS or small-alpha LASSO for quantification

Pros: Reduces dictionary size, avoids sparsity killing weak signals Cons: Two-step process, detection errors propagate Priority: Medium

Idea 7: Cross-Validation for Alpha with Custom Scorer

Approach: Use LassoCV with negative concentration penalty

def custom_scorer(y_true, y_pred, coef):
    mse = mean_squared_error(y_true, y_pred)
    # Penalize if too many metabolites have concentration > 50 mM
    unreasonable = np.sum(coef > 50)
    return mse + 0.1 * unreasonable

Pros: Encourages physically reasonable solutions Cons: More complex, may over-constrain Priority: Low


Test Results Log

Test 1: Basic LASSO with Unit Norm Normalization

Date: 2024-03-24 Result: Only Tyrosine detected (1.38 mM), R² negative Conclusion: Normalization alone not sufficient

Test 2: NNLS Baseline

Date: 2024-03-24 Result: All signal attributed to Tyrosine (923 mM), R² = -239 Root Cause: Dictionary columns highly correlated (overlapping peaks) Conclusion: Full-spectrum approach with highly overlapping references doesn't work Lesson: Need feature selection or region-wise approach

Test 3: Iterative Subtraction (CLEAN)

Date: 2024-03-24 Result: Only Alanine detected (65 mM), stopped after 1 iteration Root Cause: After subtracting Alanine, correlation to all others = 0 Key Finding: Pure spectra correlation matrix shows 0.9+ correlation between many metabolites Conclusion: Full-spectrum correlation-based methods fail for NMR


Critical Finding

Full-spectrum ML approaches (LASSO, NNLS, Iterative) DO NOT WORK for NMR mixture quantification because:

  1. High spectral overlap: CH3 peaks (1.3-1.5 ppm) look similar across metabolites
  2. Collinearity: Pure spectra correlation > 0.9 for many pairs
  3. Ill-conditioned problem: Design matrix X is nearly singular

What DOES work (our Lorentzian method):

  • Region-specific fitting (1.48 ppm for Alanine, 1.33 ppm for Lactate)
  • Dynamic peak detection within specific windows
  • Peak-by-peak quantification, not full-spectrum deconvolution

Revised Strategy

Option A: Hybrid ML + Physics (Recommended)

Use ML for peak detection/preprocessing, Lorentzian for quantification:

1. ML: Detect which peaks are present (classification)
2. ML: Estimate initial peak positions (regression)
3. Physics: Lorentzian fitting with ML-initiated parameters

Option B: Feature-Based ML

Extract features first, then ML:

1. Extract: Peak heights at known chemical shifts
2. Extract: Integrals in specific regions
3. ML: Random Forest/XGBoost on features (not raw spectrum)

Option C: Deep Learning with Physics Constraints

Neural network that outputs Lorentzian parameters directly:

Input: Spectrum
Output: {amplitude_i, center_i, width_i} for each metabolite
Loss: Reconstruction error + physics penalties

Decision

Stick with current Lorentzian method - it's physically sound and working well (R² > 0.97 for most metabolites). ML can help with automation but shouldn't replace the physics.

Test 2: [To be run]


Next Action Items

  1. Run NNLS as baseline (Idea 5) - simplest approach
  2. Try Ridge/ElasticNet (Idea 4) - compare with LASSO
  3. Implement iterative subtraction (Idea 2) - physically motivated
  4. Test region-wise fitting (Idea 1) - if above fail
  5. Consider weighted dictionary (Idea 3) - for fine-tuning

Key Metrics to Track

For each method, record:

  • R² (reconstruction quality)
  • Number of metabolites detected
  • Sum of concentrations (should be ~100 mM for cell culture)
  • Comparison with Lorentzian results
  • Physical reasonableness (all concentrations > 0, none > 200 mM)