- 14 pure metabolite references validated with Lorentzian fitting (R² > 0.97 for most)
- Model mixture File 10 quantified with Lorentzian method
- Attempted LASSO approach but needs refinement
Problem: LASSO failed with extreme R² (-2.9e+27) because dictionary columns weren't normalized Status: Partially fixed (added unit norm normalization) Next Step: Test if normalization alone fixes the problem
Problem: With normalized columns, LASSO only detected Tyrosine (1.38 mM), missing all other metabolites Possible Causes:
- Alpha too large (regularization too strong)
- Columns have very different scales even after normalization
- Mixture spectrum has different baseline/offset than references Next Step: Try smaller alpha, or use LassoCV with wider alpha range
Problem: Lactate (800) vs Tyrosine (0.5) = 1600:1 dynamic range LASSO Impact: Strong peaks dominate, weak peaks get zeroed out Potential Solutions:
- Region-wise fitting (aliphatic vs CH/OH separately)
- Iterative subtraction (CLEAN-like algorithm)
- Weighted LASSO (weight by peak uniqueness)
- Log transform (but breaks linearity - see Issue 4)
Problem: log(A + B) ≠ log(A) + log(B) Impact: NMR follows Beer-Lambert law (linear), log breaks the physics Verdict: Not recommended for mixture quantification Alternative: Robust scaling or percentile-based scaling
Metabolites Affected: Asparagine, Aspartate, Glutamate Observation: Peaks shift to lower ppm as concentration decreases (pH effect) Current Solution: Wide integration regions + dynamic peak detection LASSO Solution: Shift-tolerant dictionary (multiple shifted versions) Status: Implemented but needs testing
Approach: Divide spectrum into regions, fit separately
Region 1: 0.5-2.0 ppm (aliphatic CH3/CH2)
Region 2: 2.0-3.5 ppm (aliphatic with heteroatoms)
Region 3: 3.5-5.5 ppm (CH-OH, anomeric)
Region 4: 6.5-8.0 ppm (aromatic)
Pros: Each region has smaller dynamic range, region-specific alpha Cons: Metabolites spanning regions need stitching Priority: High
Approach:
1. Find strongest metabolite in mixture
2. Fit concentration
3. Subtract scaled pure spectrum
4. Repeat with residual
5. Until residual noise level
Pros: Physically interpretable, handles dynamic range naturally Cons: Computationally slower, order-dependent Priority: High
Approach: Weight dictionary columns by reference concentration
weighted_spec = pure_spec * ref_conc / unit_norm
# Now coefficient = relative concentration (0-1 scale)Pros: All coefficients on same scale (0-1), easier regularization Cons: Mixture spectrum must be in comparable units Priority: Medium
Approach: Use Ridge (L2) or ElasticNet (L1+L2) instead of pure LASSO
from sklearn.linear_model import Ridge, ElasticNet
# Ridge: All metabolites get non-zero values
# ElasticNet: Balance between sparsity and smoothnessPros: Ridge handles correlated features better (overlapping peaks) Cons: No sparsity (all 14 metabolites will have values) Priority: Medium
Approach: Pure least squares with non-negativity constraint
from scipy.optimize import nnls
# No regularization, just minimize ||Ax - b||^2 with x >= 0Pros: Simple, no hyperparameter tuning (no alpha) Cons: No sparsity, may overfit to noise Priority: High (as baseline comparison)
Approach: Identify which metabolites are present first, then quantify
# Step 1: Use correlation or peak detection to identify present metabolites
# Step 2: Build dictionary with only those metabolites
# Step 3: Use NNLS or small-alpha LASSO for quantificationPros: Reduces dictionary size, avoids sparsity killing weak signals Cons: Two-step process, detection errors propagate Priority: Medium
Approach: Use LassoCV with negative concentration penalty
def custom_scorer(y_true, y_pred, coef):
mse = mean_squared_error(y_true, y_pred)
# Penalize if too many metabolites have concentration > 50 mM
unreasonable = np.sum(coef > 50)
return mse + 0.1 * unreasonablePros: Encourages physically reasonable solutions Cons: More complex, may over-constrain Priority: Low
Date: 2024-03-24 Result: Only Tyrosine detected (1.38 mM), R² negative Conclusion: Normalization alone not sufficient
Date: 2024-03-24 Result: All signal attributed to Tyrosine (923 mM), R² = -239 Root Cause: Dictionary columns highly correlated (overlapping peaks) Conclusion: Full-spectrum approach with highly overlapping references doesn't work Lesson: Need feature selection or region-wise approach
Date: 2024-03-24 Result: Only Alanine detected (65 mM), stopped after 1 iteration Root Cause: After subtracting Alanine, correlation to all others = 0 Key Finding: Pure spectra correlation matrix shows 0.9+ correlation between many metabolites Conclusion: Full-spectrum correlation-based methods fail for NMR
Full-spectrum ML approaches (LASSO, NNLS, Iterative) DO NOT WORK for NMR mixture quantification because:
- High spectral overlap: CH3 peaks (1.3-1.5 ppm) look similar across metabolites
- Collinearity: Pure spectra correlation > 0.9 for many pairs
- Ill-conditioned problem: Design matrix X is nearly singular
What DOES work (our Lorentzian method):
- Region-specific fitting (1.48 ppm for Alanine, 1.33 ppm for Lactate)
- Dynamic peak detection within specific windows
- Peak-by-peak quantification, not full-spectrum deconvolution
Use ML for peak detection/preprocessing, Lorentzian for quantification:
1. ML: Detect which peaks are present (classification)
2. ML: Estimate initial peak positions (regression)
3. Physics: Lorentzian fitting with ML-initiated parameters
Extract features first, then ML:
1. Extract: Peak heights at known chemical shifts
2. Extract: Integrals in specific regions
3. ML: Random Forest/XGBoost on features (not raw spectrum)
Neural network that outputs Lorentzian parameters directly:
Input: Spectrum
Output: {amplitude_i, center_i, width_i} for each metabolite
Loss: Reconstruction error + physics penalties
Stick with current Lorentzian method - it's physically sound and working well (R² > 0.97 for most metabolites). ML can help with automation but shouldn't replace the physics.
- Run NNLS as baseline (Idea 5) - simplest approach
- Try Ridge/ElasticNet (Idea 4) - compare with LASSO
- Implement iterative subtraction (Idea 2) - physically motivated
- Test region-wise fitting (Idea 1) - if above fail
- Consider weighted dictionary (Idea 3) - for fine-tuning
For each method, record:
- R² (reconstruction quality)
- Number of metabolites detected
- Sum of concentrations (should be ~100 mM for cell culture)
- Comparison with Lorentzian results
- Physical reasonableness (all concentrations > 0, none > 200 mM)