Merge pull request #2933 to add uncertainties in quadrature

rwest · web-flow · commit df834ea5ef24 · 2026-05-07T21:48:43.000+01:00
Uncertainty Part 1.5:  Adding in quadrature.

Some changes to how uncertainties are handled.
Most significant is that they are now added "in quadrature" (ie. square, add, root).
But the assumed error on a thermo Benson group has changed too (derived from actual GA vs Benchmark numbers). And a Jupyter notebook is updatade.
This has been reviewed many times, and I just ran the entire test suite again locally, so I'm merging.
diff --git a/ipython/local_uncertainty.ipynb b/ipython/local_uncertainty.ipynb
@@ -114,11 +114,29 @@
     "\n",
     "$$\\Delta G = \\frac{1}{\\sqrt{12}}(G_{max} - G_{min})$$\n",
     "\n",
-    "Several parameters are used to formulate $\\Delta G$.  These are $\\Delta G_\\mathrm{library}$, $\\Delta G_\\mathrm{QM}$, $\\Delta G_\\mathrm{GAV}$, and $\\Delta _\\mathrm{group}$.\n",
+    "But first, to formulate $G$, we need to compile all the uncertainty sources. These are $ G_\\mathrm{library}$, $ G_\\mathrm{QM}$, $ G_\\mathrm{GAV}$, and $G _\\mathrm{group}$. We treat each as continuous random variable and add them to get the species uncertainty.\n",
     "        \n",
-    "$$\\Delta G = \\delta_\\mathrm{library} \\Delta G_\\mathrm{library} + \\delta_\\mathrm{QM} \\Delta G_\\mathrm{QM} + \\delta_\\mathrm{GAV} \\left( \\Delta G_\\mathrm{GAV} + \\sum_{\\mathrm{group}\\; j} d_{j} \\Delta G_{\\mathrm{group},j} \\right)$$\n",
+    "$$G = \\delta_\\mathrm{library} G_\\mathrm{library} + \\delta_\\mathrm{QM}  G_\\mathrm{QM} + \\delta_\\mathrm{GAV} \\left( G_\\mathrm{GAV} + \\sum_{\\mathrm{group}\\; j} d_{j} G_{\\mathrm{group},j} \\right)$$\n",
+    "\n",
+    "Here $\\delta$ is the Kronecker delta function which equals one if the species thermochemistry parameter contains the particular source type and $d_{j}$ is the degeneracy (number of appearances) of the thermo group used to construct the species thermochemistry in the group additivity method.\n",
+    "\n",
+    "To compute $\\Delta G$, we'll use the property that the variance of the sum of two random variables, $X$ and $Y$, is:\n",
+    "$$Var(X+Y)=Var(X)+Var(Y)+2Cov(X,Y)$$\n",
+    "\n",
+    "If we assume that each uncertainty source is uncorrelated with the others, then the covariance term drops out:\n",
+    "\n",
+    "$$Var(X+Y)=Var(X)+Var(Y), \\ \\ \\ \\ \\    X \\neq Y$$\n",
+    "\n",
+    "And we can compute the variance of $G$:\n",
+    "\n",
+    "$$Var(G) = \\delta_\\mathrm{library} Var(G_\\mathrm{library}) + \\delta_\\mathrm{QM} Var(G_\\mathrm{QM})+\\delta_\\mathrm{GAV} \\left(Var(G_\\mathrm{GAV}) +\\sum_{group\\ j} d_j^2 Var(G_\\mathrm{group\\ j}) \\right)$$\n",
+    "\n",
+    "\n",
+    "The standard deviation $\\Delta G$ is then the square root of that (this is an example of adding in quadrature):\n",
+    "$$\\Delta(G) = \\sqrt{\\delta_\\mathrm{library} Var(G_\\mathrm{library}) + \\delta_\\mathrm{QM} Var(G_\\mathrm{QM})+\\delta_\\mathrm{GAV} \\left(Var(G_\\mathrm{GAV}) +\\sum_{group\\ j} d_j^2 Var(G_\\mathrm{group\\ j}) \\right)}$$\n",
+    "\n",
+    "\n",
     "\n",
-    "where $\\delta$ is the Kronecker delta function which equals one if the species thermochemistry parameter contains the particular source type and $d_{j}$ is the degeneracy (number of appearances) of the thermo group used to construct the species thermochemistry in the group additivity method.\n",
     "\n",
     "### Kinetics Uncertainty\n",
     "\n",
@@ -130,13 +148,20 @@
     "\n",
     "$$\\Delta \\ln(k) = \\frac{1}{\\sqrt{12}}(\\ln k_{max} - \\ln k_{min})$$\n",
     "\n",
-    "The parameters used to formulate $\\Delta  \\ln k$ are $\\Delta \\ln k_\\mathrm{library}$, $\\Delta \\ln k_\\mathrm{training}$, $\\Delta \\ln k_\\mathrm{pdep}$, $\\Delta \\ln k_\\mathrm{family}$, $\\Delta \\ln k_\\mathrm{non-exact}$, and $\\Delta \\ln k_\\mathrm{rule}$.\n",
+    "The sources used to formulate $ \\ln k$ are $ \\ln k_\\mathrm{library}$, $ \\ln k_\\mathrm{training}$, $ \\ln k_\\mathrm{pdep}$, $ \\ln k_\\mathrm{family}$, $ \\ln k_\\mathrm{non-exact}$, and $ \\ln k_\\mathrm{rule}$.\n",
+    "\n",
+    "For exact training data matches and library and pdep reactions, the kinetic uncertainty is assigned according to their uncertainty type. For kinetics estimated using RMG's decision trees, the following formula is used to calculate the uncertainty:\n",
     "\n",
-    "For library, training, and pdep reactions, the kinetic uncertainty is assigned according to their uncertainty type.  For kinetics estimated using RMG's rate rules, the following formula is used to calculate the uncertainty:\n",
+    "$$ \\ln k_\\mathrm{rate\\; rules} = \\ln k_\\mathrm{family} + \\log_{10}(N+1) \\left(\\ln k_\\mathrm{non-exact}\\right)  + \\sum_{\\mathrm{rule}\\; i} w_i  \\ln k_{\\mathrm{rule},i}$$\n",
     "\n",
-    "$$\\Delta \\ln k_\\mathrm{rate\\; rules} = \\Delta\\ln k_\\mathrm{family} + \\log_{10}(N+1) \\left(\\Delta\\ln k_\\mathrm{non-exact}\\right)  + \\sum_{\\mathrm{rule}\\; i} w_i \\Delta \\ln k_{\\mathrm{rule},i}$$\n",
+    "where N is the total number of rate rules or training reactions used and $w_{i}$ is the weight of the rate rule or training reaction in the averaging scheme for that kinetics estimate.\n",
     "\n",
-    "where N is the total number of rate rules used and $w_{i}$ is the weight of the rate rule in the averaging scheme for that kinetics estimate. "
+    "Once again we use the property of adding variances with the assumption that sources are uncorrelated to each other. The standard deviation then adds in quadrature.\n",
+    "\n",
+    "$$\\Delta\\ln k_\\mathrm{rate\\; rules} = \\sqrt{(\\Delta \\ln k_\\mathrm{family})^2 + \\left(\\log_{10}(N+1) \\right)^2\\left(\\Delta \\ln k_\\mathrm{non-exact}\\right)^2  + \\sum_{\\mathrm{rule}\\; i} w_i^2  (\\Delta \\ln k_{\\mathrm{rule},i})^2}$$\n",
+    "\n",
+    "\n",
+    "\n"
    ]
   },
   {
@@ -178,9 +203,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "metadata": {
-    "scrolled": true
-   },
+   "metadata": {},
    "outputs": [],
    "source": [
     "result = uncertainty.local_analysis(sensitive_species, correlated=False, number=5, fileformat='.png')\n",
@@ -190,9 +213,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "metadata": {
-    "scrolled": true
-   },
+   "metadata": {},
    "outputs": [],
    "source": [
     "# Show the uncertainty plots\n",
@@ -230,9 +251,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "metadata": {
-    "scrolled": true
-   },
+   "metadata": {},
    "outputs": [],
    "source": [
     "uncertainty.assign_parameter_uncertainties(correlated=True)\n",
@@ -243,9 +262,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "metadata": {
-    "scrolled": true
-   },
+   "metadata": {},
    "outputs": [],
    "source": [
     "# Show the uncertainty plots\n",
diff --git a/rmgpy/data/thermo.py b/rmgpy/data/thermo.py
@@ -2716,11 +2716,13 @@ def extract_source_from_comments(self, species):
         Parses the verbose string of comments from the thermo data of the species object,
         and extracts the thermo sources.
 
-        Returns a dictionary with keys of either 'Library', 'QM', and/or 'GAV'.
+        Returns a dictionary with keys of 'Library', 'QM', 'ADS', and/or 'GAV'.
         Commonly, species thermo are estimated using only one of these sources.
         However, a radical can be estimated with more than one type of source, for 
         instance a saturated library value and a GAV HBI correction, or a QM saturated value
-        and a GAV HBI correction.  
+        and a GAV HBI correction. Adsorbates can be estimated using a single library
+        for the adsorbate or a combination of a gas phase library for the
+        gas phase portion and an adsorption correction.
         
         source = {'Library': String_Name_of_Library_Used,
                   'QM': String_of_Method_Used,
diff --git a/rmgpy/tools/uncertainty.py b/rmgpy/tools/uncertainty.py
@@ -43,7 +43,7 @@ class ThermoParameterUncertainty(object):
     This class is an engine that generates the species uncertainty based on its thermo sources.
     """
 
-    def __init__(self, dG_library=1.5, dG_QM=3.0, dG_GAV=1.5, dG_group=0.1, dG_ADS_correction=6.918, dG_surf_lib=6.918):
+    def __init__(self, dG_library=1.5, dG_QM=3.0, dG_GAV=1.5, dG_group=0.7159, dG_ADS_correction=6.918, dG_surf_lib=6.918):
         """
         Initialize the different uncertainties dG_library, dG_QM, dG_GAV, and dG_other with set values
         in units of kcal/mol.
@@ -62,22 +62,22 @@ def get_uncertainty_value(self, source):
         """
         Retrieve the uncertainty value in kcal/mol when the source of the thermo of a species is given.
         """
-        dG = 0.0
+        varG = 0.0
         if 'Library' in source:
-            dG += self.dG_library
+            varG += self.dG_library * self.dG_library
         if 'Surface_Library' in source:
-            dG += self.dG_surf_lib
+            varG += self.dG_surf_lib * self.dG_surf_lib
         if 'QM' in source:
-            dG += self.dG_QM
+            varG += self.dG_QM * self.dG_QM
         if 'GAV' in source:
-            dG += self.dG_GAV  # Add a fixed uncertainty for the GAV method
+            varG += self.dG_GAV * self.dG_GAV  # Add a fixed uncertainty for the GAV method
             for group_type, group_entries in source['GAV'].items():
                 group_weights = [groupTuple[-1] for groupTuple in group_entries]
-                dG += np.sum([weight * self.dG_group for weight in group_weights])
+                varG += np.sum([weight * weight * self.dG_group * self.dG_group for weight in group_weights])
         if 'ADS' in source:
-            dG += self.dG_ADS_correction  # Add adsorption correction uncertainty
+            varG += self.dG_ADS_correction * self.dG_ADS_correction  # Add adsorption correction uncertainty
 
-        return dG
+        return np.sqrt(varG)
 
     def get_partial_uncertainty_value(self, source, corr_source_type, corr_param=None, corr_group_type=None):
         """
@@ -170,39 +170,39 @@ def get_uncertainty_value(self, source):
         """
         Retrieve the dlnk uncertainty when the source of the reaction kinetics are given
         """
-        dlnk = 0.0
+        varlnk = 0.0
         if 'Library' in source:
             # Should be a single library reaction source
-            dlnk += self.dlnk_library
+            varlnk += self.dlnk_library * self.dlnk_library
         elif 'Surface_Library' in source:
             # Should be a single library reaction source
-            dlnk += self.dlnk_surf_library
+            varlnk += self.dlnk_surf_library * self.dlnk_surf_library
         elif 'PDep' in source:
             # Should be a single pdep reaction source
-            dlnk += self.dlnk_pdep
+            varlnk += self.dlnk_pdep * self.dlnk_pdep
         elif 'Training' in source:
             # Should be a single training reaction
             # Although some training entries may be used in reverse,
-            # We still consider the kinetics to be directly dependent 
+            # We still consider the kinetics to be directly dependent
             if 'surface' in source['Training'][0].lower():
-                dlnk += self.dlnk_surf_training
+                varlnk += self.dlnk_surf_training * self.dlnk_surf_training
             else:
-                dlnk += self.dlnk_training
+                varlnk += self.dlnk_training * self.dlnk_training
         elif 'Rate Rules' in source:
             family_label = source['Rate Rules'][0]
             source_dict = source['Rate Rules'][1]
             exact = source_dict['exact']
             rule_weights = [ruleTuple[-1] for ruleTuple in source_dict['rules']]
             training_weights = [trainingTuple[-1] for trainingTuple in source_dict['training']]
 
-            dlnk += self.dlnk_family
+            varlnk += self.dlnk_family * self.dlnk_family
 
             N = len(rule_weights) + len(training_weights)
             if 'node_std_dev' in source_dict:
                 # Handle autogen BM trees
                 if source_dict['node_std_dev'] < 0:
                     raise ValueError('Invalid value for std dev of kinetics family rule node')
-                dlnk += source_dict['node_std_dev']
+                varlnk += np.float_power(source_dict['node_std_dev'], 2.0)
                 if source_dict['node_n_train'] is None:
                     raise ValueError('Invalid number of training reactions for kinetics family rule node')
                 N = source_dict['node_n_train']
@@ -211,28 +211,28 @@ def get_uncertainty_value(self, source):
                 # every node template has its own fitted rate rule by definition, but here we use the
                 # number of training reactions as an approximation of the node's specificity/generality
                 # and add a penalty for being too general (large # of training reactions)
-                dlnk += np.log10(N + 1) * self.dlnk_nonexact
+                varlnk += (np.log10(N + 1) * self.dlnk_nonexact) ** 2
             else:
                 # Handle hand-made trees
                 if not exact:
                     # nonexactness contribution increases as N increases
-                    dlnk += np.log10(N + 1) * self.dlnk_nonexact
+                    varlnk += (np.log10(N + 1) * self.dlnk_nonexact) ** 2
 
                 if 'surface' in family_label.lower():
-                    dlnk += np.sum([weight * self.dlnk_surf_rule for weight in rule_weights])
-                    dlnk += np.sum([weight * self.dlnk_surf_training for weight in training_weights])
+                    varlnk += np.sum([weight * weight * self.dlnk_surf_rule * self.dlnk_surf_rule for weight in rule_weights])
+                    varlnk += np.sum([weight * weight * self.dlnk_surf_training * self.dlnk_surf_training for weight in training_weights])
                 else:
                     # Add the contributions from rules
-                    dlnk += np.sum([weight * self.dlnk_rule for weight in rule_weights])
+                    varlnk += np.sum([weight * weight * self.dlnk_rule * self.dlnk_rule for weight in rule_weights])
                     # Add the contributions from training
                     # Even though these source from training reactions, we actually
                     # use the uncertainty for rate rules, since these are now approximations
                     # of the original reaction.  We consider these to be independent of original the training
                     # parameters because the rate rules may be reversing the training reactions,
                     # which leads to more complicated dependence
-                    dlnk += np.sum([weight * self.dlnk_rule for weight in training_weights])
+                    varlnk += np.sum([weight * weight * self.dlnk_rule * self.dlnk_rule for weight in training_weights])
 
-        return dlnk
+        return np.sqrt(varlnk)
 
     def get_partial_uncertainty_value(self, source, corr_source_type, corr_param=None, corr_family=None):
         """
@@ -442,10 +442,16 @@ def extract_sources_from_model(self):
         """
         self.species_sources_dict = {}
         self.extra_species = []
+        allowed_source_keys = {'Library', 'QM', 'GAV', 'ADS'}
         for species in self.species_list:
             if species not in self.extra_species:
                 source = self.database.thermo.extract_source_from_comments(species)
-                assert source.keys() <= {'Library', 'QM', 'GAV', 'ADS'}, 'Source of thermo must be either Library, QM, GAV, or ADS'
+                unexpected_source_keys = set(source.keys()) - allowed_source_keys
+                if unexpected_source_keys:
+                    raise ValueError(
+                        f'Source of thermo must be either Library, QM, GAV, or ADS; '
+                        f'got unexpected source keys {unexpected_source_keys} for species {species.label}'
+                    )
 
                 # Now prep the source data
                 # Do not alter the GAV information, but reassign QM and Library sources to the species indices that they came from
@@ -501,7 +507,11 @@ def extract_sources_from_model(self):
                 elif len(source) == 3:
                     # combination of adsorption correction, GAV (radical), and Library/ML
 
-                    assert species.contains_surface_site(), 'only surface species should have 3 sources: adsorption correction, GAV, library/ML'
+                    if not species.contains_surface_site():
+                        raise ValueError(
+                            f'Only surface species should have 3 thermo sources (adsorption correction, GAV, and library/QM); '
+                            f'got species={species.label}, source={source}'
+                        )
 
                     # retrieve the desorbed version of the surface species-- the thing the adsorption correction was applied to during thermo estimation
                     dummy_gas_species = Species()
diff --git a/test/rmgpy/tools/uncertaintyTest.py b/test/rmgpy/tools/uncertaintyTest.py
@@ -169,13 +169,12 @@ def test_uncertainty_assignment(self):
 
         np.testing.assert_allclose(
             thermo_unc,
-            [1.5, 1.5, 2.0, 1.9, 3.1, 1.5, 1.9, 2.0, 2.0, 1.9, 2.2, 1.9, 2.0, 1.5, 3.1, 1.9, 1.5, 2.0, 1.7, 1.8, 1.8, 1.9, 1.8, 1.9, 1.9],
+            [1.5, 1.5, 2.61966, 2.51994, 2.23886, 1.5, 2.30761, 2.41611, 2.61966, 2.51994, 2.61966, 2.51994, 2.61966, 1.5, 2.23886, 2.30761, 1.5, 2.61966, 2.07366, 2.19376, 2.19376, 2.30761, 1.94616, 2.07366, 2.07366],
             rtol=1e-4,
         )
         np.testing.assert_allclose(
             kinetic_unc,
-            # the 7.81 value comes from the SIDT tree node uncertainty: 5.756 + non-exact penalty for N=1: log10(1+1) * 3.5 + family uncertainty: 1.0
-            [0.5, 1.5, 3.169924, 3.169924, 2.553605, 0.5, 2.0, 7.81, 7.81, 0.5],
+            [0.5, 1.118, 1.9783, 1.9783, 1.5363, 0.5, 2.0, 5.9369, 5.9369, 0.5],
             rtol=1e-4
         )
 
@@ -185,13 +184,13 @@ def test_specific_species_uncertainties(self):
         """
 
         expected_results = {  # order is (total_uncertainty, [group_names], [group_counts])
-            'CCCC': (1.9, ['Cs-CsCsHH', 'Cs-CsHHH'], [2, 2]),
-            'CCCCCCCCCC': (2.5, ['Cs-CsCsHH', 'Cs-CsHHH'], [8, 2]),
-            'CC(OO)CC': (2.1, ['O2s-OsCs', 'O2s-OsH', 'Cs-CsCsOsH', 'Cs-CsCsHH', 'Cs-CsHHH'], [1, 1, 1, 1, 2]),
-            'C=NCC': (1.9, ['N3d-CdCs', 'Cs-(N3dCd)CsHH', 'Cs-CsHHH', 'Cd-N3dHH'], [1, 1, 1, 1]),
-            'C=C': (1.7, ['Cds-CdsHH'], [2]),
-            'C*': (10.018, ['CH3'], [1]),  # Gas library + radical + adsorption correction
-            'O=[CH]*': (8.618, ['Cds-OdHH', 'HCdsJO'], [1, 1]),  # GAV + radical + adsorption correction
+            'CCCC': (2.5199409675625337, ['Cs-CsCsHH', 'Cs-CsHHH'], [2, 2]),
+            'CCCCCCCCCC': (6.091048438487417, ['Cs-CsCsHH', 'Cs-CsHHH'], [8, 2]),
+            'CC(OO)CC': (2.5199409675625337, ['O2s-OsCs', 'O2s-OsH', 'Cs-CsCsOsH', 'Cs-CsCsHH', 'Cs-CsHHH'], [1, 1, 1, 1, 2]),
+            'C=NCC': (2.07365649035707, ['N3d-CdCs', 'Cs-(N3dCd)CsHH', 'Cs-CsHHH', 'Cd-N3dHH'], [1, 1, 1, 1]),
+            'C=C': (2.07365649035707, ['Cds-CdsHH'], [2]),
+            'C*': (7.271261019245562, ['CH3'], [1]),  # Gas library + radical + adsorption correction
+            'O=[CH]*': (7.150786643440007, ['Cds-OdHH', 'HCdsJO'], [1, 1]),  # GAV + radical + adsorption correction
         }
 
         uncertainty = rmgpy.tools.uncertainty.Uncertainty()