Skip to content

CCQE spline smoothing: a new workflow#495

Open
kjplows wants to merge 1 commit into
GENIE-MC:masterfrom
kjplows:feature/kplows-smoothing-scripts
Open

CCQE spline smoothing: a new workflow#495
kjplows wants to merge 1 commit into
GENIE-MC:masterfrom
kjplows:feature/kplows-smoothing-scripts

Conversation

@kjplows
Copy link
Copy Markdown
Contributor

@kjplows kjplows commented May 25, 2026

CCQE cross sections for tunes like the "argon tunes" AR23_20i_00_000 and the upcoming AR25_20i_00_000 rely on Monte Carlo integration of the probe + struck nucleon system. The NewQELXSec algorithm handles polling from the nuclear ground state to provide this initial struck nucleon: it is configurable via NumNucleonThrows to perform N polls, and report the average calculated cross section from all N polls as the spline result, per knot.
This has the unintended consequence of adding a statistical uncertainty to the spline knots. For states using fairly "peaked" ground states, such as LFG (AR23_20i_00_000), this uncertainty is small (if still noticeable on the scale of the cross section). For a more realistic spectral function (such as in AR25_20i_00_000), the uncertainty is quite large.

To prove my point: below, a comparison of the result of numu + 40Ar CCQE scattering, with N=500 (current NewQELXSec default) nucleons of LocalFGM_SpectralFunctionLikeWithCorrelation and SpectralFunc ground states (i.e. AR23_20i_00_000 and AR25_20i_00_000). (Note the difference of scale peak, which is caused by a switching off of the RPA correction in the AR25_20i_00_000 tune). The jitter leads to significant instability in the tail of O(1 GeV).
image

In a previous PR #468 , I attemtped an early solution to the problem, by running a Gaussian kernel density estimator to "smooth out" these splines: neighbouring points would hopefully "pull" each knot towards what is assumed to be the "true" underlying cross section.
In this PR, I have implemented a new production workflow (currently optimised for Fermilab grid nodes, but extensible to any node that can run singularity SL7), which instead of running gmkspl once with N nucleons polled, instead runs gmkspl N times with one nucleon polled each time, incrementing the random number seed by 1 every time. The results are collected in a parquet file using python3 (required packages are installed on the worker node on workflow execution), and the parquet plus an xml file using the mean of these throws are returned.
The files can be found in $GENIE/src/contrib/ccqe_spline_smoothing/:

  • run_splines_parquet_FNAL-SL7.sh : the script that generates the executable to be run, and the DAG file to be created to be run via jobsub_submit --dag.
  • example.cmd : An example of how this is run. It is similar to (if somewhat less sophisticated than) @nusense 's production scripts. This example takes an input configuration file pdgs_for_splines.cfg that tells the workflow which neutrino and target PDGs to make splines for, and relies on the sbnd UPS setup to retrieve tagged SBN-wide version genie v3_06_02_sbn3 at qualifier e26:prof and use tune AR25_20i_01_001 (tune is described in New CMCs and small fixes for GENIE #490), and sets N=1000, for 250 knots with emax = 100 GeV.
    The config file reads:
# syntax: nu_pdg tgt_pdg pretty-neutrino pretty-tgt
12 1000060120 nue 12C
-12 1000260560 nuebar 56Fe
14 1000180400 numu 40Ar
-14 1000080160 numubar 16O

so four splines will be made: nue at 12C, nuebar at 56Fe, numu at 40Ar, and numubar at 16O.

A second plot: Demonstration of the error of the mean. This uses the SpectralFunction ground state from AR25_20i_00_000,. Each pink 'x' is one result from an N=500 run, of which 100 runs were performed. The median of these runs is the dashed red line (in this energy range, functionally identical to the mean): the red band is the error on the median (again, almost identical to the error on the mean).
image

Finally, I include a script construct_representative_splines.py (alongside a requirements.txt file that can be used alongside pyenv + Python 3.13.5 to run it on AL9 machines) that takes as input the parquets from the workflow above and runs a Gaussian process regression from sklearn (many more details about Gaussian processes can be seen in the User Guide) and make a smoothed version of the spline.
Grosso modo, the model is as follows: Suppose that the cross sections reported by the N runs we took follow a true latent distribution dσ/dΕν (Εν) with heteroskedastic noise ε overlaid on top. We know at each knot the mean, <σ>, and the error, δσ: from that, a Gaussian process regressor learns the characteristic "length scale" of the latent distribution, and distinguishes between the latent distribution and noise.
This script writes out a summary parquet with the mean, error-on-mean, prediction, and error on the prediction, and a GENIE XML file with one spline containing the predicted cross section on each knot.

For brevity, only a few plots will follow, but more are available if desired.

First, a comparison between taking the mean over N=10 000 nucleons (solid red points) and the Gaussian process regression (orange dashed line and coloured band) for numu on 40Ar, close to the peak: black points with error bars are the result of running this workflow over N = 1000 nucleons and are used as input to the Gaussian process regressor.

image

Next, a comparison between taking the mean over N=1000 nucleons (solid points) and the Gaussian process regression (coloured bands) for four tunes: G18_10a_02_11b, AR23_20i_00_000, AR25_20i_00_000, and AR25_20i_01_001, all for numu on 40Ar, in the region <= 100 GeV (top plot linear in Enu, bottom logaritmic in Enu):
numu_on_40Ar
numu_on_40Ar_semilogx
Jitter is evident in all four tunes (two use LFG, two use SF); the Gaussian process reduces these.

Third, equivalent plots for numubar on 16O, nue on 12C, and nuebar on 56Fe (the three other nuclei for which spectral function implementations exist):
numubar_on_16O_semilogx
nue_on_12C_semilogx
nuebar_on_56Fe_semilogx

Finally, two more scripts are provided:

  • replace_splines.py : Given a "base xml file" (such as a gxspl-NUsmall.xml) and a revision xml file (such as the output of construct_representative_spline.py), it searches the base xml for the spline(s) contained in the revision xml, and produces an output xml. The output is identical to base, apart from the splines contained in the revision xml which are drawn from the revision.
  • reduce_gxspl_CCQES_for_sbn.awk : A convenience awk script (based off of the scripts available in the genie_xsec larsoft.opensciencegrid.org area that @nusense wrote) that keeps only CC-QES (not charm/strange) splines for nu(e, mu)(bar), on only Ar40, O16, C12, Fe56.

A final plot: comparison between unsmoothed and smoothed numu + Ar40 spline knots, where the unsmoothed knots come from the base xml, and the smoothed knots from the output of running replace_splines.py with revision made by python3 construct_representative_splines.py -i AR25_20i_00_000/numu_40Ar.parquet -o AR25_20i_00_000/smoothed_numu_40Ar.xml -t AR25_20i_00_000 -p 14 -n 1000180400:
sbn2-sbn3_comparison_zoomed_semilogx

With my apologies for a verbose PR, and my sincere thanks for discussions and help along the way, I once again ask @nusense and @sjgardiner for review. I would consider #468 superseded by this PR.

@kjplows kjplows self-assigned this May 25, 2026
@kjplows kjplows requested review from nusense and sjgardiner May 25, 2026 14:43
@kjplows kjplows marked this pull request as ready for review May 25, 2026 14:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant