Skip to content

fixed issue with possible zero-sized gemm call in TWFFastDeriv code#6004

Open
kgasperich wants to merge 3 commits into
QMCPACK:developfrom
kgasperich:twf-singledet-novirt-fix
Open

fixed issue with possible zero-sized gemm call in TWFFastDeriv code#6004
kgasperich wants to merge 3 commits into
QMCPACK:developfrom
kgasperich:twf-singledet-novirt-fix

Conversation

@kgasperich

Copy link
Copy Markdown
Contributor

Proposed changes

In the TWFFastDeriv code, there are several arrays/views with an nvirt-sized dimension. For the most part, these are only encountered for multidet WFs, but there is one place where a single det will also encounter them.

In cases where the single det has no virtual orbs, this leads to a gemm call with m==0 (okay) and ldc==0 (not okay). I added a check that will skip that call in these cases.

This was reported to me by a colleague who was seeing Intel oneMKL ERROR: parameter 13 was incorrect on entry to DGEMM in single-det cases with nvirt==0.

I also fixed three other cases that would cause similar issues in corner cases where norb==nelec (i.e. nvirt==0) for one spin species in multidet WFs.

What type(s) of changes does this code introduce?

  • Bugfix

Does this introduce a breaking change?

  • untested

What systems has this change been tested on?

  • untested

Checklist

    • I have read the pull request guidance and develop docs
    • This PR is up to date with the current state of 'develop'
    • Code added or changed in the PR has been clang-formatted
    • This PR adds tests to cover any new code, or to catch a bug that is being fixed
    • Documentation has been added (if appropriate)

@kgasperich

Copy link
Copy Markdown
Contributor Author

Relevant tests to add would be:

  • single-det input with nvirt==0 that uses fast_derivatives
  • multi-det input where one spin species has nvirt==0
    • Something like an oxygen atom in STO-3G with Sz=1 might work here (nelec_up = 5, nelec_dn = 3).
      Spin up orbs would be full occupied in all dets, and spin down orb configs could be (1s,2px) and (2s,2px).
      I guess you could add more down dets, but they'll have the wrong symmetry.
    • I don't remember how exactly dets are ordered internally, so there might also be some place in the code that will complain if none of the dets are HF-like in occupation (i.e. the first nelec_[spin] orbs occupied for spin==up and spin==dn). If this is a limitation, you could ionize the oxygen (or use nitrogen instead) and just have the beta spindets be (1s) and (2s); this way, at least one of them (the 1s) is "HF-like"

@prckent

prckent commented Jun 22, 2026

Copy link
Copy Markdown
Contributor

Test this please

@prckent

prckent commented Jun 22, 2026

Copy link
Copy Markdown
Contributor

Thanks Kevin. Surprising that this problem remained hidden for so long. Indeed it would be ideal to hit these new cases in tests. These functions are called in CI but not with nvirt==0. A wavefunction could be made "by hand" is needs be.

@ye-luo ye-luo left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having a test will be much appreciated!
It also helps us scoping the right amount of code for skipping when "nvirt=0".

Minv_B[sid].cols(), 0.0, s_hv.data(), s_hv.cols());
if (nvirt > 0)
BLAS::gemm('n', 'n', nvirt, nocc, nocc, -1.0, Minv_Mv[sid].data(), Minv_Mv[sid].cols(), Minv_B[sid].data(),
Minv_B[sid].cols(), 0.0, s_hv.data(), s_hv.cols());

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When nvirt = 0, is max_exc_level defined at line 360 always zero? if so, prefer an early return before s_hv is defined.

// Minv_Mv = Minv[o,e].M[e,v]
BLAS::gemm('n', 'n', nvirt, ptclnum, ptclnum, 1.0, M[id].data() + ptclnum, M[id].cols(), Minv[id].data(),
Minv[id].cols(), 0.0, Minv_Mv[id].data(), Minv_Mv[id].cols());
if (nvirt > 0) // avoid gemm call with LDC == 0

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you move the definition of nvirt here like if (int nvirt = norb - ptclnum; nvirt > 0).
The code will be more compact and readable.

// X432b[h,v] -= Minv_dM[h,o].X3b[o,v]
BLAS::gemm('n', 'n', nvirt, nocc, nocc, -1.0, X3b_ov.data(), X3b_ov.cols(), Minv_dM[sid].data(),
Minv_dM[sid].cols(), 1.0, X432b_hv.data(), X432b_hv.cols());
if (nvirt > 0)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can it cover a larger scope? All the ValueMatrix above with nvirt should be in the scope and hence their consumption.

@prckent prckent changed the title fixed issue with possible zero-sized gemm call fixed issue with possible zero-sized gemm call in TWFFastDeriv code Jun 24, 2026
@prckent

prckent commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

Test this please

@prckent

prckent commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

To keep this moving, I suggest we merge the bug fix and leave any refinements for future PRs. @kgasperich please let us know what time you might reasonably have, and thanks again.

@ye-luo

ye-luo commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

@kgasperich we really would like to have a test or at least a reproducer.

@kgasperich

Copy link
Copy Markdown
Contributor Author

Sorry, I haven't had a lot of time lately to be more thorough with this.

This is fixing behavior that already isn't caught by any tests, so I guess the ideal verification would be in two parts:

  1. Make sure the bug is real and shows up on more than just one machine and/or with more than one BLAS provider. This would require testing with one or more of the nvirt==0 cases described above, using the current develop branch.
  2. Apply the changes from this PR and verify that these new tests (which should have previously failed at least for some machines/BLAS providers) now pass.

This was just reported to me by a user, and I haven't done any testing to see whether the code actually exits with the error or if it just prints the message and continues (I think that it just keeps running, but I'm not completely sure of this). The problematic calls are pointing to zero-sized chunks of memory anyway, so even if the effect might technically be UB(?), the fact that it's a gemm call with M==0 might make it effectively a no-op in practice, so if the error doesn't cause the code to exit, then a test of just the computed values might still pass.

I've been told (by the user who first reported the issue) that the changes here do eliminate the error messages, but I haven't dug any further into it. He's running on Aurora (I assume with the default oneapi/mkl modules).
I am fairly confident that the few changes I've made here should cover all of the problematic cases of this nature, but it's possible that there are upstream places to intercept some of this earlier or more cleanly (or ways to do a larger refactor that could result in overall cleaner code).

I can try to find time in the next few days to take a closer look and see if there's a cleaner fix, but I won't have time to deal with all of the testing anytime soon.

For testing: there is probably an existing test that could be at least partially reused here. If there's a single-det test with nvirt>0 that uses the fast derivatives, then we could reuse that and just truncate the MOs to only the occupied orbs, but, as explained above, this isn't going to be as simple as just checking the computed values.
Multidet testing would be similar, but it would also require dropping any determinants that would excite into the dropped MOs. Also, I guess it should only do the truncation for one spin species (either up or down); if they both have no virtuals, then we can only make one determinant. I guess maybe we could make a multidet wavefunction with only one determinant, but I don't know whether that is a weird corner case that might trigger other unexpected behavior.

If anyone else is able to assist with the testing implementation, I'll be available to discuss/clarify as needed.

@ye-luo

ye-luo commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

@kgasperich I think the issue is real. Maybe the user who first reported the issue can help providing a QMCPACK input file which already exists? With an input file, we can have some idea how to tweak an existing test for testing this corner case. This should be quite low effort to start with.

@kgasperich

Copy link
Copy Markdown
Contributor Author

I made another branch to avoid a messy commit history (not a big deal with just this single commit)
https://github.com/kgasperich/qmcpack/tree/twf-virtual-error-reproducer

All I did was remove the guard around the gemm call here:
(new branch) commented out conditional:
https://github.com/kgasperich/qmcpack/blob/94f80417d246c90f147f96958141949ead6533a9/src/QMCWaveFunctions/TWFFastDerivWrapper.cpp#L838
(this PR branch) with conditional:
https://github.com/kgasperich/qmcpack/blob/cd48b4fa12a683ed0bccccdcbc5baa059cf13a35/src/QMCWaveFunctions/TWFFastDerivWrapper.cpp#L838

and change the input SPO coef block in one test input to only read the first 4 SPOs (same as nelec):
(new branch) size=4:
https://github.com/kgasperich/qmcpack/blob/94f80417d246c90f147f96958141949ead6533a9/tests/estimator/acforce/vmc.fast.in.xml#L313
(this PR branch and develop) size=14:
https://github.com/kgasperich/qmcpack/blob/cd48b4fa12a683ed0bccccdcbc5baa059cf13a35/tests/estimator/acforce/vmc.fast.in.xml#L313

run test with ctest -V -R acforce-fast
(-V is necessary because no change in behavior occurs other than printing)

I'm building on ubuntu, cpu-only, no MPI:
When I link against libopenblas, the test passes, and I don't see anything notable in the output.
When I link against oneMKL, the test passes, but the output contains:

Intel oneMKL INTERNAL ERROR: Condition -13 detected in function DGEMM .

with MKL_VERBOSE=1:

Intel oneMKL ERROR: Parameter 13 was incorrect on entry to DGEMM .
MKL_VERBOSE DGEMM(n,n,0,4,4,0x7852447822b8,0x7851e00c0320,4,0x7851e00b7fa0,4,0x7852447822c0,(nil),0) 172ns CNR:OFF Dyn:1 FastMM:1

so it looks like C==nullptr and LDC==0 (last two args), and it's complaining about LDC.
I'd guess that it checks input arg constraints, sees that LDC is not at least max(1,M), and calls the error handler. In this case, it doesn't really matter whether it exits early, because gemm with M==0 is a no-op anyway, which is why our data is unaffected. Maybe some implementations will return early when M==0 or N==0 before even checking the rest of the inputs?

oneMKL is whatever version is supplied by intel-oneapi-mkl-devel in the intel apt source (looks like 2026.0.1 in my local mkl_version.h)

When I keep the if (nvirt > 0) around the gemm (as in the current state of this PR branch), then the MKL error messages don't appear.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants