fixed issue with possible zero-sized gemm call in TWFFastDeriv code#6004
fixed issue with possible zero-sized gemm call in TWFFastDeriv code#6004kgasperich wants to merge 3 commits into
Conversation
|
Relevant tests to add would be:
|
|
Test this please |
|
Thanks Kevin. Surprising that this problem remained hidden for so long. Indeed it would be ideal to hit these new cases in tests. These functions are called in CI but not with nvirt==0. A wavefunction could be made "by hand" is needs be. |
ye-luo
left a comment
There was a problem hiding this comment.
Having a test will be much appreciated!
It also helps us scoping the right amount of code for skipping when "nvirt=0".
| Minv_B[sid].cols(), 0.0, s_hv.data(), s_hv.cols()); | ||
| if (nvirt > 0) | ||
| BLAS::gemm('n', 'n', nvirt, nocc, nocc, -1.0, Minv_Mv[sid].data(), Minv_Mv[sid].cols(), Minv_B[sid].data(), | ||
| Minv_B[sid].cols(), 0.0, s_hv.data(), s_hv.cols()); |
There was a problem hiding this comment.
When nvirt = 0, is max_exc_level defined at line 360 always zero? if so, prefer an early return before s_hv is defined.
| // Minv_Mv = Minv[o,e].M[e,v] | ||
| BLAS::gemm('n', 'n', nvirt, ptclnum, ptclnum, 1.0, M[id].data() + ptclnum, M[id].cols(), Minv[id].data(), | ||
| Minv[id].cols(), 0.0, Minv_Mv[id].data(), Minv_Mv[id].cols()); | ||
| if (nvirt > 0) // avoid gemm call with LDC == 0 |
There was a problem hiding this comment.
If you move the definition of nvirt here like if (int nvirt = norb - ptclnum; nvirt > 0).
The code will be more compact and readable.
| // X432b[h,v] -= Minv_dM[h,o].X3b[o,v] | ||
| BLAS::gemm('n', 'n', nvirt, nocc, nocc, -1.0, X3b_ov.data(), X3b_ov.cols(), Minv_dM[sid].data(), | ||
| Minv_dM[sid].cols(), 1.0, X432b_hv.data(), X432b_hv.cols()); | ||
| if (nvirt > 0) |
There was a problem hiding this comment.
Can it cover a larger scope? All the ValueMatrix above with nvirt should be in the scope and hence their consumption.
|
Test this please |
|
To keep this moving, I suggest we merge the bug fix and leave any refinements for future PRs. @kgasperich please let us know what time you might reasonably have, and thanks again. |
|
@kgasperich we really would like to have a test or at least a reproducer. |
|
Sorry, I haven't had a lot of time lately to be more thorough with this. This is fixing behavior that already isn't caught by any tests, so I guess the ideal verification would be in two parts:
This was just reported to me by a user, and I haven't done any testing to see whether the code actually exits with the error or if it just prints the message and continues (I think that it just keeps running, but I'm not completely sure of this). The problematic calls are pointing to zero-sized chunks of memory anyway, so even if the effect might technically be UB(?), the fact that it's a gemm call with M==0 might make it effectively a no-op in practice, so if the error doesn't cause the code to exit, then a test of just the computed values might still pass. I've been told (by the user who first reported the issue) that the changes here do eliminate the error messages, but I haven't dug any further into it. He's running on Aurora (I assume with the default oneapi/mkl modules). I can try to find time in the next few days to take a closer look and see if there's a cleaner fix, but I won't have time to deal with all of the testing anytime soon. For testing: there is probably an existing test that could be at least partially reused here. If there's a single-det test with If anyone else is able to assist with the testing implementation, I'll be available to discuss/clarify as needed. |
|
@kgasperich I think the issue is real. Maybe the user who first reported the issue can help providing a QMCPACK input file which already exists? With an input file, we can have some idea how to tweak an existing test for testing this corner case. This should be quite low effort to start with. |
|
I made another branch to avoid a messy commit history (not a big deal with just this single commit) All I did was remove the guard around the gemm call here: and change the input SPO coef block in one test input to only read the first 4 SPOs (same as nelec): run test with I'm building on ubuntu, cpu-only, no MPI: with so it looks like oneMKL is whatever version is supplied by When I keep the |
Proposed changes
In the TWFFastDeriv code, there are several arrays/views with an
nvirt-sized dimension. For the most part, these are only encountered for multidet WFs, but there is one place where a single det will also encounter them.In cases where the single det has no virtual orbs, this leads to a gemm call with m==0 (okay) and ldc==0 (not okay). I added a check that will skip that call in these cases.
This was reported to me by a colleague who was seeing
Intel oneMKL ERROR: parameter 13 was incorrect on entry to DGEMMin single-det cases with nvirt==0.I also fixed three other cases that would cause similar issues in corner cases where norb==nelec (i.e. nvirt==0) for one spin species in multidet WFs.
What type(s) of changes does this code introduce?
Does this introduce a breaking change?
What systems has this change been tested on?
Checklist