There is unexploited parallelism in the polydiserpsity calculation. This limits the speedup available on high end graphics cards which are operating on 1-D dataset, since the current implementation is limited to only using one processor per q value. A card with 5000 separate processors will be mostly idle.
This is particularly important for mcSAS, which needs to evaluate
I(q_j) = sum_{i=1}^m w_i P(q_j, r_i)
where P(q,r) is the sphere form, q is length n and the distribution w,r is length m, for a total of m x n total evaluations. The current sasmodels code does this in parallel over q_j, with the sum over w_i, r_i running in serial. Instead, we could break up the loop into non-intersecting stripes, first computing
I_k(q_j) = sum_{i=1}^{m/p} w_{k+p*i} P(q_j, r_{k+p*i})
with the number of stripes p, then we can keep n x p processors busy at the same time, and the cost of n x p intermediate results. With 256 q points, the 5120 processors on an nvidia V100 can compute 20 batches in parallel simultaneously, using minimal extra memory (256 x 20 x 8 bytes). May be faster to use p=16 so that memory accesses align better.
Next turn the problem on its side, compute the following:
with one processor for each q value. Can perhaps do better by computing pairs in parallel, then pairs of pairs, requiring four cycles for p=16 rather than 16, though the overhead of managing this may outweigh any benefit. Looking at the graphs on page 5 of the following:
https://www.cl.cam.ac.uk/teaching/1617/AdvGraph/07_OpenCL.pdf
I'm guessing the 4k reductions is too small to warrant a fast algorithm.
The existing kernel_iq.c would benefit from this, at least for the inner polydispersity loop, if you are willing to tackle it. Generating a specialized kernel for the particular problem of a distribution of spheres in mcSAS will probably be easier.
Migrated from http://trac.sasview.org/ticket/1230
{
"status": "new",
"changetime": "2019-02-22T16:28:36",
"_ts": "2019-02-22 16:28:36.578150+00:00",
"description": "There is unexploited parallelism in the polydiserpsity calculation. This limits the speedup available on high end graphics cards which are operating on 1-D dataset, since the current implementation is limited to only using one processor per q value. A card with 5000 separate processors will be mostly idle.\n\nThis is particularly important for mcSAS, which needs to evaluate\n{{{\nI(q_j) = sum_{i=1}^m w_i P(q_j, r_i)\n}}}\nwhere P(q,r) is the sphere form, q is length n and the distribution w,r is length m, for a total of m x n total evaluations. The current sasmodels code does this in parallel over q_j, with the sum over w_i, r_i running in serial. Instead, we could break up the loop into non-intersecting stripes, first computing\n{{{\nI_k(q_j) = sum_{i=1}^{m/p} w_{k+p*i} P(q_j, r_{k+p*i})\n}}}\nwith the number of stripes p, then we can keep n x p processors busy at the same time, and the cost of n x p intermediate results. With 256 q points, the 5120 processors on an nvidia V100 can compute 20 batches in parallel simultaneously, using minimal extra memory (256 x 20 x 8 bytes). May be faster to use p=16 so that memory accesses align better.\n\nNext turn the problem on its side, compute the following:\n{{{\nI(q_j) = sum_k I_k(q_j)\n}}}\nwith one processor for each q value. Can perhaps do better by computing pairs in parallel, then pairs of pairs, requiring four cycles for p=16 rather than 16, though the overhead of managing this may outweigh any benefit. Looking at the graphs on page 5 of the following:\n\n https://www.cl.cam.ac.uk/teaching/1617/AdvGraph/07_OpenCL.pdf\n\nI'm guessing the 4k reductions is too small to warrant a fast algorithm.\n\nThe existing kernel_iq.c would benefit from this, at least for the inner polydispersity loop, if you are willing to tackle it. Generating a specialized kernel for the particular problem of a distribution of spheres in mcSAS will probably be easier.\n",
"reporter": "pkienzle",
"cc": "",
"resolution": "",
"workpackage": "McSAS Integration Project",
"time": "2019-02-19T14:23:34",
"component": "SasView",
"summary": "parallelize polydispersity loops",
"priority": "major",
"keywords": "",
"milestone": "SasView 4.3.0",
"owner": "",
"type": "defect"
}
There is unexploited parallelism in the polydiserpsity calculation. This limits the speedup available on high end graphics cards which are operating on 1-D dataset, since the current implementation is limited to only using one processor per q value. A card with 5000 separate processors will be mostly idle.
This is particularly important for mcSAS, which needs to evaluate
where P(q,r) is the sphere form, q is length n and the distribution w,r is length m, for a total of m x n total evaluations. The current sasmodels code does this in parallel over q_j, with the sum over w_i, r_i running in serial. Instead, we could break up the loop into non-intersecting stripes, first computing
with the number of stripes p, then we can keep n x p processors busy at the same time, and the cost of n x p intermediate results. With 256 q points, the 5120 processors on an nvidia V100 can compute 20 batches in parallel simultaneously, using minimal extra memory (256 x 20 x 8 bytes). May be faster to use p=16 so that memory accesses align better.
Next turn the problem on its side, compute the following:
with one processor for each q value. Can perhaps do better by computing pairs in parallel, then pairs of pairs, requiring four cycles for p=16 rather than 16, though the overhead of managing this may outweigh any benefit. Looking at the graphs on page 5 of the following:
I'm guessing the 4k reductions is too small to warrant a fast algorithm.
The existing kernel_iq.c would benefit from this, at least for the inner polydispersity loop, if you are willing to tackle it. Generating a specialized kernel for the particular problem of a distribution of spheres in mcSAS will probably be easier.
Migrated from http://trac.sasview.org/ticket/1230
{ "status": "new", "changetime": "2019-02-22T16:28:36", "_ts": "2019-02-22 16:28:36.578150+00:00", "description": "There is unexploited parallelism in the polydiserpsity calculation. This limits the speedup available on high end graphics cards which are operating on 1-D dataset, since the current implementation is limited to only using one processor per q value. A card with 5000 separate processors will be mostly idle.\n\nThis is particularly important for mcSAS, which needs to evaluate\n{{{\nI(q_j) = sum_{i=1}^m w_i P(q_j, r_i)\n}}}\nwhere P(q,r) is the sphere form, q is length n and the distribution w,r is length m, for a total of m x n total evaluations. The current sasmodels code does this in parallel over q_j, with the sum over w_i, r_i running in serial. Instead, we could break up the loop into non-intersecting stripes, first computing\n{{{\nI_k(q_j) = sum_{i=1}^{m/p} w_{k+p*i} P(q_j, r_{k+p*i})\n}}}\nwith the number of stripes p, then we can keep n x p processors busy at the same time, and the cost of n x p intermediate results. With 256 q points, the 5120 processors on an nvidia V100 can compute 20 batches in parallel simultaneously, using minimal extra memory (256 x 20 x 8 bytes). May be faster to use p=16 so that memory accesses align better.\n\nNext turn the problem on its side, compute the following:\n{{{\nI(q_j) = sum_k I_k(q_j)\n}}}\nwith one processor for each q value. Can perhaps do better by computing pairs in parallel, then pairs of pairs, requiring four cycles for p=16 rather than 16, though the overhead of managing this may outweigh any benefit. Looking at the graphs on page 5 of the following:\n\n https://www.cl.cam.ac.uk/teaching/1617/AdvGraph/07_OpenCL.pdf\n\nI'm guessing the 4k reductions is too small to warrant a fast algorithm.\n\nThe existing kernel_iq.c would benefit from this, at least for the inner polydispersity loop, if you are willing to tackle it. Generating a specialized kernel for the particular problem of a distribution of spheres in mcSAS will probably be easier.\n", "reporter": "pkienzle", "cc": "", "resolution": "", "workpackage": "McSAS Integration Project", "time": "2019-02-19T14:23:34", "component": "SasView", "summary": "parallelize polydispersity loops", "priority": "major", "keywords": "", "milestone": "SasView 4.3.0", "owner": "", "type": "defect" }