Skip to content

parallelize polydispersity loops (Trac #1230) #393

@pkienzle

Description

@pkienzle

There is unexploited parallelism in the polydiserpsity calculation. This limits the speedup available on high end graphics cards which are operating on 1-D dataset, since the current implementation is limited to only using one processor per q value. A card with 5000 separate processors will be mostly idle.

This is particularly important for mcSAS, which needs to evaluate

I(q_j) = sum_{i=1}^m w_i P(q_j, r_i)

where P(q,r) is the sphere form, q is length n and the distribution w,r is length m, for a total of m x n total evaluations. The current sasmodels code does this in parallel over q_j, with the sum over w_i, r_i running in serial. Instead, we could break up the loop into non-intersecting stripes, first computing

I_k(q_j) = sum_{i=1}^{m/p} w_{k+p*i} P(q_j, r_{k+p*i})

with the number of stripes p, then we can keep n x p processors busy at the same time, and the cost of n x p intermediate results. With 256 q points, the 5120 processors on an nvidia V100 can compute 20 batches in parallel simultaneously, using minimal extra memory (256 x 20 x 8 bytes). May be faster to use p=16 so that memory accesses align better.

Next turn the problem on its side, compute the following:

I(q_j) = sum_k I_k(q_j)

with one processor for each q value. Can perhaps do better by computing pairs in parallel, then pairs of pairs, requiring four cycles for p=16 rather than 16, though the overhead of managing this may outweigh any benefit. Looking at the graphs on page 5 of the following:

https://www.cl.cam.ac.uk/teaching/1617/AdvGraph/07_OpenCL.pdf

I'm guessing the 4k reductions is too small to warrant a fast algorithm.

The existing kernel_iq.c would benefit from this, at least for the inner polydispersity loop, if you are willing to tackle it. Generating a specialized kernel for the particular problem of a distribution of spheres in mcSAS will probably be easier.

Migrated from http://trac.sasview.org/ticket/1230

{
    "status": "new",
    "changetime": "2019-02-22T16:28:36",
    "_ts": "2019-02-22 16:28:36.578150+00:00",
    "description": "There is unexploited parallelism in the polydiserpsity calculation.  This limits the speedup available on high end graphics cards which are operating on 1-D dataset, since the current implementation is limited to only using one processor per q value.  A card with 5000 separate processors will be mostly idle.\n\nThis is particularly important for mcSAS, which needs to evaluate\n{{{\nI(q_j) = sum_{i=1}^m w_i P(q_j, r_i)\n}}}\nwhere P(q,r) is the sphere form, q is length n and the distribution w,r is length m, for a total of m x n total evaluations. The current sasmodels code does this in parallel over q_j, with the sum over w_i, r_i running in serial. Instead, we could break up the loop into non-intersecting stripes, first computing\n{{{\nI_k(q_j) = sum_{i=1}^{m/p} w_{k+p*i} P(q_j, r_{k+p*i})\n}}}\nwith the number of stripes p, then we can keep n x p processors busy at the same time, and the cost of n x p intermediate results.  With 256 q points, the 5120 processors on an nvidia V100 can compute 20 batches in parallel simultaneously, using minimal extra memory (256 x 20 x 8 bytes).  May be faster to use p=16 so that memory accesses align better.\n\nNext turn the problem on its side, compute the following:\n{{{\nI(q_j) = sum_k I_k(q_j)\n}}}\nwith one processor for each q value.   Can perhaps do better by computing pairs in parallel, then pairs of pairs, requiring four cycles for p=16 rather than 16, though the overhead of managing this may outweigh any benefit.  Looking at the graphs on page 5 of the following:\n\n    https://www.cl.cam.ac.uk/teaching/1617/AdvGraph/07_OpenCL.pdf\n\nI'm guessing the 4k reductions is too small to warrant a fast algorithm.\n\nThe existing kernel_iq.c would benefit from this, at least for the inner polydispersity loop, if you are willing to tackle it. Generating a specialized kernel for the particular problem of a distribution of spheres in mcSAS will probably be easier.\n",
    "reporter": "pkienzle",
    "cc": "",
    "resolution": "",
    "workpackage": "McSAS Integration Project",
    "time": "2019-02-19T14:23:34",
    "component": "SasView",
    "summary": "parallelize polydispersity loops",
    "priority": "major",
    "keywords": "",
    "milestone": "SasView 4.3.0",
    "owner": "",
    "type": "defect"
}

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions