You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: pyop2/gpu/TODO.org
+6-7Lines changed: 6 additions & 7 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,8 +1,8 @@
1
1
* Limitations/TODOs
2
2
** Changes in TSFC so that PyOP2 could have a better understanding of the variable names
3
-
- [[https://github.com/OP2/PyOP2/blob/630e55118013966e84dcc62328c45fc9061196e6/pyop2/gpu/tile.py#L65-L79][Currently]] variable names have been hard coded for CG type FE kernel on
4
-
triangular meshes.
5
-
- Once this has been done it would then be reasonable to tackle other elements
3
+
- [[https://github.com/OP2/PyOP2/blob/f50af5e819e726b97b1997f00b1ad4f66b0b574b/pyop2/gpu/tile.py#L117][Currently]], we go through a phase of metadata inference assuming a homogeneity
4
+
of kernel structure.
5
+
- Once this has been done it would then be reasonable to tackle more elements
6
6
7
7
*** Information to be fed from TSFC
8
8
- [ ] variable name of the action input
@@ -38,7 +38,7 @@ we are going from GEM representation to loopy kernel.
38
38
39
39
** Global reduction kernels. For ex. ~assemble(dot(f,f)*dx)~
40
40
- Currently all the threads write to a single memory location atomically,
41
-
thereby losing concurrency.
41
+
thereby losing some concurrency.
42
42
- Possible solution:
43
43
- Fix the block size, say 256.
44
44
- Map single cell to single thread.
@@ -47,9 +47,8 @@ we are going from GEM representation to loopy kernel.
47
47
- Finally another reduction across the newly created intermediary variable.
48
48
- One starting step would be to map the '+=' to a loopy's sum node.
49
49
50
-
** Do we need atomic additions of the output DoFs for a DG kernel?
51
-
52
-
** Tiling transformation logic fails for low orders
50
+
** Atomic scatter redutions for DG elements, necessary?
51
+
** Inner loop parallelization logic fails for low orders
53
52
- The received TSFC kernel has a slightly different representation at low orders
54
53
like P_0, P_1, DG0, DG1, etc. because some loops are unrolled, causing to
55
54
diverge from the "assumed" template of all the kernel's loop structures.
0 commit comments