[core] Align Matrix memory layout with OpenGL standard (column-major) for raylib 6.0#5768
[core] Align Matrix memory layout with OpenGL standard (column-major) for raylib 6.0#5768noinodev wants to merge 3 commits into
Conversation
modified memory order of Matrix type to correctly represent row-major layout. removed implicit transposes and all order-dependent Matrix operations. Matrix API contract is now explicit and consistent
|
@noinodev thanks for the time put into this redesign but I prefer to avoid it. Current implementation has been proved to work correctly for +12 years in multiple platforms and scenarios. |
|
@noinodev So if I aimed for maximum performance and created overloaded DrawMeshInstanced (without alloc/free and MatrixToFloatV) I would only need to overload Matrix struct and some other methods as needed but NOT this line? |
|
@casperbear 'true' transposes the matrix. if you are changing the memory order to be column major, you would then set this to 'false'. if that's what youre asking? |
|
@noinodev I meant, realistically neither this will be merged to master nor it's feasible to apply this in personal forks. But people can add overloaded versions (DrawMeshInstancedColumnMajor and MatrixColumnMajor and Translate/Scale/Rotate functions as needed) based on your code, and I tried to understand whether this line UPD: In fact, that function is only called in relation to bones 'an ideal version of DrawMeshInstanced would maintain the VBO handle between frames' - that's an intriguing thought. Trees/bushes don't change position between frames. I'll try this eventually. Thank you for the PR. |
|
@casperbear ah i see now. i have always used a custom version of DrawMeshInstanced that does both things, but ive always wondered how I can upstream those improvements. the PR was the lowest friction solution i could come up with, anything else is 100x more invasive or doesnt work. |
With the upcoming release of raylib 6.0 (congratulations Ray!) I thought I'd revisit a part of raylib I've been thinking about for a long time, as both a user around the end of 2024, and after venturing into my own rendering engine work.
It is very hard to justify an API change of this nature, so this description will be very long. The PR itself is simple, and I think if the changed memory order is not desired for 6.0, we can discuss instead how to improve correctness around aggregate initializers and serialization, specifically.
The struct Matrix type in raylib.h, rlgl.h, raymath.h and rcamera.h is logically, through its math operations and naming, in a column-major layout. However the memory order, if cast to an array of floats directly, is in a row-major layout, which in itself is confusing, but it also has implications about raylib's usage and maintainability in the future, that may need to be addressed for a major version change. This PR does three things specifically:
DrawMeshInstanced()to demonstrate the positive outcome (5-10x performance gain) of using the OpenGL standard memory order.The API contract
Throughout raylib, there are several functions that convert a Matrix type to an array of floats. These functions also swizzle the memory order to column major, and do an implicit transpose if the memory order of Matrix is not column-major. I made a PR a few months ago to make note of this, as this behaviour is not obvious and may not be expected. The API also does not have a function to reverse this operation, which may be considered in the future for reasons I will further discuss.
The math functions in raylib that use the Matrix type explicitly and exclusively use the named fields of the Matrix m0..m15, and will never manipulate the bytes of a Matrix directly, through memcpy(), memmove(), or otherwise. This usage is fairly consistent across raylib and defines an API contract that raylib expects. However, it is noteworthy that there are 19 instances within raylib where aggregate struct initializers of the syntax
Matrix matrix = {1,2,3,4...};are used. This initialization of Matrix types is not representative of the logical layout of the Matrix, and does not obey the implicit API contract. In the PR I submitted- merged earlier this year- some of these usages were cleaned up in rlgl.h, which suggests this contract may be intentional, and something we can make more definite in raylib.For this issue specifically, we can replace every usage of aggregate struct initializers for the Matrix type with explicit named field assignments, throughout all of raylib, and enforce the contract internally. Now, we can either do this with designated initializers of the form
Matrix matrix = {m0=1,m1=2,m2=3...}, which can give us the same or similar ergonomics to aggregate initializers, or we can call eitherMatrix matrix = MatrixIdentiy()orMatrix matrix = { 0 }, and then set the fields afterwardsmatrix.m0 = 1; matrix.m1 = 2;.The latter appears more useful, as many of the Matrix math operations begin from identity Matrix. This is how the change has been implemented in this PR. Several usages of aggregate initializers for Matrix types were for assigning identity Matrix in various functions, so replacing these with calls to
MatrixIdentity()is an improvement in ergonomics and maintainability. Furthermore, we may need to consider adding a function that can revert theMatrixToFloatV()family of operations. Given the current memory order and implicit transpose, it must be exposed to the user that the shape of the data has changed from some arbitrary order defined by the Matrix type, to the explicit column-major order offloat16andrl_float16used for rlgl functions, and provide then a means to construct one from the other. This function would also significantly improve the ergonomics of many functions in rlgl.h, where aggregate initializers are currently used to assign elements from a float array back into Matrix types. An API for this use-case would make rlgl.h more readable, if nothing else, especially in functions such asrlGetMatrixModelview()andrlGetMatrixProjection(), which both assign OpenGL float array matrices to fields of the Matrix type. This is a common use-case which may alleviate some boilerplate in the future. Something along the lines ofFloatVToMatrix()orMatrixFromFloatV().The memory layout
After enforcing the Matrix API contract, only one line of code must be changed to accommodate the transpose memory order: specifically, in
rlSetUniformMatrices(), the 'transpose' field must change fromtruetofalse. This is specifically because this is the only function in this category that will not useMatrixToFloatV()before passing a Matrix to OpenGL. There is a comment that says// WARNING: WebGL does not support Matrix transpose ("true" parameter), whether this is actually true, transposing the Matrix type at the application layer will make this transpose entirely unnecessary.Changing the memory order in this way lets us upload Matrix types without the implicit transpose behind
MatrixToFloatV(), which makes an enormous difference in performance in one key area.DrawMeshInstanced
This function, as it is, is extraordinarily wasteful of resources. At a high level, it will:
float16types, equivalent to 16x4xN bytes, where N is the number of instances;float16buffer and VBO created within that function.If we want DrawMeshInstanced to be a useful function to developers, and to leverage the hardware properly, we want to optimize this. Ideally we want two things: to remove the allocate+copy from the hot path, and to persist OpenGL buffers between frames. The latter cannot be done without changing the API surface, but if all of the above changes are made, we can eliminate the allocation and copy of instance buffers, and use the argument
transformscontaining the Matrix array for the VBO directly.The elementwise copy+transpose is an enormous waste of time on the CPU. This benchmark was performed on an AMD 3020e laptop, running Xubuntu 24.04 with latest Mesa drivers. The program is a modified version of the raylib Instanced Rendering example, rendering 4,000,000 instanced cubes in a single draw call using DrawMeshInstanced. This is a low-end machine, so I have included a benchmark that skips the draw, and does everything else, to demonstrate the CPU-side time savings. It is noteworthy that, with the copy included, the actual draw makes no difference, suggesting DrawMeshInstanced is entirely CPU bound.
This is a significant performance loss that can be rectified with a small set of changes, but it is important to make note of how this affects existing raylib programs.
Side-effects
As mentioned, this PR represents a total enforcement of the Matrix API and its usage, which has some specific effects on existing programs. I have been prudent to compile and run each and every raylib example and individually examine the effect of the change, of which there are none. All examples are equivalent in behaviour after enforcing the Matrix API contract. I am aware that the examples do not cover the entire API, nor every use-case of raylib, therefore we must consider how this change will affect the existing programs of developers.
For programs that use the Matrix type correctly, by using the named struct fields, and MatrixToFloatV for serialization, there is no effect to this change, and all programs should function equivalently.
For programs that use the Matrix type in unintended ways, such as uploading Matrix types to the GPU as uniforms without first calling
MatrixToFloatV, which is conventional in raylib, this change will break the program.For programs that serialize a Matrix directly, and interpret its bytes, this change will break (existing serialized files in) the program.
For programs that use
memcpy()ormemmove()and rely on the memory order of Matrix to have some certain behaviour, including networked applications where a receiver has a different understanding of the memory order, this change wil break the program.Bindings. I am not sure of the binding development spec, but any API for raylib that treats Matrix as a float buffer, and not a struct, and does not respect the logical memory layout, will break. This is something we can investigate further, as I haven't had the opportunity yet.
In most cases, the solution to these specific failure modes is to correctly use the API for Matrix types, where to enforce a certain memory layout, the
MatrixToFloatV()family of functions is used. This function guarantees an exact memory layout, and being more explicit about this behaviour in documentation will make this class of changes much smoother in the future.Conclusion
This change, given its effect on existing programs, would likely not be appropriate for standard releases. However, given the upcoming major version change, it may be worth considering to leave behind some technical debt. The failure modes of this change are well-defined, and heavily localized to small subset of the codebase.
If nothing else, I want this PR to inspire some thought or discussion about how this flavour of technical debt can be addressed in the future.