[core] Align Matrix memory layout with OpenGL standard (column-major) for raylib 6.0 by noinodev · Pull Request #5768 · raysan5/raylib

noinodev · 2026-04-16T14:16:24Z

With the upcoming release of raylib 6.0 (congratulations Ray!) I thought I'd revisit a part of raylib I've been thinking about for a long time, as both a user around the end of 2024, and after venturing into my own rendering engine work.

It is very hard to justify an API change of this nature, so this description will be very long. The PR itself is simple, and I think if the changed memory order is not desired for 6.0, we can discuss instead how to improve correctness around aggregate initializers and serialization, specifically.

The struct Matrix type in raylib.h, rlgl.h, raymath.h and rcamera.h is logically, through its math operations and naming, in a column-major layout. However the memory order, if cast to an array of floats directly, is in a row-major layout, which in itself is confusing, but it also has implications about raylib's usage and maintainability in the future, that may need to be addressed for a major version change. This PR does three things specifically:

Address and enforce the API contract of Matrix types, universally;
Modify the memory layout of Matrix types to column-major, and,
Modify DrawMeshInstanced() to demonstrate the positive outcome (5-10x performance gain) of using the OpenGL standard memory order.

The API contract

Throughout raylib, there are several functions that convert a Matrix type to an array of floats. These functions also swizzle the memory order to column major, and do an implicit transpose if the memory order of Matrix is not column-major. I made a PR a few months ago to make note of this, as this behaviour is not obvious and may not be expected. The API also does not have a function to reverse this operation, which may be considered in the future for reasons I will further discuss.

The math functions in raylib that use the Matrix type explicitly and exclusively use the named fields of the Matrix m0..m15, and will never manipulate the bytes of a Matrix directly, through memcpy(), memmove(), or otherwise. This usage is fairly consistent across raylib and defines an API contract that raylib expects. However, it is noteworthy that there are 19 instances within raylib where aggregate struct initializers of the syntax Matrix matrix = {1,2,3,4...}; are used. This initialization of Matrix types is not representative of the logical layout of the Matrix, and does not obey the implicit API contract. In the PR I submitted- merged earlier this year- some of these usages were cleaned up in rlgl.h, which suggests this contract may be intentional, and something we can make more definite in raylib.

For this issue specifically, we can replace every usage of aggregate struct initializers for the Matrix type with explicit named field assignments, throughout all of raylib, and enforce the contract internally. Now, we can either do this with designated initializers of the form Matrix matrix = {m0=1,m1=2,m2=3...}, which can give us the same or similar ergonomics to aggregate initializers, or we can call either Matrix matrix = MatrixIdentiy() or Matrix matrix = { 0 }, and then set the fields afterwards matrix.m0 = 1; matrix.m1 = 2;.

The latter appears more useful, as many of the Matrix math operations begin from identity Matrix. This is how the change has been implemented in this PR. Several usages of aggregate initializers for Matrix types were for assigning identity Matrix in various functions, so replacing these with calls to MatrixIdentity() is an improvement in ergonomics and maintainability. Furthermore, we may need to consider adding a function that can revert the MatrixToFloatV() family of operations. Given the current memory order and implicit transpose, it must be exposed to the user that the shape of the data has changed from some arbitrary order defined by the Matrix type, to the explicit column-major order of float16 and rl_float16 used for rlgl functions, and provide then a means to construct one from the other. This function would also significantly improve the ergonomics of many functions in rlgl.h, where aggregate initializers are currently used to assign elements from a float array back into Matrix types. An API for this use-case would make rlgl.h more readable, if nothing else, especially in functions such as rlGetMatrixModelview() and rlGetMatrixProjection(), which both assign OpenGL float array matrices to fields of the Matrix type. This is a common use-case which may alleviate some boilerplate in the future. Something along the lines of FloatVToMatrix() or MatrixFromFloatV().

The memory layout

After enforcing the Matrix API contract, only one line of code must be changed to accommodate the transpose memory order: specifically, in rlSetUniformMatrices(), the 'transpose' field must change from true to false. This is specifically because this is the only function in this category that will not use MatrixToFloatV() before passing a Matrix to OpenGL. There is a comment that says // WARNING: WebGL does not support Matrix transpose ("true" parameter), whether this is actually true, transposing the Matrix type at the application layer will make this transpose entirely unnecessary.

Changing the memory order in this way lets us upload Matrix types without the implicit transpose behind MatrixToFloatV(), which makes an enormous difference in performance in one key area.

DrawMeshInstanced

This function, as it is, is extraordinarily wasteful of resources. At a high level, it will:

Allocate a buffer of float16 types, equivalent to 16x4xN bytes, where N is the number of instances;
Perform an elementwise copy + transpose of each transform in the array, in O(N) time;
Allocate a VBO and copy this new buffer to it; then
Free both the float16 buffer and VBO created within that function.

If we want DrawMeshInstanced to be a useful function to developers, and to leverage the hardware properly, we want to optimize this. Ideally we want two things: to remove the allocate+copy from the hot path, and to persist OpenGL buffers between frames. The latter cannot be done without changing the API surface, but if all of the above changes are made, we can eliminate the allocation and copy of instance buffers, and use the argument transforms containing the Matrix array for the VBO directly.

The elementwise copy+transpose is an enormous waste of time on the CPU. This benchmark was performed on an AMD 3020e laptop, running Xubuntu 24.04 with latest Mesa drivers. The program is a modified version of the raylib Instanced Rendering example, rendering 4,000,000 instanced cubes in a single draw call using DrawMeshInstanced. This is a low-end machine, so I have included a benchmark that skips the draw, and does everything else, to demonstrate the CPU-side time savings. It is noteworthy that, with the copy included, the actual draw makes no difference, suggesting DrawMeshInstanced is entirely CPU bound.

	Copy	No Copy
Draw	Function: 180ms, Total Frame: 181ms	Function: 36ms, Total Frame: 120ms
No Draw	Function: 180ms, Total Frame: 180ms	Function: 26ms, Total Frame, 26ms

This is a significant performance loss that can be rectified with a small set of changes, but it is important to make note of how this affects existing raylib programs.

Side-effects

As mentioned, this PR represents a total enforcement of the Matrix API and its usage, which has some specific effects on existing programs. I have been prudent to compile and run each and every raylib example and individually examine the effect of the change, of which there are none. All examples are equivalent in behaviour after enforcing the Matrix API contract. I am aware that the examples do not cover the entire API, nor every use-case of raylib, therefore we must consider how this change will affect the existing programs of developers.

For programs that use the Matrix type correctly, by using the named struct fields, and MatrixToFloatV for serialization, there is no effect to this change, and all programs should function equivalently.

For programs that use the Matrix type in unintended ways, such as uploading Matrix types to the GPU as uniforms without first calling MatrixToFloatV, which is conventional in raylib, this change will break the program.

For programs that serialize a Matrix directly, and interpret its bytes, this change will break (existing serialized files in) the program.

For programs that use memcpy() or memmove() and rely on the memory order of Matrix to have some certain behaviour, including networked applications where a receiver has a different understanding of the memory order, this change wil break the program.

Bindings. I am not sure of the binding development spec, but any API for raylib that treats Matrix as a float buffer, and not a struct, and does not respect the logical memory layout, will break. This is something we can investigate further, as I haven't had the opportunity yet.

In most cases, the solution to these specific failure modes is to correctly use the API for Matrix types, where to enforce a certain memory layout, the MatrixToFloatV() family of functions is used. This function guarantees an exact memory layout, and being more explicit about this behaviour in documentation will make this class of changes much smoother in the future.

Conclusion

This change, given its effect on existing programs, would likely not be appropriate for standard releases. However, given the upcoming major version change, it may be worth considering to leave behind some technical debt. The failure modes of this change are well-defined, and heavily localized to small subset of the codebase.

If nothing else, I want this PR to inspire some thought or discussion about how this flavour of technical debt can be addressed in the future.

modified memory order of Matrix type to correctly represent row-major layout. removed implicit transposes and all order-dependent Matrix operations. Matrix API contract is now explicit and consistent

raysan5 · 2026-04-16T15:18:28Z

@noinodev thanks for the time put into this redesign but I prefer to avoid it. Current implementation has been proved to work correctly for +12 years in multiple platforms and scenarios.

casperbear · 2026-04-16T23:57:22Z

@noinodev So if I aimed for maximum performance and created overloaded DrawMeshInstanced (without alloc/free and MatrixToFloatV) I would only need to overload Matrix struct and some other methods as needed but NOT this line?
glUniformMatrix4fv(locIndex, count, true, (const float *)matrices);

noinodev · 2026-04-17T06:48:17Z

@casperbear 'true' transposes the matrix. if you are changing the memory order to be column major, you would then set this to 'false'. if that's what youre asking?
an ideal version of DrawMeshInstanced would maintain the VBO handle between frames, so its not reallocating the instance buffer on the GPU each time. as well as removing the alloc+copy. but just removing the copy makes a big difference already.

casperbear · 2026-04-17T07:40:02Z

@noinodev I meant, realistically neither this will be merged to master nor it's feasible to apply this in personal forks. But people can add overloaded versions (DrawMeshInstancedColumnMajor and MatrixColumnMajor and Translate/Scale/Rotate functions as needed) based on your code, and I tried to understand whether this line
glUniformMatrix4fv(locIndex, count, true, (const float *)matrices);
changed to
glUniformMatrix4fv(locIndex, count, false, (const float *)matrices);
will make any difference for DrawMeshInstancedColumnMajor. I think it won't (as long as I continue to use original Matrix for vanilla RayLib functions).

UPD: In fact, that function is only called in relation to bones
rlSetUniformMatrices(mat.shader.locs[SHADER_LOC_MATRIX_BONETRANSFORMS], model.boneMatrices, model.skeleton.boneCount);
DrawMeshInstanced doesn't use it

'an ideal version of DrawMeshInstanced would maintain the VBO handle between frames' - that's an intriguing thought. Trees/bushes don't change position between frames. I'll try this eventually. Thank you for the PR.

noinodev · 2026-04-17T07:58:24Z

@casperbear ah i see now. i have always used a custom version of DrawMeshInstanced that does both things, but ive always wondered how I can upstream those improvements. the PR was the lowest friction solution i could come up with, anything else is 100x more invasive or doesnt work.

noinodev added 3 commits April 16, 2026 21:37

modified memory order of Matrix type

56be391

modified memory order of Matrix type to correctly represent row-major layout. removed implicit transposes and all order-dependent Matrix operations. Matrix API contract is now explicit and consistent

one space

6926c87

line break

01dd4ce

raysan5 closed this Apr 16, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[core] Align Matrix memory layout with OpenGL standard (column-major) for raylib 6.0#5768

[core] Align Matrix memory layout with OpenGL standard (column-major) for raylib 6.0#5768
noinodev wants to merge 3 commits into
raysan5:masterfrom
noinodev:matrix-memory-fix

noinodev commented Apr 16, 2026

Uh oh!

raysan5 commented Apr 16, 2026

Uh oh!

casperbear commented Apr 16, 2026

Uh oh!

noinodev commented Apr 17, 2026

Uh oh!

casperbear commented Apr 17, 2026 •

edited

Loading

Uh oh!

noinodev commented Apr 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

noinodev commented Apr 16, 2026

The API contract

The memory layout

DrawMeshInstanced

Side-effects

Conclusion

Uh oh!

raysan5 commented Apr 16, 2026

Uh oh!

casperbear commented Apr 16, 2026

Uh oh!

noinodev commented Apr 17, 2026

Uh oh!

casperbear commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

noinodev commented Apr 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

casperbear commented Apr 17, 2026 •

edited

Loading