Generating C bindings

The C bindings are generated by a separate executable build/mrbind_gen_c. It consumes the parser output JSON, and generates C headers and C++ implementation files, which you are then free to compile with any compiler.

Running the generator

The minimal generation invocation looks like this:

mrbind_gen_c \
    --input parse_result.json \
    --output-header-dir output/include \
    --output-source-dir output/src \
    --helper-name-prefix MyLib_ \
    --map-path path/to/input/headers . \
    --assume-include-dir path/to/input

Let's go over the options. You can find more detailed descriptions in mrbind_gen_c --help.

--input __.json is the input JSON as produced by the parser (mrbind).
--output-header-dir __ --output-source-dir __ are the output directories for the headers and source files respectively.

The default behavior is to error if those aren't empty. Pass --clean-output-dirs to automatically delete all contents before generation.
--helper-name-prefix MyLib_ sets the name prefix for certain generated functions/types/etc. This can be anything, but ideally you should pass your C++ namespace name followed by _.

This isn't used for everything. The parsed names from the input will be prefixed with their C++ namespace names regardless of this flag. This prefix is used primarly for additional helper functions that we sometimes generate.

This prefix is also used for macros by default. If you want a different prefix for macros (e.g. because this one isn't in all caps), pass it as --helper-macro-name-prefix MYLIB_.
--map-path IN OUT controls the output directory hierarchy. Can be passed multiple times.

For each input C++ header, we generate the respective output C header with the same name, and a C source file with the same name. (This only happens for headers that didn't have their contents filtered out by the parser's --ignore, so you don't need to worry about the standard library or third-party library headers.)

Every such input C++ header must be located in one of the IN directories (this flag can be used several times to specify several directories). IN can be absolute or relative, it's made absolute internally anyway.

OUT must always be relative. It's treated as relative to --output-header-dir/--output-source-dir.

For example, if your input headers are a/b/c/d/1.hpp, a/b/c/d/2.hpp, and the output directories are --output-header-dir include --output-source-dir src, then:
- --map-path a/b . would generate include/c/d/1.h, include/c/d/2.h and src/c/d/1.c, src/c/d/2.c.
- --map-path a/b foo/bar would generate include/foo/bar/c/d/1.h, include/foo/bar/c/d/2.h and src/foo/bar/c/d/1.c, src/foo/bar/c/d/2.c.
- --map-path a/b/c . would generate include/d/1.h, include/d/2.h and src/d/1.c, src/d/2.c, and so on.
--assume-include-dir IN specifies which directories will be given to the compiler as include directories (via -I...).

It's similar to --map-path in that it can be used multiple times, and every input C++ header must be one of the IN directories.

For example, if your input headers are a/b/c/d/1.hpp, a/b/c/d/2.hpp, then --assume-include-dir a/b would cause the generator to use #include <c/d/1.hpp>, #include <c/d/2.hpp> to include those headers.

NOTE: If your input headers use the .h extension, they can end up conflicting with the output .h headers with same names. To avoid this, don't pass directories that directly contain input headers, and instead pass their parents.

For example, if your input header is named a/b/c/d/1.h, then don't pass --assume-include-dir a/b/c/d, and instead pass a/b/c or a/b or a.

If you pass --assume-include-dir a/b/c/d, then the following will happen. The generated 1.c will contain:
```
#include "1.h" // Include the generated C header `1.h`.
#include <1.h> // Include the parsed C++ input header `1.h`.
```
At least one of those will resolve incorrectly. (Since neither of those headers will be in the same directory as 1.c.)

And if you instead pass --assume-include-dir a/b/c, then you'll get:
```
#include "1.h"
#include <d/1.h>
```
Which is unambiguous, assuming you correctly set your include paths.

Ignoring problematic parts of the code

If the generation fails because of certain problematic functions/types/etc, exclude them from the bindings.

Completeness

Unlike Python, if C bindings generate and then compile successfully, you can be fairly sure that nothing is missing and they'll work correctly.

Compiling the generated code

On success, the directories passed to --output-header-dir, --output-source-dir will contain the resulting C/C++ code, which you can compile with your preferred compiler, usually into a shared library.

You need at least following compiler flags:

Add both --output-header-dir and --output-source-dir to the include search paths.
Add your input C++ headers to the include search paths. Use the same directories you passed to --assume-include-dir.

Tuning the generated bindings

All flags beflow are for mrbind_gen_c, unless mentioned otherwise.

File name length limit

Some generated headers can end up with long filenames (e.g. if you bind std::tuple with a 100 members). To set a limit on filename length, use --max-header-name-length N (e.g. with N=100). Longer filenames will get truncated, and hashes will be appended to those names to make them unique.

Move the extra headers

We generate some additional headers, such as exports.h for the shared library export macro, or headers for the standard containers if you use them.

By default all those headers end up directly in the --output-header-dir, which can look bad.

Use --helper-header-dir some/dir to move them to a subdirectory. some/dir is relative to --output-header-dir, and will typically be MylibMisc, or something along those lines.

Inherit class members

Pass --copy-inherited-members to the parser (mrbind) to paste the members from the base classes into the classes derived from them. Otherwise you'll have to manually upcast the pointers to call the base methods.

This applies only to the implicitly inherited members. Members explicitly inherited via using, including constructors, are copied unconditionally.

Exception handling

We try to handle C++ exceptions by default.

There is a global callback that gets called when an exception escapes a C function. You can change it using ..._SetSimpleExceptionHandler(). The default behavior is to print the exception and terminate, but you can change this, e.g. to set your own global variable to indicate the error and continue, for example. Setting the callback to null disables exception catching at runtime, letting them fall out of the C functions.

You can opt out of exception handling entirely using --no-handle-exceptions, then we'll pretend that exceptions don't exist.

Note that --no-handle-exceptions can be set independently from -fno-exceptions:

If you use --no-handle-exceptions but not -fno-exceptions, then exceptions will escape C functions.
If you don't use --no-handle-exceptions but use -fno-exceptions, then the C headers will still include the functions for dealing with exceptions (such as setting custom handlers for them), but they will do nothing.

So the C headers are decoupled from whether your C library is internally compiled with -fno-exceptions or not. C users can check exception support at runtime.

One more thing: currently, using -fno-exceptions makes it impossible to recover from certain errors caused by incorrect usage of the bindings, which would be exceptions if exceptions were enabled. We may or may not fix this eventually.

Expose simple structs as structs

The default behavior is to expose all classes/structs as opaque heap-allocated pointers.

If you have sufficiently simple structs, you can opt into exposing them as plain C structs instead. The prime candidates for this are struct Vec3 {float x,y,z;}; and such.

To do this, pass your struct name to --expose-as-struct .... Use this flag multiple times to expose multiple structs. Invalid names are silently ignored.

The struct must be trivial enough: trivially-copyable, standard-layout, all members are public, no base classes. The members must have built-in types, or must themselves be structs passed to --expose-as-struct (so e.g. struct Mat3 {Vec3 x,y,z;}; can be exposed correctly). If those requirements aren't satisfied, the generator will complain.

The specified name needs to be fully qualified (with all namespaces), and include all template arguments, e.g. --expose-as-struct 'MyLibrary::Vector3<int>'. (The '...' quotes are not a part of the syntax, use quotes appropriate for your shell if needed.)

You can expose several classes via a regex, e.g. to expose Foo::Bar<T> for any T, you could do --expose-as-struct '/Foo::Bar<.*>/'. (Again, the '...' are your shell's quotes, while /.../ is a part of the syntax, to indicate that this is a regex.)

Make generated headers include each other

The generated headers are shy about including each other, preferring forward declarations when possible. The end result is that the end user will often have to include many ancillary headers (e.g. if a function returns std::string and you want to interact with its return value, you need to manually include the header with the binding for std::string too).

--add-convenience-includes fixes this. It's not enabled by default because it can be too eager with the includes, slowing down the user builds (though it adds a macro that lets users opt out of the extra includes).

Don't generate elementwise constructors for large structs

When a C++ struct is an aggregate (i.e. has no constructors, and in C++ can be initialized with a list of its members in braces), we try to generate a C function for it that acts as a constructor, with a parameter for every member.

This can get out of hand with huge structs. Use --preferred-max-num-aggregate-init-fields N (with e.g. N=20) to not generate those functions for structs with more than N members.

This flag is ignored for non-default-constructible structs, as this could make them impossible to construct.

Using fixed-size typedefs

Passing --canonicalize-to-fixed-size-typedefs will use int32_t and other similar standard typedefs instead of all built-in integer types.

This is purely a style choice, and doesn't help portability. For portability, see this.

Making the bindings cross-platform

If you don't go out of your way, by default the generated C bindings code code will only be usable on one platform you generated them on.

But with some care, it's possible to make the resulting code fully consistent and portable across platforms. This section explains how to do it.

After following the steps below, it's your job to test on all platforms you support to make sure that everything works (if you generate the bindings on different platforms, you might want to check that they are exactly the same; or if you generate on only one platform but then send the code to other platforms to compile it there, test that it actually compiles). Each of the platforms (Windows, Linux, Mac, Emscripten) has some unique differences that can affect the output if not addressed.

First the simple things:

Underlying types of enums

On Windows, the underlying type of plain enums defaults to int, while on Linux it's either int or unsigned int depending on whether there are negative constants or not. This underlying type shows up in the generated C headers.

This only affects the plain enum, not enum class, which is specified by the C++ standard to always default to the int underlying type.

The fix is to pass --implicit-enum-underlying-type-is-always-int to the parser. This will make it report the default type as int (if the underlying type is specified in C++), regardless of what it actually is.

So far this hasn't caused any breakage for us.

`std::expected` vs `tl::expected`

If you switch between std::expected and tl::expected on different platforms, depending on what's available, this can help.

Pass --merge-std-and-tl-expected to the generator to remove the namespaces from both, renaming both to just expected.

This only renames things in C headers, but the full namespace will still be hardcoded in the generated .cpp files.

`size_t` and other standard typedefs

This is the biggest source of issues and differences between platforms.

The problem is that the MRBind parser expands the typedefs. So among other things, it's going to expand std::size_t, std::int64_t, etc. It means that in the C bindings, they will appear as long, long long, etc. (Can't we access the original spelling? Not in any non-trivial cases, more on that below.)

This isn't just a style issue. This causes code generated on different OSes to be different, and causes compilation errors when generating the code on one platform and then compiling it on another.

We can solve this by limiting what types you can use in your interfaces, and then applying certain type replacement rules to the parser results, making them consistent across platforms.

There are several different options:

Option	Can use `long`	Can use `long long`	Can use `[u]int64_t`	Can use `size_t`/`ptrdiff_t`	Resulting code is cross-platform?	Parser must imitate a particular platform? (see below)
A	❌	❌	✅	✅	✅ Yes	⚠️ Yes
B	❌	❌	⚠️*	✅*	⚠️ 64-bit only	✅ No
A2	❌	✅	❌	✅	✅ Yes	⚠️ Yes
B2	❌	❌	✅	❌	⚠️ 64-bit only	✅ No
None	✅	✅	❌	❌	❌ No	✅ No

Option A produces best results, but it adds parser configuration complexity (as indicated in the last column). If that's undesired, then use option B. Options A2 and B2 should probably be avoided.

* In option B, you can't use the standard [u]int64_t in your interface, and instead must use a certain custom typedef, as explained below. Moreover, all uses of size_t/ptrdiff_t in your interface will be rewritten to [u]int64_t in the bindings (which makes the code not portable to 32-bit platforms, where those types are supposed to be 32-bit).

As you can see, in all cases you must remove [unsigned] long and [unsigned] long long from your interfaces.

Each option requries passing certain flags to MRBind, which are explained below.

Option A

If you're not running the parser on a Mac, then you must pass --target=wasm32-unknown-emscripten to the parser (after --) to make it pretend that your platform WASM/Emscripten. The resulting bindings will work on any platform, they are not WASM-specific. This flag only affects some predefined macros and include directories of the parser, and importantly it changes the standard typedefs to what is favorable to us. On Macs you don't have to do this, since their standard typedefs are already good.

The reason for this is that only on Emscripten and Mac, size_t/ptrdiff_t and [u]int64_t expand to different types (long and long long), which allows the parser to distinguish them.

You must also pass the following flags to the parser: --canonicalize-long-to-size_t --canonicalize-64-to-fixed-size-typedefs.

Passing --target=wasm32-unknown-emscripten will remove the standard library from the include search path. Normally you're supposed to provide Emscripten headers instead, using --sysroot=..., where ... can be obtained from Emscripten SDK by running echo | em++ -fsyntax-only -v -xc++ - 2>&1 | grep -oP '(?<=^ ).*(?=/include/c\+\+/v1$)' (note that this command generates that directory on the first run). But on Linux (on x64 but not on Arm) you can miraculously use the host headers (i.e. regular linux headers) instead; for that pass -stdlib=libstdc++ and the list of flags printed by running the following command: clang++ -xc++ /dev/null -fsyntax-only -v 2>&1 | awk '/^#include <...> search starts here:$/{x=1; next} !/^ /&&x{exit} x{gsub(/^ /,""); printf " -isystem%s", $0} END{print ""}'. And if it complains about a missing #include <gnu/stubs-32.h> on Ubuntu, install libc6-dev-i386.

This causes it to rewrite [unsigned] long (which results from expanding size_t/ptrdiff_t) back to size_t/ptrdiff_t, and also rewrite [unsigned] long long (which results from expanding [u]int64_t) back to [u]int64_t.

Option A2 is to trade [u]int64_t support for [unsigned] long long support. I don't see how that's useful, but that can be achieved by removing --canonicalize-64-to-fixed-size-typedefs and only keeping --canonicalize-long-to-size_t.

Option B

You must stop using std::[u]int64_t, and instead add the following typedef to your library:

// Note, those typedefs are not compatible with option A. If you want to support both options, you have to add a condition to this `#ifdef` that option A is not being used.
#ifdef __APPLE__
#include <cstddef>
namespace mylib
{
    using Int64 = std::ptrdiff_t;
    using Uint64 = std::size_t;
    static_assert(sizeof(Int64) == 8);
    static_assert(sizeof(Uint64) == 8);
}
#else
#include <cstdint>
namespace mylib
{
    using Int64 = std::int64_t;
    using Uint64 = std::uint64_t;
    #endif
}

This #ifdef __APPLE__ is somewhat cosmetic, because on other 64-bit platforms (tested Windows and Linux), std::[u]int64_t expand to the same type as std::size_t and std::ptrdiff_t anyway. It's only there to support 32-bit platforms.

And then you pass --canonicalize-64-to-fixed-size-typedefs --canonicalize-size_t-to-uint64_t to the parser and --reject-long-and-long-long --use-size_t-typedef-for-uint64_t to the generator.

Here --canonicalize-64-to-fixed-size-typedefs causes long or long long to be rewritten back to the [u]int64_t. Only one of those two types is rewritten (depending on the platform), and the other is left as is, but then the generator flag --reject-long-and-long-long is used to error if any such non-rewritten type got through (which indicates that you used a type in your interface that you weren't supposed to).

Macs are special in that they use different types for size_t/ptrdiff_t and [u]int64_t (long and long long respectively), which means you can't use both in your interface. Our solution to this is to replace [u]int64_t with a custom typedef that expands to size_t/ptrdiff_t. But then, since spelling it as size_t/ptrdiff_t in the generated code would be stupid, we use --canonicalize-size_t-to-uint64_t, which on Macs causes [unsigned] long to be rewritten as [u]int64_t, instead of rewriting [unsigned] long long that way. This would normally produce broken code, which is why we have --use-size_t-typedef-for-uint64_t that adds a custom 64-bit typedef (simialr to the example above) to the generated C code.

Option B2 is a variant of this without the custom typedef and without --canonicalize-size_t-to-uint64_t + --use-size_t-typedef-for-uint64_t. You trade size_t/ptrdiff_t support for [u]int64_t support.

More details about `size_t`

Why are we doing all this, again?

This is a long story.

To recap the problem: all 64-bit standard typedefs expand to different types (long vs long long) on different platforms. In particular:

On Windows, long is 32 bits wide. So size_t, int64_t, and all other 64-bit wide standard typedefs use long long.
On Linux, long is 64 bits wide, so it's used for all those typedefs instead.
On Mac, long is 64 bits wide, but both long and long long are used, for different typedefs (because of course they are!). The typedefs with digits in their names (e.g. int64_t) use long long, while all the other ones (e.g. size_t) use long.
Emscripten works like Mac. Notably, in both 32-bit and 64-bit Emscripten, size_t always expands to long (which is 32-bit and 64-bit wide respectively).

In other words:

Typedef	Windows x64	Linux x64	Mac and Emscripten
`[u]int64_t`	`long long`	`long`	`long long`
`size_t` and `ptrdiff_t`	`long long`	`long`	`long`

This is a problem if any of those typedefs are included with in bindings. They are going to expand to long vs long long on different platforms, making the generated code incompatible between platforms.

Can't we have the parser not expand typedefs in the input?

Turns out we can't. Consider the following code:

template <typename T>
struct Vec3
{
    T x, y, z;
};

Vec3<long> a();
Vec3<long long> b();
Vec3<std::int64_t> c();

We want this to generate 3 different C types: A_long, A_long_long, and A_int64_t. A_int64_t can be a typedef for one of the other two, but it still needs its own (appropriately named!) copies of all the member functions. Because to interact with the return value of c(), the clients should be using A_int64_t_Get_x(...), not A_long_Get_x(...), which would be non-portable.

Doing it this way appears to be impossible, or at least very difficult. While I can make Clang substitute long and long long into the Vec3<T> template, I have no idea how to substitute int64_t in a way that produces the third distinct class I can interact with. Even if it can be hacked to do this, this makes no sense from the C++ point of view, this isn't something that a compiler is expected to be able to do.

And we can't just take e.g. A<long> and replace every mention of long inside with int64_t, because what if it originally was a literal long?

So in the end, a fully generic solution appears to be impossible. So instead we do the next best thing.

Rationale behind approach B: unifying Windows and Linux

The idea is simple. You get rid of all mentions of long and long long in your API, and instead use the standard typedefs, doesn't matter which ones.

Then on Windows, we have the generator rewrite every long long (that can now only come from expanding int64_t or another typedef) back to int64_t (so all typedefs converge to this one), and complain if sees any long. And on Linux we do the opposite, rewriting any long back to int64_t, and complaining if we see any long long.

The end result is that you lose the ability to use long and long long in your API directly. You still can use the standard 64-bit wide typedefs, but all of them get rewritten to [u]int64_t. This means size_t also gets rewritten as uint64_t, which is a bit sad, but acceptable.

This is achieved with the following flags:

--canonicalize-64-to-fixed-size-typedefs makes the parser replace long long with int64_t on Windows, and long with int64_t on Linux.
--reject-long-and-long-long then makes the generator complain if it sees any long or long long (that wasn't replaced with a typedef).

Are we done yet? No.

Rationale behind approach B continued: unifying Mac with Windows/Linux

Mac breaks this beautiful hack, because it has typedefs for both long and long long. There size_t and ptrdiff_t use [unsigned] long, while [u]int64_t use [unsigned] long long. The rule of thumb is that if a typedef has digits in its name, it's going to expand to long long, and otherwise to long.

Because of this, with approach B you can't have both [u]int64_t and size_t in your API at the same time, you must choose one.

If you do nothing, size_t will get rejected, and [u]int64_t will work normally.

Some libraries might want this to stop at this, but since size_t seems more valuable then [u]int64_t, we provide a workaround that lets you keep it.

Passing --canonicalize-size_t-to-uint64_t to the parser (which only has effect on Mac, but can be passed everywhere for consistency) makes it replace the other/wrong type with with [u]int64_t: instead of long long, it'll replace long.

Normally this would produce broken code, but we counteract it with --use-size_t-typedef-for-uint64_t in the generator, which then rewrites [u]int64_t into a custom typedef MyLib_[u]int64_t, which on Mac is made to expand to std::size_t or std::ptrdiff_t respectively.

And to replace [u]int64_t we suggest using a custom typedef, as shown above.

Rationale behind approach A

This is the alternative to approach B.

The problem with B is that the resulting bindings are not portable across 64-bit vs 32-bit platforms. It's not always possible to generate separate sets of bindings for the two (e.g. if you have C# bindings too, and want them to run on both, and those call into the C bindings internally, so your C bindings must be compatible with both).

Also losing the ability to use [u]int64_t in the interface directly isn't very convenient.

What we can do instead is to embrace the convenient typedef structure of Emscripten and Mac. That is if we only run the parser for those platforms (note that the parser can run in "cross-"compilation mode, so you can it on any platform, but it'll pretend to target Emscripten; then you can compile the resulting bindings for your actual platform).

Since size_t and ptrdiff_t there uniquely map to [unsigned] long, we can have the parser rewrite [unsigned] long back to size_t/ptrdiff_t. But this requires the you to never use [unsigned] long in the API directly to avoid conflicts. This is achieved by passing --canonicalize-long-to-size_t to the parser.

Then, if you never use [u]int64_t in the API, you're done. But can trade the support of [unsigned] long long for the support of [u]int64_t by also passing --canonicalize-64-to-fixed-size-typedefs to the parser to have it rewrite [unsigned] long long (which is what [u]int64_t expands to on Emscripten) back to [u]int64_t. Then [u]int64_t will work, but [unsigned] long long will stop working, because if you try to use it in your API, you'll get [u]int64_t from the parser, which will then be rewritten to [unsigned] long on Linux (rather than long long), resulting in type mismatches and compilation errors.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Generating C bindings

Running the generator

Ignoring problematic parts of the code

Completeness

Compiling the generated code

Tuning the generated bindings

File name length limit

Move the extra headers

Inherit class members

Exception handling

Expose simple structs as structs

Make generated headers include each other

Don't generate elementwise constructors for large structs

Using fixed-size typedefs

Making the bindings cross-platform

Underlying types of enums

`std::expected` vs `tl::expected`

`size_t` and other standard typedefs

Option A

Option B

More details about `size_t`

Why are we doing all this, again?

Can't we have the parser not expand typedefs in the input?

Rationale behind approach B: unifying Windows and Linux

Rationale behind approach B continued: unifying Mac with Windows/Linux

Rationale behind approach A

Uh oh!

FilesExpand file tree

generating_c.md

Latest commit

History

generating_c.md

File metadata and controls

Generating C bindings

Running the generator

Ignoring problematic parts of the code

Completeness

Compiling the generated code

Tuning the generated bindings

File name length limit

Move the extra headers

Inherit class members

Exception handling

Expose simple structs as structs

Make generated headers include each other

Don't generate elementwise constructors for large structs

Using fixed-size typedefs

Making the bindings cross-platform

Underlying types of enums

std::expected vs tl::expected

size_t and other standard typedefs

Option A

Option B

More details about size_t

Why are we doing all this, again?

Can't we have the parser not expand typedefs in the input?

Rationale behind approach B: unifying Windows and Linux

Rationale behind approach B continued: unifying Mac with Windows/Linux

Rationale behind approach A

`std::expected` vs `tl::expected`

`size_t` and other standard typedefs

More details about `size_t`