Swapping Bytes, Fast
Recently I was working on a C++ project that needed to be able to read both little and big endian data from disk, with endianness sometimes even changing partway through a given file. The goal was to define two functions, creatively named big and little, that could be used to correctly interpret fields of any size—practically speaking, this means 16, 32 and 64-bit values. For example:
uint32_t a = 0xf0000000;
uint32_t b = 0x000000f0;
assert(little(a) == big(b));
In this case, if targeting a little endian machine big would ideally compile down to a bswap instruction, with little being a no-op.
This is what I ended up with:
template <>
template <>
template <>
template <>
template <>
template <>
Is this overkill? Almost certainly, especially considering that something similar could be implemented with a few overloaded functions. But, because it is specialized by size, we can swap types like double without requiring any additional code. This is achieved through the use of a functor—put simply, a functor is an object that can be called like a function. The reason we’re using one is because we need to be able to declare a partial specialization, which is not allowed for functions in C++. The functor is ultimately wrapped in a traditional function to avoid the doubled up bswap_functor()(value) syntax.
Each specialization begins by copying the provided value to an unsigned integer of appropriate width. This is to ensure that the bitwise operators behave as expected (i.e. that bit shifts are logical shifts). Then, the bytes are swapped. This step could be replaced with an appropriate intrinsic (either compiler specific or those defined in x86intrin.h), the network byte order functions, or even inline assembly—I’ll leave this as an exercise for the reader. This is actually something worth exploring. While GCC, Clang and ICC all optimise the above code down to a bswap and a mov with -O2 (or even just movbe with -march=haswell) MSVC fails to optimise in the 64-bit case, even with /Ox.
Finally the swapped bytes are copied back to the typed value and returned. The reason we’re using memcpy and not reinterpret_cast is because that would likely violate strict type aliasing and result in undefined behaviour, although most compilers will fare okay.
The base template currently results in a compile time error as sizeof will always return a non-zero value. This could instead be extended to truly support arbitrary sizes, but types that require more than 8 bytes are somewhat uncommon. In a pinch, supporting 16 byte types like __int128 could be achieved by leveraging the 8 byte specialization:
template <>
The downside of this approach is that you need to determine the endianness of the target platform at compile time. In C99 this can be achieved via type punning:
inline bool
Each of the big four compilers will optimise out this function when targeting little endian platforms. Even MSVC, which does not fully support C99, is happy with this. Strictly speaking, this comes under undefined behaviour in C++. Thankfully C++20 has added std::endian which can be used to check the native byte order. Hence our desired functions can be defined like so:
template <>
template <>