Tuesday, November 25, 2014

Portability woes : Endianess and Alignment (Part 2)

In the previous part, endianess and 32/64 bits detection were detailed. Now we'll have a look at more complex memory alignment troubles.

Memory Access Alignment is a lesser known issue, but its impact can be huge : it will crash your program or result in a disproportionate slow down.

Let's first summarize the problem. When accessing a single byte into memory, there is no restriction. But when trying to access 2-bytes at a time (short), or 4-bytes at a time (int), alignment get into the way.
Aligned property means a 2-bytes field must always be accessed on an even address (multiple of 2), or accessing a 4-bytes field is always done on an address multiple of 4, and so on.
From a performance perspective, accessing many bytes at a time is a win, as it makes better use of the memory bus. But when a multi-bytes field is accessed on a non-aligned memory address, all sort of problems can happen : bus width or addressing limitation, cache line overlap, memory segment border overlap, and so on. 

All these issues can be solved, and indeed, on the most widely known programming environment, x86/x64, these problems are solved since a long long time.
But it has a cost, it makes the CPU more complex, and consume some precious transistor space. As a consequence, several CPU vendors selected to be a bit lazy on these issues, deciding to not address them, leaving the problem into the hands of software developers. In such case, if an unaligned memory access is nonetheless performed, the CPU sends an exception, typically resulting in a program crash.
To make the matter more complex, some CPU addressed alignment issues, but in an inefficient manner, resulting in undesirable slow performance. Other ones address it correctly for short (2-bytes) or int (4-bytes) but not long long (8-bytes).

Data alignment issue is well described, with many sources throughout Internet. Unfortunately, finding a proper portable solution for it is not, many "advisers" simply telling to avoid unaligned access altogether. Thanks, really.
But the above condition cannot be met in every circumstances. Consider how the compression algorithm works : it looks for similar pattern into past data. Such pattern can appear anywhere, not just on "aligned" addresses.

For a portable code, this situation is a nightmare. The "safe" approach would be to always access data byte-by-byte, but then, the impact on performance is huge, and for speed-oriented application such as LZ4, this trade-off is unacceptable.

The way it was handled by LZ4 up to now relied on the compiler. The basic idea is : the compiler should be fully aware if its target CPU can, or cannot, handle unaligned memory access.
This is achieved through the "pack" instruction, which, in a nutshell, tell the compiler to "not align these data", and therefore generate proper cautious assembler code when accessing them.

It gives the following result :

#if defined(__GNUC__)  && !defined(LZ4_FORCE_UNALIGNED_ACCESS)
#  define _PACKED __attribute__ ((packed))
#  define _PACKED

#if !defined(LZ4_FORCE_UNALIGNED_ACCESS) && !defined(__GNUC__)
#  if defined(__IBMC__) || defined(__SUNPRO_C) || defined(__SUNPRO_CC)
#    pragma pack(1)
#  else
#    pragma pack(push, 1)
#  endif

typedef struct { U16 v; }  _PACKED U16_S;
typedef struct { U32 v; }  _PACKED U32_S;
typedef struct { U64 v; }  _PACKED U64_S;
typedef struct {size_t v;} _PACKED size_t_S;

#if !defined(LZ4_FORCE_UNALIGNED_ACCESS) && !defined(__GNUC__)
#  if defined(__SUNPRO_C) || defined(__SUNPRO_CC)
#    pragma pack(0)
#  else
#    pragma pack(pop)
#  endif

#define A16(x)   (((U16_S *)(x))->v)
#define A32(x)   (((U32_S *)(x))->v)
#define A64(x)   (((U64_S *)(x))->v)
#define AARCH(x) (((size_t_S *)(x))->v)

If it looks a bit complex, that's because it is.
The first issue we have is that issuing the "pack" instruction must be done in a variety of ways, depending on the compiler and its version. It translates into this monstrous macro, trying to figure out all possible situations reported up to now. As you can guess, I regularly receive notice of new situations the macro cannot cope with.
This is about as bad as the previous stories regarding 32/64 bits and endianess.

But there is more.
Compilers are not that clever.
In many circumstances, the compiler will issue a "safe" slow code to access unaligned data, even though the target CPU is able to efficiently handle this situation, resulting in a large speed drop. This is especially true for late ARM CPU.
To counter-balance this effect, there is a need to manually "turn off" the "pack" instruction, using in the above example the #define LZ4_FORCE_UNALIGNED_ACCESS.
Unfortunately, the manual switch is almost never used. Source code will most of the time be compiled "as is", which is no surprise.

So we have 2 issues : issuing the "pack" instruction is complex, and not future-proof, and compilers don't automatically make the best choice.

To save the day, maybe a new runtime check will help, like for previous issues ?
Alas, there is no portable runtime test available to check for aligned properties.
(Note : of course, one could imagine running an external program/process just for this purpose, but it's outside of the scope of a little single-file library).

So we are stuck, aren't we ?

Well, that's the difficult part. To make some progresses on current situation, I'm going to change the logic : instead of relying on the compiler, take explicit charge to handle unaligned accesses.

The core foundation of the new solution is based on below function, already used within lz4frame :

static U32 LZ4_readLE32 (const BYTE* srcPtr)
    U32 value32 = srcPtr[0];
    value32 += (srcPtr[1]<<8);
    value32 += (srcPtr[2]<<16);
    value32 += (srcPtr[3]<<24);
    return value32;
What's good with this function ?
It handles both endianess and alignment in a safe way. The code is portable.

What's wrong with it ?
It's the safe approach, and therefore is slower than necessary when CPU can correctly handle unaligned memory accesses.

So, we will now special-cases CPU which do support efficient unaligned access.

Some of them are easily detectable, thanks to  widely supported standard macro : __i386____x86_64____ARM_ARCH_7__ are known architectures with good support for unaligned memory accesses. __ARM_ARCH_6__ is also able to handle it, but in a less efficient manner, so it's unclear if it's really faster than the portable version.

Finding a list of CPU with efficient support of unaligned memory accesses (and their related detection macro) has proven an impossible task so far. One may have in mind that Linux faces a similar challenge, which is supposed to be solved thanks to the macro CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS. Unfortunately, I couldn't find a place where this macro is defined. It seems to be a decentralized methodology, where each architecture tells independently if it's compatible or not. For the Linux kernel, it's likely the correct method. But that also means there is no central repository where this property is listed.

So I'm a bit stuck right now.
My expectation is that external contributors interested in LZ4 performance may benchmark the speed of the new version, tentatively enabling/disabling the prominent switch at the top of lz4.c when they see fit :

 * You can force the code to use unaligned memory access, should you know your CPU can handle it efficiently.
 * If it effectively results in better speed (up to 50% improvement can be expected)
 * please report your configuration to upstream (https://groups.google.com/forum/#!forum/lz4c)
 * so that an automatic detection macro can be added to mainline.

Each time a CPU is known to efficiently handle unaligned memory access, its standard detection macro can be added to the list, in order to automatically use the faster code path.

Is it any better than previous method ?
Well, sort of.

To begin with, there is no longer a need to fiddle with the different "pack" instructions depending on compilers and versions. This is one less complexity to manage, and dependency to worry.
Getting formally in charge of unaligned access allows introduction of dedicated functions. For example, a more adapted "string comparison" function could be introduced.

More importantly, instead of crashing when the detection fail, the library will now run slower, but still run correctly. But it introduces some new risk : many users may simply not notice the slow down, and just use the library "as is", unaware of the latent performance improvement which could be unleashed. The hope is, as long as a few contributors can detect and report the performance issue, the situation can be solved for everybody with similar setup.

Latest version of LZ4 using these new detection routines is accessible in the feature branch "AlignEndian" : https://github.com/Cyan4973/lz4/tree/AlignEndian

It's possible to compare it with latest "official" release r124. On x86/x64 CPU, it was benchmarked and proved to provide similar performance. On other CPU though, it's still worthwhile to check.


  1. Unfortunately if you lie to the compiler about the alignment of your data then you may still encounter alignment faults even on recent ARM CPUs. The support for loading from unaligned addresses applies to simple load and store instructions, not to every memory accessing instruction the CPU has. If the compiler is sufficiently optimising to notice that it could combine multiple memory accesses into a single LDRD/STRD or Neon load/store insn then it will fault if they're not aligned correctly.

    In the long run I think it's better to write portable code and get the compilers fixed if they generate particularly poor code in certain cases.

    Have you investigated how good the codegen is for doing unaligned accesses via memcpy?

    uint16_t r;
    memcpy(&r, unaligned_ptr, sizeof(r));
    return r;

    This isn't as daft as it sounds, because compilers do know how to inline small memcpy and should emit the correct simple unaligned load/store insn.

    1. Initially (a long time ago), LZ4 used memcpy() for both literal and matches. It quickly proved much less efficient than the internal "wildcopy" on x86/x64 CPU, most probably due to pipelining and branches.

      Conclusion should vary on different target hardware though.

    2. But specifically for doing an unaligned load there should be no problem with pipelining or branches because the compiler will turn "memcpy 4 bytes into local and return that local" into a single load instruction if it can. It should generate exactly the same code as your current approach in the cases where it's safe to do so, I think.

    3. Yes, for CPU which do not support unaligned memory accesses, this is probably right.

      I intend to update the code to use this method instead for unaligned accesses. The problem is, I can't test the performance impact (only the correctness), since I don't have a relevant platform available around. So any performance evolution is speculative.

    4. I forgot to mention, but this technique is what QEMU actually uses to implement unaligned accesses (we have to be portable to a wide range of host systems), so it's not just a speculative suggestion on my part.

    5. Nice to know :) I'm actually using QEMU for correctness tests for big-endian targets.
      But as you already know, I can only use the test for correctness, not to track a "performance improvement", since the emulator may not reflect real performance impact.

    6. lFor information : Latest update of the feature branch now uses memcpy() backup for unaligned memory accesses :

  2. I presume you saw the Debian bug report on this issue
    (https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=757037) but I
    wonder if you saw this thread also?

    It contains the disheartening quotes: "liblzo2 seems to be one of
    these codebases that bases its idea of how C works on portability
    folklore and the assumption that the compiler and standard C library
    are the most naive implementation possible :-(" and "sounds like
    someone thought C was an assembler and not a language with quite tight requirements on what you are allowed to do. "

    Sadly, I fear they are right. The world has changed. You are still
    viewing C as the proverbial "thin layer over assembly", which allows
    you to write portable code that can be compiled to run on processors
    with different instruction sets and capabilities. In this worldview,
    the compiler takes the source code you write and translates the
    operations into equivalent assembly appropriate for the target,
    possibly with some optimizations that do not change the outward
    behaviour of the code.

    Unfortunately, in the minds of current the compiler maintainers, this
    is but a quaint fairytale. Instead, modern C as defined by the
    current specs is a typed language with significant restrictions on
    what's legal, and if you ask it to do something the spec defines as
    "undefined", rather than generating the code you asked for (that would work just fine on the processor you know you are compiling for) the compiler will probably just pretend that code wasn't there. And when it does this, you are supposed to be grateful that it was kind enough not to format your hard drive.


  3. Unbelievable as it may seem to someone who views C as a low-level
    systems language that let's you write efficient code that takes
    advantage of the capabilities of modern processors, modern C simply
    does not provide any legal way to read or write potentially unaligned
    multibyte variables except byte-by-byte. Those optimizations which
    double your speed on processors with fast unaligned access? Just a
    temporary oversight on the part of the compiler, a bug waiting to happen, and eventually your comeuppance will be served. Seems vicious, but see Mike Stump's comments as a GCC maintainer in this (really long but insightful) thread: http://compgroups.net/comp.arch/if-it-were-easy/2993157

    The only legal way in C to do what you want is to write something like the byte-oriented code that you have done or to use memcpy() (which is part of the C standard at this point, and thus fair game for compiler optimization/omission). In either case, the 'fast path' macro that has so far given you good performance simply must go. If the compiler gods smile on you, they will optimize your seemingly slower code to run the way you had intended. If not (as has been your experience), presumably you were worshipping the wrong version of the compiler god. Upgrade and try again. Or try a different god. But whatever you do, if you value your hard drive and aren't wearing nose plugs, don't have the hubris to assume you are the one telling the compiler what to do.

    Personally, I'm coming to the conclusion that the only reliable way to obtain the level of performance you are seeking is to write in assembly, distribute compiled code, or both. Write legal C that will provide a variable level of reasonably good performance, and then write optimized versions of functions in assembly that can be used on recognized processors. You don't even have to write the assembly yourself --- it's perfectly fine to find a compliant compiler, try out every optimization option possible, verify that it's generating
    correct performant code, and lock it down to binary or assembly. If binary, distribute for all the major formats or point people to Agner Fog's converter (http://www.agner.org/optimize/#objconv). If assembly source, NASM is probably your most portable bet. One day there might be a viable intermediate code representation you could use instead (like LLVM IR:
    https://idea.popcount.org/2013-07-24-ir-is-better-than-assembly/), but
    I don't think it's there yet.

    1. If it was up to me, I would consider that it should be up to the compiler to convert something like :
      var32 = *(U32*)(ptr);
      into the optimal and nonetheless safe assembler code, depending on target CPU specifics and pointer assumption (the compiler should know if a pointer is guaranteed aligned or not. By the way, clang is already able to properly check this condition, there's no magic.)

      But compiler doesn't, and just convert directly the above expression into a direct 32-bits access, *as if* it was a simple assembler translator.
      I don't like it, but that's the way it works today.

      Obliging the application programmer to write a byte-by-byte access method to be safe is already one step too far. It's transferring the burden of an hardware restriction into the application realm. It's the tangible proof that C compilers and sometimes C specs themselves don't make a good enough job at shielding the application from hardware specifics.

      But since it's the real situation being experienced today, it's required to cope with it today. Waiting for compilers to make progresses is not a good enough a proposition.

    2. > You are still viewing C as the proverbial "thin layer over assembly"

      If you look at current LZ4 source code, you'll see it makes a lot of use, and depends on, advanced compiler logic.
      For example, I just wrote in a previous blog post that I'm writing static functions with the expectation that the compiler shall notice their result will always be the same, and actually never generate the associated assembler segment, using their direct result instead.
      LZ4 highly depends for its performance on the compiler to correctly notice that some tests are always true or false, and trigger accordingly the relevant dead-code & branches eliminations.

      The C source code is not expected to be translated "instruction by instruction" : it's too different from the generated assembler code, and actually requires the compiler interpretation and optimization to run to its full potential. So it's no surprise there's a big performance gap between compiling the library with -O3 and compiling it with no or little optimization.

    3. I haven't looked closely at the code, but having seen other examples of what you've written, I presume it's stellar. I'm basing my (rhetorically enhanced) statement based on your statements in this blog: "What's wrong with it ? It's the safe approach, and therefore is slower than necessary when CPU can correctly handle unaligned memory accesses." and "Each time a CPU is known to efficiently handle unaligned memory access, its standard detection macro can be added to the list, in order to automatically use the faster code path."

      This is the out-of-date world view that presupposes that source code which specifies byte-by-byte access will result in assembly code that uses byte-by-byte access, and hence be slower than technically undefined source code that directly specifies an unaligned read on a supported processor. This presumed correlation between source and output is the point of failure. The correct, current, fragile, and problematic view is that your goal as a programmer is to specify the outcome you desire while staying within the spec, and leave it to the compiler to generate efficient code for the processor. Once you start thinking of targetting a specific processor you are juggling knives, for unless your syntax is 100% legal, the compiler is free to ignore your obvious intent. It has no obligation to do the obvious thing that you clearly intend, and at some point will get its revenge.

      If you don't want to trust the compiler to "do the right thing" (and how could you, considering that you don't even know what processor your end user will be targetting much less which version of which compiler they will use), the only reliable thing you can do is to write your optimizations in a language other than legal C. If you can specify the exact compiler to be used, this might be able to specify a combination of command line flags requesting an unofficial C dialect, but even this is unlikely to be portable across architectures.

      There simply is no way to specify in the general case that you want to the compiler to merely translate your C source to appropriate assembly as written. Rather than meaning "without optimization", "-O0" now means something closer to "Do every operation in the most pedantic way possible, including storing every variable in memory for ease of debugging". While you might be able to achieve your desired result of "do what I said without pessimizing the code" with the right combination of specific optimization-disabling flags for a specific compiler, there is no reliable way of doing so in general.

      Although it saddens and revolts me, if you don't want to trust the end users compiler to optimize memcpy() or byte-by-byte copies to the highly efficient processor-supported single machine instruction that you desire, I think your only option is to write the optimized version in some language other than C. And I say this as a C fanatic who's not sure where else to turn. For some current projects, if I can't figure out a good means to distribute assembly, I'm genuinely considering falling back to Fortran for my inner loops.

  4. I read on link below that memory alignment is not significant. Is it just depends on study and testing?


    1. It is really platform dependent.
      As said, Intel CPU are among the best ones to handle unaligned access, especially for "common" data type, such as uint_32.
      I made some equivalent tests as Daniel with equivalent results. If you want to make a quick test, xxhsum integrates a benchmark which verify speed on unaligned access.

      But if you change the CPU, the answer will be different. As stated in the article, some CPU will simply *crash* your application on unaligned access. So in such case, it's damn critical.
      Others CPU handle unaligned accesses, but with weak performance, so it's still important too.

      Therefore, the problem of alignment is not applicable for standard PC. It's for everything else.

  5. Hi Yann,

    I just stumbled across your very interesting post. I recently found that I could shrink enums to 8-bits only by appending the packed attributed to them, and that it worked for unions as well, but I never thought about using that for unaligned data! Thus I went a bit further based on your idea and used a union instead :

    typedef union {
    unsigned char u8;
    signed char s8;
    unsigned short u16;
    signed short s16;
    unsigned int u32;
    signed int s32;
    unsigned long long u64;
    signed long long s64;
    } __attribute__((packed)) packed;

    That way I can dereference pointers with the type I need.
    Eg: ((packed *)x)->u32

    I verified that the code is correct and identical to a regular load on i386, x86_64, armv7-a, armv7-a in thumb mode, and produces lots of byte loads on armv5 which doesn't support unaligned accesses. I was surprized to discover that mips32 does support unaligned accesses. The memcpy() equivalent code is more or less optimized on mips32 (adds stupid add/sub on sp), and the memcpy() on armv5 is not optimized at all (does a byte copy to a local variable).

    What I like with this approach is that it relies on the compiler's knowledge of the target and uses it exactly as it defines the way to access structure members. If it fails on one arch, it will require a bug report. And contrary to another post above, I don't like to start to suspect the compiler of sometimes being wrong and avoiding what it provides "just in case". I'm fine with working around known bugs but not supposed ones.

    You were speaking about performance, I found unaligned accesses to be not really fast on atom in 32-bit mode, accessing a 32-bit unaligned word is as slow as accessing the 4 bytes separately there.

    1. Hi Willy

      I trust you are in an excellent position to test these techniques on real hardware. Your feedback is highly appreciated.

      For the record, since this article was written, I completely changed to the "qemu" method, as described above by pm215. All reads are converted to memcpy(). You can have a look at its latest incarnation within the "dev" branch on github :

      In essence, this method "trusts" the compiler (and its standard libraries) on the best way to access a memory segment (as you suggest). Forcing the read length to a fixed "4" or "8" bytes is just an hint, which is supposed to help trigger a 32-bit/64-bit read on CPU supporting unaligned accesses. But the compiler is free to decide differently if it believes a better opcode is preferable.

      The difference with the "pack" method is mostly about portability : pack attribute is non-standard, and vary more or less depending on compiler, and even depending on compiler versions. This proved a pain to maintain. memcpy() does not suffer such difference between platforms.

      Nonetheless, it might happen that pack and memcpy() end up being compiled differently, with one method proving better than the other, depending on target, compiler, versions, and maybe even optimizer settings. I have no good way to measure extensively such difference today.