Saturday, January 14, 2012

How does cache line cause false sharing?

 http://drdobbs.com/go-parallel/article/217500206?pgno=1

"In two previous articles I pointed out the performance issue of false sharing (aka cache line ping-ponging), where threads use different objects but those objects happen to be close enough in memory that they fall on the same cache line, and the cache system treats them as a single lump that is effectively protected by a hardware write lock that only one core can hold at a time. [1,2] This causes real but invisible performance contention; whichever thread currently has exclusive ownership so that it can physically perform an update to the cache line will silently throttle other threads that are trying to use different (but, alas, nearby) data that sits on the same line. It's easy to see why the problem arises when multiple cores are writing to different parts of the same cache line, because only one can hold the exclusive hardware lock at a time. In practice, however, it can be even more common to encounter a reader thread using what it thinks is read-only data still getting throttled by a writer thread updating a different but nearby memory location, because the reading thread has to invalidate its copy of the cache line and wait until after the writer has finished to reload it.

A number of readers have asked for more information and examples on where false sharing arises and how to deal with it. I mentioned one concrete example in passing in [3] where Example 4 showed eliminating false sharing as one of the stages of optimizing a queue.

This month, let's consider a concrete example that shows an algorithm in extremis due to false sharing distress, how to use tools to analyze the problem, and the two coding techniques we can use to eliminate false sharing trouble.  "

Thursday, January 12, 2012

Word Alignment

Word alignment is not a particularly difficult concept, but it is fairly important, because it does show up in unusual places.

What you know

We've defined a word to mean 4 bytes. To store a word in byte-addressable memory (i.e. where each element of memory is one byte), you have to break up the 32 bit quantity into 4 bytes. Thus, if the word was 0x01ab23cd, it's broken up into 0x01, 0xab, 0x23, 0xcd. You can store this in two ways. If it's big endian, than the most significant byte (i.e., 0x01) is stored in the smallest of four consective addresses. The data 0xab, 0x23, 0xcx are stored in the following three memory addresses. Thus, if you stored the first byte in address 1000, the remaining bytes are stored in addresses 1001, 1002, and 1003.
For little endian, you store the least significant byte (0xcd) in the smallest address (in our example, this is address 1000), then 0x23, 0xab, and 0x01. Thus, it's stored in reverse order.
Even though it's somewhat inaccurate to say this, we say a word is stored at, say, address 1000. That is, we pick the smallest address, and say that's where the data is located in memory. Thus, if the data has N bytes, then it is stored in address A to A + N - 1, and we say that the data is at address A.

Word alignment

However, there's a second issue. For reasons of making hardware simpler (and sometimes because the ISA defines it this way), words are often stored at word aligned addresses. Word-aligned means the address is stored at an address that's divisible by 4. If you look at an address that's divisible by 4 and written in binary, you see that the last two bits are 0.
Why is this interesting? Whenever you're dealing with word quantities they must appear at word aligned addresses. Consider the following structure (written in C++):
struct Foo {
   char x ;  // 1 byte
   int y ;   // 4 byte, must be word-aligned
   char z ;  // 1 byte
   int w ;   // 4 byte, must be word-aligned
} ;
In C/C++, data is stored in the order declared. Thus, x, y, z, and w appear in that order in memory. In principle, the amount of memory needed by Foo should be 10 bytes (1 byte for each char, 4 bytes for each int variable).
However, due to word aligment, it will probably take more than 10 bytes. In particular, if y and w are both word aligned, and z is in between, there will be 3 unusued bytes. Thus, the structure may be 13 bytes large, with 3 filler bytes, used for padding.
To see this in action, try declaring a structure or class as above, then use the sizeof operator, and see how many bytes it has.
Byte quantities can be stored at any address in memory. Halfword quantities (16 bits) are often stored at half-word aligned addresses (addresses divisible by 2). Doubleword quantities (64 bits) are often stored at double-word aligned addresses (addresses divisible by 8). You see these restrictions most often on a RISC ISA.
CISC ISAs may not necessarily require alignment of words, etc.

Chart

This chart summarizes the characteristics of word-alignment.
Quantity Address divisible by (Binary) address ends in
Byte 1 anything
Halfword (16 bits) 2 0
Word (32 bits) 4 00
Doubleword (64 bits) 8 000

Specific structure packing when using the GNU C Compiler

The GNU C compiler does not support the #pragma directives. In particular it does not support the "#pragma pack" directive. So when using the GNU C compiler, you can ensure structure packing in one of two ways
  • Define the structure appropriately so that it is intrinsically packed. This is hard to do and requires an understanding of how the compiler behaves with respect to alignment on the target machine. Also it is hard to maintain.
  • Use the "packed" attribute against the members of a structure. This attribute mechanism is an extension to the GNU C compiler. An example of how you would do this is below.
    struct test
     {
             unsigned char  field1 __attribute__((__packed__));
             unsigned short field2 __attribute__((__packed__));
             unsigned long  field3 __attribute__((__packed__));
     } var1, var2;
    
    
    Note the use of the keyword "__attribute__" with the attribute "__packed__" within the double brackets (before the terminating semicolon of each member variable declaration).
    An alternate way of doing the above is as below.
    struct test
     {
             unsigned char  field1;
             unsigned short field2;
             unsigned long  field3;
     } __attribute__((__packed__));
     
     typedef struct test test_t;
     
     test_t var1, var2;
    
    This will ensure that all members of the structure are packed. Note that this doesn't seem to work right if you try to combine the typedef and the struct definition or if you combine variable declarations with the structure definition.