street programmer: December 2011

Sunday, December 18, 2011

select, poll or epoll

For very small numbers of sockets (varies depending on your hardware, of course, but we're talking about something on the order of 10 or fewer), select can beat epoll in memory usage and runtime speed. Of course, for such small numbers of sockets, both mechanisms are so fast that you don't really care about this difference in the vast majority of cases.

One clarification, though. Both select and epoll scale linearly. A big difference, though, is that the userspace-facing APIs have complexities that are based on different things. The cost of a select call goes roughly with the value of the highest numbered file descriptor you pass it. If you select on a single fd, 100, then that's roughly twice as expensive as selecting on a single fd, 50. Adding more fds below the highest isn't quite free, so it's a little more complicated than this in practice, but this is a good first approximation for most implementations.

The cost of epoll is closer to the number of file descriptors that actually have events on them. If you're monitoring 200 file descriptors, but only 100 of them have events on them, then you're (very roughly) only paying for those 100 active file descriptors. This is where epoll tends to offer one of its major advantages over select. If you have a thousand clients that are mostly idle, then when you use select you're still paying for all one thousand of them. However, with epoll, it's like you've only got a few - you're only paying for the ones that are active at any given time.

All this means that epoll will lead to less CPU usage for most workloads. As far as memory usage goes, it's a bit of a toss up. select does manage to represent all the necessary information in a highly compact way (one bit per file descriptor). And the FD_SETSIZE (typically 1024) limitation on how many file descriptors you can use with select means that you'll never spend more than 128 bytes for each of the three fd sets you can use with select (read, write, exception). Compared to those 384 bytes max, epoll is sort of a pig. Each file descriptor is represented by a multi-byte structure. However, in absolute terms, it's still not going to use much memory. You can represent a huge number of file descriptors in a few dozen kilobytes (roughly 20k per 1000 file descriptors, I think). And you can also throw in the fact that you have to spend all 384 of those bytes with select if you only want to monitor one file descriptor but its value happens to be 1024, wheras with epoll you'd only spend 20 bytes. Still, all these numbers are pretty small, so it doesn't make much difference.

And there's also that other benefit of epoll, which perhaps you're already aware of, that it is not limited to FD_SETSIZE file descriptors. You can use it to monitor as many file descriptors as you have. And if you only have one file descriptor, but its value is greater than FD_SETSIZE, epoll works with that too, butselect does not.

Randomly, I've also recently discovered one slight drawback to epoll as compared to select orpoll. While none of these three APIs supports normal files (ie, files on a file system), select andpoll present this lack of support as reporting such descriptors as always readable and always writeable. This makes them unsuitable for any meaningful kind of non-blocking filesystem I/O, a program which uses select or poll and happens to encounter a file descriptor from the filesystem will at least continue to operate (or if it fails, it won't be because of select or poll), albeit it perhaps not with the best performance.

On the other hand, epoll will fail fast with an error (EPERM, apparently) when asked to monitor such a file descriptor. Strictly speaking, this is hardly incorrect. It's merely signalling its lack of support in an explicit way. Normally I would applaud explicit failure conditions, but this one is undocumented (as far as I can tell) and results in a completely broken application, rather than one which merely operates with potentially degraded performance.

In practice, the only place I've seen this come up is when interacting with stdio. A user might redirect stdin or stdout from/to a normal file. Whereas previously stdin and stdout would have been a pipe -- supported by epoll just fine -- it then becomes a normal file and epoll fails loudly, breaking the application.

Saturday, December 17, 2011

oprofile

Profiling code to find performance bottlenecks is a relatively common operation. My goal here isn’t to give you a detailed understanding of OProfile. I just want to convey to you that it’s incredibly easy to use and extremely powerful. If you don’t know much about profiling, or use only gprof because it’s the only profiler you know, please read on.
OProfile is a system-profiler for the Linux platform that has been my absolute favorite tool for profiling code. I have no idea why it’s not more popular or so unknown (maybe I just don’t run in the right circles). OProfile does have somewhat onerous setup requirements but it has become standard fair for many distributions that now have excellent support for it. It’s become substantially easier to get set up over the past few iterations of my favorite distributions (why am I lying? I only really use Ubuntu… but it’s still true!). For many distributions, it’s become trivial to setup and use. It should be a standard part of any Linux developer’s toolkit. (Windows and Mac developers have similar tools that work on similar principles: V-Tune and Shark.)
The most commonly used profiler is gprof. Most programmers tend to know about gprof and can probably profile something given enough googling. In my opinion, given the choice, OProfile is a far superior tool. I’d like to first discuss the primary differences between profilers like gprof and profilers like OProfile. If you happen to be a gprof ninja and notice any mistakes, please let me know. Hopefully, though, I can convince you that OProfile is both technically superior and easier to use.

OProfile vs gprof

The Inner Workings

Both OProfile and gprof work based on a statistical sampling method. They periodically poke into your program(s), figure out what code is currently being called, and increment the counter for that symbol. If you let this run long enough, with a high enough sample rate, and you’ll get a pretty accurate distribution of how the code works. The primary difference, however, between OProfile and gprof, is what triggers these samples.
Gprof uses a special system call that will periodically sample the currently running process only. This means gprof will only be aware of things that happen in “user time” for your process. You will not see bottlenecks that occur in external shared libraries (like libc) or the kernel. This can result in gprof’s results being very skewed for certain types of bottlenecks (page faults, file i/o, memory fragmentation, etc.).
If you are using gprof and the ‘time’ command doesn’t agree with gprof’s cumulative total time, you’ve likely hit this limitation.
OProfile, on the other hand, is a system-wide profiler that triggers on hardware events and will record not only all of the current symbols being executed but also which process they belong to. This means OProfile sees all. If your code is causing a major bottleneck in the kernel or in libc, OProfile will see it and gprof will not. The downside of this kind of omniscience is that OProfile requires special permissions. Typically, this means the sudo password. If you are doing computational optimization, this is usually not a problem.

Call Graphs

Gprof requires code be built with the -pg flag. This instruments the actual code with information that will help build an accurate call graph. The upside here is you get your call graph and you also have cumulative timings (how much time did we spend inside this function, and all of its children functions). The downside is that you have to instrument your code and you also need to recompile it with special flags. There’s two issues here that are worth noting. First, the -pg flag can interfere with other flags (like -fomit-frame-pointer). This can make certain bits of code no longer compile or work correctly (for example, inline assembly). Second, adding instrumentation to the code can cause fundamental changes. There is a Heisenberg effect that by trying to measure your code, you affect its true performance.
OProfile, on the other hand, requires no instrumentation of the code. This means OProfile does not need your code to be recompiled with any special flags (so long as the binary in question hasn’t had the symbols stripped). Since OProfile’s only source of information is the hardware-based event sampling, it doesn’t have any real information about call graphs. OProfile is able to build “statistical” call graphs where it can make guesses about which functions are calling which. This typically requires some interpretation on your part to fully decipher. If you need an accurate call graph, OProfile might not be the best tool.

Multithreaded Code

I have no idea if this issue is resolved in gprof yet but gprof does not (or, at least, did not) support multithreaded code. Oprofile does.

Summary

OProfile pros:

Supports hardware based events (more than just CPU clock cycles: cache misses, etc.)
Supports multi-threaded code.
Sees all processes and will find bottlenecks in other places (like kernel and libc).
Does not require any special compilation flags, or even recompiled code.

Cons:

Requires root access
Requires kernel support
Does not provide (precise) call graphs nor cumulative timings
Not portable (Linux only)

Setting Up OProfile

Most popular distributions have an OProfile package that can be easily installed. The only caveat is that some distribution flavors come with kernels that don’t have OProfile support built in. This means you’ll need a new kernel (they might offer an alternate in their package system, or else you will have to… compile it yourself… gah). I’ve found over the last few years many of the most popular distributions are shipping with kernels that support OProfile. Ubuntu, for example, has for awhile (while the server version, I believe, does not).
Installing the tool is usually fairly trivial (vanilla Ubuntu users need only: sudo apt-get install oprofile). Once installed the only setup required is:

sudo opcontrol --no-vmlinux

This tells OProfile that you do not have an uncompressed binary of your kernel (your vmlinux file). This means that OProfile will assign all kernel samples to a black-box called “no-vmlinux”. If you see “no-vmlinux” high on the charts, you will know you are having bottlenecks inside the kernel. You can peer into the kernel by telling OProfile where your vmlinux for the kernel can be found. If you build your own kernel, you should already know what to do. By default, however, most distributions ship with only a compressed kernel (vmlinuz). Typically, there is a package that you can download that will also give you an uncompressed version of the kernel. For example, Ubuntu (Hardy Heron) calls it linux-image-debug-{version}. If you install that package, it will put a vmlinux file in /boot/. If you point oprofile to that file, you will be able to get a breakdown of what is happening inside your kernel.
Usually, by default, your system’s prebuilt binaries are stripped of their symbols. This means that all of these applications and libraries will be black boxes to OProfile as well. If, for whatever reason, a particular program seems to be bottlenecked in libc for example, and you want the profiler to break down whats happening in libc, you will need to install the version of libc that contains symbols (look for the -dbg package). Furthermore, if you want to profile some pre-built binary, you will need a version without the symbols stripped out.

Using OProfile

It’s really this simple:

sudo opcontrol --reset
sudo opcontrol --start
./run_my_code
sudo opcontrol --shutdown
opreport -lt1

Those are petty much all of the commands you need to know to use oprofile. You reset and start the profiler, run your code, and then shut it down. The final step is viewing the results. The opreport command in my example has taken two flags. The -l flag shows you inside each process (otherwise you will get just an overview of each process and its respective usage of the system resources). The -t1 flag sets a threshold at 1% so you don’t see every symbol.

Advanced Beginner Usage

Cache Misses

On x86 machines, it’s also fairly easy to track cache misses using OProfile. By default, OProfile tracks CPU_CLK_UNHALTED hardware events. This is really a measure of how long your code takes to run. You can change the hardware event (sudo opcontrol –list-events to see all the other options). In particular, you can switch to the L2_REQUESTS event with a mask that includes only ‘INVALID’ requests. This requires a bit of internet searching, or the Intel optimization manuals. I’ve gone ahead and looked it up for you, though:

sudo opcontrol --event=L2_RQSTS:1000:0xf1

Events are specified in this way with the first number being the count (how many of these events cause a sample to occur, lower is better but requires more overhead). The second number is the unit mask. In this case, the 0xF1 means across all CPUs, but only INVALID L2_RQSTS.

WTF moments

Have you ever built in some complicated new package and aren’t getting results you think you should be? If you are anything like me your first thought is “Am I even running the new code?”. Maybe you open your editor, dig down, and drop in a few printf()s to figure out if you are even calling the new super fantastic function. This obviously isn’t the smartest way to do this. OProfile solves this problem trivially by letting you sample any compiled code. You can just fire up the profiler and actually look at all the symbols that were caught by the profiler. This is usually the fastest way to figure out if the code you wanted to be called is being called.

Thursday, December 15, 2011

Default constructors? Don't rely on it, there is no free lunch

C++ provides the following 4 functions if you don't provide one:

Constructor
Copy constructor
Destructor
Assignment operator

Most time, we think we can use these default constructors and no need to type our own. Since it is free, why not?

But sometimes, depending on the compiler, it may surprise you. For example, for the following class:

struct Foo
{
    enum Enum
    {
       SIZE = 1024;
     }

     char m_buf[SIZE];
};

Is it any different from the following implementation? Where we just provide a default constructor which does nothing.

struct Foo
{
Foo()
    {}

    ~Foo()
     {}

    enum Enum
    {
       SIZE = 1024;
     }

     char m_buf[SIZE];
};

If you think they are same, you are wrong. I am not sure whether it is a gcc bug or a C++ feature, but I tried both g++3.4.6 and g++ 4.1.2, the first implementation (without providing the constructor), it is much much slower than the 2nd version.

If you just do a new, for the first implementation, it can take more than 100ns, and can grow to 300ns if the SIZE becomes 4096;
For the 2nd implementation, it only take about 40ns, regardless of the size, e.g., either 2 bytes, or 4096 bytes.

From the assembler code, it looks the default constructor does some memset, while the 2nd one does nothing, really a surprise.

Sunday, December 11, 2011

How to trim a std::string?

When dealing with some input, we often want to trim it first. But C++ std::string lacks such a trim function, following is a pretty decent implementation:

   void trim(std::string& str)
        {
          string::size_type pos = str.find_last_not_of(' ');
          if(pos != string::npos) {
                str.erase(pos + 1);
                pos = str.find_first_not_of(' ');
                if(pos != string::npos) str.erase(0, pos);
          }
          else str.erase(str.begin(), str.end());
        }

Saturday, December 10, 2011

Implement your own memory pool or just use Hoard?

Have you been bothered by the speed of memory allocation? e.g., new/delete, or malloc/free. Performance wise, they are essentially the same thing. To speed this up, many application implement their own memory pool based allocation mechanism, and then override operator new/delete to allocate/release the memory from the memory pool. for example:

class MemPool
{
public:
       void *allocate();
       void   release(void *p);
};

class Foo
{
public:
      static void *operator new(size_t size)
      {
return m_pool.allocate();
      }

      static void delete(void *p, size_t size)
      {
   m_pool.release(p);
      }
...
};
This seems to be a standard strategy many people are using. It is nothing wrong, but if you have such a memory pool in your project, you should compare its performance against Hoard's, the result may surprise you.

This happened exactly to me recently.

If I only create and delete Foo from one thread, my memory pool is much faster than Hoard;

However, if I do this in multi-threads, say, 4 threads, my pool become much slower than Hoard, and the amazing thing about Hoard is it's performance is not affected by the number of threads at all.

My pool is a very normal pool, just pre-allocates enough space, and from then on it does not allocate memory from the system anymore. so it should be pretty faster than the plain new/delete. However, to manage the free list, I used a user space spin lock, and this becomes the main reason of the slowness compared to Hoard.

Hoard, however, has the self-learn adaptive ability, it keeps different heap for different threads, and hence most likely requires very less lock contention among different threads.

Please check http://www.hoard.org/ to learn more detail about Hoard

Monday, December 5, 2011

How to add CDT (C++ Development Tools) to Eclipse on Linux Ubuntu

To install cdt support do the following:
Fire up eclipse:

Help->Install New Software
Click Add
In location add: http://download.eclipse.org/tools/cdt/releases/galileo

You should be able to select this now from the drop down box titled “Work with:”
Follow it through and it will install what you need.

Friday, December 2, 2011

Be practical, how to create a thread-safe singleton?

As a programmer, each design is like making choices, this is hard. Good and experienced programmer make good choices. Here is one good example.
In many projects, you need to design a singleton, and it has to be thread safe? How to make it thread-safe?

class Foo
{
public:
static Foo* getInstance();

private:
static Foo* s_foo;

Foo();
Foo(const Foo& foo);
Foo& operator=(const Foo& rhs);
};

Foo* Foo::s_foo = NULL;

Solution 1: Put everything into one critical section:

Foo* Foo::getInstance()
{
lock;
if (!s_foo)
s_foo = new Foo();
unlock;
return s_foo;
}

Is this thread-safe? Yes, it does both s_foo checking and construction in the critical section. But you can also easily see, it is not efficient, and no way to be used in a low-latency project.

Solution 2: Double checked locking

Foo* Foo::getInstance()
{
if (!s_foo)
{
// zone 1
lock();
// zone 2
if (!s_foo)
s_foo = new Foo();
unlock();
}
return s_foo;
}

Why do we need check s_foo again inside the lock? s_foo is NULL initially, when two threads call getInstance() simultaneously, both will get into the zone 1, and eventually one thread gets the lock and initialized s_foo, and release the lock; then another thread goes to zone 2 also, and you can see if we don't check s_foo again here, we will create two Foo objects.

Is it efficient? Yes. But is it really thread safe? Unfortunately, no. See the following link for the out of order memory write issue associated with multi-cores.

http://www.ibm.com/developerworks/java/library/j-dcl/index.html

You can see to make it perfect, we have to tweak here and there again and again. But are all these tweaking really that necessary? Most time, it is not,

Let's see one solution:

Foo& Foo::getInstance()
{
static Foo& f;
return f;
}

Is this implementation perfect, of course not. But is it practical? Yes, it is very practical, and works perfectly in almost all projects. The real beauty about it is how simple it is!. Some one might argue we can potentially create two static objects here too, it is possible. But if you initialize the static object by calling Foo::getInstance() right before it is ever used, e.g., right after parsing program arguments inside main(), this is more than enough to make it thread safe.

Thursday, December 1, 2011

log4cxx is quite slow

log4cxx is an open source logging library which is used quite often in many projects by many firms, however, when trying to measure some latency recently, I noticed this library is too slow to be used in a low-latency project. So what factor contributes to the slowness?

Think about it, if you are about to write a logger which need to support multi-writer/one-reader, what will you do? log4cxx won't be too much different!

First, it need to create a record of what need to be logged, regardless of whether the record resource is from a memory pool or from raw memory, it need some time to do this;

Second, it needs to format the log data into some string format and save in the record created in the first step, again, this step is not trivial, time-wise;

Third, it need to push this records into a queue and then the reader thread can then write it to a log file, will this enqueue need some kind of lock? You bet.

In the next post, we will talk about some fast loggers.

Contex Switches

When dealing with low latency programming, e.g., micro seconds or nano seconds, one thing we want to avoid as much as we can is context switches. Once a context witch happens, the program switches from use space to kernel space, and this is very cost, and the latency will increase quite dramatically.

How to avoid the context switches then?

First, avoid system calls. if the the same thing can be done in the user space quickly, don't rely on system calls. For example, many programmers nowadays use the user space spin locks quite often, this can be done by some simple atomic operations. Otherwise, you would fall to pthread_mutex, which is quite expensive.

Second, avoid frequent memory allocations, e.g., malloc/free, or new/delete, these function calls eventually will call some system calls to get some memory from heap (brk). Instead, you can allocate the memory just once, but allocate enough and create a memory pool with that memory.