Thursday, March 01, 2012

Performance and Scalability on IBM POWER7

I recently had a chance to run some benchmarks of PostgreSQL 9.2devel on an IBM POWER7 machine provided by IBM and hosted by Oregon State University's Open Source Lab.  I'd like to run more tests, but for a first pass I decided to run a SELECT-only pgbench test at various client counts and scale factors.  This basically measures how quickly we can do primary key lookups in a completely read-only environment.  This generated a pretty graph.  Here it is.


There are a couple of interesting things about this graph.  First, this machine has 16 cores, but 64 hardware threads.  A hardware thread, clearly, is not quite as good as a full core.  So, unsurprisingly, all four curves have an inflection point right around 16 cores.  After that, performance keeps going up - in fact, in the scale factor 10 and 100 case, it increases slightly even beyond the 64-client mark at which all hardware threads are presumably saturated.  pgbench was running on the same machine as the database server, which may or may not be related.

Second, the absolute performance is quite good.  Here's an older graph, taken on a 32-core AMD 6128 machine.  That was generated using an older code base, but I don't think too much has changed in the interim, at least not that's relevant to this test.  So we can see that this POWER7 machine is pretty fast - we get significantly more transactions per second, despite having fewer real cores.  Also note that this machine is running Linux 3.2.x, which should mean that the Linux lseek scalability problem I complained about previously is no longer an issue.

Third, it's interesting to note what happens as we increase the size of the data set.  Scale factor 100 generates a database of about 1502 MB, so both that one and the scale factor 300 run are operating on databases that fit entirely inside the database's page cache, shared_buffers, which I had set to 8GB for these tests.  On the other hand, the runs at scale factor 1000 and scale factor 3000 are larger than the database's page cache, so we've got to copy pages in and out from the operating system as they are used.  PostgreSQL doesn't use direct I/O, so we're just copying from the operating system's page cache, not the disk.

Still, data copying is expensive, so we'd expect some performance degradation, yet, at the lower client counts, it's not too bad.  At 16 clients, scale factor 1000 is just 6% slower than scale factor 100.  As we ramp up the number of clients, though, things get quite a bit worse.  At 32 clients, the regression has increased to 13%, and at 64 clients, it's 41%.  There's a known concurrency problem with buffer allocation (a lock called BufFreelistLock) so these results aren't entirely surprising, but they do illustrate that at least on this problem, the issue isn't so much performance as scalability.  The extra data copying does hurt, but the lock contention hurts more.

I did one more set of test runs using scale factor 10000.  This data set was so large that it didn't even fit in memory - the server has 64GB of RAM.  Of course, this led to a huge drop-off in performance, so it didn't make sense to put those results on the same graph.  But I made a separate graph with just those results.


I don't think this server has a particularly powerful I/O subsystem, but even if it did, disks are a lot slower than memory, and this benchmark is generating completely random I/O, which is not something disks are very good at.  Nevertheless we seem to do a pretty good job saturating the available I/O capacity.

When I have the opportunity, I'd like to run some read-write tests on this machine as well; I'll post those results when I have them.

8 comments:

  1. On a production server in which clients are in separate machine, does it have any benefit to use a CPU with more SMT(like SPARC processors)? Or because PostgreSQL is handling requests per process, it is better to use more cores?

    ReplyDelete
  2. Well, in this test, the hardware threads are good, but not as good as full cores. Whether that's true in general, I don't know.

    ReplyDelete
  3. Wonder how things could scale with SSD enterprise drives.

    ReplyDelete
  4. So if I have a machine with multiple cores should I or should not use hyperthreading technology to get more logical CPU's?

    ReplyDelete
  5. You should not, since the HT cores won't give you the real performance, and OS does not know about it...

    ReplyDelete
  6. Modern OSes are much improved with regard to hyperthreading. They now understand that a virtual CPU is not a real CPU. They know which virtual CPUs are on the same die and will not leave a die idle while another one is overscheduled. And operations such as memory accesses are actually quite expensive in terms of CPU cycles, even though we often think of them as 'fast', so any categorical advice not to enable hyperthreading is likely to merit revisiting in the future.

    ReplyDelete
  7. Agreed that enabling hyperthreading is a good idea. Older x86 systems didn't allow an idle core to run as fast with hyperthreading enabled, but was fixed a few years ago, and modern operating systems know about hyperthreading, so there is little reason to disable it, at least on x86.

    Not sure about Power. Robert's tests show the system using the virtual cores, but it was not tested with hyperthreading disabled.

    ReplyDelete
  8. I think that some of the issues people saw with hyperthreading on earlier versions of PostgreSQL may have been lock contention issues that are resolved in PostgreSQL 9.2. But that's just a guess. I think the only real way to know what's right on your system is to try it both ways.

    ReplyDelete