In the year since Intel Corp. (NASDAQ: INTC) released the Nehalem-EP quad-core Xeon CPU, all the hubbub surrounding that chip and its new design has proven accurate. Bigger, better, faster, more — a whole lot more than anything that Intel had ever released before. But that was then, this is now, and as they say, what have you done for me lately?
Today, the Nehalem-EP gives way to the Westmere-EP, or X5600-series Xeon CPU. Whereas the Nehalem was a big-time performance bump over the previous generation, the Westmere is a more incremental and predictable improvement, but it’s definitely a better chip. Westmere picks up where Nehalem left off.
The key developments in Westmere-EP are two more cores (six total), the ability to address two DIMMs per channel at 1,333MHz, a 50 percent larger L3 cache, a set of instructions (AES-NI) for accelerating AES encryption, and better CPU power management. Westmere is the equal of Nehalem in single-threaded workloads, but far more scalable thanks to the additional two cores per die. The speed of Westmere’s encryption operations will also turn heads.
In fact, the 400 per cent performance increase shown with the AES-NI instructions makes whole-disk encryption almost unnoticeable. Previously encryption required a fairly sizable performance trade-off, but with the Westmere’s AES performance jump, it becomes a no-brainer. And that’s just one of many potential use cases of the AES-NI features.
Generation gap
The Westmere is built on the same basic guidelines as the Nehalem — integrated memory controller, shared L3 cache per socket, and QPI (QuickPath Interconnect) — but it’s based on a 32nm process rather than Nehalem’s 45nm. It runs up to 3.33GHz per core, and two threads per core with Hyper-Threading. That’s 24 logical CPUs in a two-socket system, all balanced against 6.4GT/s QPI. It’s definitely fast, but not terribly so when compared to Nehalem CPUs running at the same clock speed.
Like Nehalem, Westmere implements Turbo Mode to ramp up the clock speed on certain cores depending on load. Turbo Mode benefits single-threaded and lightly threaded applications by increasing the performance of a few cores when needed.
Also, Westmere CPUs sit in the same sockets as Nehalem CPUs. In fact, some Nehalem-based mainboards can support Westmere already, possibly requiring a BIOS update. This isn’t true of all Nehalem systems, however, so do some research first.
In a bid to reduce power consumption, Westmere CPUs can essentially gate off unused cores and shut them down to reduce power, saving their state in cache. Yes, Nehalems can do this too, but Westmere chips can also gate off the uncore, or the region of the CPU that is tasked not with central processing but with memory control and L3 cache, bus controllers, and so on. Whereas a Nehalem could power gate each core, the Westmere can power gate everything, which has the benefit of reducing power draw at idle.
Also in the realm of reducing power consumption, the Westmere CPUs can use low-voltage DDR3 RAM running at 1.35 volts as well as standard DDR3 1.5-volt DIMMs. In addition to the relatively small reduction in power draw, low-voltage DIMMs generate less heat, thereby reducing overall cooling requirements, which is especially significant in servers and blades with high RAM counts.
Bench time
I had the opportunity to run a series of benchmarks on two sets of Westmere chips, the X5670s and X5680s. Both six-cores, the X5670s run 2.93GHz per core, while the X5680s run 3.33GHz per core. The tests were my standard array of real-world workloads rather than mainline benchmarking tools. They are composed of LAME MP3 audio conversion tests, gzip and bzip2 compression tests, MD5 calculation tests, and MP4-to-FLV video conversion tests. Each of these tests is a single-threaded process, but they are run concurrently at increasing levels to measure performance of the processors under various loads. I start at a 1:1 physical-core-to-process level, then ramp up the ratio significantly.
For these tests, I compared a two-CPU, 8-core 3.20GHz Nehalem W5580 system with 24GB of DDR3 RAM running at 1,333MHz to a two-CPU, 12-core 3.33GHz Westmere X5680 system with 24GB of DDR3 RAM running at 1,333MHz. Aside from the slight difference in clock speed, these are essentially the same chip, but one generation apart. All tests were run from RAM disks to eliminate disk I/O from interfering with the raw CPU tests, and Hyper-Threading was enabled.
The results are pretty much what you’d expect from a Nehalem CPU with two additional cores. At the lowest concurrency level of eight processes, the processors proved essentially equal, with the slight edge to the Westmere X5680 due to the slightly higher clock speed. The LAME test showed the Nehalem running 27 seconds where the Westmere hit 26 seconds.
The next iteration was 12 concurrent processes, and here the Westmere began to pull away, with an identical runtime of 26 seconds. The eight-core Nehalem was oversubscribed at this point and turned in a 37-second time. After that, the Westmere ran away with the test, culminating at a runtime of 149 seconds on the 96-process test, where the Nehalem fell in at 234 seconds. Basically, on a per-core basis, the Westmere isn’t much faster than the Nehalem, but there are simply more cores and it scales far better because of that.
The other tests reflected the same results in a slightly less spectacular fashion, except the video conversion test that had the Westmere finishing the 96-process run a full 102 seconds faster than the Nehalem.
I ran the same test suite on a set of 2.93GHz Westmere X5670s, and the results showed the same scaling benefit, but with slower times due to the reduced clock speed as compared to the 3.33GHz X5680s. However, even at a 1:1 process-to-core ratio, the X5670 was roughly on par with the 3.2GHz Nehalem W5580, probably due to the larger L3 cache.
All this points to the fact that if your workloads are single-threaded and single-process, then Westmere CPUs aren’t going to buy you very much over their older counterparts. However, if you run highly threaded applications or many iterations of single-threaded applications at once, Westmere will provide significant benefit. Of course, one of the biggest multi-threaded workloads is virtualization, and that’s where Westmere will likely find a very happy home. Any sufficiently threaded virtualization platform will make great use of the extra cores in Westmere, on the same socket and using the same RAM.
I also ran specific tests on the new AES-NI encryption instructions. The results were very impressive. Using the same OpenSSL build with and without the AES-NI patches, AES encryption test cases showed a 400 percent performance increase. The test was simple: encrypt an 851MB file with the AES-256-CBC cipher. Without the AES-NI instructions in use, a Westmere X5670 CPU consistently completed this task in 13.5 seconds. When the AES-NI engine was used, the same task took only 3 seconds. That’s huge.
The X5600 family is large, with 12 different Westmere-EP CPUs offering plenty of options. The X5650, X5670, and X5689 CPUs all boast the full complement of six cores running at 2.26GHz, 2.93GHz, and 3.33GHz respectively at 95 watts. The lower-cost E5620, E5630, and E5640s run four cores at 2.26GHz, 2.4GHz, and 2.53GHz respectively at 80 watts.
Westmere-EP also provides 40-watt options in the L5630 and L5609, whereas the lowest-power Nehalem-EP ran at 60 watts. There are also a few oddities, such as the 95-watt X5657 that runs four cores at 2.93GHz, and the 130-watt X5677 that runs four cores at 3.46GHz. The last two are optimized for lightly threaded workloads that can make use of the higher clock speeds, but don’t need the additional cores.
If virtualization or highly threaded workloads are your game, the six-core models are definitely the way to go. If you’re looking for raw clock speed on single-threaded workloads, the four-core models will give you better bang for your CPU buck. For the power miser, or those wanting to cram lots of lower-power CPUs into a rack, the 40-watt chips may be right up your alley. Essentially, there seems to be a Westmere chip to fit just about every scenario.