STX B+ Tree Speed Test Measurements on Raspberry Pi (Model B)
Posted on 2013-05-06 09:48 by Timo Bingmann at Permlink with 1 Comments. Tags: #stx-btree #frontpage
The Raspberry Pi is maybe one of the most hyped embedded system projects in the last year, and I also got myself one for experiments. People are doing amazing things with this Linux-in-a-box SoC. Doubtlessly, the popularity is due to the standardized platform and a large community forming around it, which makes fixing the many small problems with Linux on ARM systems feasible. For me, the Raspberry Pi is an alternative architecture on which to test my algorithms and libraries, which exhibits somewhat different characteristics than the highly optimized desktop CPUs.
So I decided to run my STX B+ Tree speed test on the Raspberry Pi Model B, because most people use the SoC for multimedia purposes and little other memory performance data is available. The B+ tree speed test gives insight into the platform's overall memory and processing performance, and thus yields a better assessment of how useful the system is for general purpose applications (unlike multimedia decoding). Most benchmarks focus solely on floating point or integer arithmetic, which alone are very poor indicators for overall system performance. The Raspberry Pi forums say it has performance similar to a "Pentium 2 with 300 MHz", but that is for arithmetic.
The Model B sports a 700 MHz (not over-clocked) ARM1176JZF-S CPU with 512 MiB RAM, and a peak power rating of only 3.5 Watts! Apparently, the CPU has 64 KiB L2 cache, however, this cache is shared with the integrated GPU and for me it was unclear how to exclusively allocate it to CPU. Nevertheless, the B+ tree test results show that the cache works as expected. The B+ tree speed test was compiled on the Debian wheezy image from around 2013-03 with g++ 4.6.3-14+rpi1
and CXXFLAGS=-O3 -march=armv6j -mfpu=vfp -mfloat-abi=hard
. The special flag for ARM1176JZF was a bit slower! Due to the limited memory, the speed test was run with only at most 8 Mi inserted items.
The complete speed test data is available for download: raspi-btree-results.tar.bz2 (407 KiB), together with the following result PDF:
I have selected the experiment with largest "data movement" (multimap with a complete insert/find/delete cycle) for the following discussion to gain insight into the platform's memory architecture and speed. The plot above is compared to the results from an Intel i7 920 (2.67 GHz, 256 KiB L2 cache, 8 MiB L3 cache, and 12 GiB of DDR3-800 RAM), which are shown in absolute values at the bottom of this post and are included in the STX B+ tree tarball. The following plot shows the ratio of the speed test results of Raspberry Pi over the Intel i7, individually for each container.
Obviously, the results differ both quantitatively and qualitatively. Of course, the Intel i7 architecture has a lot more sophisticated circuitry built into it than the ARM RISC system in the Raspberry Pi, so this speed comparison is quantitatively pretty unfair: the ARM is about 7-25 times slower than the Intel i7, which is a lot higher than the pure 700 MHz / 2.67 GHz ratio may suggest.
However, if one takes the price ratio of just the ARM CPU (lets assume $19 of the $35 are for the CPU) over the plain Intel i7 CPU package (currently about $380) into account, then the CPU price ratio of 20 is in range of the speed ratio. If one considers the cost of a whole Intel i7-based system over the $50 of Raspberry Pi with case and SDHC card, then the Raspberry wins by a long shot. The same is much more pronounced for energy cost: where an Intel i7 desktop can easily draw 180 Watts, the Raspberry Pi is happy with just 3.5 Watts.
With these ratios in mind, the slowdown is not really that bad, and it is worth-while to look at the qualitative differences. The "bump" in the middle of the ratio plot is obviously due to the difference in L2/L3 cache: the Intel i7 has 128 times more cache. This cache boosts random access speeds on Intel i7 for about seven to eight powers to two, which is most of the plot. The fast random accesses are most notable for the hash table's performance, where the Raspberry Pi lacks far behind for small tables. For larger tables, the slowdown goes down, because Intel i7's cache overflows and RAM comes the bottleneck.
If one ignores the bump, it is still remarkable that generally speaking the container plots lines switch their order at the left and right ends: the B+ tree with fanout <200>
begins with smallest slowdown on Raspberry Pi, but ends with the highest. Likewise, the std::map
red-black tree has highest slowdown of the trees for small item counts, but best slowdown for high item counts. This indicates that the performance penalty of random accesses over sequential ones is higher on more sophisticated circuitry like the Intel i7. Thus on slower platforms like the ARM, the many algorithmic tricks depending on random access are more worth-while.
Summarizing the results: the Raspberry Pi's architecture shows a general slowdown of 8-10 for very small and very large datasets. For medium range datasets the lack of a larger CPU cache becomes very notable, bringing the slowdown up to 16, and yielding exceptionally high slowdowns up to 25 for small hash tables. Thus when programming performance-oriented algorithms for these SoC CPUs, one must reconsider their cache hierarchy and adapt algorithms to very small cache sizes.
F'n cool. Amazing work. Bookmarked!
Drinking a great beer and reading through your site. Gonna smoke a big cigar now, and think about your plots.