In the previous article in this series we looked at the performance of different compilers on Intel’s x86/x86-64 architecture. In this article we will look at the performance of the two available compilers on Sun’s SPARC Niagara platform.
Only two compilers were used in this part of the test, and versions of both of these were included in the previous roundup. They are:
- GNU GCC (4.2.3)
- Sun’s SunCC (12.0, part of Sun Studio development toolkit)
The server used for testing was a Sun T2000 with a 1.0GHz SPARC T1 processor with 8 cores and 32 threads. The operating system was Solaris 10.
The test application fits curves to arbitrary sampled data. It has 4 innermost loops for fitting 4 curve parameters and it implements it’s own partial caching using a technique described in a previous article on this site. The code implements most of the optimizations also mentioned in the said article, in order to improve vectorization and speed of the application.
More information on the application used for testing is available in the previous article that covers compiler performance on the x86 platform.
Selecting the correct compiler switches is important for achieving best performance from your code. They can also break things, and make the code that produces results that are quite clearly wrong. The switches used in the tests were the ones found to produce the fastest code without producing results that are wildly inaccurate (some numeric instability is acceptable, as long as it is limited).
Compiler switches used are separated into 2 parts for each compiler. The common part, and the machine specific part. The common part is the same for the given compiler on all platforms, and the machine specific part varies according to the processor. The switches used are heavily based on what the compiler developers deemed a good fully optimized combination (based on the compiler’s -fast shortcut switch or equivalent references in the documentation). Some options, however, have been added/removed/changed due to producing poor results, either in accuracy or speed.
|Common||-O3 -fno-strict-aliasing -ffast-math -foptimize-register-move -frerun-loop-opt -fexpensive-optimizations -fprefetch-loop-arrays -fomit-frame-pointer -funroll-loops -ftree-vectorize -ftree-vectorizer-verbose=5 -fpic -Wall|
|SPARC T1||-mcpu=niagara -mtune=niagara|
|Common||-xO5 -xalias_level=simple -fns=yes -fsimple=2 -xbuiltin=%all -xdepend=yes -xlibmil -xlibmopt -xunroll=4 -xprefetch=auto -xprefetch_level=1 -xipo=2 -xldscope=global -Yl,/usr/bin -Qpath /usr/bin -features=extensions|
|SPARC T1||-xarch=v9 -xchip=ultraT1 -xtarget=native -m64 -xcache=native|
Here are the normalized scores, relative to GCC (GCC = 100%, more is better).
|SunCC||11.74 i/s||12.79%||(with -xvector=lib)|
|SunCC||92.98 i/s||101.30%||(without -xvector)|
Two things became evident during the test. At first we tried using the -xvector=lib parameter on SunCC in order to gain any available advantage from vectorization features of the CPU. This, however, produced results that were an order of magnitude worse that GCC’s. It wasn’t immediately clear why SunCC’s performance was that much lower, but it was quickly pinned down to the -xvector compiler flag. Removing it put the performance of the produced code back on track, and it came out 1.3% ahead of GCC. Presumably, the -xvector=lib flag causing a massive slowdown in the compiled code is down to a compiler or library bug.
The difference here is similar to the difference between GCC and SunCC on the x86 platform, only on SPARC T1 the results are in favour of SunCC.
Unlike on the Intel platform where using Intel’s compiler produced a massive performance boost, on Sun’s platform the advantage of using Sun’s compiler is fairly minimal.
The other thing the vigilant among the reaters may have noticed is that the performance of Sun’s SPARC T1 processor with the best compiler abailable is approximately 67x (a staggering 6,700%) slower than the long obsolete Pentium III used in the x86 compiler roundup (93 i/s T1/SunCC vs. 6231 i/s P3/ICC). In addition, those faimiliar with the T1 processor will know that even though the T1 has 8 cores, it only has one FPU shared between them. This means that unlike on the multi-core Core2 CPUs where performance in this test would scale nearly linearly with multiple parallel processes (only single thread tests were used in both this and the previous article), on the T1 the advantage of multiple cores/threads would be minimal. But then again, Sun openly admit that T1 is not a processor to use for heavy floating point operations, which is largely what this test does.
While this poses some interesting questions about the performance of the T1, the purpose of this article was an analyze performance differences between different compilers on the same platform, rather than relative performance differences between different platforms. A separate article focusing on performance differences between the x86 Core2 and SPARC T1 platforms is already in the making to cover this aspect.