Saturday, April 10, 2010

Compiling 64-bit R 2.10.1 with MKL in Linux

The rationale for compiling R using the Intel Math Kernel Library

Recently, there has been a surge in the use of Intel's Math Kernel Library (MKL; http://software.intel.com/en-us/intel-mkl/) among data analysis packages. MKL is a highly optimized set of linear algebra libraries that includes full Basic Linear Algebra Subprograms (BLAS) and Linear Algebra Package (LAPACK) implementations, as well as fast Fourier transforms and vector math. I think the folk interpretation is that Intel engineers have inside knowledge on how to exploit fully the number crunching powers of Intel CPUs, thereby allowing them to produce a remarkably fast math library. REvolution Computing has developed a version of R that is linked against MKL with impressive speedups in many functions that rely on complex algebraic manipulation. Recently, Enthought Inc. has also begun to provide Python binaries linked against MKL, with similarly improved performance. And although it's not emphasized directly, Matlab links to MKL in most recent releases.

The good news for R users on Linux is that Intel provides a free license for MKL, assuming that it is used for personal, non-commercial purposes. I set out to compile R 2.10.1 from source on 64-bit Gentoo Linux, linking to the latest version of MKL (10.2.4.032). My major goal was to create a super-fast version of R to be used within the StatET plugin for Eclipse, my favorite IDE for R development. Given that the process was bumpy and took the better part of an afternoon, I thought I would post my experiences in hopes that they might be useful to others. The notes below should mostly apply to 32-bit Linux OSs, but I need 64-bit R to process some rather large psychophysiology datasets, so I'll assume you're running 64-bit Linux, too.

Getting ready

The R Installation and Administration guide is the best place to start when learning to compile R. It gives a good listing of prerequisites for installation, configuration options, suggestions for compilation, linking to BLAS and LAPACK libraries, and even good starting points for linking to MKL. I would recommend at least skimming this guide before you try to compile R. Pay particular attention to Appendix A, which details the programs and libraries that need to be present prior to compiling R.

I think Gentoo is an awesome Linux distribution: all packages are compiled from source and are optimized for your processor. Plus, the basic installation is fairly bare-bones and the package management system (emerge) is very smart. Because of Gentoo's preference for compiling packages from source, all of the required tools for compiling R (detailed in the R Installation guide) were already in place on my machine, including gcc (4.3.4), libiconv, and make. Thus, other than downloading MKL, I didn't have to install anything. If you don't have prerequisite packages installed on your Linux distribution, you should be able to track them down easily.

You'll need to get a license for MKL and download the latest version. Extract the archive, then install it using the install.sh script provided by Intel. Read the Install.txt file for details on the MKL installation and licensing process. In the instructions below, I'll assume that MKL has successfully been installed to: /opt/intel/mkl/10.2.4.032. By default, MKL installs to the /opt directory.

Configuring and compiling R with MKL

Download the R 2.10.1 source from CRAN. Extract the archive to a directory of your choice using tar xvzf R-2.10.1.tar.gz.

Before you run the configure script in the R-2.10.1 directory, you'll want to setup the environment variables to ensure that R is compiled with the best code and linking optimizations and that it is linked against MKL. I've adapted the commands below from the R Installation and Administration guide. I would suggest using a bash script to automate this (i.e., paste all of the commands together in a single .sh file to be executed using the source command), but you could also just type in the commands at a bash shell:

export FFLAGS="-march=core2 -O3"
export CFLAGS="-march=core2 -O3"
export CXXFLAGS="-march=core2 -O3"
export FCFLAGS="-march=core2 -O3"
These set the gcc compiler flags to compile for a particular architecture (here, Intel Core 2 processors) and to use the highest level of code optimization (O3, that's an "o" not a "zero"). Note that core2 is a supported option for -march as of gcc 4.3. In gcc 4.2, Core 2 processors were optimized using -march=nocona. If you're using a different processor, look here, or try -march=native, which should detect your setup. Some Linux programs won't compile correctly using -O3, which nominally provides the most optimized code, but R compiled perfectly on my box -- and using O3 may lead to noticeable performance enhancements over O2. So, I recommend that you use it.

MKL_LIB_PATH=/opt/intel/mkl/10.2.4.032/lib/em64t

export LD_LIBRARY_PATH=$MKL_LIB_PATH
These lines define the location of the 64-bit MKL libraries (MKL_LIB_PATH) and tell the gcc linker where to look for the MKL libraries when compiling R (LD_LIBRARY_PATH).

export LDFLAGS="-L${MKL_LIB_PATH},-Bdirect,--hash-style=both,-Wl,-O1"
This line instructs the linker to look in the MKL_LIB_PATH directory for relevant libraries throughout the compile process and it optimizes the way in which linked libraries are loaded, as discussed here.

export SHLIB_LDFLAGS="-lpthread"
export MAIN_LDFLAGS="-lpthread"
These lines are only relevant if you want to compile R as a shared library. In my case, I want to use R within Eclipse, which relies on the JRI package within rJava. If you want to run within an embedded program, such as Eclipse, you will want to compile it as a shared library. Otherwise, it's probably better not to compile R as a shared library (see here for details). The SHLIB_LDFLAGS line above requests that the shared library is linked against the pthread library, which supports multithreading (useful for speeding up R through MKL). If you don't have this line but use the configuration below, the compilation will break.

MKL="-L${MKL_LIB_PATH} -lmkl_gf_lp64 -lmkl_gnu_thread -lmkl_lapack -lmkl_core -liomp5 -lpthread"
This specifies how to dynamically link MKL to R (i.e., use MKL as the BLAS for R). MKL has numerous linking options. I've adopted the recommendations provided in the R Installation and Administration guide. Intel provides a link advisor tool for MKL here. Interestingly, the link advisor gives a different result than the recommendation above, but I haven't tried compiling R with a different link to MKL.

./configure --enable-R-shlib --with-blas="$MKL" --with-lapack

make

make check
The configure line requests that R be compiled as a shared library (in my case, so that I can use it within Eclipse) and that it use MKL for the BLAS, as defined by the $MKL environment variable above. Note that the inclusion of --with-lapack indicates that the specified BLAS (MKL) also contains a LAPACK library.

make compiles the source and make check runs some basic tests of the compiled program to ensure that R is functioning properly. Note that the lapack.R test from make check will differ from the expected output and may be flagged as an error. At least on my machine, the differences result because the MKL-linked R finds a different, but valid, set of solutions to a system of equations, relative to R's internal LAPACK routines, so I'm not worried.

If you've come this far, all that's left is to type

make install
R will be installed in the /usr/local directory by default and the primary R library structure is located in /usr/local/lib64/R. You can now run R by typing: /usr/local/bin/R. Or, if /usr/local is in your PATH, just type R. On Gentoo, you'll want to type env-update && source /etc/profile for the R program to be accessible in your PATH.

How much does MKL improve R performance relative to the built-in BLAS/LAPACK?

After someone asked in a comment below, I ran a few quick tests to determine how much MKL sped up my particular installation of R. To do this, I compiled a version of R 2.10.1 using the default settings (just ./configure; make). I then ran an established set of R benchmarks (also used in REvolution's calculations) from here: http://r.research.att.com/benchmarks/. The benchmark script was run in a fresh R session each time, and the benchmarks were repeated 15 times for each R distribution. My computer is a Intel Core 2 Quad 9550 (2.83GHz) with 7GB RAM. The results are impressive and very similar to those reported by REvolution. (The means below represent the number of seconds required to run the full
R-benchmark-25.R script.)

Default R:
- Mean=64.95s; SD=11.83s

R with MKL:
- Mean: 11.84; SD=0.13

In other words, the MKL version was around 5.5 times faster than R using the built-in BLAS/LAPACK. Caveat: The speedups may have been due, in part, to the use of -O3 and -march flags, as well as the linker optimizations, but I bet that the vast majority is due to MKL.

----
Next up, I'll write a quick post on how to use your MKL-supercharged R installation within Eclipse. I hope that this guide proves useful and stimulates more people to try out MKL for R.

9 comments:

  1. How much faster is it compared to a non-MKL R?

    ReplyDelete
  2. You misunderstood the BLAS interfaces. Do use MKL, you do NOT have to rebuild R. You simply replace your reference blas, atlas, ... libraries. We have been doing that transparently on Debian for over five years. Plus, these MKLs were actually included with REvolution R in Ubuntu 9.10.

    ReplyDelete
  3. Hi Dirk, Thanks for pointing out that Debian uses a shared BLAS and Ubuntu packages are now using MKL. I wasn't aware of that, but it's great that Intel has been flexible about distributing these libraries along with R. You're right that one can easily switch out the reference BLAS/LAPACK libraries, but only if R is compiled with BLAS as a shared library (using the --enable-BLAS-shlib configure option). Here, I wasn't interested in that route (I'm not planning to switch away from MKL), but I can see its potential advantages.

    ReplyDelete
  4. Hi Michael,

    Thanks for the great instructions! I was able to get similar speed up (5-6 time) on SUSE 11.0 with dual core quad, Intel(R) Xeon(R) CPU E5310 @ 1.60GHz.
    Interestingly the patched version,R-patched_2010-08-17, run ~10% faster than R-2.11.1.

    /home/vmorozov/tmp/Rbench.2.10.1:Total time for all 15 tests_________________________ (sec): 113.439333333333
    /home/vmorozov/tmp/Rbench.2.11.1:Total time for all 15 tests_________________________ (sec): 22.697
    /home/vmorozov/tmp/Rbench.2.11.1.patch:Total time for all 15 tests_________________________ (sec): 20.435
    /home/vmorozov/tmp/Rbench.2.11.1.patch.NT6:Total time for all 15 tests_________________________ (sec): 20.5283333333333
    /home/vmorozov/tmp/Rbench.2.11.1.patch.NT8:Total time for all 15 tests_________________________ (sec): 21.2033333333333

    "NT" stands for 6 and 8 threads via "OMP_NUM_THREADS" variable. Apperently it doesn't have effect. That I don't understand

    Vlad

    ReplyDelete
  5. Just comments to my previos post.
    'MKL_NUM_THREADS' and 'OMP_NUM_THREADS' both work. I have should check it with small thread number .And look at specific tests that are supposed to gain from BLAS/LAPACK optimization. At the specific task, the speedup is proportional to number of threads. The gain goes to plateau with more than 4 threads. And the patched version on ~40% faster than the official release...
    [rstats:R-2.10.1] grep ^Linear ~/tmp/Rbench.2.1*
    /home/vmorozov/tmp/Rbench.2.10.1:Linear regr. over a 3000x3000 matrix (c = a \ b')___ (sec): 16.6253333333333
    /home/vmorozov/tmp/Rbench.2.11.1:Linear regr. over a 3000x3000 matrix (c = a \ b')___ (sec): 0.845666666666664
    /home/vmorozov/tmp/Rbench.2.11.1.patch:Linear regr. over a 3000x3000 matrix (c = a \ b')___ (sec): 0.495666666666665
    /home/vmorozov/tmp/Rbench.2.11.1.patch.NT1:Linear regr. over a 3000x3000 matrix (c = a \ b')___ (sec): 2.12966666666667
    /home/vmorozov/tmp/Rbench.2.11.1.patch.NT2:Linear regr. over a 3000x3000 matrix (c = a \ b')___ (sec): 1.13433333333334
    /home/vmorozov/tmp/Rbench.2.11.1.patch.NT4:Linear regr. over a 3000x3000 matrix (c = a \ b')___ (sec): 0.708333333333333
    /home/vmorozov/tmp/Rbench.2.11.1.patch.NT4OMP:Linear regr. over a 3000x3000 matrix (c = a \ b')___ (sec): 0.690000000000001

    Vlad

    ReplyDelete
  6. Hey have you done this with 2.13?
    Thanks!

    ReplyDelete
  7. Great post. I'll give this a try.

    You state: "Recently, Enthought Inc. has also begun to provide Python binaries linked against MKL, with similarly improved performance."

    I have the epd bundle and I've experimented with their numpy linked against MKL. Surprisingly, my version of numpy which is linked against my locally compiled and tuned ATLAS libraries outperforms the epd numpy by approximately 15%. By taking the time to tweak ATLAS for your environment you can get pretty good performance -- perhaps as good or better than that provided by precompiled MKL.

    ReplyDelete
  8. This comment has been removed by the author.

    ReplyDelete