Tuesday, May 4, 2010

Developing a user-friendly regular expression function (easyGregexpr)

In the past few months, I've developed a set of functions for automating model estimation and interpretation using Mplus, an outstanding latent variable modeling program that has unparalleled flexibility for complex models (e.g., factor mixture models). I recently rolled these functions into an R package called MplusAutomation. Because the package focuses on extracting various parameters from text output files, I've learned a lot about regular expressions, particularly Perl-compatible regular expressions using PCRE. R provides a handful of useful regular expression routines that are Perl-compatible (including perl=TRUE as a parameter) and I've made frequent use of grep, regexpr, and gregexpr.

The problem with regexpr and gregexpr, in particular, is that their output is wonky and does not lend itself to easy string manipulations. The rest of the post will focus on gregexpr, which is identical to regexpr, except that it returns all matches for a regular expression, whereas regexpr returns only the first. So, if you're searching for all instances of the letter "a" in the line "abcacdabb", regexpr would only match the first a, whereas gregexpr would find all three a's.

Let's take a simple example. We want R to extract all HTML tags from a text file read into a character vector using the scan function. So that it's easy to follow, I've just defined a character vector with a simple HTML example.

> exampleText <- c("<html>< head>", "<title>This is a title.</title>", "</head>", "<body>", "<h1>This is an example header.</h1><p>And here is some basic text.</p>", "A line without any tags.", "</body>", "</html>")

> exampleText
[1] "<html>< head>"
[2] "<title>This is a title.</title>"
[3] "</head>"
[4] "<body>"
[5] "<h1>This is an example header.</h1><p>And here is some basic text.</p>"
[6] "A line without any tags."
[7] "</body>"
[8] "</html>"

Our goal is to locate all of the opening and closing HTML tags in this file with the intention of processing them further in some way. In a real-world example, we might want to compute quantities based on certain numbers extracted from text or to replace certain strings after making some changes. Here is the output from gregexpr using a simple regular expression that matches all HTML tags in the source above.
> gregexpr("<\\s*/*\\s*\\w+\\s*>", exampleText, perl=TRUE)
[[1]]
[1] 1 7
attr(,"match.length")
[1] 6 7

[[2]]
[1] 1 24
attr(,"match.length")
[1] 7 8

[[3]]
[1] 1
attr(,"match.length")
[1] 7

[[4]]
[1] 1
attr(,"match.length")
[1] 6

[[5]]
[1] 1 31 36 67
attr(,"match.length")
[1] 4 5 3 4

[[6]]
[1] -1
attr(,"match.length")
[1] -1

[[7]]
[1] 1
attr(,"match.length")
[1] 7

[[8]]
[1] 1
attr(,"match.length")
[1] 7

As can be seen above, gregexpr returns a list where each element in the character vector is represented as a list element and the starting positions for each match on a line are returned as numeric vectors within the list elements. The length of the match (i.e., the number of characters) is stored as an attribute "match.length".

There are a few things that irk me about the output of gregexpr. First, the matched string itself is not returned (one would need to use substr to obtain this). If nothing else, this makes it very difficult to know if you've written a regular expression correctly (i.e., debugging). Second, storing vectors within lists makes it more difficult to extract the lines and positions of matches relative to a simple data.frame. I suppose one could use some combination of lapply and sapply to extract values of interest, but it seems unintuitive to me. Third, if one were interested in the matched string, the substr function expects to receive a start and stop character within the string, but gregexpr returns a match length, so one must resort to expressions like

result <- gregexpr("<\\s*/*\\s*\\w+\\s*>", exampleText, perl=TRUE)

matchedString <- substr(exampleText[1], result[[1]][2], result[[1]][2] + attr(result[[1]], "match.length")[2] - 1)

Now that is some painful code!

My goal was to develop a simple wrapper for gregexpr that would return a data.frame with the starting and stop positions of each match, as well as the matched string itself. The wrapper is useful for parsing character vectors (although it would be easy to extend it to lists or data.frames).

Here's the code:
easyGregexpr <- function(pattern, charvector, ...) {
require(plyr)

if (storage.mode(charvector) != "character") stop("easyGregexpr expects charvector to be a character vector.")

#identify all matches
regexpMatches <- gregexpr(pattern, charvector, ...)

convertMatches <- c()
for (i in 1:length(regexpMatches)) {
thisLine <- regexpMatches[[i]]
#only append if there is at least one match on this line
if (thisLine[1] != -1) {
convertMatches <- rbind(convertMatches, data.frame(element=i, start=thisLine, end=thisLine + attr(thisLine, "match.length") - 1))
}
}

#if no matches exist, return null (otherwise, will break adply)
if (is.null(convertMatches)) return(NULL)

#We now have a data frame with the line, starting position, and ending position of every match
#Add the matched string to the data.frame
#Use adply to iterate over rows and apply substr func
convertMatches <- adply(convertMatches, 1, function(row) {
row$match <- substr(charvector[row$element], row$start, row$end)
return(as.data.frame(row))
})

#need to convert from factor to character because adply uses stringsAsFactors=TRUE even when overridden
convertMatches$match <- as.character(convertMatches$match)
return(convertMatches)
}

Any option for gregexpr (e.g., perl=TRUE) can be passed identically to easyGregexpr (because of the ... parameter). The code relies on Hadley Wickham's useful plyr package, although I wish that ddply would accept an argument of 1 for .variables to allow for processing by row (I've seen workarounds like the adply above or using transform, but these seem less intuitive). I also know that the memory management for the function isn't ideal because of the repeated rbind calls to an initially empty variable. That said, I developed a version that preallocated the convertMatches data.frame starting with the call numMatches <- sum(sapply(regexpMatches, function(x) sum(x > 0))) and it was no faster in my profiling runs and the code was less readable.

When we use the easyGregexpr function to parse the exampleText above, here is the output:


> easyGregexpr("<\\s*/*\\s*\\w+\\s*>", exampleText, perl=TRUE)
   element start end    match
1        1     1   6   <html>
2        1     7  13  < head>
3        2     1   7  <title>
4        2    24  31 </title>
5        3     1   7  </head>
6        4     1   6   <body>
7        5     1   4     <h1>
8        5    31  35    </h1>
9        5    36  38      <p>
10       5    67  70     </p>
11       7     1   7  </body>
12       8     1   7  </html>


A friendly, simple, clean data.frame with the elements, positions, and strings matched! I hope that this function proves useful to others. Maybe in future iterations of R, the built-in functions will provide more useful output. For now, I find myself using this wrapper frequently for parsing text.

Monday, April 12, 2010

Using MKL-Linked R in Eclipse

Setting up Eclipse to use MKL-Linked R

In my previous post, I showed how to compile R 2.10.1 using Intel's Math Kernel Library for the BLAS/LAPACK interface. Even though it takes a bit of time to setup, I think the noticeably improved calculation speed justifies the effort. Although I'm happy to use R from the command line for basic stuff, I prefer to develop my code in Eclipse and there is one configuration step needed for Eclipse to talk properly to MKL-linked R (assuming you followed the instructions in the last post).

If you haven't used Eclipse before, I would strongly recommend it as an Integrated Development Environment (IDE) for R. It's free, has a large user base, and is regularly updated. The editor is very flexible and allows for multiple tabs and easy arrangement of several windows. The StatET plugin links R with Eclipse through the rJava package, thereby allowing the user to browse objects in the R environment using the object browser. You can also get auto-completion suggestions for functions in the current environment by pressing Ctrl+Space. See Jeremy Anglim's post for some useful links about StatET and Eclipse. Before proceeding, if you haven't done so already, install the rJava package in R using install.packages("rJava").

Because I compiled R as a shared library linked against Intel's MKL, Eclipse struggles to locate libraries for MKL (and the shared R library). To fix this problem, one needs to set the LD_LIBRARY_PATH environment variable to include the MKL library directory, as well as the R shared library directory. By default, Java will not look for libraries in these directories, leading to errors such as this one (assuming you've enabled debugging as described on the StatET site:

java.lang.UnsatisfiedLinkError: /usr/local/lib64/R/library/rJava/jri/libjri.so: libmkl_gf_lp64.so: cannot open shared object file: No such file or directory

java.lang.UnsatisfiedLinkError: /usr/local/lib64/R/library/rJava/jri/libjri.so: libR.so: cannot open shared object file: No such file or directory

Although some documentation suggested that I set the
java.library.path variable in the VM arguments in the JRE tab of the Eclipse run configuration, this is problematic because this setting only helps to load libraries directly requested by the application, not those libraries that reference other libraries (as is the case for MKL). Anyhow, the trick is to set the LD_LIBRARY_PATH environment variable in the Environment tab of the Eclipse run configuration, as seen here:


If you've installed things to the same place as me, in the Environment tab, you'll want to click "New" to add a new environment variable, call it LD_LIBRARY_PATH, and give it a value of /opt/intel/mkl/10.2.4.032/lib/em64t;/usr/local/lib64/R/lib.

The only other thing to be done is to verify that you've setup an R environment that points to the MKL-based installation. R environments are setup within the StatET preferences: Windows > Preferences > StatET > Run/Debug > R Environments. In my case, I installed R to /usr/local/lib64/R, so I specified that as the target and named the R installation R 2.10.1 x64 MKL. Then, under the R Config tab in your R run configuration (the same one as above), make sure you select the environment you just created. If you haven't setup StatET for Eclipse before, I would strongly encourage you to read Longhow Lam's guide, which provides more details (see especially the Configuring R section).

That's basically it! The major tweak is to tell Java where to find linked libraries using the LD_LIBRARY_PATH environment variable. Happy coding!

Saturday, April 10, 2010

Compiling 64-bit R 2.10.1 with MKL in Linux

The rationale for compiling R using the Intel Math Kernel Library

Recently, there has been a surge in the use of Intel's Math Kernel Library (MKL; http://software.intel.com/en-us/intel-mkl/) among data analysis packages. MKL is a highly optimized set of linear algebra libraries that includes full Basic Linear Algebra Subprograms (BLAS) and Linear Algebra Package (LAPACK) implementations, as well as fast Fourier transforms and vector math. I think the folk interpretation is that Intel engineers have inside knowledge on how to exploit fully the number crunching powers of Intel CPUs, thereby allowing them to produce a remarkably fast math library. REvolution Computing has developed a version of R that is linked against MKL with impressive speedups in many functions that rely on complex algebraic manipulation. Recently, Enthought Inc. has also begun to provide Python binaries linked against MKL, with similarly improved performance. And although it's not emphasized directly, Matlab links to MKL in most recent releases.

The good news for R users on Linux is that Intel provides a free license for MKL, assuming that it is used for personal, non-commercial purposes. I set out to compile R 2.10.1 from source on 64-bit Gentoo Linux, linking to the latest version of MKL (10.2.4.032). My major goal was to create a super-fast version of R to be used within the StatET plugin for Eclipse, my favorite IDE for R development. Given that the process was bumpy and took the better part of an afternoon, I thought I would post my experiences in hopes that they might be useful to others. The notes below should mostly apply to 32-bit Linux OSs, but I need 64-bit R to process some rather large psychophysiology datasets, so I'll assume you're running 64-bit Linux, too.

Getting ready

The R Installation and Administration guide is the best place to start when learning to compile R. It gives a good listing of prerequisites for installation, configuration options, suggestions for compilation, linking to BLAS and LAPACK libraries, and even good starting points for linking to MKL. I would recommend at least skimming this guide before you try to compile R. Pay particular attention to Appendix A, which details the programs and libraries that need to be present prior to compiling R.

I think Gentoo is an awesome Linux distribution: all packages are compiled from source and are optimized for your processor. Plus, the basic installation is fairly bare-bones and the package management system (emerge) is very smart. Because of Gentoo's preference for compiling packages from source, all of the required tools for compiling R (detailed in the R Installation guide) were already in place on my machine, including gcc (4.3.4), libiconv, and make. Thus, other than downloading MKL, I didn't have to install anything. If you don't have prerequisite packages installed on your Linux distribution, you should be able to track them down easily.

You'll need to get a license for MKL and download the latest version. Extract the archive, then install it using the install.sh script provided by Intel. Read the Install.txt file for details on the MKL installation and licensing process. In the instructions below, I'll assume that MKL has successfully been installed to: /opt/intel/mkl/10.2.4.032. By default, MKL installs to the /opt directory.

Configuring and compiling R with MKL

Download the R 2.10.1 source from CRAN. Extract the archive to a directory of your choice using tar xvzf R-2.10.1.tar.gz.

Before you run the configure script in the R-2.10.1 directory, you'll want to setup the environment variables to ensure that R is compiled with the best code and linking optimizations and that it is linked against MKL. I've adapted the commands below from the R Installation and Administration guide. I would suggest using a bash script to automate this (i.e., paste all of the commands together in a single .sh file to be executed using the source command), but you could also just type in the commands at a bash shell:

export FFLAGS="-march=core2 -O3"
export CFLAGS="-march=core2 -O3"
export CXXFLAGS="-march=core2 -O3"
export FCFLAGS="-march=core2 -O3"
These set the gcc compiler flags to compile for a particular architecture (here, Intel Core 2 processors) and to use the highest level of code optimization (O3, that's an "o" not a "zero"). Note that core2 is a supported option for -march as of gcc 4.3. In gcc 4.2, Core 2 processors were optimized using -march=nocona. If you're using a different processor, look here, or try -march=native, which should detect your setup. Some Linux programs won't compile correctly using -O3, which nominally provides the most optimized code, but R compiled perfectly on my box -- and using O3 may lead to noticeable performance enhancements over O2. So, I recommend that you use it.

MKL_LIB_PATH=/opt/intel/mkl/10.2.4.032/lib/em64t

export LD_LIBRARY_PATH=$MKL_LIB_PATH
These lines define the location of the 64-bit MKL libraries (MKL_LIB_PATH) and tell the gcc linker where to look for the MKL libraries when compiling R (LD_LIBRARY_PATH).

export LDFLAGS="-L${MKL_LIB_PATH},-Bdirect,--hash-style=both,-Wl,-O1"
This line instructs the linker to look in the MKL_LIB_PATH directory for relevant libraries throughout the compile process and it optimizes the way in which linked libraries are loaded, as discussed here.

export SHLIB_LDFLAGS="-lpthread"
export MAIN_LDFLAGS="-lpthread"
These lines are only relevant if you want to compile R as a shared library. In my case, I want to use R within Eclipse, which relies on the JRI package within rJava. If you want to run within an embedded program, such as Eclipse, you will want to compile it as a shared library. Otherwise, it's probably better not to compile R as a shared library (see here for details). The SHLIB_LDFLAGS line above requests that the shared library is linked against the pthread library, which supports multithreading (useful for speeding up R through MKL). If you don't have this line but use the configuration below, the compilation will break.

MKL="-L${MKL_LIB_PATH} -lmkl_gf_lp64 -lmkl_gnu_thread -lmkl_lapack -lmkl_core -liomp5 -lpthread"
This specifies how to dynamically link MKL to R (i.e., use MKL as the BLAS for R). MKL has numerous linking options. I've adopted the recommendations provided in the R Installation and Administration guide. Intel provides a link advisor tool for MKL here. Interestingly, the link advisor gives a different result than the recommendation above, but I haven't tried compiling R with a different link to MKL.

./configure --enable-R-shlib --with-blas="$MKL" --with-lapack

make

make check
The configure line requests that R be compiled as a shared library (in my case, so that I can use it within Eclipse) and that it use MKL for the BLAS, as defined by the $MKL environment variable above. Note that the inclusion of --with-lapack indicates that the specified BLAS (MKL) also contains a LAPACK library.

make compiles the source and make check runs some basic tests of the compiled program to ensure that R is functioning properly. Note that the lapack.R test from make check will differ from the expected output and may be flagged as an error. At least on my machine, the differences result because the MKL-linked R finds a different, but valid, set of solutions to a system of equations, relative to R's internal LAPACK routines, so I'm not worried.

If you've come this far, all that's left is to type

make install
R will be installed in the /usr/local directory by default and the primary R library structure is located in /usr/local/lib64/R. You can now run R by typing: /usr/local/bin/R. Or, if /usr/local is in your PATH, just type R. On Gentoo, you'll want to type env-update && source /etc/profile for the R program to be accessible in your PATH.

How much does MKL improve R performance relative to the built-in BLAS/LAPACK?

After someone asked in a comment below, I ran a few quick tests to determine how much MKL sped up my particular installation of R. To do this, I compiled a version of R 2.10.1 using the default settings (just ./configure; make). I then ran an established set of R benchmarks (also used in REvolution's calculations) from here: http://r.research.att.com/benchmarks/. The benchmark script was run in a fresh R session each time, and the benchmarks were repeated 15 times for each R distribution. My computer is a Intel Core 2 Quad 9550 (2.83GHz) with 7GB RAM. The results are impressive and very similar to those reported by REvolution. (The means below represent the number of seconds required to run the full
R-benchmark-25.R script.)

Default R:
- Mean=64.95s; SD=11.83s

R with MKL:
- Mean: 11.84; SD=0.13

In other words, the MKL version was around 5.5 times faster than R using the built-in BLAS/LAPACK. Caveat: The speedups may have been due, in part, to the use of -O3 and -march flags, as well as the linker optimizations, but I bet that the vast majority is due to MKL.

----
Next up, I'll write a quick post on how to use your MKL-supercharged R installation within Eclipse. I hope that this guide proves useful and stimulates more people to try out MKL for R.

Wednesday, April 7, 2010

What this blog is about

I recently decided to start this blog to discuss my experiences as a former computer programmer who is pursuing a career in psychological research. I completed my Ph.D. in clinical psychology in 2009 and am currently a postdoc in the Department of Psychiatry at the University of Pittsburgh. My research focuses on improving the classification of personality disorders using advanced statistical methods, particularly latent variable modeling. I am also interested in the development of personality problems during childhood and adolescence and am starting to pursue developmental neuroimaging training to understand the role of cognitive control in personality dysfunction.

This blog is primarily intended to discuss my experiences programming in R (http://www.r-project.org), which is the main statistical package I use for data analysis. I am passionate about data visualization, so you're likely to see some of my explorations using ggplot2, an awesome R plotting package. I also use SAS, Mplus, and Matlab for some projects. And I use AFNI and FSL to analyze neuroimaging data, so those may float into the mix at times. My work computer dual boots Windows 7 or Gentoo Linux, depending on the task.

I am especially interested to discuss best coding practices in R and am always eager to learn better methods in data analysis. I look forward to your feedback!

Cheers!
Michael