Tuesday, May 4, 2010

Developing a user-friendly regular expression function (easyGregexpr)

In the past few months, I've developed a set of functions for automating model estimation and interpretation using Mplus, an outstanding latent variable modeling program that has unparalleled flexibility for complex models (e.g., factor mixture models). I recently rolled these functions into an R package called MplusAutomation. Because the package focuses on extracting various parameters from text output files, I've learned a lot about regular expressions, particularly Perl-compatible regular expressions using PCRE. R provides a handful of useful regular expression routines that are Perl-compatible (including perl=TRUE as a parameter) and I've made frequent use of grep, regexpr, and gregexpr.

The problem with regexpr and gregexpr, in particular, is that their output is wonky and does not lend itself to easy string manipulations. The rest of the post will focus on gregexpr, which is identical to regexpr, except that it returns all matches for a regular expression, whereas regexpr returns only the first. So, if you're searching for all instances of the letter "a" in the line "abcacdabb", regexpr would only match the first a, whereas gregexpr would find all three a's.

Let's take a simple example. We want R to extract all HTML tags from a text file read into a character vector using the scan function. So that it's easy to follow, I've just defined a character vector with a simple HTML example.

> exampleText <- c("<html>< head>", "<title>This is a title.</title>", "</head>", "<body>", "<h1>This is an example header.</h1><p>And here is some basic text.</p>", "A line without any tags.", "</body>", "</html>")

> exampleText
[1] "<html>< head>"
[2] "<title>This is a title.</title>"
[3] "</head>"
[4] "<body>"
[5] "<h1>This is an example header.</h1><p>And here is some basic text.</p>"
[6] "A line without any tags."
[7] "</body>"
[8] "</html>"

Our goal is to locate all of the opening and closing HTML tags in this file with the intention of processing them further in some way. In a real-world example, we might want to compute quantities based on certain numbers extracted from text or to replace certain strings after making some changes. Here is the output from gregexpr using a simple regular expression that matches all HTML tags in the source above.
> gregexpr("<\\s*/*\\s*\\w+\\s*>", exampleText, perl=TRUE)
[1] 1 7
[1] 6 7

[1] 1 24
[1] 7 8

[1] 1
[1] 7

[1] 1
[1] 6

[1] 1 31 36 67
[1] 4 5 3 4

[1] -1
[1] -1

[1] 1
[1] 7

[1] 1
[1] 7

As can be seen above, gregexpr returns a list where each element in the character vector is represented as a list element and the starting positions for each match on a line are returned as numeric vectors within the list elements. The length of the match (i.e., the number of characters) is stored as an attribute "match.length".

There are a few things that irk me about the output of gregexpr. First, the matched string itself is not returned (one would need to use substr to obtain this). If nothing else, this makes it very difficult to know if you've written a regular expression correctly (i.e., debugging). Second, storing vectors within lists makes it more difficult to extract the lines and positions of matches relative to a simple data.frame. I suppose one could use some combination of lapply and sapply to extract values of interest, but it seems unintuitive to me. Third, if one were interested in the matched string, the substr function expects to receive a start and stop character within the string, but gregexpr returns a match length, so one must resort to expressions like

result <- gregexpr("<\\s*/*\\s*\\w+\\s*>", exampleText, perl=TRUE)

matchedString <- substr(exampleText[1], result[[1]][2], result[[1]][2] + attr(result[[1]], "match.length")[2] - 1)

Now that is some painful code!

My goal was to develop a simple wrapper for gregexpr that would return a data.frame with the starting and stop positions of each match, as well as the matched string itself. The wrapper is useful for parsing character vectors (although it would be easy to extend it to lists or data.frames).

Here's the code:
easyGregexpr <- function(pattern, charvector, ...) {

if (storage.mode(charvector) != "character") stop("easyGregexpr expects charvector to be a character vector.")

#identify all matches
regexpMatches <- gregexpr(pattern, charvector, ...)

convertMatches <- c()
for (i in 1:length(regexpMatches)) {
thisLine <- regexpMatches[[i]]
#only append if there is at least one match on this line
if (thisLine[1] != -1) {
convertMatches <- rbind(convertMatches, data.frame(element=i, start=thisLine, end=thisLine + attr(thisLine, "match.length") - 1))

#if no matches exist, return null (otherwise, will break adply)
if (is.null(convertMatches)) return(NULL)

#We now have a data frame with the line, starting position, and ending position of every match
#Add the matched string to the data.frame
#Use adply to iterate over rows and apply substr func
convertMatches <- adply(convertMatches, 1, function(row) {
row$match <- substr(charvector[row$element], row$start, row$end)

#need to convert from factor to character because adply uses stringsAsFactors=TRUE even when overridden
convertMatches$match <- as.character(convertMatches$match)

Any option for gregexpr (e.g., perl=TRUE) can be passed identically to easyGregexpr (because of the ... parameter). The code relies on Hadley Wickham's useful plyr package, although I wish that ddply would accept an argument of 1 for .variables to allow for processing by row (I've seen workarounds like the adply above or using transform, but these seem less intuitive). I also know that the memory management for the function isn't ideal because of the repeated rbind calls to an initially empty variable. That said, I developed a version that preallocated the convertMatches data.frame starting with the call numMatches <- sum(sapply(regexpMatches, function(x) sum(x > 0))) and it was no faster in my profiling runs and the code was less readable.

When we use the easyGregexpr function to parse the exampleText above, here is the output:

> easyGregexpr("<\\s*/*\\s*\\w+\\s*>", exampleText, perl=TRUE)
   element start end    match
1        1     1   6   <html>
2        1     7  13  < head>
3        2     1   7  <title>
4        2    24  31 </title>
5        3     1   7  </head>
6        4     1   6   <body>
7        5     1   4     <h1>
8        5    31  35    </h1>
9        5    36  38      <p>
10       5    67  70     </p>
11       7     1   7  </body>
12       8     1   7  </html>

A friendly, simple, clean data.frame with the elements, positions, and strings matched! I hope that this function proves useful to others. Maybe in future iterations of R, the built-in functions will provide more useful output. For now, I find myself using this wrapper frequently for parsing text.