Regular expressions in R vs RStudio

R-bloggers 2014-04-14

(This article was first published on Robin Lovelace - R, and kindly contributed to R-bloggers)

The 'regex' family of languages andcommands is used for manipulating text strings. More specifically, regular expressions are typically used for finding specific patterns of characters andreplacing them with others. These technique have a range of applicationsand R's base installation has powerful regex tools (not to mentionadd-on packages for string manipulation such asstringr).Yet regex in R is cause of much confusion, as indicated by the multitude ofstackoverflow questions on the subjectand the documentation is difficult for novices to the world of regex.

In this post we are going to see a few reproducible examples of R'simplementation of regular expressions and how it can make your life easier.There is already much useful information on the topic such as a simplehow-to on usingregex to load files with specific names,an excellent introduction from Regular-Expressions.infoand R's terse documentation on the matter triggered with ?regex.In an attempt to adhere to the DRY principle, this article will focus onjust three topics: R's basic regex commands, search and replace in R andR Studio's implementation of regex. The final topic has rarely been discussedyet understanding the differences between regex in R Studio's search paneland in standard R can save much time.R Studio's implementation of regex is confusingly differentfrom R's default regex behavior.

R's basic regex commands

Let us start with an example: illegal column names that describeage categories, attributes and which we want to eventuallyprefix with 'm'. Here are the original column names imported from Excel:

x <- c("16_24cat", "25_34cat", "35_44catch", "45_54Cat", "55_104fat")

Imagine we first want to select all items in this string containing'cat'. The basic regex command in R is grep, which simplyreturns the index of the matching elements:

grep(pattern = "cat", x = x)
## [1] 1 2 3

Note that 'catch' was included whereas 'Cat' was not: the matchingpattern can appear anywhere in the text string but is case specific.To make the search ignore cases, simply add the ignore.case = T argument.To exclude 'catch', the dollar sign can be used. These arguments can be usedto extract the names we are really interested in:

grep("cat$", x, ignore.case = T)
## [1] 1 2 4

grepl is the same as grep, only it outputs a yes/now output for each element:

grepl("cat$", x, ignore.case = T)
## [1]  TRUE  TRUE FALSE  TRUE FALSE

The final regex-related command worth knowing strsplit. Imaginewe want all characters on the right-hand side of the underscore.

strsplit(x, split = "_")
## [[1]]## [1] "16"    "24cat"## ## [[2]]## [1] "25"    "34cat"## ## [[3]]## [1] "35"      "44catch"## ## [[4]]## [1] "45"    "54Cat"## ## [[5]]## [1] "55"     "104fat"

Strangely, the hardest part of the strsplit functionis to re-combine the list output into a useful form.For this, in base R, we need the power of sapply:

sapply(strsplit(x, split = "_"), "[", 2)
## [1] "24cat"   "34cat"   "44catch" "54Cat"   "104fat"

A simpler way to achieve this same result would be to use thestr_split_fixed function of the stringr package:

library(stringr)str_split_fixed(x, "_", 2)[, 2]
## [1] "24cat"   "34cat"   "44catch" "54Cat"   "104fat"

Finding and replacing in R

To search and replace the first instance of a pattern, use sub.Much more useful is gsub, which replaces all instances.To replace all instances of 'cat' with 'fat', use the following:

gsub(pattern = "cat$", replacement = "fat", x = x, ignore.case = T)
## [1] "16_24fat"   "25_34fat"   "35_44catch" "45_54fat"   "55_104fat"

Let's try something more complicated. We want all instances oftwo numbers followed by an alphabet letter to have an additional characterinserted, but only if the first of those numbers is 3 or less:

gsub("([1-3][1-9][a-z])", "m\\1", x, perl = T, )
## [1] "16_m24cat"  "25_m34cat"  "35_44catch" "45_54Cat"   "55_104fat"

The above syntax is bizarre, so let's run through it. - We have specified that we want Perl-esqueregex, allowing us to store groups for later referral. - The curved bracketshave no impact on the search result, but are used by Perl to store the contentsof the group. - The square brackets refer to any character matching the range of values indicated.- the \\1 symbol means "replace this with the value captured in group 1",meaning that the same numbers are retained, even though the numbers were used in the match.

Regular expressions in RStudio

The above bullet points may seem like a daunting amount of explanation for only one line of code, and the regex is still relatively simple! More complex commands can be done, especially whenthe perl = T argument is enabled or using the more coherent stringrpackage. However, one of the main uses that R users may have for regularexpressions is not their data per se - R is most useful for quantitative notcharacter information after all. Regular expressions can be hugely useful forediting long R scripts to do different jobs.

R Studio's search and replace functionality is well known but the little "Regex"tick mark is less so.

button

Hitting that tick mark brings the whole power of Perl regex to bear on yourcode editing: time spent learning the functionality ofregular expressions can pay dividends in fast code modification andautomation of repetitive tasks.

In fact, the example used throughout this article was originally implementedin RStudio to change the name of text strings so they would be validcolumn names. The text in the previous image means, in English, "searchfor any instance of " followed by a number and then insert the letter mdirectly after the quote mark and the number". This altered all the column names in an instant.

Unfortunately, R Studio's website has little in the way of documentationfor this little feature (hint hint to any on the RStudio team reading this),other than athread about how it differs from R's implementation, as coveredin this article.

Conclusion

Clearly, there is large potential for confusion in R's implementation ofregular expressions. To pick one example, the groupregister $1 in RStudio and pure Perl is different from the \\1 register used in R.Yet it is well worth the effort of learning regex in R, as it opens uphuge possibility for the manipulation and analysis of text data as wellas automation of your own code re-writing.

This article and the reproducible examples should be of useto others. Understanding regular expressions in R and RStudio can make theR programming process more powerful, less prone to human error andfaster. It is hoped that this will allow R users more time away from their computers,to enjoy nature and hopefully to return to their computers with a refreshed desireto learn and use their specialist skills for the greater good.

To leave a comment for the author, please follow the link and comment on his blog: Robin Lovelace - R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...