Regular expressions in R vs RStudio
R-bloggers 2014-04-14
Summary:
The 'regex' family of languages andcommands is used for manipulating text strings. More specifically, regular expressions are typically used for finding specific patterns of characters andreplacing them with others. These technique have a range of applicationsand R's base installation has powerful regex tools (not to mentionadd-on packages for string manipulation such asstringr).Yet regex in R is cause of much confusion, as indicated by the multitude ofstackoverflow questions on the subjectand the documentation is difficult for novices to the world of regex.
In this post we are going to see a few reproducible examples of R'simplementation of regular expressions and how it can make your life easier.There is already much useful information on the topic such as a simplehow-to on usingregex to load files with specific names,an excellent introduction from Regular-Expressions.infoand R's terse documentation on the matter triggered with ?regex
.In an attempt to adhere to the DRY principle, this article will focus onjust three topics: R's basic regex commands, search and replace in R andR Studio's implementation of regex. The final topic has rarely been discussedyet understanding the differences between regex in R Studio's search paneland in standard R can save much time.R Studio's implementation of regex is confusingly differentfrom R's default regex behavior.
R's basic regex commands
Let us start with an example: illegal column names that describeage categories, attributes and which we want to eventuallyprefix with 'm'. Here are the original column names imported from Excel:
x <- c("16_24cat", "25_34cat", "35_44catch", "45_54Cat", "55_104fat")
Imagine we first want to select all items in this string containing'cat'. The basic regex command in R is grep
, which simplyreturns the index of the matching elements:
grep(pattern = "cat", x = x)
## [1] 1 2 3
Note that 'catch' was included whereas 'Cat' was not: the matchingpattern can appear anywhere in the text string but is case specific.To make the search ignore cases, simply add the ignore.case = T
argument.To exclude 'catch', the dollar sign can be used. These arguments can be usedto extract the names we are really interested in:
grep("cat$", x, ignore.case = T)
## [1] 1 2 4
grepl
is the same as grep
, only it outputs a yes/now output for each element:
grepl("cat$", x, ignore.case = T)
## [1] TRUE TRUE FALSE TRUE FALSE
The final regex-related command worth knowing strsplit
. Imaginewe want all characters on the right-hand side of the underscore.
strsplit(x, split = "_")
## [[1]]## [1] "16" "24cat"## ## [[2]]## [1] "25" "34cat"## ## [[3]]## [1] "35" "44catch"## ## [[4]]## [1] "45" "54Cat"## ## [[5]]## [1] "55" "104fat"
Strangely, the hardest part of the strsplit
functionis to re-combine the list output into a useful form.For this, in base R, we need the power of sapply
:
sapply(strsplit(x, split = "_"), "[", 2)
## [1] "24cat" "34cat" "44catch" "54Cat" "104fat"
A simpler way to achieve this same result would be to use thestr_split_fixed
function of the stringr
package:
library(stringr)str_split_fixed(x, "_", 2)[, 2]
## [1] "24cat" "34cat" "44catch" "54Cat" "104fat"
Finding and replacing in R
To search and replace the first instance of a pattern, use sub
.Much more useful is gsub
, which replaces all instances.To replace all instances of 'cat'