Useful Functions in R for Manipulating Text Data
R-bloggers 2014-02-28
Introduction
In my current job, I study HIV at the genetic and biochemical levels. Thus, I often work with data involving the sequences of nucleotides or amino acids of various patient samples of HIV, and this type of work involves a lot of manipulating text. (Strictly speaking, I analyze sequences of nucleotides from DNA that are reverse-transcribed from the HIV’s RNA.) In this post, I describe some common functions in R that I often use for text processing.
Obtaining Basic Information about Character Variables
In R, I often work with text data in the form of character variables. To check if a variable is a character variable, use the is.character() function.
> year = 2014> is.character(year)[1] FALSE
If a variable is not a character variable, you can convert it to a character variable using the as.character() function.
> year.char = as.character(year)> is.character(year.char)[1] TRUE
A basic piece of information about a character variable is the number of characters that exist in this string. Use the nchar() function to obtain this information.
> nchar(year.char)[1] 4
Pattern Matching and Manipulation
I often need to combine several character variables into one string, and the paste() function is useful for that. Notice my use of the “sep =” option to specify that I want to separate the variables with 1 space.
> first = 'The'> second = 'Chemical'> third = 'Statistician'> my.name = paste(first, second, third, sep = ' ')> my.name[1] "The Chemical Statistician"
A common task in my job is determining whether or not a sequence of nucleotides/amino acids is present in a much longer sequence of length
(i.e.
). Essentially, I want to determine if a pattern of text exists in a character variable. The grepl() function is useful for that; in fact, the pattern of interest can be searched in multiple character variables simultaneously – just combine the 2 variables using the c() function!
> x = 'ATCG'> y = 'GGACTCTAAATCCGTACTATCGTCATCGTTTTTCCT'> z = 'CTATCGGGTAGCT'> grepl(x, c(y, z))[1] TRUE TRUE
If you want to determine precisely where “x” is located along “y” and along “z”, use the gregexpr() function.
> gregexpr(x, c(y, z))[[1]][1] 19 25attr(,"match.length")[1] 4 4attr(,"useBytes")[1] TRUE[[2]][1] 3attr(,"match.length")[1] 4attr(,"useBytes")[1] TRUE
The output of gregexpr(x, c(y, z)) is a list of 2 objects.
- The first object contains the positional information about the pattern “x” in the variable “y”.
- “x” appears twice in the variable “y” – at positions 19 and 25. (Specifically, the “A” in x = ‘ATCG’ appears at positions 19 and 25.)
- The second object contains the positional information about the pattern “x” in the variable “z”.
To extract these positions, you must first slice the list into its 2 objects – use double braces to do this. Then, you can extract the positions from each object - use single braces to do this. For simplicity, let’s assign the output of gregexpr(x, c(y, z)) to a variable named “pos”.
> pos = gregexpr(x, c(y, z))> pos[[1]][1] 19 25attr(,"match.length")[1] 4 4attr(,"useBytes")[1] TRUE> pos[[1]][1][1] 19> pos[[1]][2][1] 25
If you want to extract a portion of a string, use the substr() function. For example, if I know that the first 3 nucleotides of a particular DNA sequence are junk, I would want to discard them and extract the rest of that sequence only. Let’s use the variable “y” to illustrate this.
> y[1] "GGACTCTAAATCCGTACTATCGTCATCGTTTTTCCT"> substr(y, 4, nchar(y))[1] "CTCTAAATCCGTACTATCGTCATCGTTTTTCCT"
Further Information
John Myles White, who co-wrote the excellent “Machine Learning for Hackers” with Drew Conway, has a nice blog entry on some other useful functions for text processing in R. If you have any more suggestions, please share them in the comments!
Filed under: R programming Tagged: amino acids, as.character(), data manipulation, DNA, gregexpr(), grepl(), HIV, is.character(), manipulating strings, nchar(), nucleotides, paste(), R, R programming, string, strings, substr(), text, text data, text manipulation, text processing
R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...