Build a search engine in 20 minutes or less
R-bloggers 2013-03-27
Summary:
…or your money back.

author = "Ben Ogorek" Twitter = "@baogorek" email = paste0(sub("@", "", Twitter), "@gmail.com") Setup
Pretend this is Big Data:
doc1 <- "Stray cats are running all over the place. I see 10 a day!" doc2 <- "Cats are killers. They kill billions of animals a year." doc3 <- "The best food in Columbus, OH is the North Market." doc4 <- "Brand A is the best tasting cat food around. Your cat will love it." doc5 <- "Buy Brand C cat food for your cat. Brand C makes healthy and happy cats." doc6 <- "The Arnold Classic came to town this weekend. It reminds us to be healthy." doc7 <- "I have nothing to say. In summary, I have told you nothing." and this is the Big File System:
doc.list <- list(doc1, doc2, doc3, doc4, doc5, doc6, doc7) N.docs <- length(doc.list) names(doc.list) <- paste0("doc", c(1:N.docs)) You have an information need that is expressed via the following text query:
query <- "Healthy cat food" How will you meet your information need amidst all this unstructured text?
Jokes aside, we're going to use an old method, one that's tried and true, one that goes way back to the 1960's. We're going to implement the vector space model of information retrieval in R. In the process, you'll hopefully learn something about the tm package and about the analysis of unstructured data before it was Big.
Text mining in R
Fundamentals of the tm package
If you have not installed the tm [1][2] and Snowball [3] packages, please do so now.
install.packages("tm") install.packages("Snowball") Load the tm package into memory.
library(tm) In text mining and related fields, a corpus is a collection of texts, often with extensive manual annotation. Not surprisingly, the Corpus class is a fundamental data structure in tm.
my.docs <- VectorSource(c(doc.list, query)) my.docs$Names <- c(names(doc.list), "query") my.corpus <- Corpus(my.docs) my.corpus ## A corpus with 8 text documents Above we treated the query like any other document. It is, after all, just another string of text. Queries are not typically known a priori, but in the processing steps that follow, we will pretend like we knew ours in advance to avoid repeating steps.
Standardizing
One of the nice things about the Corpus class is the tm_map function, which cleans and standardizes documents within a Corpus object. Below are some of the transformations.
getTransformations() ## [1] "as.PlainTextDocument" "removeNumbers" "removePunctuation" ## [4] "removeWords" "stemDocument" "stripWhitespace" First, let's get rid of punctuation.
my.corpus <- tm_map(my.corpus, removePunctuation) my.corpus$doc1 ## Stray cats are running all over the place I see 10 a day Suppose we don't want to count “cats” and “cat” as two separate words. Then we will use the stemDocument transformation to implement the famous Porter Stemmer algorithm. To use this particular transformation, first load the Snowball package.
library(Snowball) my.corpus <- tm_map(my.corpus, stemDocument) my.corpus$doc1 ## Stray cat are run all over the place I see 10 a day Finally, remove numbers and any extra white space.
my.corpus <- tm_map(my.corpus, removeNumbers) my.corpus <- tm_map(my.corpus, tolower) my.corpus <- tm_map(my.corpus, stripWhitespace) my.corpus$doc1 ## stray cat are run all over the place i see a day We applied all these standardization techniques without much thought. For instance, we sacrificed inflection in favor of fewer words. But at least the transformations make sense on a heuristic level, much like the similarity concepts to follow.
The vector space model
Document similarity
Here's a trick that's been around for a while: represent each document as a vector in \( \mathcal{R}^N \) (with \( N \) as the number of words) and use the angle \( \theta \) between the vectors as a similarity measure. Rank by the similarity of each document to the query and you have a search engine.
One of the simplest things we can do is to count words within documents. This naturally forms a two dimensional structure,