RMOA: Massive online data stream classifications with R & MOA
R-bloggers 2014-05-18
Summary:
For those of you who don't know MOA. MOA stands for Massive On-line Analysis and is an open-source framework that allows to build and run experiments of machine learning or data mining on evolving data streams. The website of MOA (http://moa.cms.waikato.ac.nz) indicates it contains machine learning algorithms for classification, regression, clustering, outlier detection and recommendation engines. For R users who work with a lot of data or encounter RAM issues when building models on large datasets, MOA and in general data streams have some nice features. Namely:
- It uses a limited amount of memory. So this means no RAM issues when building models.
- Processes one example at a time, and will run over it only once
- Works incrementally - so that a model is directly ready to be used for prediction purposes
- Easy to set up data streams on data in RAM (data.frame/matrix), data in files (csv, delimited, flat table) as well as out-of memory data in an ffdf (ff package).
- Easy to set up a MOA classification model
- There are 26 classification models available which range from
- Classification Trees (AdaHoeffdingOptionTree, ASHoeffdingTree, DecisionStump, HoeffdingAdaptiveTree, HoeffdingOptionTree, HoeffdingTree, LimAttHoeffdingTree, RandomHoeffdingTree)
- Bayes Rule (NaiveBayes, NaiveBayesMultinomial)
- Ensemble learning
- Bagging (LeveragingBag, OzaBag, OzaBagAdwin, OzaBagASHT)
- Boosting (OCBoost, OzaBoost, OzaBoostAdwin)
- Stacking (LimAttClassifier)
- Other (AccuracyUpdatedEnsemble, AccuracyWeightedEnsemble, ADACC, DACC, OnlineAccuracyUpdatedEnsemble, TemporallyAugmentedClassifier, WeightedMajorityAlgorithm)
- Active learning (ActiveClassifier)
- Easy R-familiar interface to train the model on streaming data with a familiar formula interface as in
trainMOA(model, formula, data, subset, na.action = na.exclude, ...)
- Easy to predict new data alongside the model as in
predict(object, newdata, type = "response", ...)
## ## Installation from github## library(devtools)install.packages("ff")install.packages("rJava")install_github("jwijffels/RMOA", subdir="RMOAjars/pkg")install_github("jwijffels/RMOA", subdir="RMOA/pkg")## ## HoeffdingTree example## require(RMOA)hdt <- HoeffdingTree(numericEstimator = "GaussianNumericAttributeClassObserver")hdt## Define a stream - e.g. a stream based on a data.framedata(iris)iris <- factorise(iris)irisdatastream <- datastream_dataframe(data=iris) ## Train the HoeffdingTree on the iris datasetmymodel <- trainMOA(model = hdt, formula = Species ~ Sepal.Length + Sepal.Width + Petal.Length, data = irisdatastream)## Predict using the HoeffdingTree on the iris datasetscores <- predict(mymodel, newdata=iris, type="response")table(scores, iris$Species)scores <- predict(mymodel, newdata=iris, type="votes")head(scores)## ## Boosted set of HoeffdingTrees## irisdatastream <- datastream_dataframe(data=iris)mymodel <- OzaBoost(baseLearner = "trees.HoeffdingTree", ensembleSize = 30)mymodel <- trainMOA(model = mymodel, formula = Species ~ Sepal.Length + Sepal.Width + Petal.Length, data = irisdatastream) ## Predict scores <- predict(mymodel, newdata=iris, type="response")table(scores, iris$Species)scores <- predict(my