Stone flakes II

R-bloggers 2014-06-16

(This article was first published on Wiekvoet, and kindly contributed to R-bloggers)
Continuing from last week, the aim is now to classify the stone flakes based on their various properties. Three methods are used. LDA is an obvious standard. A classification tree is both simple and visually appealing. Random forest as a complex method, where more complex relations can easily be captured. Surprising with these data is that the classification tree is doing better than random forest with respect to predicting the input data.

Data

This is the same as last week. However, I now opted to make the group labels a bit more short. r2 <- read.table('StoneFlakes.txt',header=TRUE,na.strings='?') r1 <- read.table('annotation.txt',header=TRUE,na.strings='?') r12 <- merge(r1,r2) r12$Group <- factor(r12$group,labels=c('Lower Paleolithic',     'Levallois technique',     'Middle Paleolithic',     'Homo Sapiens')) r12c <- r12[complete.cases(r12),]

LDA

LDA is part of the MASS package. The settings are default. The plot function has been adapted, to retain proper labeling and colors. ld1 <- lda(Group ~ LBI + RTI + WDI + FLA + PSF + FSF + ZDF1 + PROZD,     data=r12c) ld1 Call: lda(Group ~ LBI + RTI + WDI + FLA + PSF + FSF + ZDF1 + PROZD,      data = r12c) Prior probabilities of groups:   Lower Paleolithic Levallois technique  Middle Paleolithic        Homo Sapiens           0.15492958          0.32394366          0.47887324          0.04225352  Group means:                          LBI      RTI      WDI      FLA       PSF       FSF Lower Paleolithic   1.105455 33.82727 2.690000 118.8182 31.881818  8.372727 Levallois technique 1.254348 28.74348 2.768261 121.2609 15.021739 14.900000 Middle Paleolithic  1.195000 25.88235 3.384412 113.9706  9.497059 23.294118 Homo Sapiens        1.453333 22.96333 3.150000 117.0000  7.666667 27.900000                         ZDF1    PROZD Lower Paleolithic   15.86364 57.81818 Levallois technique 30.46087 69.17391 Middle Paleolithic  62.31471 84.23529 Homo Sapiens        56.36667 87.00000 Coefficients of linear discriminants:               LD1           LD2         LD3 LBI    2.63864596 -7.3053369000  4.96921267 RTI    0.03735899  0.0683327644  0.03955854 WDI    0.64616126 -0.3086473881  0.06507385 FLA   -0.04798439 -0.0591106950 -0.05229559 PSF   -0.01917272  0.0515112771  0.10178299 FSF    0.02488002 -0.0001579818  0.07080551 ZDF1   0.02896112  0.0396366883 -0.01625858 PROZD  0.03036974 -0.0079845555  0.04335520 Proportion of trace:    LD1    LD2    LD3  0.7026 0.2585 0.0389  There are eight objects incorrect classified. Out of 71 complete cases, that is 10% wrong. mydf <- data.frame(ID=r12c$ID,     Group=r12c$Group,     predict=predict(ld1)$class) mydf[mydf$Group!=mydf$pred,]
    ID               Group             predict 9    c  Middle Paleolithic Levallois technique 10  cl   Lower Paleolithic Levallois technique 37  ms Levallois technique   Lower Paleolithic 38   n  Middle Paleolithic Levallois technique 45 roe  Middle Paleolithic Levallois technique 55  sz Levallois technique  Middle Paleolithic 61  va Levallois technique        Homo Sapiens 68 woe   Lower Paleolithic Levallois technique

LDA Plot

This is a small adaptation from the plot.lda function in MASS. At visual examination, it seems that the groups are not completely separated. Especially, I wonder if they are much better separated than last weeks biplot.
cols <- colorRampPalette(c('violet','gold','seagreen'))(4) 
myldaplot <- function (x, panel = panel.lda, ..., cex = 0.7, dimen, abbrev = FALSE, 
    xlab = "LD1", ylab = "LD2",indata,cols) 
{
  panel.lda <- function(x, y, ...) text(x, y, g, 
        cex = cex, ...)
  if (!is.null(Terms <- x$terms)) {
    data <- model.frame(x)
    X <- model.matrix(delete.response(Terms), data)
    g <- indata$ID
    mr <- cols[as.numeric(model.response(data))]
    xint <- match("(Intercept)", colnames(X), nomatch = 0L)
    if (xint > 0L) 
      X <- X[, -xint, drop = FALSE]
  }
  else {
    xname <- x$call$x
    gname <- x$call[[3L]]
    if (!is.null(sub <- x$call$subset)) {
      X <- eval.parent(parse(text = paste(deparse(xname, 
                      backtick = TRUE), "[", deparse(sub, backtick = TRUE), 
                  ",]")))
      g <- eval.parent(parse(text = paste(deparse(gname, 
                      backtick = TRUE), "[", deparse(sub, backtick = TRUE), 
                  "]")))
    }
    else {
      X <- eval.parent(xname)
      g <- eval.parent(gname)
    }
    if (!is.null(nas <- x$call$na.action)) {
      df <- data.frame(g = g, X = X)
      df <- eval(call(nas, df))
      g <- df$g
      X <- df$X
    }
  }
  if (abbrev) 
    levels(g) <- abbreviate(levels(g), abbrev)
  means <- colMeans(x$means)
  X <- scale(X, center = means, scale = FALSE) %*% x$scaling
  if (!missing(dimen) && dimen < ncol(X)) 
    X <- X[, 1L:dimen, drop = FALSE]
  if (ncol(X) > 2L) {
    pairs(X, panel = panel, ...,col=mr)
  }
  else if (ncol(X) == 2L) {
    eqscplot(X[, 1L:2L], xlab = xlab, ylab = ylab, type = "n", 
        ...)
    panel(X[, 1L], X[, 2L], ...)
  }
  else ldahist(X[, 1L], g, xlab = xlab, ...)
  invisible(NULL)
}
myldaplot(ld1,indata=r12c,cols=cols)

Classification tree

Rpart is my favorite classification tree implementation. The only tuning is setting minsplit to 10, the default of 20 seems a bit large for 79 objects and four categories. The printed output is skipped, since we have the plot. The interpretation is pretty simple. First split on ZDF1, to distinguish old from young (less old?). The young can be split on LBI to Middle paleolithic and homo sapiens. The old by PSF to Levoillas and Lower Paleolithic. A final split between middle and lower paleolithic shows that the differences are not clear cut.
library(rpart)
rp1 <- rpart(Group ~ LBI + RTI + WDI + FLA + PSF + FSF + ZDF1 + PROZD,
    data=r12,minsplit=10)
plot(rp1,margin=.1)
text(rp1, use.n = TRUE,cex=.8)
There are ten incorrect predicted objects. Note that it is difficult to compare this with LDA, since objects with missing data (e.g. sk misses FLA, PSF, FSF) are predicted with rpart, while they were removed prior to LDA analysis.
mydf <- data.frame(ID=r12$ID,     Group=r12$Group,     predict=predict(rp1,type='class')) mydf[mydf$Group!=mydf$predict,]
    ID               Group             predict 6  bie Levallois technique  Middle Paleolithic 10   c  Middle Paleolithic Levallois technique 11  cl   Lower Paleolithic Levallois technique 42  ms Levallois technique   Lower Paleolithic 43   n  Middle Paleolithic Levallois technique 59  sk Levallois technique  Middle Paleolithic 62  sz Levallois technique  Middle Paleolithic 65  ta Levallois technique  Middle Paleolithic 76 woe   Lower Paleolithic Levallois technique 77 wol Levallois technique  Middle Paleolithic
Note that cross validation makes for a much worse model assessment; about one third are incorrectly predicted with leave one out.
xmat <- xpred.rpart(rp1,xval=79)
colSums(xmat!=do.call(cbind,lapply(1:5,function(x) as.numeric(r12$Group) )))
0.71428571 0.30304576 0.12371791 0.05832118 0.02182179 
        42         26         23         24         25 

Randomforest

Randomforest seems to predict slightly worse than the more simple methods, with OOB error around 21%. As might be expected, Homo Sapiens, with only 3 rows, is particularly difficult to classify. Similar to the classification tree ZDF1 is an important variable, but FLA and PROZD were not important in the classification tree but are in random forest.
library(randomForest)
rf1 <- randomForest(Group ~ LBI + RTI + WDI + FLA + PSF + FSF + ZDF1 + PROZD,
   data=r12c,
   importance=TRUE,
   nodesize=10)
rf1
Call:  randomForest(formula = Group ~ LBI + RTI + WDI + FLA + PSF +      FSF + ZDF1 + PROZD, data = r12c, importance = TRUE, nodesize = 10)                Type of random forest: classification                      Number of trees: 500 No. of variables tried at each split: 2         OOB estimate of  error rate: 23.94% Confusion matrix:                     Lower Paleolithic Levallois technique Middle Paleolithic Lower Paleolithic                   8                   3                  0 Levallois technique                 3                  17                  3 Middle Paleolithic                  1                   4                 29 Homo Sapiens                        0                   0                  3                     Homo Sapiens class.error Lower Paleolithic              0   0.2727273 Levallois technique            0   0.2608696 Middle Paleolithic             0   0.1470588 Homo Sapiens                   0   1.0000000
Wrong predicted:
mydf <- data.frame(ID=r12c$ID,     Group=r12c$Group,     predict=predict(rf1)) mydf[mydf$Group!=mydf$predict,]
    ID               Group             predict 6  bie Levallois technique  Middle Paleolithic 8   bo Levallois technique  Middle Paleolithic 9   by Levallois technique   Lower Paleolithic 10   c  Middle Paleolithic Levallois technique 11  cl   Lower Paleolithic Levallois technique 14  e2  Middle Paleolithic   Lower Paleolithic 21  g5  Middle Paleolithic Levallois technique 28 gue Levallois technique   Lower Paleolithic 40  ml   Lower Paleolithic Levallois technique 42  ms Levallois technique   Lower Paleolithic 43   n  Middle Paleolithic Levallois technique 50 roe  Middle Paleolithic Levallois technique 55 sa1        Homo Sapiens  Middle Paleolithic 56 sa2        Homo Sapiens  Middle Paleolithic 57 sa3        Homo Sapiens  Middle Paleolithic 62  sz Levallois technique  Middle Paleolithic

importanceplot

varImpPlot(rf1)

To leave a comment for the author, please follow the link and comment on his blog: Wiekvoet.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...