Stone flakes II
R-bloggers 2014-06-16
(This article was first published on Wiekvoet, and kindly contributed to R-bloggers)
Continuing from last week, the aim is now to classify the stone flakes based on their various properties. Three methods are used. LDA is an obvious standard. A classification tree is both simple and visually appealing. Random forest as a complex method, where more complex relations can easily be captured. Surprising with these data is that the classification tree is doing better than random forest with respect to predicting the input data. Data
This is the same as last week. However, I now opted to make the group labels a bit more short. r2 <- read.table('StoneFlakes.txt',header=TRUE,na.strings='?') r1 <- read.table('annotation.txt',header=TRUE,na.strings='?') r12 <- merge(r1,r2) r12$Group <- factor(r12$group,labels=c('Lower Paleolithic', 'Levallois technique', 'Middle Paleolithic', 'Homo Sapiens')) r12c <- r12[complete.cases(r12),]LDA
LDA is part of the MASS package. The settings are default. The plot function has been adapted, to retain proper labeling and colors. ld1 <- lda(Group ~ LBI + RTI + WDI + FLA + PSF + FSF + ZDF1 + PROZD, data=r12c) ld1 Call: lda(Group ~ LBI + RTI + WDI + FLA + PSF + FSF + ZDF1 + PROZD, data = r12c) Prior probabilities of groups: Lower Paleolithic Levallois technique Middle Paleolithic Homo Sapiens 0.15492958 0.32394366 0.47887324 0.04225352 Group means: LBI RTI WDI FLA PSF FSF Lower Paleolithic 1.105455 33.82727 2.690000 118.8182 31.881818 8.372727 Levallois technique 1.254348 28.74348 2.768261 121.2609 15.021739 14.900000 Middle Paleolithic 1.195000 25.88235 3.384412 113.9706 9.497059 23.294118 Homo Sapiens 1.453333 22.96333 3.150000 117.0000 7.666667 27.900000 ZDF1 PROZD Lower Paleolithic 15.86364 57.81818 Levallois technique 30.46087 69.17391 Middle Paleolithic 62.31471 84.23529 Homo Sapiens 56.36667 87.00000 Coefficients of linear discriminants: LD1 LD2 LD3 LBI 2.63864596 -7.3053369000 4.96921267 RTI 0.03735899 0.0683327644 0.03955854 WDI 0.64616126 -0.3086473881 0.06507385 FLA -0.04798439 -0.0591106950 -0.05229559 PSF -0.01917272 0.0515112771 0.10178299 FSF 0.02488002 -0.0001579818 0.07080551 ZDF1 0.02896112 0.0396366883 -0.01625858 PROZD 0.03036974 -0.0079845555 0.04335520 Proportion of trace: LD1 LD2 LD3 0.7026 0.2585 0.0389 There are eight objects incorrect classified. Out of 71 complete cases, that is 10% wrong. mydf <- data.frame(ID=r12c$ID, Group=r12c$Group, predict=predict(ld1)$class) mydf[mydf$Group!=mydf$pred,]LDA Plot
This is a small adaptation from the plot.lda function in MASS. At visual examination, it seems that the groups are not completely separated. Especially, I wonder if they are much better separated than last weeks biplot.
cols <- colorRampPalette(c('violet','gold','seagreen'))(4)
myldaplot <- function (x, panel = panel.lda, ..., cex = 0.7, dimen, abbrev = FALSE,
xlab = "LD1", ylab = "LD2",indata,cols)
{
panel.lda <- function(x, y, ...) text(x, y, g,
cex = cex, ...)
if (!is.null(Terms <- x$terms)) {
data <- model.frame(x)
X <- model.matrix(delete.response(Terms), data)
g <- indata$ID
mr <- cols[as.numeric(model.response(data))]
xint <- match("(Intercept)", colnames(X), nomatch = 0L)
if (xint > 0L)
X <- X[, -xint, drop = FALSE]
}
else {
xname <- x$call$x
gname <- x$call[[3L]]
if (!is.null(sub <- x$call$subset)) {
X <- eval.parent(parse(text = paste(deparse(xname,
backtick = TRUE), "[", deparse(sub, backtick = TRUE),
",]")))
g <- eval.parent(parse(text = paste(deparse(gname,
backtick = TRUE), "[", deparse(sub, backtick = TRUE),
"]")))
}
else {
X <- eval.parent(xname)
g <- eval.parent(gname)
}
if (!is.null(nas <- x$call$na.action)) {
df <- data.frame(g = g, X = X)
df <- eval(call(nas, df))
g <- df$g
X <- df$X
}
}
if (abbrev)
levels(g) <- abbreviate(levels(g), abbrev)
means <- colMeans(x$means)
X <- scale(X, center = means, scale = FALSE) %*% x$scaling
if (!missing(dimen) && dimen < ncol(X))
X <- X[, 1L:dimen, drop = FALSE]
if (ncol(X) > 2L) {
pairs(X, panel = panel, ...,col=mr)
}
else if (ncol(X) == 2L) {
eqscplot(X[, 1L:2L], xlab = xlab, ylab = ylab, type = "n",
...)
panel(X[, 1L], X[, 2L], ...)
}
else ldahist(X[, 1L], g, xlab = xlab, ...)
invisible(NULL)
}
myldaplot(ld1,indata=r12c,cols=cols)
Classification tree
Rpart is my favorite classification tree implementation. The only tuning is setting minsplit to 10, the default of 20 seems a bit large for 79 objects and four categories. The printed output is skipped, since we have the plot. The interpretation is pretty simple. First split on ZDF1, to distinguish old from young (less old?). The young can be split on LBI to Middle paleolithic and homo sapiens. The old by PSF to Levoillas and Lower Paleolithic. A final split between middle and lower paleolithic shows that the differences are not clear cut.library(rpart)
rp1 <- rpart(Group ~ LBI + RTI + WDI + FLA + PSF + FSF + ZDF1 + PROZD,
data=r12,minsplit=10)
plot(rp1,margin=.1)
text(rp1, use.n = TRUE,cex=.8)
mydf <- data.frame(ID=r12$ID, Group=r12$Group, predict=predict(rp1,type='class')) mydf[mydf$Group!=mydf$predict,]
ID Group predict 6 bie Levallois technique Middle Paleolithic 10 c Middle Paleolithic Levallois technique 11 cl Lower Paleolithic Levallois technique 42 ms Levallois technique Lower Paleolithic 43 n Middle Paleolithic Levallois technique 59 sk Levallois technique Middle Paleolithic 62 sz Levallois technique Middle Paleolithic 65 ta Levallois technique Middle Paleolithic 76 woe Lower Paleolithic Levallois technique 77 wol Levallois technique Middle Paleolithic
Note that cross validation makes for a much worse model assessment; about one third are incorrectly predicted with leave one out.xmat <- xpred.rpart(rp1,xval=79)
colSums(xmat!=do.call(cbind,lapply(1:5,function(x) as.numeric(r12$Group) )))
0.71428571 0.30304576 0.12371791 0.05832118 0.02182179
42 26 23 24 25
Randomforest
Randomforest seems to predict slightly worse than the more simple methods, with OOB error around 21%. As might be expected, Homo Sapiens, with only 3 rows, is particularly difficult to classify. Similar to the classification tree ZDF1 is an important variable, but FLA and PROZD were not important in the classification tree but are in random forest.
library(randomForest)
rf1 <- randomForest(Group ~ LBI + RTI + WDI + FLA + PSF + FSF + ZDF1 + PROZD,
data=r12c,
importance=TRUE,
nodesize=10)
rf1
Call: randomForest(formula = Group ~ LBI + RTI + WDI + FLA + PSF + FSF + ZDF1 + PROZD, data = r12c, importance = TRUE, nodesize = 10) Type of random forest: classification Number of trees: 500 No. of variables tried at each split: 2 OOB estimate of error rate: 23.94% Confusion matrix: Lower Paleolithic Levallois technique Middle Paleolithic Lower Paleolithic 8 3 0 Levallois technique 3 17 3 Middle Paleolithic 1 4 29 Homo Sapiens 0 0 3 Homo Sapiens class.error Lower Paleolithic 0 0.2727273 Levallois technique 0 0.2608696 Middle Paleolithic 0 0.1470588 Homo Sapiens 0 1.0000000
Wrong predicted:
mydf <- data.frame(ID=r12c$ID, Group=r12c$Group, predict=predict(rf1)) mydf[mydf$Group!=mydf$predict,]
ID Group predict 6 bie Levallois technique Middle Paleolithic 8 bo Levallois technique Middle Paleolithic 9 by Levallois technique Lower Paleolithic 10 c Middle Paleolithic Levallois technique 11 cl Lower Paleolithic Levallois technique 14 e2 Middle Paleolithic Lower Paleolithic 21 g5 Middle Paleolithic Levallois technique 28 gue Levallois technique Lower Paleolithic 40 ml Lower Paleolithic Levallois technique 42 ms Levallois technique Lower Paleolithic 43 n Middle Paleolithic Levallois technique 50 roe Middle Paleolithic Levallois technique 55 sa1 Homo Sapiens Middle Paleolithic 56 sa2 Homo Sapiens Middle Paleolithic 57 sa3 Homo Sapiens Middle Paleolithic 62 sz Levallois technique Middle Paleolithic
importanceplot
varImpPlot(rf1)
To leave a comment for the author, please follow the link and comment on his blog: Wiekvoet.
R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...