Ordinal Data

R-bloggers 2013-03-17

(This article was first published on Wiekvoet, and kindly contributed to R-bloggers)
I expect to be getting some ordinal data, from 5 or 9 point rating scales, pretty soon, so I am having a look ahead how to treat those. Often ANOVA is used, even though it is well known not to be ideal fro a statistical point of view, so that is the starting point. Next stops are polr (MASS), clm (ordinal) and MCMCoprobit (MCMCpack). Vglm (VGAM) is skipped. The viewpoint I am using is as somebody who needs to deliver summary results to a project manager or program manager, fully knowing that sales and/or marketing may be borrowing slides too. Keep It Simple, Stupid. The data used for now is the cheese data, (McCullagh & Nelder), but I intend to look for more complex/real data in a future post.

Data

Data are responses to 4 cheeses on 9 response categories. 1=strong dislike, 9=excellent taste. To be honest, I had forgotten about those data, and ran into them https://onlinecourses.science.psu.edu/stat504/node/177. For compactness (of R script) the data is imported a bit different. I also reformatted to one observation per row. Then a plot so we can see what the data looks like. cheese <- data.frame(Cheese=rep(LETTERS[1:4],each=9),     Response=rep(1:9,4),N=as.integer(         c(0,0,1,7,8,8,19,8,1,6,9,12,             11,7,6,1,0,0,1,1,6,8,23,7,5             ,1,0,0,0,0,1,3,7,14,16,11))) cheese$FResponse <- ordered(cheese$Response) #one record(row) for each observation cheese2  <- cheese[cheese$N>0,] cheese2 <- lapply(1:nrow(cheese2),function(x)        do.call(rbind,           replicate(cheese2$N[x],cheese2[x,c(1:2,4)],               simplify=FALSE)       ) ) cheese2 <- do.call(rbind,cheese2) mosaicplot(     xtabs( ~Cheese + factor(Response,levels=9:1) ,data=cheese2),     color=colorRampPalette(c('green','red'))(9),     main='',     ylab='Response')

ANOVA

As explained, first approach is ANOVA. From statistical point of view not ideal, ignoring the fact that these are nine categories. However, when explaining what you did, it saves a bit. I would not trust a sales person to understand ANOVA, but then the other side of the table is used to it, so they won't ask.  Anyway, below a number of variations on the result. library(multcomp) Res.aov <- aov( Response ~ Cheese, data=cheese2) anova(Res.aov) Analysis of Variance Table Response: Response            Df Sum Sq Mean Sq F value    Pr(>F)     Cheese      3 450.48 150.160  76.296 < 2.2e-16 *** Residuals 204 401.50   1.968                       --- Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1  TukeyHSD(Res.aov)   Tukey multiple comparisons of means     95% family-wise confidence level Fit: aov(formula = Response ~ Cheese, data = cheese2) $Cheese          diff        lwr        upr     p adj B-A -2.750000 -3.4626832 -2.0373168 0.0000000 C-A -1.384615 -2.0972986 -0.6719321 0.0000063 D-A  1.173077  0.4603937  1.8857602 0.0001793 C-B  1.365385  0.6527014  2.0780679 0.0000087 D-B  3.923077  3.2103937  4.6357602 0.0000000 D-C  2.557692  1.8450091  3.2703755 0.0000000
par(mar=c(5,4,6,2)+.1)
plot(cld(glht(Res.aov,linfct=mcp(Cheese='Tukey')))) par(mar=c(5,4,4,2)+.1)
mt <- model.tables(Res.aov,type='means',cterms='Cheese')  data.frame(dimnames(mt$tables$Cheese)[1],     Response=as.numeric(mt$tables$Cheese),     LSD=cld(glht(Res.aov,linfct=mcp(Cheese='Tukey'))     )$mcletters$monospacedLetters )   Cheese Response  LSD A      A 6.250000   c  B      B 3.500000 a    C      C 4.865385  b   D      D 7.423077    d

polr

The first statistically correct way to look at the data is polr from the MASS package. Unfortunately I dislike the output a bit. For instance, base level (cheese A) is ignored in some output. I am not so sure how to interpret the difference between cheese A and cheese B as -3 except for the observation that it is significant and cheese A is better. A thing not obvious documented is the possibility to get proportions for the categories as predictions. So, I can learn that the model has for cheese A 17% of the people in the top 2 box. I am sure top 2 box is a parameter I won't have to explain to my product manager. No confidence interval for it though. library(MASS) Res.polr <- polr( FResponse ~ Cheese, data=cheese2 ) summary(Res.polr) Call: polr(formula = FResponse ~ Cheese, data = cheese2) Coefficients:          Value Std. Error t value CheeseB -3.352     0.4287  -7.819 CheeseC -1.710     0.3715  -4.603 CheeseD  1.613     0.3805   4.238 Intercepts:     Value    Std. Error t value  1|2  -5.4674   0.5236   -10.4413 2|3  -4.4122   0.4278   -10.3148 3|4  -3.3126   0.3700    -8.9522 4|5  -2.2440   0.3267    -6.8680 5|6  -0.9078   0.2833    -3.2037 6|7   0.0443   0.2646     0.1673 7|8   1.5459   0.3017     5.1244 8|9   3.1058   0.4057     7.6547 Residual Deviance: 711.3479  AIC: 733.3479 
confint(Res.polr)
             2.5 %     97.5 %
CheeseB -4.2134700 -2.5304266
CheeseC -2.4508895 -0.9919797
CheeseD  0.8794083  2.3746894
confint(glht(Res.polr,linfct=mcp(Cheese='Tukey')))
Simultaneous Confidence Intervals Multiple Comparisons of Means: Tukey Contrasts Fit: polr(formula = FResponse ~ Cheese, data = cheese2) Quantile = 2.5617 95% family-wise confidence level Linear Hypotheses:            Estimate lwr     upr     B - A == 0 -3.3518  -4.4500 -2.2537 C - A == 0 -1.7099  -2.6615 -0.7583 D - A == 0  1.6128   0.6379  2.5876 C - B == 0  1.6419   0.6806  2.6033 D - B == 0  4.9646   3.7434  6.1858 D - C == 0  3.3227   2.2421  4.4033 topred <- data.frame(Cheese=levels(cheese2$Cheese)) predict(Res.polr,topred,type='prob')              1           2          3          4         5          6 1 0.0042045373 0.007778488 0.02315719 0.06072687 0.1915897 0.22360710 2 0.1075966845 0.149644074 0.25256118 0.24192345 0.1684022 0.04745525 3 0.0228101657 0.040027267 0.10476248 0.20195874 0.3208724 0.16204645 4 0.0008409326 0.001570815 0.00479563 0.01349081 0.0537320 0.09799769            7           8           9 1 0.31326180 0.132805278 0.042868991 2 0.02500937 0.005841786 0.001566035 3 0.11040482 0.029081216 0.008036485 4 0.31086686 0.333235095 0.183470165
topred$Top2Box.polr <- rowSums(predict(Res.polr,topred,type='prob')[,8:9])
topred   Cheese Top2Box.polr 1      A  0.175674269 2      B  0.007407821 3      C  0.037117701 4      D  0.516705260

clm

Clm is from the ordinal package. As this package is dedicated to ordinal data it is clearly a bit more advanced than polr. The package has the possibility to use mixed models and multiplicative scale effects. These are things I won't use now, but would like to use or look at once I have panelist data. For now clm function is enough. I am now able to make an overall statement about product differences (polr did not calculate the equivalent of model Res.clm0), although I might use 3 degrees of freedom rather than 1. There is also a confidence interval for the proportion subjects selecting a category. library(ordinal)
Res.clm <- clm(FResponse ~Cheese,data=cheese2) summary(Res.clm) formula: FResponse ~ Cheese
data:    cheese2
 link  threshold nobs logLik  AIC    niter max.grad cond.H 
 logit flexible  208  -355.67 733.35 6(0)  3.14e-11 8.8e+01
Coefficients:
        Estimate Std. Error z value Pr(>|z|)    
CheeseB  -3.3518     0.4287  -7.819 5.34e-15 ***
CheeseC  -1.7099     0.3715  -4.603 4.16e-06 ***
CheeseD   1.6128     0.3805   4.238 2.25e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
Threshold coefficients:
    Estimate Std. Error z value
1|2 -5.46738    0.52363 -10.441
2|3 -4.41219    0.42776 -10.315
3|4 -3.31262    0.37004  -8.952
4|5 -2.24401    0.32674  -6.868
5|6 -0.90776    0.28335  -3.204
6|7  0.04425    0.26457   0.167
7|8  1.54592    0.30168   5.124
8|9  3.10577    0.40573   7.655
confint(Res.clm)
             2.5 %     97.5 %
CheeseB -4.2134421 -2.5304102
CheeseC -2.4508720 -0.9919723
CheeseD  0.8793992  2.3746663
Res.clm0 <- clm(FResponse ~ .,data=cheese2)
anova(Res.clm0,Res.clm) Likelihood ratio tests of cumulative link models:          formula:           link: threshold: Res.clm  FResponse ~ Cheese logit flexible   Res.clm0 FResponse ~ .      logit flexible            no.par    AIC      logLik LR.stat df Pr(>Chisq)     Res.clm      11 733.35 -3.5567e+02                           Res.clm0     12  24.00 -7.8504e-06  711.35  1  < 2.2e-16 *** --- Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1  predict(Res.clm,topred) $fit              1           2           3          4         5          6 1 0.0042045475 0.007778711 0.023157362 0.06072674 0.1915909 0.22360309 2 0.1075968958 0.149647575 0.252560377 0.24192109 0.1684021 0.04745442 3 0.0228099720 0.040027964 0.104762084 0.20195678 0.3208734 0.16204462 4 0.0008409256 0.001570844 0.004795616 0.01349065 0.0537319 0.09799495            7           8           9 1 0.31326168 0.132806881 0.042870069 2 0.02500956 0.005841882 0.001566077 3 0.11040645 0.029081977 0.008036783 4 0.31086254 0.333236858 0.183475714
rowSums(predict(Res.clm,topred)$fit[,8:9])
          1           2           3           4 
0.175676950 0.007407959 0.037118760 0.516712572 
1-predict(Res.clm,topred,type='cum.prob')$cprob2[,8]
          1           2           3           4  0.175676950 0.007407959 0.037118760 0.516712572 
topred2 <- data.frame(Cheese=levels(cheese2$Cheese),
    FResponse=ordered(8,levels=1:9)) predict(Res.clm,topred2,type='prob',interval=TRUE) $fit [1] 0.132806881 0.005841882 0.029081977 0.333236858 $lwr [1] 0.078727094 0.002502026 0.014310133 0.230268677 $upr [1] 0.21535183 0.01357929 0.05820190 0.45502986

VGAM

I like the VGAM package. It shows great promise for many situations where a vector of observations is measured. This model is but a simple thing for it. On the face of it however, ordinal seems to have extensions which are more appropriate for my current problem. Also, in the help it says 'This package is undergoing continual development and improvement. Until my monograph comes out and this package is released as version 1.0-0 the user should treat everything subject to change'. The data needs to be entered a bit differently and the script is commented, in case anybody wants to know how it might work.
#library(VGAM) #nobs <- nrow(cheese2)/4  #resD <- resC <- resB <- resA <- matrix(0,9,nobs) #resA[cheese2$Response[cheese2$Cheese=='A']+(0:(nobs-1))*9] <- 1 #resB[cheese2$Response[cheese2$Cheese=='B']+(0:(nobs-1))*9] <- 1 #resC[cheese2$Response[cheese2$Cheese=='C']+(0:(nobs-1))*9] <- 1 #resD[cheese2$Response[cheese2$Cheese=='D']+(0:(nobs-1))*9] <- 1 #cheesevglm <- as.data.frame(t(cbind(resA,resB,resC,resD))) #cheesevglm$Cheese <- factor(rep(levels(cheese2$Cheese),each=nobs)) # #Res.vglm <- vglm(cbind(V1,V2,V3,V4,V5,V6,V7,V8,V9) ~ Cheese,cumulative,data=cheesevglm) #summary(Res.vglm) #predict(Res.vglm,topred,type='response')

MCMCoprobit

MCMCpack has Bayesian roots. It has many functions, ordinal data is but one of them. It does not rely on JAGS/Winbugs/Openbugs.  A difference between MCMCoprobit and the previous functions is the use of probit rather than logit as the link function. I don't think this should stop me from using it.The big disadvantage is that you get raw samples, so additional work is needed to get presentable results. Luckily this allows the calculation of proportions top 2 box I have been yapping about. A new question is how to interpret the big interval for cheese C.
library(MCMCpack) res.MCMCpack <- MCMCoprobit(Response ~ Cheese,data=cheese2) summary(res.MCMCpack) Iterations = 1001:11000 Thinning interval = 1  Number of chains = 1  Sample size per chain = 10000  1. Empirical mean and standard deviation for each variable,    plus standard error of the mean:                Mean      SD  Naive SE Time-series SE (Intercept)  3.2423 0.16115 0.0016115       0.016761 CheeseB     -1.9334 0.22376 0.0022376       0.019444 CheeseC     -0.9782 0.20786 0.0020786       0.008635 CheeseD      0.9382 0.20754 0.0020754       0.005334 gamma2       0.6401 0.13234 0.0013234       0.064949 gamma3       1.3265 0.14265 0.0014265       0.072361 gamma4       1.9697 0.10030 0.0010030       0.036723 gamma5       2.7635 0.09922 0.0009922       0.035434 gamma6       3.3180 0.11913 0.0011913       0.051066 gamma7       4.1606 0.12626 0.0012626       0.057679 gamma8       4.9981 0.11750 0.0011750       0.049354 2. Quantiles for each variable:                2.5%     25%     50%     75%   97.5% (Intercept)  2.9288  3.1358  3.2430  3.3497  3.5576 CheeseB     -2.3606 -2.0869 -1.9372 -1.7850 -1.4926 CheeseC     -1.3895 -1.1200 -0.9763 -0.8394 -0.5730 CheeseD      0.5378  0.7979  0.9380  1.0760  1.3434 gamma2       0.4428  0.5227  0.6221  0.7498  0.8878 gamma3       1.1034  1.2156  1.3013  1.4400  1.6292 gamma4       1.7679  1.9017  1.9779  2.0465  2.1424 gamma5       2.5821  2.6867  2.7606  2.8439  2.9328 gamma6       3.1207  3.2393  3.2912  3.3880  3.5808 gamma7       3.8694  4.0840  4.1776  4.2596  4.3487 gamma8       4.8306  4.9177  4.9788  5.0353  5.2884 top2box <- data.frame(probability = c(         pnorm(res.MCMCpack[,'gamma7']-(res.MCMCpack[,1]),lower.tail=FALSE),         pnorm(res.MCMCpack[,'gamma7']-(res.MCMCpack[,1]-res.MCMCpack[,'CheeseB']),lower.tail=FALSE),         pnorm(res.MCMCpack[,'gamma7']-(res.MCMCpack[,1]-res.MCMCpack[,'CheeseC']),lower.tail=FALSE),         pnorm(res.MCMCpack[,'gamma7']-(res.MCMCpack[,1]-res.MCMCpack[,'CheeseD']),lower.tail=FALSE)),     Cheese=rep(levels(cheese2$Cheese),each=nrow(res.MCMCpack)))   densityplot(~ probability | Cheese,data=top2box,xlab='Proportion Top 2 Box',     panel= function(x, ...) {       panel.densityplot(x,plot.points='rug' , ...)     } )

To leave a comment for the author, please follow the link and comment on his blog: Wiekvoet.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series,ecdf, trading) and more...