Generalized Addictive Models and Mixed-Effects in Agriculture

R-bloggers 2017-07-15

Summary:

IntroductionIn the previous post I explored the use of linear model in the forms most commonly used in agricultural research.Clearly, when we are talking about linear models we are implicitly assuming that all relations between the dependent variable y and the predictors x are linear. In fact, in a linear model we could specify different shapes for the relation between y and x, for example by including polynomials (read for example: https://datascienceplus.com/fitting-polynomial-regression-r/). However, we can do that only in cases where we can clearly see a particular shape of the relation, for example quadratic. The problem is in many cases we can see from a scatterplot that we have a non-linear distribution of the points, but it is difficult to understand its form. Moreover, in a linear model the interpretation of polynomial coefficients become more difficult and this may decrease their usefulness.An alternative approach is provided by Generalized Addictive Models, which allows us to fit models with non-linear smoothers without specifying a particular shape a priori.I will not go into much details about the theory behind GAMs. You can refer to these two books (freely available online) to know more:Wood, S.N., 2017. Generalized additive models: an introduction with R. CRC press.http://reseau-mexico.fr/sites/reseau-mexico.fr/files/igam.pdfCrawley, M.J., 2012. The R book. John Wiley & Sons.https://www.cs.upc.edu/~robert/teaching/estadistica/TheRBook.pdfSome BackgroundAs mentioned above, GAM models are more powerful that the other linear model we have seen in previous posts since they allow to include non-linear smoothers into the mix. In mathematical terms GAM solve the following equation:It may seem like a complex equation, but actually it is pretty simple to understand. The first thing to notice is that with GAM we are not necessarily estimating the response directly, i.e. we are not modelling y. In fact, as with GLM we have the possibility to use link functions to model non-normal response variables (and thus perform poisson or logistic regression). Therefore, the term g(μ) is simply the transformation of y needed to "linearize" the model. When we are dealing with a normally distributed response this term is simply replace by y.Now we can explore the second part of the equation, where we have two terms: the parametric and the non-parametric part. In GAM we can include all the parametric terms we can include in lm or glm, for example linear or polynomial terms. The second part is the non-parametric smoother that will be fitted automatically and it is the key point of GAMs.To better understand the difference between the two parts of the equation we can explore an example. Let's say we have a response variable (normally distributed) and two predictors x1 and x2. We look at the data and we observe a clear linear relation between x1 and y, but a complex curvilinear pattern between x2 and y. Because of this we decide to fit a generalized addictive model that in this particular case will take the following equation:Since y is normal we do not need the link function g(). Then we are modelling x1 as a linear model with intercept beta zero and coefficient beta one. However, since we observed a curvilinear relation between x2 and y we also including a non-parametric smoothing function to model x2.Practical ExampleIn this tutorial we will work once again with the package agridat so that we can work directly with real data in agriculture. Other packages we will use are ggplot2, moments, pscl and MuMIn: library(agridat) library(ggplot2) library(moments) library(pscl) library(MuMIn) In R there are two packages to fit generalized addictive models, I will talk about the package mgcv. For an overview of GAMs from the package gam you can refer to this post: https://datascienceplus.com/generalized-additive-models/The first thing we need to do is install the package mgcv: install.packages("mgcv") library(mgcv) Now we can load once again the package lasrosas.corn with measures of yield based on nitrogen treatments, plus topographic position and brightness value (for more info please take a look at my previous post: Linear Models (lm, ANOVA and ANCOVA) in Agriculture). Then we can use the function pairs to plot all variable in scatterplots, colored by topographic position: dat = lasrosas.corn attach(dat) pairs(dat[,4:9], lower.panel = NULL, col=topo) This produces the following image:In the previous post we only fitted linear models to these data, and therefore the relations between yield and all other predictors were always modelled as lines. However, if we look at the scatterplot between yield and bv, we can clearly see a pattern that does not really look linear, with some blue dots that deviates from the main cloud. If these blue dots were not present we would be happy in modelling this relation as linear. In fact we can prove that by only focusing on this plot and removing the level W from topo: par(mfrow=c(1,2)) plot(yield ~ bv, pch=20, data=dat, xlim=c(100,220)) plot(yield ~ bv, pch=20, data=dat[dat$top

Link:

http://feedproxy.google.com/~r/RBloggers/~3/S33IWsso420/

From feeds:

Statistics and Visualization » R-bloggers

Tags:

Authors:

Fabio Veronesi

Date tagged:

07/15/2017, 13:18

Date published:

07/15/2017, 08:06