Bayesian regression example

From Bio_708
Jump to: navigation, search

Generate fake data

series.R
set.seed(411)
 
N <- 40
 
## predictor variables
a <- runif(N)
b <- runif(N)
c <- runif(N)
y <- rnorm(N,mean=a+4*b+c,sd=1)
dat <- data.frame(y,a,b,c)

Do a linear model

lm.mk

lm.Rout
log
> # Global Environment: series.RData
> 
> # Scripts: lm.R
> 
> summary(lm(y~a+b+c,data=dat))
 
Call:
lm(formula = y ~ a + b + c, data = dat)
 
Residuals:
     Min       1Q   Median       3Q      Max 
-2.11619 -0.46816  0.06305  0.56538  2.05915 
 
Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -0.3336     0.4456  -0.749   0.4590    
a             0.6909     0.6129   1.127   0.2671    
b             5.0782     0.5332   9.523 2.25e-11 ***
c             1.1771     0.5353   2.199   0.0344 *  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
 
Residual standard error: 0.9981 on 36 degrees of freedom
Multiple R-squared:  0.7574,	Adjusted R-squared:  0.7371 
F-statistic: 37.46 on 3 and 36 DF,  p-value: 3.645e-11
 
> 
>

Bayesian analysis

With bayesglm

bayesglm.mk

bayesglm.Rout
log
> # Global Environment: series.RData
> 
> # Scripts: bayesglm.R
> 
> library("arm")
> summary(bayesglm(y~a+b+c,data=dat))
 
Call:
bayesglm(formula = y ~ a + b + c, data = dat)
 
Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-2.11859  -0.46488   0.06545   0.57159   2.05432  
 
Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -0.3209     0.4448  -0.721   0.4753    
a             0.6943     0.6104   1.137   0.2629    
b             5.0555     0.5319   9.505 2.37e-11 ***
c             1.1704     0.5337   2.193   0.0348 *  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
 
(Dispersion parameter for gaussian family taken to be 0.9962802)
 
    Null deviance: 147.805  on 39  degrees of freedom
Residual deviance:  35.866  on 36  degrees of freedom
AIC: 119.15
 
Number of Fisher Scoring iterations: 9
 
> 
>

With BUGS/JAGS

We need priors for the four regression parameters, and for the error variance. The BUGS/JAGS framework does not let us use improper priors, so we have to pick things that we think are "wide" enough.

For the error variance, we also have to pick something that's always positive.

Finally, BUGS/JAGS thinks in terms of "precision" τ=1/σ2, which is confusing, but you can get used to it. It's best to start by guessing what you think a "large" standard deviation would be, then taking 1/σ2 to get the relevant value of τ. A plausible rule of thumb might be 10 times the expected value of your parameter. For example, if you expected that the difference between masses of two species might be 100g, you could use a prior with a standard deviation of 1000g or a precision of 10-6. The normal priors below have low precision, so they have high variance.

rjags

bayes.mk

bayes.bug
model {
	for (i in 1:N){
		y[i] ~ dnorm(pred[i], tau)
		pred[i] <- ma*a[i] + mb*b[i] + mc*c[i] + int
	}
	ma ~ dnorm(0, .0001)
	mb ~ dnorm(0, .0001)
	mc ~ dnorm(0, .0001)
	int ~ dnorm(0, .0001)
	tau ~ dgamma(.001, .001)
}

R2jags

bayes_r2jags.mk

bayes_r2jags.R
library('R2jags')
jags1 <- jags(model.file='bayes.bug',
              parameters=c("ma","mb","mc","int"),
              data = list('a' = a, 'b' = b, 'c' = c, 'N'=N, 'y'=y),
              n.chains = 4,
              inits=NULL)

  • You can use inits=NULL to have JAGS pick the initial values randomly from the priors. For more complex models you might want to pick starting values for the chains yourself (see the ?jags documentation).

Examine the chains and output

coda2.mk

coda2.R
library('R2jags')
library('coda')
library('emdbook')    ## for lump.mcmc.list() and as.mcmc.bugs()
library('coefplot2')
bb <- jags1$BUGSoutput  ## extract the "BUGS output" component
mm <- as.mcmc.bugs(bb)  ## convert it to an "mcmc" object that coda can handle
plot(jags1)             ## large-format graph
## plot(mm)                ## trace + density plots, same as above
xyplot(mm,layout=c(2,3))  ## prettier trace plot
densityplot(mm,layout=c(2,3)) ## prettier density plot
coefplot2(jags1)              ## estimate + credible interval plot

The lump.mcmc.list function from the emdbook package can be useful for converting a set of MCMC chains into a single long chain.

WorkingWiki messages

Error: Make ‘coda2.Rout.pdf’ failed. Consult the log file for more information.

WorkingWiki messages

Error: Make ‘coda2.Rout-0.png’ failed. Consult the log file for more information.

WorkingWiki messages

Error: Make ‘coda2.Rout-1.png’ failed. Consult the log file for more information.

WorkingWiki messages

Error: Make ‘coda2.Rout-2.png’ failed. Consult the log file for more information.

WorkingWiki messages

Error: Make ‘coda2.Rout-3.png’ failed. Consult the log file for more information.

Further notes

Categorical variables

Implementing models with categorical variables (say, a t-test or an ANOVA) is a little bit more tedious than the multiple regression analysis shown above. There are two basic strategies:

  1. pass the categorical variable as a vector of numeric codes (i.e. pass as.numeric(f) rather than f to JAGS in your data list), and make your parameters a simple vector of the means corresponding to each level, e.g. you could have a vector of parameters catparam and specify the predicted value as pred[i] = catparam[f[i]]
  2. construct a design matrix using the model.matrix function in R and pass the whole model matrix (X)to JAGS. Then, to get the predicted value for each observation, just add up the relevant columns of the model matrix: e.g. pred[i] = beta[1]*X[i,1]+beta[2]*X[i,2]+beta[3]*X[i,3]. You can also use the built-in inprod (inner product) function: pred[i] = inprod(X[i,],beta)

The second approach is a little bit harder to understand but generalizes to more complicated situations, and gives you answers that will more closely match the analogous analyses in R (e.g. using lm).

Built-in Bayesian modeling techniques

  • bayesglm from the arm package

coda3.mk

coda3.R
library('coda')
library('emdbook')    ## for lump.mcmc.list() and as.mcmc.bugs()
mmL <- lump.mcmc.list(mm)  ## convert to a single long chain
colMeans(mmL)   ## which one is biggest? b ...
## test the probability that it is really the biggest
mean((mmL[,"mb"] > mmL[,"ma"]) & (mmL[,"mb"] > mmL[,"mc"]))
## or (alternatively -- this would be good if you had lots of categories)
meanschain <- mmL[,-(1:2)]
maxval <- apply(meanschain,1,which.max)
mean(maxval==2)