Re: linear regression with errors in both variables
- To: mathgroup at smc.vnet.net
- Subject: [mg96538] Re: linear regression with errors in both variables
- From: Mark Fisher <particlefilter at gmail.com>
- Date: Mon, 16 Feb 2009 06:54:56 -0500 (EST)
- References: <gmrmga$a1k$1@smc.vnet.net>
On Feb 10, 5:56 am, Joerg <scha... at biologie.hu-berlin.de> wrote: > Hi, > > I want to test the hypothesis that my data > follows a known simple linear relationship, > y = a + bx. However, I have (known) measurements > errors in both the y and the x values. > > I suppose just a simple linear regression > does not do here. > > Any suggestions how do test this correctly? > > Thanks, > > joerg Hi all, The classic "errors-in-variables" problem in econometrics has the following structure: (1) yt = a + b*zt + et The paramters of interest are (a, b). The problem is that we don't observe zt. Instead we observe xt which equals zt plus a measurement error: (2) xt = zt + ut In order to proceed, we need to model the unobserved zt. Let (3) zt = m + vt Assume (et, ut, vt) are iid N(0, V) where V = diag(se^2, su^2, sv^2). Integrating zt out of (1-3) using Integrate[ PDF[NormalDistribution[a + b*zt, se], yt] * PDF[NormalDistribution[zt, su], xt] * PDF[NormalDistribution[m, sv], zt], {zt, -Infinity, Infinity}, GenerateConditions -> False] produces (after rearrangement) (4) PDF[NormalDistribution[a1 + b1*xt, se1], yt] * PDF [NormalDistribution[m, su1], xt] which can be expressed as (5) yt = a1 + b1*xt + e1t (6) xt = m + u1t where (7) a1 = a + b*m*su^2/(su^2+sv^2) (8) b1 = b*sv^2/(su^2+sv^2) (9) se1^2 = se^2 + b^2*su^2*sv^2/(su^2+sv^2) (10) su1^2 = su^2+sv^2 The likelihood for the parameters, theta = (a,b,m,se,su,sv), is given by likelihood = Product[fun[data[[t]]], {t, T}] where data is a list of the observed {xt, yt} pairs and fun[{xt, yt}] is given by (4) using the substitutions (7-10). A Bayesian approach specifies a prior distribution for theta, p (theta), and then uses an MCMC algorithm (such as the random-walk Metropolis algorithm) to draw from the posterior distribution, p(theta| data). [The posterior distribution is proportional to likelihood * p (theta).] Note that (xt, yt) has a bivariate normal distribution, which is characterized by 5 parameters. However, there are 6 parameters in theta; this suggests a lack of identification. Fortunately, this presents no particular problem for the Bayesian approach, even with a flat prior for theta. --Mark