Re: linear regression with errors in both variables
- To: mathgroup at smc.vnet.net
- Subject: [mg96538] Re: linear regression with errors in both variables
- From: Mark Fisher <particlefilter at gmail.com>
- Date: Mon, 16 Feb 2009 06:54:56 -0500 (EST)
- References: <gmrmga$a1k$1@smc.vnet.net>
On Feb 10, 5:56 am, Joerg <scha... at biologie.hu-berlin.de> wrote:
> Hi,
>
> I want to test the hypothesis that my data
> follows a known simple linear relationship,
> y = a + bx. However, I have (known) measurements
> errors in both the y and the x values.
>
> I suppose just a simple linear regression
> does not do here.
>
> Any suggestions how do test this correctly?
>
> Thanks,
>
> joerg
Hi all,
The classic "errors-in-variables" problem in econometrics has the
following structure:
(1) yt = a + b*zt + et
The paramters of interest are (a, b). The problem is that we don't
observe zt. Instead we observe xt which equals zt plus a measurement
error:
(2) xt = zt + ut
In order to proceed, we need to model the unobserved zt. Let
(3) zt = m + vt
Assume (et, ut, vt) are iid N(0, V) where V = diag(se^2, su^2,
sv^2).
Integrating zt out of (1-3) using
Integrate[
PDF[NormalDistribution[a + b*zt, se], yt] *
PDF[NormalDistribution[zt, su], xt] *
PDF[NormalDistribution[m, sv], zt],
{zt, -Infinity, Infinity},
GenerateConditions -> False]
produces (after rearrangement)
(4) PDF[NormalDistribution[a1 + b1*xt, se1], yt] * PDF
[NormalDistribution[m, su1], xt]
which can be expressed as
(5) yt = a1 + b1*xt + e1t
(6) xt = m + u1t
where
(7) a1 = a + b*m*su^2/(su^2+sv^2)
(8) b1 = b*sv^2/(su^2+sv^2)
(9) se1^2 = se^2 + b^2*su^2*sv^2/(su^2+sv^2)
(10) su1^2 = su^2+sv^2
The likelihood for the parameters, theta = (a,b,m,se,su,sv), is given
by
likelihood = Product[fun[data[[t]]], {t, T}]
where data is a list of the observed {xt, yt} pairs and fun[{xt, yt}]
is given by (4) using the substitutions (7-10).
A Bayesian approach specifies a prior distribution for theta, p
(theta), and then uses an MCMC algorithm (such as the random-walk
Metropolis algorithm) to draw from the posterior distribution, p(theta|
data). [The posterior distribution is proportional to likelihood * p
(theta).]
Note that (xt, yt) has a bivariate normal distribution, which is
characterized by 5 parameters. However, there are 6 parameters in
theta; this suggests a lack of identification. Fortunately, this
presents no particular problem for the Bayesian approach, even with a
flat prior for theta.
--Mark