MathGroup Archive: June 2008 [00398]

[Date Index] [Thread Index] [Author Index]

Re: Estimating slope from noisy data

To: mathgroup at smc.vnet.net
Subject: [mg89615] Re: Estimating slope from noisy data
From: Bill Rowe <readnews at sbcglobal.net>
Date: Sun, 15 Jun 2008 06:12:22 -0400 (EDT)

On 6/14/08 at 5:29 AM, andreas.kohlmajer at gmx.de wrote:

>I have difficulties to estimate the correct slope from noisy data.
>This is the code to generate the noisy data:

>Needs["LinearRegression`"]; slope = 1.0; sigma = 0.5; xrange = 1.0;

>SeedRandom[123]; (* initialize random generator *) rnd = {#, #*slope
>+ RandomReal[NormalDistribution[0, sigma]]} &;

>(* generate 2000 data points *) data = Table[
>rnd[RandomReal[NormalDistribution[0, xrange/3.0]]], {2000}];

Here, I would have done this a bit differently by using an
uniform distribution to get random x values. That is I would
have done:

signal=RandomReal[1,{2000}];
noise=RandomReal[NormalDistribution[0, sigma];
data=Transpose@{signal,signal+noise};

For testing, I find it advantageous to separate the noise from
what is being fitted. That way I can easily do a fit without the
noise and verify my code.

However, I don't think this addresses the issue.

>subset = Take[data, 8]; ListPlot[subset, PlotRange -> {{-3, 3}, {-3,
>3}},
>PlotStyle -> PointSize[.025]]
>fit = Regress[subset, x, x, IncludeConstant -> False,
>RegressionReport -> {SummaryReport, ParameterCITable}]

Why fit such a few points after generating 2000 points? And why
aren't you using {1,x} and including the constant?

When you use only x and don't include the constant you are not
getting the best estimate of the slope. Forcing the fit through
zero is definitely not the same problem.

Using your data above, for the full data set including the
constant I get

FindFit[data, a x + b, {a, b}, x]

{a->1.03022,b->-0.015233}

and for the reduced data set with constant I get

FindFit[data[[;; 8]], a x + b, {a, b}, x]

{a->1.93674,b->-0.166389}

So, I conclude the high slope value is due to the large noise
value and the few samples being considered.

There is also an additional factor that contributes to the
result obtained. You chose a normal distribution and sigma
apparently intended to have a high probability of x values
between -1 and 1. But that choice means a small sample will
generally have a much smaller range. That in turn will
significantly increase the uncertainty in the estimated slope
for a given amount of noise.

In particular note for the first 8 samples of your data set:

{Min@#, Max@#} &[data[[;; 8, 1]]]
{-0.560437, 0.473236}

and

Subtract @@@ data[[;; 8]]

{-0.166174,0.88855,0.851499,-0.198653,-0.763068,0.0476301,0.236313,0.51059}

That is, the amount of error (noise) you are adding exceeds the
range of the data sample at several points. So, it should not be
surprising such a small sample gets a high slope value.

>The correct slope is exactly 1. As the data is quite noisy, the CI
>of the slope is very big. The estimated slope is far to big (1.947).
>If I use more data points, the estimation gets better; I could also
>use a wider x-range, to get a better estimate for the slope.
>However, I'm quite limited in the x-range, so using a wider x-range
>is no option for me.

>I could check the RSquared for significance (If[Abs[r*Sqrt[n - 2]/
>Sqrt[1 - r^2]] >=
>Quantile[StudentTDistribution[n - 2], 1 - 0.05], r, 0] (*
>significance of 95% *)). I this case, it is significant.

>Is there any other way to get a good estimate for the slope, without
>using too many data points?

The number of data points needed for a good estimate depends on
a combination of the amount of noise in the data, range of the
data and the true slope. You can adjust these parameters so the
estimate of the true slope will be poor to good for any given
number of data points.

Prev by Date: Re: Re: Re: 6.0.3

Next by Date: Re: Estimating slope from noisy data

Previous by thread: Re: Estimating slope from noisy data

Next by thread: Re: Estimating slope from noisy data