Re: Estimating slope from noisy data

*To*: mathgroup at smc.vnet.net*Subject*: [mg89615] Re: Estimating slope from noisy data*From*: Bill Rowe <readnews at sbcglobal.net>*Date*: Sun, 15 Jun 2008 06:12:22 -0400 (EDT)

On 6/14/08 at 5:29 AM, andreas.kohlmajer at gmx.de wrote: >I have difficulties to estimate the correct slope from noisy data. >This is the code to generate the noisy data: >Needs["LinearRegression`"]; slope = 1.0; sigma = 0.5; xrange = 1.0; >SeedRandom[123]; (* initialize random generator *) rnd = {#, #*slope >+ RandomReal[NormalDistribution[0, sigma]]} &; >(* generate 2000 data points *) data = Table[ >rnd[RandomReal[NormalDistribution[0, xrange/3.0]]], {2000}]; Here, I would have done this a bit differently by using an uniform distribution to get random x values. That is I would have done: signal=RandomReal[1,{2000}]; noise=RandomReal[NormalDistribution[0, sigma]; data=Transpose@{signal,signal+noise}; For testing, I find it advantageous to separate the noise from what is being fitted. That way I can easily do a fit without the noise and verify my code. However, I don't think this addresses the issue. >subset = Take[data, 8]; ListPlot[subset, PlotRange -> {{-3, 3}, {-3, >3}}, >PlotStyle -> PointSize[.025]] >fit = Regress[subset, x, x, IncludeConstant -> False, >RegressionReport -> {SummaryReport, ParameterCITable}] Why fit such a few points after generating 2000 points? And why aren't you using {1,x} and including the constant? When you use only x and don't include the constant you are not getting the best estimate of the slope. Forcing the fit through zero is definitely not the same problem. Using your data above, for the full data set including the constant I get FindFit[data, a x + b, {a, b}, x] {a->1.03022,b->-0.015233} and for the reduced data set with constant I get FindFit[data[[;; 8]], a x + b, {a, b}, x] {a->1.93674,b->-0.166389} So, I conclude the high slope value is due to the large noise value and the few samples being considered. There is also an additional factor that contributes to the result obtained. You chose a normal distribution and sigma apparently intended to have a high probability of x values between -1 and 1. But that choice means a small sample will generally have a much smaller range. That in turn will significantly increase the uncertainty in the estimated slope for a given amount of noise. In particular note for the first 8 samples of your data set: {Min@#, Max@#} &[data[[;; 8, 1]]] {-0.560437, 0.473236} and Subtract @@@ data[[;; 8]] {-0.166174,0.88855,0.851499,-0.198653,-0.763068,0.0476301,0.236313,0.51059} That is, the amount of error (noise) you are adding exceeds the range of the data sample at several points. So, it should not be surprising such a small sample gets a high slope value. >The correct slope is exactly 1. As the data is quite noisy, the CI >of the slope is very big. The estimated slope is far to big (1.947). >If I use more data points, the estimation gets better; I could also >use a wider x-range, to get a better estimate for the slope. >However, I'm quite limited in the x-range, so using a wider x-range >is no option for me. >I could check the RSquared for significance (If[Abs[r*Sqrt[n - 2]/ >Sqrt[1 - r^2]] >= >Quantile[StudentTDistribution[n - 2], 1 - 0.05], r, 0] (* >significance of 95% *)). I this case, it is significant. >Is there any other way to get a good estimate for the slope, without >using too many data points? The number of data points needed for a good estimate depends on a combination of the amount of noise in the data, range of the data and the true slope. You can adjust these parameters so the estimate of the true slope will be poor to good for any given number of data points.