MathGroup Archive: August 2012 [00144]

[Date Index] [Thread Index] [Author Index]

Re: Trying to quickly split lists at the point of maximum variance reduction

To: mathgroup at smc.vnet.net
Subject: [mg127623] Re: Trying to quickly split lists at the point of maximum variance reduction
From: Ray Koopman <koopman at sfu.ca>
Date: Wed, 8 Aug 2012 21:34:00 -0400 (EDT)
Delivered-to: l-mathgroup@mail-archive0.wolfram.com
Delivered-to: l-mathgroup@wolfram.com
Delivered-to: mathgroup-newout@smc.vnet.net
Delivered-to: mathgroup-newsend@smc.vnet.net
References: <jvn3om$am6$1@smc.vnet.net>

On Aug 5, 5:39 pm, Earl Mitchell <earl.j.mitch... at gmail.com> wrote:
> Hi all,
>
> I've got a list of 42,000 lists of 233 variables (integers from 1 to 250),
> each with a response in the last position (integer from 0 to 9).  I want to
> find the variable and split point in that variable at which the weighted
> average variance in the response will be minimized.
>
> As a trivial example say the original variance in the response for the
> whole set is 100.  Imagine that there is some variable, say at position
> 220, which has Tally'd values of {{0,40,000},{1,2000}} and, if we split the
> corresponding responses along this break point the weighted average
> variance of the two lists is 80, which happens to be the lowest possible
> resulting variance for any single split, on any single variable.
>
> How can I find this point quickly?  This description might be confusing -
> let me know if something is not clear.  Thanks ahead of time for the help!
>
> Mitch
>
> PS.  Currently this job can be done in with this code:
>
> FindMaxVarianceReductionSplit[data_List] :=
>  Module[{transdata, splitvarreductionpairs, withoutputs, testsplits, split,
>    endvar, maxvarreduction},
>   transdata = Transpose[data];
>   splitvarreductionpairs = With[{outputs = transdata[[-1]]},
>     ParallelTable[
>      With[{inputs = transdata[[i]], startvar = N@NewVariance[outputs]},
>       withoutputs = Thread[{inputs, outputs}];
>       testsplits = Union[inputs];
>       Table[
>        With[{splitval = testsplits[[j]]},
>         split = GatherBy[withoutputs, #[[1]] > splitval &];
>         endvar =
>          Total[(N@Length[#]*NewVariance[#[[All, -1]]] & /@ split)/
>            Length[withoutputs]];
>         {splitval, startvar - endvar}
>         ]
>        ,
>        {j, Length[testsplits]}]], {i, Length[Most[transdata]]}]
>     ];
>
>   maxvarreduction = Max[Flatten[splitvarreductionpairs, 1][[All, -1]]];
>   Position[splitvarreductionpairs, maxvarreduction]
>
>   ]
>
> ... on my brand spanking new MBP it completes in just under 1,000 seconds
> being parallelized.  I need this to run much faster to have any practical
> applications.
>
> Thanks again!

I find your description confusing. Here's how I read it:

You have a table whose dimensions are 42000 x 233, and a corresponding
vector of 42000 responses. All the values in the table are integers in
[1, 250]. All the responses are integers in [0,9]. You want to choose
a column, say k, and a value, say x, such that if you split the
responses into two groups according as the k'th element in the
corresponding row of the table is <= x or > x, you minimize the sum of
squared deviations of the responses from their respective group
means.

Please confirm or correct that interpretation.

Prev by Date: Text Alignment in Graphics[]

Next by Date: Re: Landau letter, Re: Mathematica as a New Approach...

Previous by thread: Trying to quickly split lists at the point of maximum variance reduction

Next by thread: any news of new edition of The Mathematica GuideBooks?