MathGroup Archive 2012

[Date Index] [Thread Index] [Author Index]

Search the Archive

Trying to quickly split lists at the point of maximum variance reduction

  • To: mathgroup at smc.vnet.net
  • Subject: [mg127593] Trying to quickly split lists at the point of maximum variance reduction
  • From: Earl Mitchell <earl.j.mitchell at gmail.com>
  • Date: Sun, 5 Aug 2012 14:59:54 -0400 (EDT)
  • Delivered-to: l-mathgroup@mail-archive0.wolfram.com
  • Delivered-to: l-mathgroup@wolfram.com
  • Delivered-to: mathgroup-newout@smc.vnet.net
  • Delivered-to: mathgroup-newsend@smc.vnet.net

Hi all,

I've got a list of 42,000 lists of 233 variables (integers from 1 to 250),
each with a response in the last position (integer from 0 to 9).  I want to
find the variable and split point in that variable at which the weighted
average variance in the response will be minimized.

As a trivial example say the original variance in the response for the
whole set is 100.  Imagine that there is some variable, say at position
220, which has Tally'd values of {{0,40,000},{1,2000}} and, if we split the
corresponding responses along this break point the weighted average
variance of the two lists is 80, which happens to be the lowest possible
resulting variance for any single split, on any single variable.

How can I find this point quickly?  This description might be confusing -
let me know if something is not clear.  Thanks ahead of time for the help!

Mitch

PS.  Currently this job can be done in with this code:

FindMaxVarianceReductionSplit[data_List] :=
 Module[{transdata, splitvarreductionpairs, withoutputs, testsplits, split,
   endvar, maxvarreduction},
  transdata = Transpose[data];
  splitvarreductionpairs = With[{outputs = transdata[[-1]]},
    ParallelTable[
     With[{inputs = transdata[[i]], startvar = N@NewVariance[outputs]},
      withoutputs = Thread[{inputs, outputs}];
      testsplits = Union[inputs];
      Table[
       With[{splitval = testsplits[[j]]},
        split = GatherBy[withoutputs, #[[1]] > splitval &];
        endvar =
         Total[(N@Length[#]*NewVariance[#[[All, -1]]] & /@ split)/
           Length[withoutputs]];
        {splitval, startvar - endvar}
        ]
       ,
       {j, Length[testsplits]}]], {i, Length[Most[transdata]]}]
    ];

  maxvarreduction = Max[Flatten[splitvarreductionpairs, 1][[All, -1]]];
  Position[splitvarreductionpairs, maxvarreduction]

  ]


... on my brand spanking new MBP it completes in just under 1,000 seconds
being parallelized.  I need this to run much faster to have any practical
applications.


Thanks again!


  • Prev by Date: Re: Why no OSX SystemDialogInput["RecordSound"]?
  • Next by Date: any news of new edition of The Mathematica GuideBooks?
  • Previous by thread: Re: How to Extract Conditional Expression?
  • Next by thread: Re: Trying to quickly split lists at the point of maximum variance reduction