Trying to quickly split lists at the point of maximum variance reduction
- To: mathgroup at smc.vnet.net
- Subject: [mg127593] Trying to quickly split lists at the point of maximum variance reduction
- From: Earl Mitchell <earl.j.mitchell at gmail.com>
- Date: Sun, 5 Aug 2012 14:59:54 -0400 (EDT)
- Delivered-to: l-mathgroup@mail-archive0.wolfram.com
- Delivered-to: l-mathgroup@wolfram.com
- Delivered-to: mathgroup-newout@smc.vnet.net
- Delivered-to: mathgroup-newsend@smc.vnet.net
Hi all, I've got a list of 42,000 lists of 233 variables (integers from 1 to 250), each with a response in the last position (integer from 0 to 9). I want to find the variable and split point in that variable at which the weighted average variance in the response will be minimized. As a trivial example say the original variance in the response for the whole set is 100. Imagine that there is some variable, say at position 220, which has Tally'd values of {{0,40,000},{1,2000}} and, if we split the corresponding responses along this break point the weighted average variance of the two lists is 80, which happens to be the lowest possible resulting variance for any single split, on any single variable. How can I find this point quickly? This description might be confusing - let me know if something is not clear. Thanks ahead of time for the help! Mitch PS. Currently this job can be done in with this code: FindMaxVarianceReductionSplit[data_List] := Module[{transdata, splitvarreductionpairs, withoutputs, testsplits, split, endvar, maxvarreduction}, transdata = Transpose[data]; splitvarreductionpairs = With[{outputs = transdata[[-1]]}, ParallelTable[ With[{inputs = transdata[[i]], startvar = N@NewVariance[outputs]}, withoutputs = Thread[{inputs, outputs}]; testsplits = Union[inputs]; Table[ With[{splitval = testsplits[[j]]}, split = GatherBy[withoutputs, #[[1]] > splitval &]; endvar = Total[(N@Length[#]*NewVariance[#[[All, -1]]] & /@ split)/ Length[withoutputs]]; {splitval, startvar - endvar} ] , {j, Length[testsplits]}]], {i, Length[Most[transdata]]}] ]; maxvarreduction = Max[Flatten[splitvarreductionpairs, 1][[All, -1]]]; Position[splitvarreductionpairs, maxvarreduction] ] ... on my brand spanking new MBP it completes in just under 1,000 seconds being parallelized. I need this to run much faster to have any practical applications. Thanks again!