Re: Trying to quickly split lists at the point of maximum variance reduction
- To: mathgroup at smc.vnet.net
- Subject: [mg127623] Re: Trying to quickly split lists at the point of maximum variance reduction
- From: Ray Koopman <koopman at sfu.ca>
- Date: Wed, 8 Aug 2012 21:34:00 -0400 (EDT)
- Delivered-to: l-mathgroup@mail-archive0.wolfram.com
- Delivered-to: l-mathgroup@wolfram.com
- Delivered-to: mathgroup-newout@smc.vnet.net
- Delivered-to: mathgroup-newsend@smc.vnet.net
- References: <jvn3om$am6$1@smc.vnet.net>
On Aug 5, 5:39 pm, Earl Mitchell <earl.j.mitch... at gmail.com> wrote: > Hi all, > > I've got a list of 42,000 lists of 233 variables (integers from 1 to 250), > each with a response in the last position (integer from 0 to 9). I want to > find the variable and split point in that variable at which the weighted > average variance in the response will be minimized. > > As a trivial example say the original variance in the response for the > whole set is 100. Imagine that there is some variable, say at position > 220, which has Tally'd values of {{0,40,000},{1,2000}} and, if we split the > corresponding responses along this break point the weighted average > variance of the two lists is 80, which happens to be the lowest possible > resulting variance for any single split, on any single variable. > > How can I find this point quickly? This description might be confusing - > let me know if something is not clear. Thanks ahead of time for the help! > > Mitch > > PS. Currently this job can be done in with this code: > > FindMaxVarianceReductionSplit[data_List] := > Module[{transdata, splitvarreductionpairs, withoutputs, testsplits, split, > endvar, maxvarreduction}, > transdata = Transpose[data]; > splitvarreductionpairs = With[{outputs = transdata[[-1]]}, > ParallelTable[ > With[{inputs = transdata[[i]], startvar = N@NewVariance[outputs]}, > withoutputs = Thread[{inputs, outputs}]; > testsplits = Union[inputs]; > Table[ > With[{splitval = testsplits[[j]]}, > split = GatherBy[withoutputs, #[[1]] > splitval &]; > endvar = > Total[(N@Length[#]*NewVariance[#[[All, -1]]] & /@ split)/ > Length[withoutputs]]; > {splitval, startvar - endvar} > ] > , > {j, Length[testsplits]}]], {i, Length[Most[transdata]]}] > ]; > > maxvarreduction = Max[Flatten[splitvarreductionpairs, 1][[All, -1]]]; > Position[splitvarreductionpairs, maxvarreduction] > > ] > > ... on my brand spanking new MBP it completes in just under 1,000 seconds > being parallelized. I need this to run much faster to have any practical > applications. > > Thanks again! I find your description confusing. Here's how I read it: You have a table whose dimensions are 42000 x 233, and a corresponding vector of 42000 responses. All the values in the table are integers in [1, 250]. All the responses are integers in [0,9]. You want to choose a column, say k, and a value, say x, such that if you split the responses into two groups according as the k'th element in the corresponding row of the table is <= x or > x, you minimize the sum of squared deviations of the responses from their respective group means. Please confirm or correct that interpretation.