Re: Trying to quickly split lists at the point of

*To*: mathgroup at smc.vnet.net*Subject*: [mg127640] Re: Trying to quickly split lists at the point of*From*: Earl Mitchell <earl.j.mitchell at gmail.com>*Date*: Fri, 10 Aug 2012 02:42:01 -0400 (EDT)*Delivered-to*: l-mathgroup@mail-archive0.wolfram.com*Delivered-to*: l-mathgroup@wolfram.com*Delivered-to*: mathgroup-newout@smc.vnet.net*Delivered-to*: mathgroup-newsend@smc.vnet.net*References*: <CAFAKiUSJexp00Rz4sq0-eUCn9fq6B0OkZSyokX_qaB49x13fLQ@mail.gmail.com>

This is great, thank you so much. I'll have to spend some time integrating it into my current code to see how it impacts the performance and if any other speedups are necessary As an aside, another method I was considering was thus... Sort column k (and the response variables) smallest to largest. Taking the Rest@Union[k] to get the complete list of possible splits, which we can call kS. Pick the median value of kS for the first split, then try kS + 1; if the sum of squared deviations of the responses increases try kS - 1, if it decreases try kS + 2. Continue trying new splits until the variance increases, at which point the last kS is the minimal split. When I started coding up the algorithm for this I realized that I was essentially trying to re-invent the brilliance that is NMinimize[] - and whatever code I wrote for this would be both slower and clumsier than the in-house functionality. Unfortunately when I tried to design a function that took as one of it's arguments kS, and tried to Minimize the resulting sum of squared deviations in the responses by altering the element of kS being evaluated (kS[[i]], where NMinimize was solving for the minimal i), Mathematica would always return i->0 and throw a part error. I know NMinimize wasn't designed with this sort of thing in mind, but I was curious if you had any high level thoughts on the approach I've outlined, and other insights into the workings of NMinimize. Thanks again for your insightful and clear response! Again, I apologize if something above is opaque... Mitch the sorting the list all possible split points On Thu, Aug 9, 2012 at 12:36 PM, Ray Koopman <koopman at sfu.ca> wrote: > I agree, it's not always easy to word these things clearly. After > I posted, I realized that a still clearer statement would be "... > split the responses into two groups according as the corresponding > element of column k in the table is <= x or > x, ...". > > Anyhow, there is a faster solution. First some notation. Let y be the > list of responses, and let yA and yB be the sublists that result from > splitting the responses into an A-set and a B-set. Let nA, nB, and n > be the lengths of yA, yB, and y; and let mA, mB, and m be the means of > yA, yB, and y. Then n = nA + nB, n*m = nA*mA + nB*mB, and > > sum[(y - m)^2] = sum[(yA - mA)^2] + sum[(yB - mB)^2] > + (nA*nB/n)*(mA - mB)^2. > > For any given y, the left hand side is fixed regardless of the split, > so splitting to minimize the sum of the first two terms on the right > is equivalent to splitting to maximize the last term on the right, > which turns out to be faster. > > Here is a compiled function that finds the optimal split of the > responses for any given column. It returns a 2-element list of Reals. > The first value is the optimal cutting score. > The second value is nA*nB*(mA - mB)^2 for that cut. > > Arguments: y is the list of responses; it must be Real, > to avoid problems that arise with n's as large as yours. > x is an Integer list of positive values on which to base the split. > xmax must be >= the actual maximum value in x. > > I leave it to you to either expand the function to loop over all the > columns of the data, or embed calls to the function in such a loop. > > cutter = Compile[{{y,_Real,1},{x,_Integer,1},{xmax,_Integer,0}}, > Module[{f = Table[0,{xmax}], s = Table[0.,{xmax}], n = Length@x, > t = Plus@@y, nA = 0, tA = 0., xi = 0, xmax2, dmax = 0., d}, > Do[xi = x[[i]]; f[[xi]]++; s[[xi]] += y[[i]], {i,n}]; > xmax2 = xmax; While[f[[xmax2]] == 0, xmax2--]; > xmax2--; While[f[[xmax2]] == 0, xmax2--]; > Do[If[f[[i]] == 0, Continue[]]; > nA += f[[i]]; tA += s[[i]]; > d = (nA*t - n*tA)^2 / (nA(n-nA)); > If[d > dmax, dmax = d; xi = i], {i,xmax2}]; > {N@xi, dmax}]] > > ----- Earl Mitchell <earl.j.mitchell at gmail.com> wrote: > > That is spot on. Sorry for the clumsy description. > > > > For some context (since you may know other resources to suggest along > these > > lines) this is part of a larger project to create a suite of data-mining > > and machine learning tools, with the goal being three fold: increase my > > mathematica programming abilities, increase my knowledge of advanced > > statistical methods, and (obviously) generate some algorithms that can be > > used to make predictions. > > > > This particular step is part of the RandomForrest statistical method. > > > > Thanks so much for your help, > > Mitch > > > > On Wed, Aug 8, 2012 at 7:34 PM, Ray Koopman <koopman at sfu.ca> wrote: > > > >> On Aug 5, 5:39 pm, Earl Mitchell <earl.j.mitchell at gmail.com> wrote: > >>> Hi all, > >>> > >>> I've got a list of 42,000 lists of 233 variables (integers from 1 to > 250), > >>> each with a response in the last position (integer from 0 to 9). I > want to > >>> find the variable and split point in that variable at which the > weighted > >>> average variance in the response will be minimized. > >>> > >>> As a trivial example say the original variance in the response for the > >>> whole set is 100. Imagine that there is some variable, say at position > >>> 220, which has Tally'd values of {{0,40,000},{1,2000}} and, if we split > >>> the corresponding responses along this break point the weighted average > >>> variance of the two lists is 80, which happens to be the lowest > possible > >>> resulting variance for any single split, on any single variable. > >>> > >>> How can I find this point quickly? This description might be confusing > - > >>> let me know if something is not clear. Thanks ahead of time for the > help! > >>> > >>> Mitch > >>> > >>> PS. Currently this job can be done in with this code: > >>> > >>> FindMaxVarianceReductionSplit[data_List] := > >>> Module[{transdata, splitvarreductionpairs, withoutputs, testsplits, > split, > >>> endvar, maxvarreduction}, > >>> transdata = Transpose[data]; > >>> splitvarreductionpairs = With[{outputs = transdata[[-1]]}, > >>> ParallelTable[ > >>> With[{inputs = transdata[[i]], startvar = N@NewVariance > [outputs]}, > >>> withoutputs = Thread[{inputs, outputs}]; > >>> testsplits = Union[inputs]; > >>> Table[ > >>> With[{splitval = testsplits[[j]]}, > >>> split = GatherBy[withoutputs, #[[1]] > splitval &]; > >>> endvar = > >>> Total[(N@Length[#]*NewVariance[#[[All, -1]]] & /@ split)/ > >>> Length[withoutputs]]; > >>> {splitval, startvar - endvar} > >>> ] > >>> , > >>> {j, Length[testsplits]}]], {i, Length[Most[transdata]]}] > >>> ]; > >>> > >>> maxvarreduction = Max[Flatten[splitvarreductionpairs, 1][[All, -1]]]; > >>> Position[splitvarreductionpairs, maxvarreduction] > >>> > >>> ] > >>> > >>> ... on my brand spanking new MBP it completes in just under 1,000 > seconds > >>> being parallelized. I need this to run much faster to have any > practical > >>> applications. > >>> > >>> Thanks again! > >> > >> I find your description confusing. Here's how I read it: > >> > >> You have a table whose dimensions are 42000 x 233, and a corresponding > >> vector of 42000 responses. All the values in the table are integers in > >> [1, 250]. All the responses are integers in [0,9]. You want to choose > >> a column, say k, and a value, say x, such that if you split the > >> responses into two groups according as the k'th element in the > >> corresponding row of the table is <= x or > x, you minimize the sum of > >> squared deviations of the responses from their respective group > >> means. > >> > >> Please confirm or correct that interpretation. >