Mathematica 9 is now available
Services & Resources / Wolfram Forums / MathGroup Archive
-----

MathGroup Archive 2012

[Date Index] [Thread Index] [Author Index]

Search the Archive

Re: Trying to quickly split lists at the point of

  • To: mathgroup at smc.vnet.net
  • Subject: [mg127638] Re: Trying to quickly split lists at the point of
  • From: Ray Koopman <koopman at sfu.ca>
  • Date: Fri, 10 Aug 2012 02:41:20 -0400 (EDT)
  • Delivered-to: l-mathgroup@mail-archive0.wolfram.com
  • Delivered-to: l-mathgroup@wolfram.com
  • Delivered-to: mathgroup-newout@smc.vnet.net
  • Delivered-to: mathgroup-newsend@smc.vnet.net

I agree, it's not always easy to word these things clearly. After
I posted, I realized that a still clearer statement would be "...
split the responses into two groups according as the corresponding
element of column k in the table is <= x or > x, ...".

Anyhow, there is a faster solution. First some notation. Let y be the
list of responses, and let yA and yB be the sublists that result from
splitting the responses into an A-set and a B-set. Let nA, nB, and n
be the lengths of yA, yB, and y; and let mA, mB, and m be the means of
yA, yB, and y. Then n = nA + nB, n*m = nA*mA + nB*mB, and
  
sum[(y - m)^2] = sum[(yA - mA)^2] + sum[(yB - mB)^2] 
                 + (nA*nB/n)*(mA - mB)^2.

For any given y, the left hand side is fixed regardless of the split,
so splitting to minimize the sum of the first two terms on the right
is equivalent to splitting to maximize the last term on the right,
which turns out to be faster.

Here is a compiled function that finds the optimal split of the 
responses for any given column. It returns a 2-element list of Reals.
The first value is the optimal cutting score.
The second value is nA*nB*(mA - mB)^2 for that cut.

Arguments: y is the list of responses; it must be Real,
to avoid problems that arise with n's as large as yours.
x is an Integer list of positive values on which to base the split.
xmax must be >= the actual maximum value in x.

I leave it to you to either expand the function to loop over all the
columns of the data, or embed calls to the function in such a loop.

cutter = Compile[{{y,_Real,1},{x,_Integer,1},{xmax,_Integer,0}},
Module[{f = Table[0,{xmax}], s = Table[0.,{xmax}], n = Length@x,
        t = Plus@@y, nA = 0, tA = 0., xi = 0, xmax2, dmax = 0., d},
Do[xi = x[[i]]; f[[xi]]++; s[[xi]] += y[[i]], {i,n}];
xmax2 = xmax; While[f[[xmax2]] == 0, xmax2--];
xmax2--;      While[f[[xmax2]] == 0, xmax2--];
Do[If[f[[i]] == 0, Continue[]]; 
   nA += f[[i]]; tA += s[[i]];
   d = (nA*t - n*tA)^2 / (nA(n-nA));
   If[d > dmax, dmax = d; xi = i], {i,xmax2}];
{N@xi, dmax}]]

----- Earl Mitchell <earl.j.mitchell at gmail.com> wrote:
> That is spot on.  Sorry for the clumsy description.
> 
> For some context (since you may know other resources to suggest along these
> lines) this is part of a larger project to create a suite of data-mining
> and machine learning tools, with the goal being three fold: increase my
> mathematica programming abilities, increase my knowledge of advanced
> statistical methods, and (obviously) generate some algorithms that can be
> used to make predictions.
> 
> This particular step is part of the RandomForrest statistical method.
> 
> Thanks so much for your help,
> Mitch
> 
> On Wed, Aug 8, 2012 at 7:34 PM, Ray Koopman <koopman at sfu.ca> wrote:
> 
>> On Aug 5, 5:39 pm, Earl Mitchell <earl.j.mitchell at gmail.com> wrote:
>>> Hi all,
>>>
>>> I've got a list of 42,000 lists of 233 variables (integers from 1 to 250),
>>> each with a response in the last position (integer from 0 to 9). I want to
>>> find the variable and split point in that variable at which the weighted
>>> average variance in the response will be minimized.
>>>
>>> As a trivial example say the original variance in the response for the
>>> whole set is 100.  Imagine that there is some variable, say at position
>>> 220, which has Tally'd values of {{0,40,000},{1,2000}} and, if we split
>>> the corresponding responses along this break point the weighted average
>>> variance of the two lists is 80, which happens to be the lowest possible
>>> resulting variance for any single split, on any single variable.
>>>
>>> How can I find this point quickly? This description might be confusing -
>>> let me know if something is not clear. Thanks ahead of time for the help!
>>>
>>> Mitch
>>>
>>> PS.  Currently this job can be done in with this code:
>>>
>>> FindMaxVarianceReductionSplit[data_List] :=
>>>  Module[{transdata, splitvarreductionpairs, withoutputs, testsplits, split,
>>>    endvar, maxvarreduction},
>>>   transdata = Transpose[data];
>>>   splitvarreductionpairs = With[{outputs = transdata[[-1]]},
>>>     ParallelTable[
>>>      With[{inputs = transdata[[i]], startvar = N@NewVariance[outputs]},
>>>       withoutputs = Thread[{inputs, outputs}];
>>>       testsplits = Union[inputs];
>>>       Table[
>>>        With[{splitval = testsplits[[j]]},
>>>         split = GatherBy[withoutputs, #[[1]] > splitval &];
>>>         endvar =
>>>          Total[(N@Length[#]*NewVariance[#[[All, -1]]] & /@ split)/
>>>            Length[withoutputs]];
>>>         {splitval, startvar - endvar}
>>>         ]
>>>        ,
>>>        {j, Length[testsplits]}]], {i, Length[Most[transdata]]}]
>>>     ];
>>>
>>>   maxvarreduction = Max[Flatten[splitvarreductionpairs, 1][[All, -1]]];
>>>   Position[splitvarreductionpairs, maxvarreduction]
>>>
>>>   ]
>>>
>>> ... on my brand spanking new MBP it completes in just under 1,000 seconds
>>> being parallelized.  I need this to run much faster to have any practical
>>> applications.
>>>
>>> Thanks again!
>>
>> I find your description confusing. Here's how I read it:
>>
>> You have a table whose dimensions are 42000 x 233, and a corresponding
>> vector of 42000 responses. All the values in the table are integers in
>> [1, 250]. All the responses are integers in [0,9]. You want to choose
>> a column, say k, and a value, say x, such that if you split the
>> responses into two groups according as the k'th element in the
>> corresponding row of the table is <= x or > x, you minimize the sum of
>> squared deviations of the responses from their respective group
>> means.
>>
>> Please confirm or correct that interpretation.



  • Prev by Date: Re: Trying to quickly split lists at the point of
  • Next by Date: problems withn 3d models
  • Previous by thread: Re: Trying to quickly split lists at the point of
  • Next by thread: Re: Trying to quickly split lists at the point of