MathGroup Archive: August 2012 [00164]

[Date Index] [Thread Index] [Author Index]

Re: Trying to quickly split lists at the point of

To: mathgroup at smc.vnet.net
Subject: [mg127637] Re: Trying to quickly split lists at the point of
From: Earl Mitchell <earl.j.mitchell at gmail.com>
Date: Fri, 10 Aug 2012 02:41:00 -0400 (EDT)
Delivered-to: l-mathgroup@mail-archive0.wolfram.com
Delivered-to: l-mathgroup@wolfram.com
Delivered-to: mathgroup-newout@smc.vnet.net
Delivered-to: mathgroup-newsend@smc.vnet.net
References: <jvn3om$am6$1@smc.vnet.net>

That is spot on.  Sorry for the clumsy description.

For some context (since you may know other resources to suggest along these
lines) this is part of a larger project to create a suite of data-mining
and machine learning tools, with the goal being three fold: increase my
mathematica programming abilities, increase my knowledge of advanced
statistical methods, and (obviously) generate some algorithms that can be
used to make predictions.

This particular step is part of the RandomForrest statistical method.

Thanks so much for your help,
Mitch

On Wed, Aug 8, 2012 at 7:34 PM, Ray Koopman <koopman at sfu.ca> wrote:

> On Aug 5, 5:39 pm, Earl Mitchell <earl.j.mitch... at gmail.com> wrote:
> > Hi all,
> >
> > I've got a list of 42,000 lists of 233 variables (integers from 1 to
> 250),
> > each with a response in the last position (integer from 0 to 9).  I want
> to
> > find the variable and split point in that variable at which the weighted
> > average variance in the response will be minimized.
> >
> > As a trivial example say the original variance in the response for the
> > whole set is 100.  Imagine that there is some variable, say at position
> > 220, which has Tally'd values of {{0,40,000},{1,2000}} and, if we split
> the
> > corresponding responses along this break point the weighted average
> > variance of the two lists is 80, which happens to be the lowest possible
> > resulting variance for any single split, on any single variable.
> >
> > How can I find this point quickly?  This description might be confusing -
> > let me know if something is not clear.  Thanks ahead of time for the
> help!
> >
> > Mitch
> >
> > PS.  Currently this job can be done in with this code:
> >
> > FindMaxVarianceReductionSplit[data_List] :=
> >  Module[{transdata, splitvarreductionpairs, withoutputs, testsplits,
> split,
> >    endvar, maxvarreduction},
> >   transdata = Transpose[data];
> >   splitvarreductionpairs = With[{outputs = transdata[[-1]]},
> >     ParallelTable[
> >      With[{inputs = transdata[[i]], startvar = N@NewVariance[outputs]},
> >       withoutputs = Thread[{inputs, outputs}];
> >       testsplits = Union[inputs];
> >       Table[
> >        With[{splitval = testsplits[[j]]},
> >         split = GatherBy[withoutputs, #[[1]] > splitval &];
> >         endvar =
> >          Total[(N@Length[#]*NewVariance[#[[All, -1]]] & /@ split)/
> >            Length[withoutputs]];
> >         {splitval, startvar - endvar}
> >         ]
> >        ,
> >        {j, Length[testsplits]}]], {i, Length[Most[transdata]]}]
> >     ];
> >
> >   maxvarreduction = Max[Flatten[splitvarreductionpairs, 1][[All, -1]]];
> >   Position[splitvarreductionpairs, maxvarreduction]
> >
> >   ]
> >
> > ... on my brand spanking new MBP it completes in just under 1,000 seconds
> > being parallelized.  I need this to run much faster to have any practical
> > applications.
> >
> > Thanks again!
>
> I find your description confusing. Here's how I read it:
>
> You have a table whose dimensions are 42000 x 233, and a corresponding
> vector of 42000 responses. All the values in the table are integers in
> [1, 250]. All the responses are integers in [0,9]. You want to choose
> a column, say k, and a value, say x, such that if you split the
> responses into two groups according as the k'th element in the
> corresponding row of the table is <= x or > x, you minimize the sum of
> squared deviations of the responses from their respective group
> means.
>
> Please confirm or correct that interpretation.
>
>

Prev by Date: How to increase evaluation speed for nested numerical integration

Next by Date: Re: Trying to quickly split lists at the point of

Previous by thread: How to increase evaluation speed for nested numerical integration

Next by thread: Re: Trying to quickly split lists at the point of