[Date Index]
[Thread Index]
[Author Index]
Re: Trying to quickly split lists at the point of maximum variance reduction
*To*: mathgroup at smc.vnet.net
*Subject*: [mg127623] Re: Trying to quickly split lists at the point of maximum variance reduction
*From*: Ray Koopman <koopman at sfu.ca>
*Date*: Wed, 8 Aug 2012 21:34:00 -0400 (EDT)
*Delivered-to*: l-mathgroup@mail-archive0.wolfram.com
*Delivered-to*: l-mathgroup@wolfram.com
*Delivered-to*: mathgroup-newout@smc.vnet.net
*Delivered-to*: mathgroup-newsend@smc.vnet.net
*References*: <jvn3om$am6$1@smc.vnet.net>
On Aug 5, 5:39 pm, Earl Mitchell <earl.j.mitch... at gmail.com> wrote:
> Hi all,
>
> I've got a list of 42,000 lists of 233 variables (integers from 1 to 250),
> each with a response in the last position (integer from 0 to 9). I want to
> find the variable and split point in that variable at which the weighted
> average variance in the response will be minimized.
>
> As a trivial example say the original variance in the response for the
> whole set is 100. Imagine that there is some variable, say at position
> 220, which has Tally'd values of {{0,40,000},{1,2000}} and, if we split the
> corresponding responses along this break point the weighted average
> variance of the two lists is 80, which happens to be the lowest possible
> resulting variance for any single split, on any single variable.
>
> How can I find this point quickly? This description might be confusing -
> let me know if something is not clear. Thanks ahead of time for the help!
>
> Mitch
>
> PS. Currently this job can be done in with this code:
>
> FindMaxVarianceReductionSplit[data_List] :=
> Module[{transdata, splitvarreductionpairs, withoutputs, testsplits, split,
> endvar, maxvarreduction},
> transdata = Transpose[data];
> splitvarreductionpairs = With[{outputs = transdata[[-1]]},
> ParallelTable[
> With[{inputs = transdata[[i]], startvar = N@NewVariance[outputs]},
> withoutputs = Thread[{inputs, outputs}];
> testsplits = Union[inputs];
> Table[
> With[{splitval = testsplits[[j]]},
> split = GatherBy[withoutputs, #[[1]] > splitval &];
> endvar =
> Total[(N@Length[#]*NewVariance[#[[All, -1]]] & /@ split)/
> Length[withoutputs]];
> {splitval, startvar - endvar}
> ]
> ,
> {j, Length[testsplits]}]], {i, Length[Most[transdata]]}]
> ];
>
> maxvarreduction = Max[Flatten[splitvarreductionpairs, 1][[All, -1]]];
> Position[splitvarreductionpairs, maxvarreduction]
>
> ]
>
> ... on my brand spanking new MBP it completes in just under 1,000 seconds
> being parallelized. I need this to run much faster to have any practical
> applications.
>
> Thanks again!
I find your description confusing. Here's how I read it:
You have a table whose dimensions are 42000 x 233, and a corresponding
vector of 42000 responses. All the values in the table are integers in
[1, 250]. All the responses are integers in [0,9]. You want to choose
a column, say k, and a value, say x, such that if you split the
responses into two groups according as the k'th element in the
corresponding row of the table is <= x or > x, you minimize the sum of
squared deviations of the responses from their respective group
means.
Please confirm or correct that interpretation.
Prev by Date:
**Text Alignment in Graphics[]**
Next by Date:
**Re: Landau letter, Re: Mathematica as a New Approach...**
Previous by thread:
**Trying to quickly split lists at the point of maximum variance reduction**
Next by thread:
**any news of new edition of The Mathematica GuideBooks?**
| |