Re: Best method to break apart a data set

*To*: mathgroup at smc.vnet.net*Subject*: [mg113883] Re: Best method to break apart a data set*From*: Leonid Shifrin <lshifr at gmail.com>*Date*: Wed, 17 Nov 2010 05:28:06 -0500 (EST)

Hi Eli, I suggest the graph-based approach. Basically, the procedure is this: rescale the points so that x and y directions are treated on equal footing. Then compute the mutual Euclidean distances for all the points. Then impose some threshold (I put 0.1, but this can be tuned), so that any distance smaller than it will be converted to 1, and any distance larger than it - to 0 (I do this with a combination of Clip and Unitize, both being very fast built-in functions). Then use the Combinatorica function FromAdjacencyMatrix to construct the Combinatorica graph using the modified distance matrix as an adjacency matrix. Finally, find connected components of this graph with another Combinatorica function - they will give you the positions of points in your original list which belong to the same component. Finally, extract those components and plot them with ListPlot: stuff = ... (your list) Needs["Combinatorica`"] Module[{ threshold = 0.1, xmean = Mean[N@stuff[[All, 1]]], ymean = Mean[N@stuff[[All, 2]]], xmax = Max[Abs[N@stuff[[All, 1]]]], ymax = Max[Abs[N@stuff[[All, 2]]]], normalized, admatrix, graph, components}, normalized = Transpose[(Transpose[N@stuff] - {xmean, ymean})/{xmax, ymax}]; admatrix = Unitize@Clip[ Outer[Norm[#1 - #2] &, normalized, normalized, 1, 1], {0, threshold}, {0, 0}]; graph = FromAdjacencyMatrix[admatrix]; components = ConnectedComponents[graph]; Print["Positions of components: ", components]; Print[Length[stuff], " ", Length[Flatten[components , 1]]]; ListPlot[stuff[[#]]] & /@ components ] (* Output *) Positions of components: {{1,4,7,10,13,16,20,24,28,32,35,38,41,44,47,50,52,54,56,58,60,62,64,65,66},{2,5,8,11,14,17,21,25,29,33,36,39,42,45,48,51,53,55,57,59,61,63},{3,6,9,12,15,18,22,26,30,34,37,40,43,46,49},{19,23,27,31}} 66 66 (* Plots suppressed *) Works for me. For larger lists, you may want to make several optimizations, such as using SparseArray to store the adjacency matrix, and use the version 8 graph functions instead of Combinatorica. Computing the matrix of pair distances still seems necessary in this approach, and is O(n^2) operation where n is the number of points though. But you may use Compile (especially the new M8 Compile) to make it at least as fast as possible. You may also compute it row by row and only keep the positions of non-zero elements - this will decrease the memory consumption during the adjacency matrix computation, perhaps at the expense of slowing it down somewhat. Regards, Leonid On Tue, Nov 16, 2010 at 1:05 PM, EliL <elansey at gmail.com> wrote: > For > stuff = {{0, 3290}, {0, 8576}, {0, 12081}, {4569828, 3336}, {4569828, > 8581}, {4569828, 12109}, {9139656, 3468}, {9139656, > 8600}, {9139656, 12193}, {13709484, 3671}, {13709484, > 8637}, {13709484, 12328}, {18279312, 3924}, {18279312, > 8698}, {18279312, 12513}, {22849141, 4205}, {22849141, > 8791}, {22849141, 12741}, {22849141, 15220}, {27418969, > 4494}, {27418969, 8925}, {27418969, 13009}, {27418969, > 15637}, {31988797, 4774}, {31988797, 9106}, {31988797, > 13312}, {31988797, 15995}, {36558625, 5032}, {36558625, > 9342}, {36558625, 13646}, {36558625, 16320}, {41128453, > 5259}, {41128453, 9633}, {41128453, 14008}, {45698281, > 5453}, {45698281, 9979}, {45698281, 14394}, {50268109, > 5612}, {50268109, 10377}, {50268109, 14802}, {54837937, > 5742}, {54837937, 10819}, {54837937, 15230}, {59407765, > 5846}, {59407765, 11298}, {59407765, 15675}, {63977593, > 5929}, {63977593, 11809}, {63977593, 16135}, {68547422, > 5995}, {68547422, 12345}, {73117250, 6048}, {73117250, > 12902}, {77687078, 6091}, {77687078, 13475}, {82256906, > 6125}, {82256906, 14062}, {86826734, 6153}, {86826734, > 14660}, {91396562, 6176}, {91396562, 15268}, {95966390, > 6195}, {95966390, 15884}, {100536218, 6210}, {105106046, > 6223}, {109675875, 6233}} > > If you ListPlot it you'll see 4 distinct curves. I'd love to have > Mathematica break them apart into four separate sets. I've tried > using FindClusters, but the default doesn't work. I've also tried > using DistanceFunction -> (Norm[#1[[2]] - #2[[2]]]^2 &) to pull only > the closeness in the y-axis. This successfully separates the bottom > curve, but doesn't break the 3 upper curves up in the right way. > > Any other ideas for how to break this up? Or different Distance > Functions to use? > Thanks so much, > Eli. > >