MathGroup Archive 2010

[Date Index] [Thread Index] [Author Index]

Search the Archive

Re: Best method to break apart a data set

  • To: mathgroup at smc.vnet.net
  • Subject: [mg113883] Re: Best method to break apart a data set
  • From: Leonid Shifrin <lshifr at gmail.com>
  • Date: Wed, 17 Nov 2010 05:28:06 -0500 (EST)

Hi Eli,

I suggest the graph-based approach. Basically, the procedure is this:
rescale the points so that x and y directions are treated on equal footing.
Then compute the mutual Euclidean distances for all the points. Then impose
some
threshold (I put 0.1, but this can be tuned), so that any distance smaller
than it will be converted to 1, and any distance larger than it - to 0 (I do
this with a combination of Clip and Unitize, both being very fast built-in
functions). Then use the Combinatorica function FromAdjacencyMatrix to
construct the Combinatorica graph using the modified distance matrix as an
adjacency matrix. Finally, find connected components of this graph with
another Combinatorica function - they will give you the positions of points
in your
original list which belong to the same component. Finally, extract those
components and plot them with ListPlot:

stuff = ... (your list)

Needs["Combinatorica`"]

Module[{
  threshold = 0.1,
  xmean = Mean[N@stuff[[All, 1]]],
  ymean = Mean[N@stuff[[All, 2]]],
  xmax = Max[Abs[N@stuff[[All, 1]]]],
  ymax =  Max[Abs[N@stuff[[All, 2]]]], normalized, admatrix, graph,
  components},
 normalized =
  Transpose[(Transpose[N@stuff] - {xmean, ymean})/{xmax, ymax}];
 admatrix =
  Unitize@Clip[
    Outer[Norm[#1 - #2] &, normalized, normalized, 1, 1], {0,
     threshold}, {0, 0}];
 graph = FromAdjacencyMatrix[admatrix];
 components = ConnectedComponents[graph];
 Print["Positions of components: ", components];
 Print[Length[stuff], "  ", Length[Flatten[components , 1]]];
 ListPlot[stuff[[#]]] & /@ components
 ]

(* Output *)

Positions of components:
{{1,4,7,10,13,16,20,24,28,32,35,38,41,44,47,50,52,54,56,58,60,62,64,65,66},{2,5,8,11,14,17,21,25,29,33,36,39,42,45,48,51,53,55,57,59,61,63},{3,6,9,12,15,18,22,26,30,34,37,40,43,46,49},{19,23,27,31}}

66  66

(* Plots suppressed *)


Works for me.

For larger lists, you may want to make several optimizations, such as using
SparseArray to store
the adjacency matrix, and use the version 8 graph functions instead of
Combinatorica. Computing the
matrix of pair distances still seems necessary in this approach, and is
O(n^2) operation where n
is the number of points though. But you may use Compile (especially the new
M8 Compile) to make
it at least as fast as possible. You may also compute it row by row and only
keep the positions of non-zero
elements - this will decrease the memory consumption during the adjacency
matrix computation, perhaps
at the expense of slowing it down somewhat.

Regards,
Leonid


On Tue, Nov 16, 2010 at 1:05 PM, EliL <elansey at gmail.com> wrote:

> For
> stuff = {{0, 3290}, {0, 8576}, {0, 12081}, {4569828, 3336}, {4569828,
>    8581}, {4569828, 12109}, {9139656, 3468}, {9139656,
>   8600}, {9139656, 12193}, {13709484, 3671}, {13709484,
>   8637}, {13709484, 12328}, {18279312, 3924}, {18279312,
>   8698}, {18279312, 12513}, {22849141, 4205}, {22849141,
>   8791}, {22849141, 12741}, {22849141, 15220}, {27418969,
>   4494}, {27418969, 8925}, {27418969, 13009}, {27418969,
>   15637}, {31988797, 4774}, {31988797, 9106}, {31988797,
>   13312}, {31988797, 15995}, {36558625, 5032}, {36558625,
>   9342}, {36558625, 13646}, {36558625, 16320}, {41128453,
>   5259}, {41128453, 9633}, {41128453, 14008}, {45698281,
>   5453}, {45698281, 9979}, {45698281, 14394}, {50268109,
>   5612}, {50268109, 10377}, {50268109, 14802}, {54837937,
>   5742}, {54837937, 10819}, {54837937, 15230}, {59407765,
>   5846}, {59407765, 11298}, {59407765, 15675}, {63977593,
>   5929}, {63977593, 11809}, {63977593, 16135}, {68547422,
>   5995}, {68547422, 12345}, {73117250, 6048}, {73117250,
>   12902}, {77687078, 6091}, {77687078, 13475}, {82256906,
>   6125}, {82256906, 14062}, {86826734, 6153}, {86826734,
>   14660}, {91396562, 6176}, {91396562, 15268}, {95966390,
>   6195}, {95966390, 15884}, {100536218, 6210}, {105106046,
>   6223}, {109675875, 6233}}
>
> If you ListPlot it you'll see 4 distinct curves. I'd love to have
> Mathematica break them apart into four separate sets.  I've tried
> using FindClusters, but the default doesn't work. I've also tried
> using DistanceFunction -> (Norm[#1[[2]] - #2[[2]]]^2 &) to pull only
> the closeness in the y-axis. This successfully separates the bottom
> curve, but doesn't break the 3 upper curves up in the right way.
>
> Any other ideas for how to break this up? Or different Distance
> Functions to use?
> Thanks so much,
> Eli.
>
>


  • Prev by Date: Re: Mathematica 8 is now available
  • Next by Date: Re: Best method to break apart a data set
  • Previous by thread: Re: Best method to break apart a data set
  • Next by thread: Re: Best method to break apart a data set