MathGroup Archive 2000

[Date Index] [Thread Index] [Author Index]

Search the Archive

Union in Mathematica versions 3 and 4

  • To: mathgroup at smc.vnet.net
  • Subject: [mg23309] Union in Mathematica versions 3 and 4
  • From: Carl Woll <carlw at u.washington.edu>
  • Date: Mon, 1 May 2000 18:00:10 -0400 (EDT)
  • Sender: owner-wri-mathgroup at wolfram.com

Recently, there has been some discussion of a vector union function
where I asserted that Union with a SameTest funtion other than the
default has complexity of O(N^2). This is a change between Mathematica
versions 3 and 4, and is in fact really a bug fix, since the algorithm
used in Mathematica version 3 did not produce a true Union function.
That is, in version 3 it was possible to apply the Union function to a
data set using a SameTest option other than the default with the result
that some of the elements of the data set were not eliminated even
though they should have been removed. This is because the old Union
algorithm always sorted the data set with the standard canonical
ordering first, and then compared adjacent elements using the supplied
SameTest function.

This is a good change, but there are situations where the sort and then
test algorithm will still produce the correct result. In version 3, we
just used Union to execute this algorithm, but in version 4, this
approach will clearly perform very inefficiently on large data sets. In
this post I wanted to discuss how to produce such an algorithm in
version 4. The simplest idea was the following:

SortUnion[data_, OrderFunction->p_,
SameTest->f_]:=First/@Split[Sort[data, p],f]

The first thing about this function I wanted to mention was the use of
First/@. In previous versions of Mathematica I believe this was the
fastest way to generate a list of just the first elements. In
Mathematica 4, however, it is not the most efficient way of getting a
list of the first elements. The most efficient techique now uses the new
All specification in Part. So a better SortUnion function is the
following:

SortUnion[data_, OrderFunction->p_,
SameTest->f_]:=Split[Sort[data,p],f][[All,1]]

By the way, if you wanted just the last elements, you would just replace
1 with -1. The next thing to note is that Sort without the optional
OrderFunction is much much faster than a sort using an optional
OrderFunction. In my tests, this difference may be as much as 2 orders
of magnitude. For example,

In[1]:=
data = Table[Random[], {100000}];

In[5]:=
s1 = Sort[data]; // Timing
s2 = Sort[data, OrderedQ[{#1, #2}] &]; // Timing
s1 === s2

Out[5]=
{0.609 Second, Null}

Out[6]=
{44.672 Second, Null}

Out[7]=
True

This suggests that it may be wise to organize ones data such that the
standard optionless Sort performs the sort you want. On the other hand,
the function Split with and without the optional SameTest function does
not exhibit as pronounced a difference. With this in mind, I believe
that a SortUnion function with a SameTest option, but without an
OrderFunction option would be useful, as in

SortUnion[data_, SameTest->f_]:=Split[Sort[data],f][[All,1]]

I believe that this captures the exact same functionality as the version
3 Union function.

Now for the reason why I looked into the above. There has been some
discussion recently of the somewhat misleadingly named OrderedUnion
function (I now think I should have named it UnsortedUnion). It turns
out that a competing O(N log N) function which I wrote before became
O(N^2) in version 4 because of the above change in Union. In revisiting
this algorithm, I came up with an improvement which is now significantly
better than the above OrderedUnion function. That is, while OrderedUnion
used an O(N) algorithm, it turned out that the crossover point from my
new O(N log N) function to the O(N) function was so high that the O(N)
function froze the kernel. Below I give the definition of this new O(N
log N) function (called UnsortedUnion) to find the union of a set while
preserving the original order.

In[7]:=
SortUnion[data_, SameTest -> f_] := Split[Sort[data], f][[All, 1]]

In[8]:=
UnsortedUnion[li_List] := Module[{len = Length[li], tmp},
    tmp =
      SortUnion[Transpose[{li, Range[len]}],
        SameTest -> (#1[[1]] === #2[[1]] &)];
    Sort[Transpose[RotateLeft[Transpose[tmp]]]][[All, 2]]
    ]

In[9]:=
OrderedUnion[li_List] := Block[{i, Sequence},
    i[n_] := (i[n] = Sequence[]; n);
    i /@ li]

In[10]:=
data = Table[Random[Integer, 40000], {40000}];

In[11]:=
r1 = OrderedUnion[data]; // Timing
r2 = UnsortedUnion[data]; // Timing
r1 === r2

Out[11]=
{5.266 Second, Null}

Out[12]=
{4.031 Second, Null}

Out[13]=
True



  • Prev by Date: Re: more on vector unions
  • Next by Date: RE: [Q] Equation solving?
  • Previous by thread: Re: more on vector unions
  • Next by thread: Re: Union in Mathematica versions 3 and 4