Union in Mathematica versions 3 and 4
- To: mathgroup at smc.vnet.net
- Subject: [mg23309] Union in Mathematica versions 3 and 4
- From: Carl Woll <carlw at u.washington.edu>
- Date: Mon, 1 May 2000 18:00:10 -0400 (EDT)
- Sender: owner-wri-mathgroup at wolfram.com
Recently, there has been some discussion of a vector union function where I asserted that Union with a SameTest funtion other than the default has complexity of O(N^2). This is a change between Mathematica versions 3 and 4, and is in fact really a bug fix, since the algorithm used in Mathematica version 3 did not produce a true Union function. That is, in version 3 it was possible to apply the Union function to a data set using a SameTest option other than the default with the result that some of the elements of the data set were not eliminated even though they should have been removed. This is because the old Union algorithm always sorted the data set with the standard canonical ordering first, and then compared adjacent elements using the supplied SameTest function. This is a good change, but there are situations where the sort and then test algorithm will still produce the correct result. In version 3, we just used Union to execute this algorithm, but in version 4, this approach will clearly perform very inefficiently on large data sets. In this post I wanted to discuss how to produce such an algorithm in version 4. The simplest idea was the following: SortUnion[data_, OrderFunction->p_, SameTest->f_]:=First/@Split[Sort[data, p],f] The first thing about this function I wanted to mention was the use of First/@. In previous versions of Mathematica I believe this was the fastest way to generate a list of just the first elements. In Mathematica 4, however, it is not the most efficient way of getting a list of the first elements. The most efficient techique now uses the new All specification in Part. So a better SortUnion function is the following: SortUnion[data_, OrderFunction->p_, SameTest->f_]:=Split[Sort[data,p],f][[All,1]] By the way, if you wanted just the last elements, you would just replace 1 with -1. The next thing to note is that Sort without the optional OrderFunction is much much faster than a sort using an optional OrderFunction. In my tests, this difference may be as much as 2 orders of magnitude. For example, In[1]:= data = Table[Random[], {100000}]; In[5]:= s1 = Sort[data]; // Timing s2 = Sort[data, OrderedQ[{#1, #2}] &]; // Timing s1 === s2 Out[5]= {0.609 Second, Null} Out[6]= {44.672 Second, Null} Out[7]= True This suggests that it may be wise to organize ones data such that the standard optionless Sort performs the sort you want. On the other hand, the function Split with and without the optional SameTest function does not exhibit as pronounced a difference. With this in mind, I believe that a SortUnion function with a SameTest option, but without an OrderFunction option would be useful, as in SortUnion[data_, SameTest->f_]:=Split[Sort[data],f][[All,1]] I believe that this captures the exact same functionality as the version 3 Union function. Now for the reason why I looked into the above. There has been some discussion recently of the somewhat misleadingly named OrderedUnion function (I now think I should have named it UnsortedUnion). It turns out that a competing O(N log N) function which I wrote before became O(N^2) in version 4 because of the above change in Union. In revisiting this algorithm, I came up with an improvement which is now significantly better than the above OrderedUnion function. That is, while OrderedUnion used an O(N) algorithm, it turned out that the crossover point from my new O(N log N) function to the O(N) function was so high that the O(N) function froze the kernel. Below I give the definition of this new O(N log N) function (called UnsortedUnion) to find the union of a set while preserving the original order. In[7]:= SortUnion[data_, SameTest -> f_] := Split[Sort[data], f][[All, 1]] In[8]:= UnsortedUnion[li_List] := Module[{len = Length[li], tmp}, tmp = SortUnion[Transpose[{li, Range[len]}], SameTest -> (#1[[1]] === #2[[1]] &)]; Sort[Transpose[RotateLeft[Transpose[tmp]]]][[All, 2]] ] In[9]:= OrderedUnion[li_List] := Block[{i, Sequence}, i[n_] := (i[n] = Sequence[]; n); i /@ li] In[10]:= data = Table[Random[Integer, 40000], {40000}]; In[11]:= r1 = OrderedUnion[data]; // Timing r2 = UnsortedUnion[data]; // Timing r1 === r2 Out[11]= {5.266 Second, Null} Out[12]= {4.031 Second, Null} Out[13]= True