MathGroup Archive: November 1999 [00091]

[Date Index] [Thread Index] [Author Index]
Re: Intersection and element counts
To: mathgroup at smc.vnet.net
Subject: [mg20677] Re: [mg20615] Intersection and element counts
From: "Wolf, Hartmut" <hwolf at debis.com>
Date: Sun, 7 Nov 1999 02:10:10 -0500
Organization: debis Systemhaus
References: <199911040713.CAA02652@smc.vnet.net.>
Sender: owner-wri-mathgroup at wolfram.com
Arturas Acus schrieb:
> 
> Dear Group
> 
> I am interesting in intersecion, which takes into
> account the number of the same elements.
> 
> It is I would like
> 
> myIntersection[{a,a,a,b,b,c},{a,a,b,b}]
> 
> to give me {a,a,b,b}.
> 
> Any solutions?
> Do theoretical speed of this function is
> very different of usual intersection algorithm speed?
> 

Dear Arturas,

I tried several ideas, which just came to my mind, and tested them. 
(This is of course not a definitive study, just the result my hacking
yesterday evening, there will be more interesting ideas showing up in
this usegroup - for shure!)

l1 = {a, a, a, b, b, c}; l2 = {a, a, b, b};

As you did in your example, I assumed that both lists of (any
expressions) are in "canonical order"

The Functions considered:
-------------------------
(1) we transform the lists to lists of {expr, count}, 
    e.g. l1 -> {{a, 3}, {b, 2}, {c, 1}};
    Intersect these, by using only the first component in SameTest 
    and calculate the "intersection" count,
    then transform back to the original type of representation

yourIntersection[1][l1_, l2_] := Module[{ul1, ul2, icount},
{ul1, ul2} = Map[{First[#], Length[#]} &, Split[#]] & /@ {l1, l2};
Flatten[Table[#1, {#2}] & @@@ 
        Transpose[{First /@ (icount = {}; 
                Intersection[ul1, ul2, 
                  SameTest -> ((If[First[#1] == First[#2], 
                            AppendTo[icount, Min[Last[#1], Last[#2]]];
True, 
                            False]) &)]), icount}], 1]    ]
Off[List::smtst]
(* why this has to be done I don't know *)

yourIntersection[1][l1, l2]
{a, a, b, b}


(2) we try to use built-in Intersection, by adjoining to each element 
    its position in the repetition, so there ar no more identical
elements,
    e.g. l1->{{{a, 1}, {a, 2}, {a, 3}}, {{b, 1}, {b, 2}}, {{c, 1}}},
    Intersect and strip off the adornments.

yourIntersection[2][l1_, l2_] := 
  Module[{un1, un2}, {un1, un2} = 
      Flatten[Transpose[{ #, Range[Length[#]]}] & /@ #, 1] & /@ 
        Split /@ {l1, l2}; Transpose[Intersection[un1, un2]][[1]]    ]

yourIntersection[2][l1, l2]
{a, a, b, b}


(3) a variation of (2), we use a different method for making the
elements
    singular, the idea is to avoid Split and such have one scan less

yourIntersection[3][l1_, l2_] := 
  Module[{un1, un2, cnt}, 
    With[{f = If[First[#1] === #2, {#2, ++cnt}, {#2, cnt = 1}] &}, {un1, 
          un2} = (cnt = 1; FoldList[f, {First[#], 1}, Rest[#] ] &) /@
{l1, 
            l2};
      Transpose[Intersection[un1, un2]][[1]]    ]]

yourIntersection[3][l1, l2]
{a, a, b, b}


(4) simple procedural merge (since input is in canonical order),
    use the function "Order" for testing

yourIntersection[4][l1_, l2_] := 
  With[{len1 = Length[l1], len2 = Length[l2]}, 
    Module[{t1 = 1, t2 = 1, result = {}, obj}, 
      While[t1 <= len1 && t2 <= len2, 
        Switch[Order[(obj = l1[[t1]]), l2[[t2]]],
          1, t1++,
          0, (AppendTo[result, obj]; t1++; t2++),
          -1, t2++]];
      result]]

yourIntersection[4][l1, l2]
{a, a, b, b}


(5) another idea: get all elements in both lists (Union) and 
    count the elements in both lists including zero count,
    e.g. l2 -> {2, 2, 0}, use simple count function;
    use these lists (minimum for each element) as directive to 
    generate the result from the list of all objects

anotherIntersection[l1_, l2_] := 
  With[{objs = Union[l1, l2]}, 
    With[{c1 = countFunction[objs, l1], c2 = countFunction[objs, l2]}, 
      MapThread[Sequence @@ Table[#1, {Min[#2, #3]}] &, {objs, c1, c2}] 
]]

cntF1 = Function[{objs, inlist}, Count[inlist, #] & /@ objs];

yourIntersection[5][l1_, l2_] := Block[{countFunction = cntF1},
anotherIntersection[l1, l2]]

yourIntersection[5][l1, l2]
{a, a, b, b}

(6) I suspect the count function, write my own

cntF2 = Function[{objs, inlist}, 
      With[{len0 = Length[objs], len1 = Length[inlist]}, 
        Module[{t0 = 1, t1 = 1, result = {}, c = 0}, 
          While[t0 <= 
              len0, (While[t1 <= len1 && inlist[[t1]] === objs[[t0]],
c++; 
                t1++]; AppendTo[result, c]; c = 0; t0++)]; result]]];

yourIntersection[6][l1_, l2_] := Block[{countFunction = cntF2},
anotherIntersection[l1, l2]]

yourIntersection[6][l1, l2]
{a, a, b, b}

(7) yet  another idea: we Split both lists, then Join and Sort them, and 
    look for pairs of list with identical elements, output the short of
these.
    Otherwise we drag along the information for the next comparison step
    by wrapping with "no", finally delete these spurious data

yourIntersection[7][l1_, l2_] := 
  Module[{sjl, no}, 
    With[{ff = 
          If[First[#1] === First[#2], 
              If[Length[#1] <= Length[#2], List @@ #1, #2], no @@ #2]
&}, 
      sjl = Sort[Join[Split[l1], Split[l2]], 
          Order[First[#1], First[#2]] == 1 &]; 
      Flatten[DeleteCases[FoldList[ff, no[Null], sjl], _no], 1]    ]]
  
yourIntersection[7][l1, l2]
{a, a, b, b}


The Tests:
----------

We test several orders of mangitude of the input, first n=100:

{xl1, xl2} = 
  Table[With[{n = 100}, 
        MapThread[
          Table[#1, {#2}] &, {Range[n], 
            Table[Floor[6 Exp[-Random[Real, {0., 3.5}]]], {n}]}]] // 
      Flatten, {2}];

{{1, 1, 1, 1, 2, 2, 2, 2, 3, 4, 7, 7, 7, 8, 8, 8, 8, 8, 10, 10, 10, 10,
13, 
    14, 14, 14, 14, 14, 16, 17, 18, 18, 18, 22, 22, 22, 26, 26, 26, 26,
27, 
    27, 27, 28, 28, 29, 29, 29, 30, 30, 30, 31, 31, 31, 31, 31, 32, 32,
32, 
    33, 33, 33, 34, 34, 34, 34, 35, 35, 35, 35, 36, 38, 38, 38, 38, 39,
41, 
    44, 44, 44, 45, 47, 47, 47, 51, 51, 51, 51, 51, 52, 53, 55, 55, 55,
55, 
    55, 56, 56, 56, 57, 57, 58, 61, 62, 63, 65, 65, 65, 65, 65, 66, 71,
72, 
    72, 72, 72, 72, 75, 77, 78, 79, 81, 81, 81, 81, 82, 82, 82, 82, 82,
83, 
    87, 87, 87, 87, 90, 90, 91, 92, 95, 95, 95, 95, 95, 96, 97, 97, 97}, 
{2, 2, 2, 3, 3, 4, 4, 4, 4, 6, 6, 6, 6, 8, 8, 9, 9, 9, 10, 12, 14, 14,
14, 15,
     15, 19, 19, 24, 24, 24, 24, 24, 27, 29, 29, 29, 29, 34, 34, 36, 37,
37, 
    38, 38, 38, 38, 40, 43, 43, 45, 45, 45, 49, 51, 51, 53, 54, 54, 56,
56, 
    56, 56, 56, 57, 61, 61, 61, 61, 63, 63, 64, 64, 65, 66, 69, 69, 71,
71, 
    76, 76, 77, 79, 79, 80, 80, 80, 81, 81, 81, 83, 85, 85, 85, 85, 85,
87, 
    87, 88, 88, 90, 91, 91, 93, 99, 99, 99, 99, 99}}

(I had to play around with the constants (6 and 3.5) to get something
"pleasant")

Length /@ {xl1, xl2}
{148, 108}

Evaluate[Function[x, Hold[Timing[x[##]] ]] /@ Array[yourIntersection,
7]] &[
    xl1, xl2] // ReleaseHold

{{0.29 Second, -same as below-}, 
 {0.02 Second, -same as below-}, 
 {0.06 Second, -same as below-}, 
 {0.05 Second, -same as below-}, 
 {0.07 Second, -same as below-}, 
 {0.11 Second, -same as below-}, 
 {0.071 Second, {2, 2, 2, 3, 4, 8, 8, 10, 14, 14, 14, 27, 29, 29, 
      29, 34, 34, 36, 38, 38, 38, 38, 45, 51, 51, 53, 56, 56, 56, 57,
61, 63, 
      65, 66, 71, 77, 79, 81, 81, 81, 83, 87, 87, 90, 91}}}

Length[%[[-1, -1]]]
45

to better compare, let's extract the Timing

{0.28 Second, 0.03 Second, 0.04 Second, 0.05 Second, 0.07 Second, 
   0.11 Second, 0.071 Second}

Next is n=1000:

Length /@ ({xxl1, xxl2} = 
      Table[With[{n = 1000}, -code as above-])
{1201, 1217}

Evaluate[Function[x, Hold[Timing[x[##]][[1]] ]] /@ 
          Array[yourIntersection, 7]] &[xxl1, xxl2] // ReleaseHold

{26.949 Second, 0.26 Second, 0.361 Second, 0.581 Second, 0.56 Second, 
  1.483 Second, 0.941 Second}


Finally n=10000:

Length /@ ({xxxl1, xxxl2} = 
      Table[With[{n = 1000}, -code as above-])
{12029, 11629}

We eliminate the first function from the contest

Evaluate[Function[x, Hold[Timing[x[##]][[1]] ]] /@ 
          Drop[Array[yourIntersection, 7], 1]] &[xxxl1, xxxl2] //
ReleaseHold

{-not calculated-, 2.493 Second, 3.465 Second, 12.859 Second, 19.939
Second, 
  46.196 Second, 10.014 Second}


A short discussion:
-------------------

As often, the first idea was not the best one; #1 is worst and seemingly
O[n^2].
But why? One might spot 'AppendTo[icount, Min[Last[#1], Last[#2]]]' as
being the source of the imperformance and O[n^2] behaviour. Hoewever
this is not true! I get quite the same timing if I replace that with
icount={icount, Min[Last[#1], Last[#2]]} and Flatten the result (be
cautious if you have general expressions instead of numbers in my test,
as not to Flatten them too). My experience is, that AppendTo (now, in
Version 4 ??) has become pretty fast if you don't access to the list in
between repeated appends. (My impression is, Wolfram, they (now?) use
delayed evaluation, so some past arguments have to be reconsidered.) I
cannot spot the source for O[n^2] behaviour, Split, Map, Transpose,
Flatten are clearly O[n], Intersection might be O[n log n] but *could*
be O[n] for our input. Of course SameTest slows down Intersection, but
should not switch to another class of complexity.

What is really suspect is, although the correct result is being
produced, I had to switch off spurious error messages:
Off[List::smtst].  Am I doing something wrong? (if you can see that,
please tell me!), or is there some bug in Intersection?

The other algorithms which don't appear to be linear are #5 and #6 (the
status of # 4 is not quite clear from the measurements). I attribute
this to that Union operation, which should behave as O[n log n]. To my
surprise #5 is better that #6, it countains Count[inlist, #] & /@ objs]
as counting function. Well I think Wolfram did a good job for Count and
used a similar idea than I did here, but they are somewhat nearer to the
heart of the machine. (That should make Count of O[n log n] or even O[n]
for our (canonically ordered) input.)

I'm clearly disappointed from #4. I wrote that in order to have a linear
algorithm for certain (ref. to my remarks concerning AppendTo above),
obviously it's too *procedural*, too far off the machine language.

The score of #7 however is better than I expected; the Sort inside slows
it to O[n log n] *or* the internal Sort algorithm makes use of the
partial ordering state after the Join of two ordered list (?).

#2 proved to be the best one (only of *these* of course), but Split,
Transpose, Flatten, Range and "naked" Intersection are of the best, and
this came not very much to a surprise.

To replace in #3 Range by *procedural* code in order to save one
(suspected) scan through the data didn't pay off.

And, of course you may compile one or the other function, but I'm
leaving that for Carl ;-)

Kind regards, Hartmut
References:
- Intersection and element counts
  - From: "Arturas Acus" <acus@itpa.lt>
Prev by Date: Re: TeX -> Mathematica
Next by Date: Re: Solution of this equation
Previous by thread: Intersection and element counts
Next by thread: Re: Intersection and element counts