[Date Index] [Thread Index] [Author Index]
shuffling a large set in Mathematica

To: mathgroup@smc.vnet.net
Subject: [mg11553] shuffling a large set in Mathematica
From: Daniel Lichtblau <danl@wolfram.com>
Date: Sat, 14 Mar 1998 13:56:06 -0500
Organization: Wolfram Research, Inc.
A few weks ago the question was raised as to how one can randomly
permute a large set efficiently. I gave a method that is quite standard
and has expected complexity of O(n*log(n)) for a set of size n. This
complexity comes about because we must sort a set of n values selected
at random from a master set that is effectively infinite, and the known
best sorting algorithms in this case are of that expected complexity.

Two questions then arise.
(1) Can we devise a method to select at random one element at a time?
This is of interest if we are dealing blackjack from an extremely large
deck. Perhaps we only wish to use, say, a fourth of it (hence we do not
want to shuffle the entire deck) because we know that a player is using
Mathematica to "count the cards."
(2) Can we do better than O(n*log(n)) complexity?

It turns out we can do both with the same algorithm. We can get O(n)
complexity for shuffling the entire deck, more generally we get O(m) if
we only want to select at random, without replacement, m elements. The
idea is simple. We pick elements at random. Any time we select one, we
record that fact in a hash table. Then if we pick it again we simply do
not use it. We continue until we have selected a constant fraction from
the entire deck, say 1/2. Then we consolidate the remaining elements
and append to those already chosen the result of randomly permuting the
remaining elements.

I will first show the code, then demonstrate that it is O(n) in time
complexity, then show why this is so (maybe this is obvious to some of
you).


Off[General::spell1]

fshuffle[ll_List, k_] := Module[
	{table, j=0, lenll=Length[ll], len, indx, result={}, rest={}, tmp={}},
	len = Ceiling[lenll*k];
	While[j<len,
		indx = Random[Integer, {1,lenll}];
		If [table[indx]=!=True,
			table[indx] = True;
			result = {result, ll[[indx]]};
			j++;
			];
		];
	If [len<lenll,
		Do [If [table[i]=!=True, rest = {rest, ll[[i]]}], {i,lenll}];
		tmp = fshuffle[Flatten[rest],k];
		];
	Remove[table];
	{result, tmp}
	]

fastShuffle[n_Integer, k_:1/2] := Flatten[fshuffle[Range[n], k]]

We see experimentally that it is indeed O(n) for speed:

In[20]:= Timing[fs1K = fastShuffle[2^10];] Out[20]= {0.56 Second, Null}

In[21]:= Timing[fs4K = fastShuffle[2^12];] Out[21]= {2.17 Second, Null}

In[22]:= Timing[fs16K = fastShuffle[2^14];] Out[22]= {9.1 Second, Null}

In[23]:= Timing[fs64K = fastShuffle[2^16];] Out[23]= {36.33 Second,
Null}

In[24]:= Timing[fs256K = fastShuffle[2^18];] Out[24]= {150.86 Second,
Null}


Let us see why this is expected complexity O(n). Say the fraction we use
is k where 0<k<1. Then in a given call to our main routine we choose
only k*(#elements). Each time through the main loop we select an index
at random between one and the length of the set. The probability that
this index was already chosen is always less than k. Let E(n,k) denote
the expected number of Random[] calls needed to successfully select k*n
elements. We denote a failed selection attempt by f, a succesful
attempt by s. Denote the probability of a failed selection attempt,
given we have already chosen j elements, by P(f,j) (similarly use
P(s,j) for a success). Clearly P(f,j)<k, hence P(s,j)>1-k. This
suffices to give us an upper bound E(n,k) < n*k/(1-k). A more refined
estimate will be given below, but this will do for purposes of showing
the expected run time is O(n).

Now lets call F(n,k) the amount of "work" we need to do to select n
elements, that is, to shuffle the entire deck. It is obviously a
function of both n and k. Below we will assume that each of the stated
operations e.g. a call to Random, a hash-table lookup, etc. are
weighted equally. For purposes of getting a complexity bound all that
matters is that they are constant independent of n and k. We see that,
up to these omitted constant factors, F(n,k) = n    [the loop to build
the 'rest' list]
	+ k*n    [to build up 'result']
	+ (1-k)*n    [to Flatten 'rest']
	+ E(n,k)    [calls to Random]
	+ E(n,k)    [hash table queries]
	+ k*n    [add elements to hash table]
	+ k*n    [free the hash table]
	+ F((1-k)*n,k)    [recursive call to shuffle 'rest'] = 2*(n + k*n +
E(n,k)) + F((1-k)*n,k)

Below we use '->' to denote "approaches for n sufficiently large." Then
F(n,k) < 2*(n + k*n + n*k/(1-k)) + F((1-k)*n,k)
	= n*(2 + k(1+1/(1-k))) + F((1-k)*n,k)
	  = n*(2 + k(1+1/(1-k))) + ((1-k)*n)*(2 + k(1+1/(1-k))) +
F((1-k)^2*n,k)
	  -> n*(2 + k(1+1/(1-k))) * Sum[(1-k)^j, {j,0,Infinity}]
	  = n*(2 + k(1+1/(1-k)))*(1/k) (summing the geometric series)
	  = n*(2/k + 1 + 1/(1-k)).

Thus F(n,k) < n*constant, where constant depends on the choice of k.

Several examples will demonstrate that this algorithm seems to work best
for k somewhere on the larger side of 1/2. For example:

In[4]:= Timing[fastShuffle[2^15, 1/4];] Out[4]= {21.95 Second, Null}

In[5]:= Timing[fastShuffle[2^15, 1/3];] Out[5]= {19.57 Second, Null}

In[6]:= Timing[fastShuffle[2^15, 2/5];] Out[6]= {18.43 Second, Null}

In[7]:= Timing[fastShuffle[2^15, 1/2];] Out[7]= {17.66 Second, Null}

In[8]:= Timing[fastShuffle[2^15, 3/5];] Out[8]= {17.2 Second, Null}

In[9]:= Timing[fastShuffle[2^15, 2/3];] Out[9]= {17.27 Second, Null}

In[10]:= Timing[fastShuffle[2^15, 3/4];] Out[10]= {18.03 Second, Null}

Let us now refine the complexity estimate to learn more about setting
the parameter k. This is just for fun because obviously the timings are
close for a fairly wide range of values that the time saved is hardly
worth the bother. But curiousity prevails.

First note that when selecting k*n elements, the probability of a
collision on the first attempt is 0, hence the expected number of
attempts needed for the first selection is 1; in the previous notation,
p(s,0)=1. The probability of a collision the next time is 1/n, hence
p(s,1)=1-1/n=(n-1)/n and so the expected number of attempts to select
the second element is n/(n-1). Similarly the expected number of
attempts needed to select the third element is n/(n-2). To choose k*n
elements we thus have (using the character '~' to denote
"approximately") E(n,k) = (1+n/(n-1) + n/(n-2) + ... + n/(n-k*n))
	= n*(1/n + 1/(n-1) + ... + 1/(n-k*n))
	~ n*Integrate[1/x,{x,n*(1-k),n}]
	= n*(Log[n]-Log[n*(1-k)])
	= n*Log[1/(1-k)]

Hence, similar to the upper bound estimate above, we have the
approximation
F(n,k) ~ 2*n*(1 + k + Log[1/(1-k)]) + F((1-k)*n,k)
	-> 2*n * (1 + k + Log[1/(1-k)]) * Sum[(1-k)^j, {j,0,Infinity}]
	= 2*n * (1 + k + Log[1/(1-k)]) * (1/k)
	= 2*n * (1 + 1/k + Log[1/(1-k)]/k)

We have again assumed equal weighting of the basic operations, and this
is of course no longer valid for getting a precise optimal choice of k.
Actually, the best way to optimize k is as above, to run simple tests
to see what value performs the best for fairly large n. But then we
cannot show off the equation-solving capability of the software. So we
proceed.

Let us find a value for k which minimizes the expression for F. It is
clear that such must exist in the interval 0<k<1 because as k
approaches either endpoint it is obvious F(n,k) will go to infinity. It
is also clear that our approximation for F is a differentiable function
of k in that interval, hence we can use basic calculus here. Finally
note* that we can drop the factor of 2*n for our purposes.

In[12]:= f[k_] := 1 + 1/k + Log[1/(1-k)]/k

In[13]:= deriv = D[f[k],k];

In[19]:= FindRoot[deriv==0, {k,.5}]
Out[19]= {k -> 0.682156}

This would indicate that our assumptions and estimates are not entirely
out of line with observed timings. Actually with a small amount of work
we can get an explicit closed form for k in terms of the ProductLog
function.

In[20]:= num = Numerator[Together[deriv]];

In[21]:= InputForm[Solve[num==0, k]]

                               1   k Solve::dinv: The expression (-----)
                             1 - k
     involves unknowns in more than one argument, so inverse functions
cannot
     be used.

Out[21]//InputForm= 
Solve[1 - 2*k + Log[(1 - k)^(-1)] - k*Log[(1 - k)^(-1)] == 0, k]

No joy yet. But it looks as though we might simplify matters if we
substitute a variable for k-1 (I confess it probably takes some
experience with Solve to realize this path will lead somewhere).

In[24]:= soln = Solve[newnum == 0, j]; InverseFunction::ifun: 
   Warning: Inverse functions are being used. Values may be lost for
    multivalued inverses.
Solve::ifun: Inverse functions are being used by Solve, so some
solutions may
     not be found.

Now we rewrite the solutions in terms of k.

In[30]:= InputForm[candidates = (k /. k->j+1) /. soln]
Out[30]//InputForm= 
{1 + ProductLog[-E^(-2)]^(-1), 1 + ProductLog[-1, -E^(-2)]^(-1)}

Let us select a solution in the correct range.

In[37]:= InputForm[soln = Select[candidates, (Positive[#] && #<1) &]]
$MaxExtraPrecision::meprec: 
   In increasing internal precision while attempting to evaluate 
                 -2
    ProductLog[-E  ], the limit $MaxExtraPrecision = 50.
     was reached. Increasing the value of $MaxExtraPrecision may help
resolve
     the uncertainty.
$MaxExtraPrecision::meprec: 
   In increasing internal precision while attempting to evaluate 
                     -2
    ProductLog[-1, -E  ], the limit $MaxExtraPrecision = 50.
     was reached. Increasing the value of $MaxExtraPrecision may help
resolve
     the uncertainty.
Out[37]//InputForm= {1 + ProductLog[-1, -E^(-2)]^(-1)}

The warning messages are a minor glitch, one I hope to see fixed in
future.
Let us check that our solution agrees with the FindRoot result.

In[38]:= N[soln]
Out[38]= {0.682156}

Success.

The algorithm will always successfully shuffle the deck, but the time it
takes is expected to be O(n) but in principle can be worse. Such are
known as Las Vegas algorithms. Actually the probability that it will
take longer is vanishingly negligible, I just wanted a cheap excuse to
use the phrase "Las Vegas algorithm" in a note about shuffling cards.

The crossover compared to the O(n^2) algorithm happens at around twenty
thousand elements in our development version, and far lower in all
previous versions.

sshuffle[ll_List] := Module[
	{deck=ll, result={}, rnd, j, size=Length[ll]},
	For [j=size, j>0, j--,
		rnd = Random[Integer, j-1] + 1;
		result = {result, deck[[rnd]]};
		deck = Drop[deck,{rnd}];
		];
	result
	]

slowShuffle[n_Integer] := Flatten[sshuffle[Range[n]]]

In[31]:= Timing[ss1K = slowShuffle[2^10];] Out[31]= {0.21 Second, Null}

In[32]:= Timing[ss4K = slowShuffle[2^12];] Out[32]= {1.13 Second, Null}

In[33]:= Timing[ss16K = slowShuffle[2^14];] Out[33]= {7.43 Second, Null}

The timings above are from our development version. They are, in my
opinion, quite fast for this algorithm. Indeed, the O(n^2) behavior
does not really become apparent until we have around 2^16 elements
because the list manipulation operations are so well optimized for
large lists of machine integers.

Of course it is a simple matter to make a hybrid algorithm. We merely
add the line

fshuffle[ll_List, k_] /; Length[ll]<=crossover := sshuffle[ll]

where crossover has been tuned by experiment.

Note that while fastShuffle is very good with respect to question (1),
in actual practice it is not an improvement over the O(n*log(n))
algorithm for permuting the entire set until we reach extremely large
values of n. To see this, we can try for example

mediumShuffle[n_] := Map[Last,
	Sort[Transpose[{Table[Random[],{n}],Range[n]}]]]

In[26]:= Timing[ms16K = mediumShuffle[2^14];] Out[26]= {1.02 Second,
Null}

In[27]:= Timing[ms64K = mediumShuffle[2^16];] Out[27]= {4.63 Second,
Null}

In[28]:= Timing[ms256K = mediumShuffle[2^18];] Out[28]= {20.04 Second,
Null}

The crossover will happen at size in the range of several million, no
small deck! But of course the fastShuffle can easily be modified to
select fewer than n elements, whereas mediumShuffle must permute the
entire set.

We used Mathematica to implement an efficient shuffle algorithm and to
test it against some other algorithms. We tuned the algorithm
experimentally based on these test runs. We then used Mathematica and
some basic calculus to approximate the "optimum" value for the
algorithm tuning parameter k. Is this great software or what? (Okay,
maybe I have a bias.)


Daniel Lichtblau
Wolfram Research
Prev by Date: Re: Eigenvalues
Next by Date: Re: TELLING PROGRAM THAT A VARIABLE IS AN INTEGER
Prev by thread: Re: Automatic generation of piecewise functions
Next by thread: Odd behaviour in Mathematica 3.0.1