shuffling a large set in Mathematica
- To: mathgroup@smc.vnet.net
- Subject: [mg11553] shuffling a large set in Mathematica
- From: Daniel Lichtblau <danl@wolfram.com>
- Date: Sat, 14 Mar 1998 13:56:06 -0500
- Organization: Wolfram Research, Inc.
A few weks ago the question was raised as to how one can randomly permute a large set efficiently. I gave a method that is quite standard and has expected complexity of O(n*log(n)) for a set of size n. This complexity comes about because we must sort a set of n values selected at random from a master set that is effectively infinite, and the known best sorting algorithms in this case are of that expected complexity. Two questions then arise. (1) Can we devise a method to select at random one element at a time? This is of interest if we are dealing blackjack from an extremely large deck. Perhaps we only wish to use, say, a fourth of it (hence we do not want to shuffle the entire deck) because we know that a player is using Mathematica to "count the cards." (2) Can we do better than O(n*log(n)) complexity? It turns out we can do both with the same algorithm. We can get O(n) complexity for shuffling the entire deck, more generally we get O(m) if we only want to select at random, without replacement, m elements. The idea is simple. We pick elements at random. Any time we select one, we record that fact in a hash table. Then if we pick it again we simply do not use it. We continue until we have selected a constant fraction from the entire deck, say 1/2. Then we consolidate the remaining elements and append to those already chosen the result of randomly permuting the remaining elements. I will first show the code, then demonstrate that it is O(n) in time complexity, then show why this is so (maybe this is obvious to some of you). Off[General::spell1] fshuffle[ll_List, k_] := Module[ {table, j=0, lenll=Length[ll], len, indx, result={}, rest={}, tmp={}}, len = Ceiling[lenll*k]; While[j<len, indx = Random[Integer, {1,lenll}]; If [table[indx]=!=True, table[indx] = True; result = {result, ll[[indx]]}; j++; ]; ]; If [len<lenll, Do [If [table[i]=!=True, rest = {rest, ll[[i]]}], {i,lenll}]; tmp = fshuffle[Flatten[rest],k]; ]; Remove[table]; {result, tmp} ] fastShuffle[n_Integer, k_:1/2] := Flatten[fshuffle[Range[n], k]] We see experimentally that it is indeed O(n) for speed: In[20]:= Timing[fs1K = fastShuffle[2^10];] Out[20]= {0.56 Second, Null} In[21]:= Timing[fs4K = fastShuffle[2^12];] Out[21]= {2.17 Second, Null} In[22]:= Timing[fs16K = fastShuffle[2^14];] Out[22]= {9.1 Second, Null} In[23]:= Timing[fs64K = fastShuffle[2^16];] Out[23]= {36.33 Second, Null} In[24]:= Timing[fs256K = fastShuffle[2^18];] Out[24]= {150.86 Second, Null} Let us see why this is expected complexity O(n). Say the fraction we use is k where 0<k<1. Then in a given call to our main routine we choose only k*(#elements). Each time through the main loop we select an index at random between one and the length of the set. The probability that this index was already chosen is always less than k. Let E(n,k) denote the expected number of Random[] calls needed to successfully select k*n elements. We denote a failed selection attempt by f, a succesful attempt by s. Denote the probability of a failed selection attempt, given we have already chosen j elements, by P(f,j) (similarly use P(s,j) for a success). Clearly P(f,j)<k, hence P(s,j)>1-k. This suffices to give us an upper bound E(n,k) < n*k/(1-k). A more refined estimate will be given below, but this will do for purposes of showing the expected run time is O(n). Now lets call F(n,k) the amount of "work" we need to do to select n elements, that is, to shuffle the entire deck. It is obviously a function of both n and k. Below we will assume that each of the stated operations e.g. a call to Random, a hash-table lookup, etc. are weighted equally. For purposes of getting a complexity bound all that matters is that they are constant independent of n and k. We see that, up to these omitted constant factors, F(n,k) = n [the loop to build the 'rest' list] + k*n [to build up 'result'] + (1-k)*n [to Flatten 'rest'] + E(n,k) [calls to Random] + E(n,k) [hash table queries] + k*n [add elements to hash table] + k*n [free the hash table] + F((1-k)*n,k) [recursive call to shuffle 'rest'] = 2*(n + k*n + E(n,k)) + F((1-k)*n,k) Below we use '->' to denote "approaches for n sufficiently large." Then F(n,k) < 2*(n + k*n + n*k/(1-k)) + F((1-k)*n,k) = n*(2 + k(1+1/(1-k))) + F((1-k)*n,k) = n*(2 + k(1+1/(1-k))) + ((1-k)*n)*(2 + k(1+1/(1-k))) + F((1-k)^2*n,k) -> n*(2 + k(1+1/(1-k))) * Sum[(1-k)^j, {j,0,Infinity}] = n*(2 + k(1+1/(1-k)))*(1/k) (summing the geometric series) = n*(2/k + 1 + 1/(1-k)). Thus F(n,k) < n*constant, where constant depends on the choice of k. Several examples will demonstrate that this algorithm seems to work best for k somewhere on the larger side of 1/2. For example: In[4]:= Timing[fastShuffle[2^15, 1/4];] Out[4]= {21.95 Second, Null} In[5]:= Timing[fastShuffle[2^15, 1/3];] Out[5]= {19.57 Second, Null} In[6]:= Timing[fastShuffle[2^15, 2/5];] Out[6]= {18.43 Second, Null} In[7]:= Timing[fastShuffle[2^15, 1/2];] Out[7]= {17.66 Second, Null} In[8]:= Timing[fastShuffle[2^15, 3/5];] Out[8]= {17.2 Second, Null} In[9]:= Timing[fastShuffle[2^15, 2/3];] Out[9]= {17.27 Second, Null} In[10]:= Timing[fastShuffle[2^15, 3/4];] Out[10]= {18.03 Second, Null} Let us now refine the complexity estimate to learn more about setting the parameter k. This is just for fun because obviously the timings are close for a fairly wide range of values that the time saved is hardly worth the bother. But curiousity prevails. First note that when selecting k*n elements, the probability of a collision on the first attempt is 0, hence the expected number of attempts needed for the first selection is 1; in the previous notation, p(s,0)=1. The probability of a collision the next time is 1/n, hence p(s,1)=1-1/n=(n-1)/n and so the expected number of attempts to select the second element is n/(n-1). Similarly the expected number of attempts needed to select the third element is n/(n-2). To choose k*n elements we thus have (using the character '~' to denote "approximately") E(n,k) = (1+n/(n-1) + n/(n-2) + ... + n/(n-k*n)) = n*(1/n + 1/(n-1) + ... + 1/(n-k*n)) ~ n*Integrate[1/x,{x,n*(1-k),n}] = n*(Log[n]-Log[n*(1-k)]) = n*Log[1/(1-k)] Hence, similar to the upper bound estimate above, we have the approximation F(n,k) ~ 2*n*(1 + k + Log[1/(1-k)]) + F((1-k)*n,k) -> 2*n * (1 + k + Log[1/(1-k)]) * Sum[(1-k)^j, {j,0,Infinity}] = 2*n * (1 + k + Log[1/(1-k)]) * (1/k) = 2*n * (1 + 1/k + Log[1/(1-k)]/k) We have again assumed equal weighting of the basic operations, and this is of course no longer valid for getting a precise optimal choice of k. Actually, the best way to optimize k is as above, to run simple tests to see what value performs the best for fairly large n. But then we cannot show off the equation-solving capability of the software. So we proceed. Let us find a value for k which minimizes the expression for F. It is clear that such must exist in the interval 0<k<1 because as k approaches either endpoint it is obvious F(n,k) will go to infinity. It is also clear that our approximation for F is a differentiable function of k in that interval, hence we can use basic calculus here. Finally note* that we can drop the factor of 2*n for our purposes. In[12]:= f[k_] := 1 + 1/k + Log[1/(1-k)]/k In[13]:= deriv = D[f[k],k]; In[19]:= FindRoot[deriv==0, {k,.5}] Out[19]= {k -> 0.682156} This would indicate that our assumptions and estimates are not entirely out of line with observed timings. Actually with a small amount of work we can get an explicit closed form for k in terms of the ProductLog function. In[20]:= num = Numerator[Together[deriv]]; In[21]:= InputForm[Solve[num==0, k]] 1 k Solve::dinv: The expression (-----) 1 - k involves unknowns in more than one argument, so inverse functions cannot be used. Out[21]//InputForm= Solve[1 - 2*k + Log[(1 - k)^(-1)] - k*Log[(1 - k)^(-1)] == 0, k] No joy yet. But it looks as though we might simplify matters if we substitute a variable for k-1 (I confess it probably takes some experience with Solve to realize this path will lead somewhere). In[24]:= soln = Solve[newnum == 0, j]; InverseFunction::ifun: Warning: Inverse functions are being used. Values may be lost for multivalued inverses. Solve::ifun: Inverse functions are being used by Solve, so some solutions may not be found. Now we rewrite the solutions in terms of k. In[30]:= InputForm[candidates = (k /. k->j+1) /. soln] Out[30]//InputForm= {1 + ProductLog[-E^(-2)]^(-1), 1 + ProductLog[-1, -E^(-2)]^(-1)} Let us select a solution in the correct range. In[37]:= InputForm[soln = Select[candidates, (Positive[#] && #<1) &]] $MaxExtraPrecision::meprec: In increasing internal precision while attempting to evaluate -2 ProductLog[-E ], the limit $MaxExtraPrecision = 50. was reached. Increasing the value of $MaxExtraPrecision may help resolve the uncertainty. $MaxExtraPrecision::meprec: In increasing internal precision while attempting to evaluate -2 ProductLog[-1, -E ], the limit $MaxExtraPrecision = 50. was reached. Increasing the value of $MaxExtraPrecision may help resolve the uncertainty. Out[37]//InputForm= {1 + ProductLog[-1, -E^(-2)]^(-1)} The warning messages are a minor glitch, one I hope to see fixed in future. Let us check that our solution agrees with the FindRoot result. In[38]:= N[soln] Out[38]= {0.682156} Success. The algorithm will always successfully shuffle the deck, but the time it takes is expected to be O(n) but in principle can be worse. Such are known as Las Vegas algorithms. Actually the probability that it will take longer is vanishingly negligible, I just wanted a cheap excuse to use the phrase "Las Vegas algorithm" in a note about shuffling cards. The crossover compared to the O(n^2) algorithm happens at around twenty thousand elements in our development version, and far lower in all previous versions. sshuffle[ll_List] := Module[ {deck=ll, result={}, rnd, j, size=Length[ll]}, For [j=size, j>0, j--, rnd = Random[Integer, j-1] + 1; result = {result, deck[[rnd]]}; deck = Drop[deck,{rnd}]; ]; result ] slowShuffle[n_Integer] := Flatten[sshuffle[Range[n]]] In[31]:= Timing[ss1K = slowShuffle[2^10];] Out[31]= {0.21 Second, Null} In[32]:= Timing[ss4K = slowShuffle[2^12];] Out[32]= {1.13 Second, Null} In[33]:= Timing[ss16K = slowShuffle[2^14];] Out[33]= {7.43 Second, Null} The timings above are from our development version. They are, in my opinion, quite fast for this algorithm. Indeed, the O(n^2) behavior does not really become apparent until we have around 2^16 elements because the list manipulation operations are so well optimized for large lists of machine integers. Of course it is a simple matter to make a hybrid algorithm. We merely add the line fshuffle[ll_List, k_] /; Length[ll]<=crossover := sshuffle[ll] where crossover has been tuned by experiment. Note that while fastShuffle is very good with respect to question (1), in actual practice it is not an improvement over the O(n*log(n)) algorithm for permuting the entire set until we reach extremely large values of n. To see this, we can try for example mediumShuffle[n_] := Map[Last, Sort[Transpose[{Table[Random[],{n}],Range[n]}]]] In[26]:= Timing[ms16K = mediumShuffle[2^14];] Out[26]= {1.02 Second, Null} In[27]:= Timing[ms64K = mediumShuffle[2^16];] Out[27]= {4.63 Second, Null} In[28]:= Timing[ms256K = mediumShuffle[2^18];] Out[28]= {20.04 Second, Null} The crossover will happen at size in the range of several million, no small deck! But of course the fastShuffle can easily be modified to select fewer than n elements, whereas mediumShuffle must permute the entire set. We used Mathematica to implement an efficient shuffle algorithm and to test it against some other algorithms. We tuned the algorithm experimentally based on these test runs. We then used Mathematica and some basic calculus to approximate the "optimum" value for the algorithm tuning parameter k. Is this great software or what? (Okay, maybe I have a bias.) Daniel Lichtblau Wolfram Research