RE (long): Re: Increase in efficiency with Module

*To*: mathgroup at smc.vnet.net*Subject*: [mg40143] RE (long): [mg40117] Re: Increase in efficiency with Module*From*: "Wolf, Hartmut" <Hartmut.Wolf at t-systems.com>*Date*: Sat, 22 Mar 2003 05:08:59 -0500 (EST)*Sender*: owner-wri-mathgroup at wolfram.com

My dear friends, yet another stroke of genius from Carl Woll. (If you want to have a look of more, just scan the archive!) We earthlings, however, might question ourselves: how only can we arrive at such an idea? Perhaps it's helpful to show, in little, reason-able, steps how to earn 2 cents. (See below; as this thread has become rather long now, to make it more readable, I've edited the line breaks, and snipped off most; I apologize to Bobby Treat.) >-----Original Message----- >From: Carl K. Woll [mailto:carlw at u.washington.edu] To: mathgroup at smc.vnet.net >Sent: Friday, March 21, 2003 8:37 AM >To: mathgroup at smc.vnet.net >Subject: [mg40143] [mg40117] Re: Increase in efficiency with Module > > >Hi, > >Since this thread seems to have transformed itself into a >speed contest, I thought I would throw in my 2 cents. >Here is my attempt: > >h[a_]:=(s+=Tr[Sign[a-First[a]]];Rest[a]) > >carl[t_]:=Block[{s=0}, Nest[h,t,Length[t]]; s] > > >Comparing to ftauc3PP (the fastest so far according to Bobby) >we have: > >test=Table[Random[],{1000}]; > >In[105]:= >ftauc3PP[test]//Timing >carl[test]//Timing >Out[105]= >{0.265 Second, 9362} >Out[106]= >{0.094 Second, 9362} > >So, my uncompiled code is almost 3 times faster than ftauc3PP. > >Carl Woll >Physics Dept >U of Washington > [...] >> >> On Tue, 18 Mar 2003 02:21:27 -0500 (EST), Wolf, Hartmut ><Hartmut.Wolf at t-systems.com> wrote: >> >> >> >>> >> >>>> -----Original Message----- >> >>>> From: Aaron E. Hirsh [mailto:aehirsh at stanford.edu] To: mathgroup at smc.vnet.net >To: mathgroup at smc.vnet.net >> >>> To: mathgroup at smc.vnet.net >> >>>> Sent: Monday, March 17, 2003 9:35 AM >> >>>> To: mathgroup at smc.vnet.net >> >>>> Subject: [mg40143] [mg40117] Increase in efficiency with Module >> >>>> >> >>>> >> >>>> I would be grateful if someone could explain the difference >> >>>> in efficiency between the following two simple programs. >> >>>> ry is a list of length n. >> >>>> For n = 1000, the program using Module takes >> >>>> 1 second on my laptop, whereas the program >> >>>> without Module takes 75 seconds. >> >>>> >> >>>> ftauc = Compile[{{ry, _Real, 1}, {n, _Real, 0}}, >> >>>> Module[{i, j, a}, i = 1; a = 0; >> >>>> Do[ j = i + 1; >> >>>> Do[ >> >>>> If[ry[[i]] < ry[[j]], a++, >> >>>> If[ry[[i]] > ry[[j]], a--]]; >> >>>> j++, {n - i}]; >> >>>> i++, {n - 1}]; a >> >>>> ]] >> >>>> >> >>>> ftauc2 = Compile[{{ry, _Real, 1}, {n, _Real, 0}}, >> >>>> i = 1; >> >>>> a = 0; >> >>>> Do[j = i + 1; >> >>>> Do[ >> >>>> If[ry[[i]] < ry[[j]], a++, >> >>>> If[ry[[i]] > ry[[j]], a--]]; >> >>>> j++, {n - i}]; >> >>>> i++, {n - 1}]; >> >>>> a] >> >>>> >> >>>> thank you, >> >>>> -- >> >>>> >> >>>> Aaron E. Hirsh >> >>>> Center for Computational Genetics and Biological Modeling >> >>>> Department of Biological Sciences >> >>>> Stanford University >> >>>> tel. (650) 723-4952 >> >>>> fax. (650) 725-0180 >> >>>> >> >>> >> >>> Aaron, >> >>> >> >>> if you inspect the compiled code ftauc[[4]] and >> >>> ftauc2[[4]] you see considerable differences >> >>> which are obviously caused by access to >> >>> the global variables i, j and a (in ftauc). >> >>> I cannot go through theses statements >> >>> line by line, merely refer to >> >>> http://library.wolfram.com/database/Conferences/4682/ >> >>> >> >>> Although, this appears to be out of date now, a key >> >>> sentence therein still holds: >> >>> "The main obstacle to achieving a speedup ... is when the >> >>> compiled code has to externally evaluate a function". >> >>> From the compiled code for ftauc2, I suppose, >> >>> that access and updates of the global variables >> >>> i,j,a is done by calling an (external) function. >> >>> >> >>> Further: "In structures such as Module or Block, local >> >>> variables can be declared. The compiled code >> >>> will store value of such a variable in a >> >>> register, which can then be updated whenever the >> >>> variable gets a new value. >> >>> >> >>> >> >>> Such use local variables! >> >>> >> >>> [...] [I had snipped this off, bring it back now:] >> >>> >> >>> Functional code is better though: >> >>> >> >>> Plus @@ Flatten[ >> >>> MapThread[ >> >>> Thread[Unevaluated[Order[#1, #2]]] &, >> >>> {ry, NestList[Rest, ry, Length[ry] - 1]}]] >> >>> [...] In[3]:= test = Table[Random[], {1000}]; To find this code, lets look at the list operations first, in an abstracted manner. Simply take this for a list: In[4]:= r = MapIndexed[(-1)^First[#2] #1 &, Range[5]] Out[4]= {-1, 2, -3, 4, -5} Now what have we to process? We have to compare each element of the list with any element following that list, i.e. with any element of In[5]:= rr = NestList[Rest, r, Length[r] - 1] Out[5]= {{-1, 2, -3, 4, -5}, {2, -3, 4, -5}, {-3, 4, -5}, {4, -5}, {-5}} In[6]:= MapThread[f, {r, rr}] Out[6]= {f[-1, {-1, 2, -3, 4, -5}], f[2, {2, -3, 4, -5}], f[-3, {-3, 4, -5}], f[4, {4, -5}], f[-5, {-5}]} and each of these expressions must be threaded, as e.g. In[7]:= Thread[f[1, {1, 2, 3, 4, 5}]] Out[7]= {f[1, 1], f[1, 2], f[1, 3], f[1, 4], f[1, 5]} This put together: In[8]:= MapThread[Thread[f[#1, #2]] &, {r, rr}] Out[8]= {{f[-1, -1], f[-1, 2], f[-1, -3], f[-1, 4], f[-1, -5]}, {f[2, 2], f[2, -3], f[2, 4], f[2, -5]}, {f[-3, -3], f[-3, 4], f[-3, -5]}, {f[4, 4], f[4, -5]}, {f[-5, -5]}} What we did was not quite correct, all terms f[ j, j] are not wanted, more so this: In[9]:= MapThread[Thread[f[#1, #2]] &, {Drop[r, {-1}], Drop[rr, {1}]}] Out[9]= {{f[-1, 2], f[-1, -3], f[-1, 4], f[-1, -5]}, {f[2, -3], f[2, 4], f[2, -5]}, {f[-3, 4], f[-3, -5]}, {f[4, -5]}} What do we have to put in? Well, what I did, was Order, giving In[10]:= Order[#, 2] & /@ {1, 2, 3} Out[10]= {1, 0, -1} In[11]:= MapThread[Thread[f[#1, #2]] &, {Drop[r, {-1}], Drop[rr, {1}]}] /. f -> Order Out[11]= {{1, -1, 1, -1}, {-1, 1, -1}, {1, -1}, {-1}} We just to sum up all that: In[12]:= Plus @@ Flatten[%] Out[12]= -2 Or perhaps better In[13]:= Apply[Plus, %%, {0, 1}] Out[13]= -2 Now we cannot just simply insert Order for f, as within Thread, e.g. In[14]:= Order[-1, {2, -3, 4, -5}] Out[14]= 1 becomes evaluated prematurely. So we have to prevent this with Unevaluated. In[15]:= MapThread[ Thread[Unevaluated[Order[#1, #2]]] &, {Drop[r, {-1}], Drop[rr, {1}]}] /. f -> Order Out[15]= {{1, -1, 1, -1}, {-1, 1, -1}, {1, -1}, {-1}} Now as Order[j, j] == 0 we can dispense with the Drop-ping In[16]:= MapThread[ Thread[Unevaluated[Order[#1, #2]]] &, {r, NestList[Rest, r, Length[r] - 1]}] Out[16]= {{0, 1, -1, 1, -1}, {0, -1, 1, -1}, {0, 1, -1}, {0, -1}, {0}} ...and such arrive at In[17]:= Plus @@ Flatten[ MapThread[ Thread[Unevaluated[Order[#1, #2]]] &, {test, NestList[Rest, test, Length[test] - 1]}]] // Timing Out[17]= {3.095 Second, 12110} This certainly is not bad. But, what I had missed here, and Bobby Treat did not, is to search for a more specialized function for Order, when the arguments are Reals (or Integers), namely for In[18]:= Sign[2 - #1] & /@ {1, 2, 3} Out[18]= {1, 0, -1} In[19]:= Plus @@ Flatten[ MapThread[ Thread[Unevaluated[Sign[#2 - #1]]] &, {test, NestList[Rest, test, Length[test] - 1]}]] // Timing Out[19]= {1.141 Second, 12110} What a difference! But for Sign, we have In[20]:= Attributes[Sign] Out[20]= {Listable, NumericFunction, Protected} Such we don't nead threading In[21]:= MapThread[Sign[#2 - #1] &, {r, NestList[Rest, r, Length[r] - 1]}] Out[21]= {{0, 1, -1, 1, -1}, {0, -1, 1, -1}, {0, 1, -1}, {0, -1}, {0}} and again threading with MapThread isn't needed In[22]:= Sign[#2 - #1] & @@ {r, NestList[Rest, r, Length[r] - 1]} Out[22]= {{0, 1, -1, 1, -1}, {0, -1, 1, -1}, {0, 1, -1}, {0, -1}, {0}} In[23]:= Plus @@ Flatten@Sign[#2 - #1] & @@ {test, NestList[Rest, test, Length[test] - 1]} // Timing Out[23]= {1.202 Second, 12110} Staring steadily at this expression, we might see, that instead of using the outer Thread (or MapThread), we just might map over the second argument, and retrieve the former first argument to Sign, by taking just the first element of that. In[24]:= Sign[#1 - First[#1]] & /@ NestList[Rest, r, Length[r] - 1] Out[24]= {{0, 1, -1, 1, -1}, {0, -1, 1, -1}, {0, 1, -1}, {0, -1}, {0}} This is preferable as the input structure is simpler, one argument less and map over one list, instead of threading over two lists (at the outer level). Also, as stated above, Flatten isn't needed if we apply Plus to both levels: In[25]:= Apply[Plus, Sign[# - First[#]] & /@ NestList[Rest, test, Length[test] - 1], {0, 1}] // Timing Out[25]= {0.861 Second, 12110} This already is better as best compiled code (and becomes even better as problem size increases). Now there comes something special, which we simply have to know (and this has been reported in this list): Trace on large (linear) lists is much more effective than Apply-ing Plus, see: In[26]:= test2 = Table[Random[], {500000}]; In[27]:= Plus @@ test2 // Timing Out[27]= {1.342 Second, 250348.} In[28]:= Tr[test2] // Timing Out[28]= {0.02 Second, 250348.} In[29]:= Clear[test2] We are eager to exploit this In[30]:= Tr[Tr[Sign[# - First[#]]] & /@ NestList[Rest, test, Length[test] - 1]] // Timing Out[30]= {0.311 Second, 12110} In[31]:= carl[test] // Timing Out[31]= {0.33 Second, 12110} Bingo!! We got him, faster! Deplorably.... In[44]:= test3 = Table[Random[], {10000}]; In[46]:= Tr[Tr[Sign[# - First[#]]] & /@ NestList[Rest, test3, Length[test3] - 1]] // Timing >From In[46]:= No more memory available. Mathematica kernel has shut down. Try quitting other applications and then retry. Carl's program is much more effcient on memory than ours! What can we do better? If we only could put somehow the Sign function into the nesting operation, we would not need NestList to map over, but only Nest. As we finally just sum over the result of map, we could instead already sum up when nesting. (This is a "design pattern".) So what all we have to do, is to drag with the partial sums. A way to do this, is to reserve an element of the nested structure to accumulate, as in: In[32]:= First[ Nest[{#[[1]] + Tr[Sign[#[[2]] - First[#[[2]]]]], Rest[#[[2]]]} &, {0, test}, Length[test] ]] // Timing Out[32]= {0.321 Second, 12110} In[33]:= carl[test] // Timing Out[33]= {0.321 Second, 12110} Now Carl didn't do this, but instead used a local symbol to accumulate In[34]:= Block[{c = 0}, Nest[(c += Tr[Sign[# - First[#]]]; Rest[#]) &, test, Length[test] ]; c] // Timing Out[34]= {0.321 Second, 12110} This is Carl's code. Finis -- Hartmut Wolf