Re: Compilation: Avoiding inlining

*To*: mathgroup at smc.vnet.net*Subject*: [mg121840] Re: Compilation: Avoiding inlining*From*: DmitryG <einschlag at gmail.com>*Date*: Tue, 4 Oct 2011 02:44:31 -0400 (EDT)*Delivered-to*: l-mathgroup@mail-archive0.wolfram.com*References*: <201109250234.WAA26203@smc.vnet.net> <j5s88c$pf1$1@smc.vnet.net>

On Sep 27, 6:24 am, Oliver Ruebenkoenig <ruebe... at wolfram.com> wrote: > On Sat, 24 Sep 2011,DmitryGwrote: > > On Sep 23, 3:49 am, Oliver Ruebenkoenig <ruebe... at wolfram.com> wrote: > >> On Thu, 22 Sep 2011,DmitryGwrote: > >>> On Sep 21, 5:41 am,DmitryG<einsch... at gmail.com> wrote: > >>>> On Sep 20, 6:08 am, David Bailey <d... at removedbailey.co.uk> wrote: > > >>>>> On 16/09/2011 12:08, Oliver Ruebenkoenig wrote: > > >>>>>> On Fri, 16 Sep 2011,DmitryGwrote: > > >>>>>>> Here is a program with (* Definition of the equations *) made one= - > >>>>>>> step, that performs the same as in my previous post..............= ...... > > >>>>> I tried pasting your example into Mathematica, but unfortunately th= ere > >>>>> seems to be a variable 'x' which is undefined - presumably some inp= ut > >>>>> data. It might be worth posting a complete example, so that people = can > >>>>> explore how to get decent performance. > > >>>>> Error: > > >>>>> Part::partd: "Part specification x[[1]] is longer than depth of obj= ect. " > > >>>>> David Baileyhttp://www.dbaileyconsultancy.co.uk > > >>>> Hi David, > > >>>> The RK4 procedure here works with the solution vector x whose initia= l > >>>> value is defined after the RK4 procedure and the equations are > >>>> defined. Mathematica does not know the lenght of x at the beginning > >>>> and this is why it complains. You can ignore these complaints. Of th= e > >>>> several codes posted above, that in Oliver's 19 September post is th= e > >>>> best because it is fully compiled. Here is this code with plotting t= he > >>>> solution: > > >>>> *************************************** > >>>> (* Runge-Kutta-4 routine *) > >>>> ClearAll[makeCompRK] > >>>> makeCompRK[f_] := > >>>> Compile[{{x0, _Real, 1}, {t0, _Real}, {tMax, _Real}, {n, _Integer= }}, > >>>> Module[{h, K1, K2, K3, K4, SolList, x = x0, t}, h = (tMax - = t0)/n; > >>>> SolList = Table[x0, {n + 1}]; > >>>> Do[t = t0 + k h; > >>>> K1 = h f[t, x]; > >>>> K2 = h f[t + (1/2) h, x + (1/2) K1]; > >>>> K3 = h f[t + (1/2) h, x + (1/2) K2]; > >>>> K4 = h f[t + h, x + K3]; > >>>> x = x + (1/6) K1 + (1/3) K2 + (1/3) K3 + (1/6) K4; > >>>> SolList[[k + 1]] = x, {k, 1, n}]; > >>>> SolList](*,Parallelization->True*), CompilationTarget -> "C", > >>>> CompilationOptions -> {"InlineCompiledFunctions" -> True}] > > >>>> (* Defining equations *) > >>>> NN = 1000; > >>>> cRHS = With[{NN = NN}, Compile[{{t, _Real, 0}, {x, _Real, 1}}, > >>>> Table[-x[[i]]* > >>>> Sin[0.1 t]^2/(1 + > >>>> 100 Sum[x[[i + j]], {j, 1, Min[4, NN - i]}]^2), {= i,1, NN}] > >>>> (*, > >>>> CompilationTarget->"C"*)(*, > >>>> CompilationOptions->{"InlineExternalDefinitions"->True}*)]]; > > >>>> (*Compilation*) > >>>> tt0 = AbsoluteTime[]; > >>>> Timing[RK4Comp = makeCompRK[cRHS];] > >>>> AbsoluteTime[] - tt0 > >>>> (*CompilePrint[RK4Comp2]*) > > >>>> (*Setting parameters and Calculation*) > >>>> x0 = Table[ > >>>> RandomReal[{0, 1}], {i, 1, NN}]; t0 = 0; tMax = 300; n = 5= 00; > >>>> tt0 = AbsoluteTime[]; > >>>> Sol = RK4Comp[x0, t0, tMax, n]; > >>>> AbsoluteTime[] - tt0 > > >>>> Print["Compilation: ", Developer`PackedArrayQ@Sol] > > >>>> (* Plotting *) > >>>> tList = Table[1. t0 + (tMax - t0) k/n, {k, 0, n}]; > >>>> x1List = Transpose[{tList, Transpose[Sol][[1]]}]; > >>>> x2List = Transpose[{tList, Transpose[Sol][[2]]}]; > >>>> x3List = Transpose[{tList, Transpose[Sol][[3]]}]; > >>>> ListLinePlot[{x1List, x2List, x3List}, PlotMarkers -> Automatic, > >>>> PlotStyle -> {Blue, Green, Red}, PlotRange -> {0, 1}] > > >>>> Best, > > >>>> Dmitry > > >>> The execution time of the program above on my laptop today is 1.0 for > >>> compilation RK4 in Mathematica and 0.24 for compilation RK4 in C. = For > > >> You get some further speed up is you give the > > >> , "RuntimeOptions" -> "Speed" > > >> option to makeCompRK. > > >>> other compiled functions, it does not matter if the compilation targe= t > >>> is C or Mathematica (why?). > > >>> In my actual research program, I have a lot of external definitions > >>> and all of them have to be compiled. To model this situation, I have > >>> rewritten the same program with external definitions as follows: > > >>> ............ > > >>> (* Defining equations *) > >>> NN = 1000; > >>> ff = Compile[{{t}}, Sin[0.1 t]^2]; > >>> su = With[{NN = NN}, Compile[{{i, _Integer}, {x, _Real, 1}}, Sum[= x[[i > >>> + j]], {j, 1, Min[4, NN - i]}]]]; > >>> cRHS = With[{NN = NN}, Compile[{{t}, {x, _Real, 1}},Table[- > >>> x[[i]]*ff[t]/(1 + 100 su[i, x]^2), {i, 1, NN}, CompilationOptions -> > >>> {"InlineExternalDefinitions" -> True, "InlineCompiledFunctions" -> > >>> True}]]; > >>> ................................................... > > >>> Now the execution time is 2.2 for compilation in Mathematica and 1.42 > >>> for compilation in C. We see there is a considerable slowdown because > >>> of the external definition (actually because of su). I wonder why doe= s > >>> it happen? > > >> If you look at CompilePrint[cRHS] you will see a CopyTensor that is in= the > >> loop. That causes the slowdown. > > >> With some further optimizations, you could write > > >> su2 = With[{NN = NN}, > >> Compile[{{x, _Real, 1}}, > >> Table[Sum[x[[i + j]], {j, 1, Min[4, NN - i]}], {i, 1, NN}]]= ]; > > >> cRHS2 = With[{NN = NN}, > >> Compile[{{t, _Real, 0}, {x, _Real, 1}}, -x* > >> ff[t]/(1 + 100*su2[x]^2) > >> , CompilationTarget -> "C" > >> , CompilationOptions -> {"InlineExternalDefinitions" -> Tru= e, > >> "InlineCompiledFunctions" -> True}]]; > > >> Then, the CopyTensor is outside of the loop. > > >> Why is there a CopyTensor in the first place? Because su could be evil= and > >> (e.g. via a call to MainEvaluate to a function that has Attribute Hold= All) > >> change the value of the argument. I have to see if that could be avoid= ed. > >> I'll send an email if I find something. > > >> The external definitions are compiled and inlined in the > > >>> compiled code of RK4Comp, thus, to my understanding the execution tim= e > >>> should be the same. What is wrong here? > > >> I think there might be another cave canem: The expression optimizer th= at is > >> called by the compiler may not be able to optimize as much if there ar= e > >> several function calls instead of one. > > >> Oliver > > >>> Dmitry > > > Thank you Oliver! Great ideas, as usual! > > > I was able to rewrite my actual research programs in this way and they > > run faster. > > > A potentially very important question: I have noticed that the program > > we are discussing, when compiled in C, runs on both cores of my > > Only when compiled to C? You could try to set Parallelization->False > and/or it might be that MKL runs some stuff in parallel. > > Try > > SetSystemOptions["MKLThreads" -> 1] and see if that helps. > > > processor. No parallelization options have been set, so what is it? > > Automatic parallelization by the C compiler (I have Microsoft visual > > under Windows 7) ? Do you have this effect on your computer? > > I can not test that since I use Linux/gcc. > > > However, the programs of a different type, such as my research > > program, still run on one core of the processor. I don't see what > > makes the compiled program run in different ways, because they are > > written similarly. > > I understand that you'd want to compare the generated code with the > handwritten code on the same number of threads but I can not resist to > point out that the parallelization of the C++ code is something that need= s > to be developed but that parallelization via Mathematica come at almost > not additional cost. > > On a completely different note, here is another approach that could be > taken. > > CCodeGenerator/tutorial/CodeGeneration > > Oliver This behavior is not new to me. Calculating matrix exponentials also leads to a 100% processor usage on multiprocessor computers without any parallelization. I have observed it on my Windows 7 laptop and on a Mac Pro at work. The system monitor shows that only one Mathematica kernel is working but the load of this kernel is much greater than 100%, especially on the Mac Pro that has 8 cores. My laptop may switch off (because of overheating?) during such calculations while the Mac is OK. Also I've seen such a behavior solving PDEs with NDSolve in some cases. I wonder what is happening and I do not know whether this effect is good or bad. As I cannot control it, I cannot measure if such an extensive processor usage leads to a speed-up. I am going to get Mathematica for Linux and test it there, too. Best, Dmitry