Re: Compilation: Avoiding inlining
- To: mathgroup at smc.vnet.net
- Subject: [mg121724] Re: Compilation: Avoiding inlining
- From: Oliver Ruebenkoenig <ruebenko at wolfram.com>
- Date: Tue, 27 Sep 2011 06:22:08 -0400 (EDT)
- Delivered-to: l-mathgroup@mail-archive0.wolfram.com
- References: <201109250234.WAA26203@smc.vnet.net>
On Sat, 24 Sep 2011, DmitryG wrote: > On Sep 23, 3:49 am, Oliver Ruebenkoenig <ruebe... at wolfram.com> wrote: >> On Thu, 22 Sep 2011, DmitryG wrote: >>> On Sep 21, 5:41 am, DmitryG <einsch... at gmail.com> wrote: >>>> On Sep 20, 6:08 am, David Bailey <d... at removedbailey.co.uk> wrote: >> >>>>> On 16/09/2011 12:08, Oliver Ruebenkoenig wrote: >> >>>>>> On Fri, 16 Sep 2011, DmitryG wrote: >> >>>>>>> Here is a program with (* Definition of the equations *) made one- >>>>>>> step, that performs the same as in my previous post.................... >> >>>>> I tried pasting your example into Mathematica, but unfortunately there >>>>> seems to be a variable 'x' which is undefined - presumably some input >>>>> data. It might be worth posting a complete example, so that people can >>>>> explore how to get decent performance. >> >>>>> Error: >> >>>>> Part::partd: "Part specification x[[1]] is longer than depth of object. " >> >>>>> David Baileyhttp://www.dbaileyconsultancy.co.uk >> >>>> Hi David, >> >>>> The RK4 procedure here works with the solution vector x whose initial >>>> value is defined after the RK4 procedure and the equations are >>>> defined. Mathematica does not know the lenght of x at the beginning >>>> and this is why it complains. You can ignore these complaints. Of the >>>> several codes posted above, that in Oliver's 19 September post is the >>>> best because it is fully compiled. Here is this code with plotting the >>>> solution: >> >>>> *************************************** >>>> (* Runge-Kutta-4 routine *) >>>> ClearAll[makeCompRK] >>>> makeCompRK[f_] := >>>> Compile[{{x0, _Real, 1}, {t0, _Real}, {tMax, _Real}, {n, _Integer}}, >>>> Module[{h, K1, K2, K3, K4, SolList, x = x0, t}, h = (tMax - t0)/n; >>>> SolList = Table[x0, {n + 1}]; >>>> Do[t = t0 + k h; >>>> K1 = h f[t, x]; >>>> K2 = h f[t + (1/2) h, x + (1/2) K1]; >>>> K3 = h f[t + (1/2) h, x + (1/2) K2]; >>>> K4 = h f[t + h, x + K3]; >>>> x = x + (1/6) K1 + (1/3) K2 + (1/3) K3 + (1/6) K4; >>>> SolList[[k + 1]] = x, {k, 1, n}]; >>>> SolList](*,Parallelization->True*), CompilationTarget -> "C", >>>> CompilationOptions -> {"InlineCompiledFunctions" -> True}] >> >>>> (* Defining equations *) >>>> NN = 1000; >>>> cRHS = With[{NN = NN}, Compile[{{t, _Real, 0}, {x, _Real, 1}}, >>>> Table[-x[[i]]* >>>> Sin[0.1 t]^2/(1 + >>>> 100 Sum[x[[i + j]], {j, 1, Min[4, NN - i]}]^2), {i,1, NN}] >>>> (*, >>>> CompilationTarget->"C"*)(*, >>>> CompilationOptions->{"InlineExternalDefinitions"->True}*)]]; >> >>>> (*Compilation*) >>>> tt0 = AbsoluteTime[]; >>>> Timing[RK4Comp = makeCompRK[cRHS];] >>>> AbsoluteTime[] - tt0 >>>> (*CompilePrint[RK4Comp2]*) >> >>>> (*Setting parameters and Calculation*) >>>> x0 = Table[ >>>> RandomReal[{0, 1}], {i, 1, NN}]; t0 = 0; tMax = 300; n = 500; >>>> tt0 = AbsoluteTime[]; >>>> Sol = RK4Comp[x0, t0, tMax, n]; >>>> AbsoluteTime[] - tt0 >> >>>> Print["Compilation: ", Developer`PackedArrayQ@Sol] >> >>>> (* Plotting *) >>>> tList = Table[1. t0 + (tMax - t0) k/n, {k, 0, n}]; >>>> x1List = Transpose[{tList, Transpose[Sol][[1]]}]; >>>> x2List = Transpose[{tList, Transpose[Sol][[2]]}]; >>>> x3List = Transpose[{tList, Transpose[Sol][[3]]}]; >>>> ListLinePlot[{x1List, x2List, x3List}, PlotMarkers -> Automatic, >>>> PlotStyle -> {Blue, Green, Red}, PlotRange -> {0, 1}] >> >>>> Best, >> >>>> Dmitry >> >>> The execution time of the program above on my laptop today is 1.0 for >>> compilation RK4 in Mathematica and 0.24 for compilation RK4 in C. For >> >> You get some further speed up is you give the >> >> , "RuntimeOptions" -> "Speed" >> >> option to makeCompRK. >> >> >> >> >> >> >> >> >> >>> other compiled functions, it does not matter if the compilation target >>> is C or Mathematica (why?). >> >>> In my actual research program, I have a lot of external definitions >>> and all of them have to be compiled. To model this situation, I have >>> rewritten the same program with external definitions as follows: >> >>> ............ >> >>> (* Defining equations *) >>> NN = 1000; >>> ff = Compile[{{t}}, Sin[0.1 t]^2]; >>> su = With[{NN = NN}, Compile[{{i, _Integer}, {x, _Real, 1}}, Sum[x[[i >>> + j]], {j, 1, Min[4, NN - i]}]]]; >>> cRHS = With[{NN = NN}, Compile[{{t}, {x, _Real, 1}},Table[- >>> x[[i]]*ff[t]/(1 + 100 su[i, x]^2), {i, 1, NN}, CompilationOptions -> >>> {"InlineExternalDefinitions" -> True, "InlineCompiledFunctions" -> >>> True}]]; >>> ................................................... >> >>> Now the execution time is 2.2 for compilation in Mathematica and 1.42 >>> for compilation in C. We see there is a considerable slowdown because >>> of the external definition (actually because of su). I wonder why does >>> it happen? >> >> If you look at CompilePrint[cRHS] you will see a CopyTensor that is in the >> loop. That causes the slowdown. >> >> With some further optimizations, you could write >> >> su2 = With[{NN = NN}, >> Compile[{{x, _Real, 1}}, >> Table[Sum[x[[i + j]], {j, 1, Min[4, NN - i]}], {i, 1, NN}]]]; >> >> cRHS2 = With[{NN = NN}, >> Compile[{{t, _Real, 0}, {x, _Real, 1}}, -x* >> ff[t]/(1 + 100*su2[x]^2) >> , CompilationTarget -> "C" >> , CompilationOptions -> {"InlineExternalDefinitions" -> True, >> "InlineCompiledFunctions" -> True}]]; >> >> Then, the CopyTensor is outside of the loop. >> >> Why is there a CopyTensor in the first place? Because su could be evil and >> (e.g. via a call to MainEvaluate to a function that has Attribute HoldAll) >> change the value of the argument. I have to see if that could be avoided. >> I'll send an email if I find something. >> >> The external definitions are compiled and inlined in the >> >>> compiled code of RK4Comp, thus, to my understanding the execution time >>> should be the same. What is wrong here? >> >> I think there might be another cave canem: The expression optimizer that is >> called by the compiler may not be able to optimize as much if there are >> several function calls instead of one. >> >> Oliver >> >> >> >> >> >> >> >>> Dmitry > > Thank you Oliver! Great ideas, as usual! > > I was able to rewrite my actual research programs in this way and they > run faster. > > A potentially very important question: I have noticed that the program > we are discussing, when compiled in C, runs on both cores of my Only when compiled to C? You could try to set Parallelization->False and/or it might be that MKL runs some stuff in parallel. Try SetSystemOptions["MKLThreads" -> 1] and see if that helps. > processor. No parallelization options have been set, so what is it? > Automatic parallelization by the C compiler (I have Microsoft visual > under Windows 7) ? Do you have this effect on your computer? > I can not test that since I use Linux/gcc. > However, the programs of a different type, such as my research > program, still run on one core of the processor. I don't see what > makes the compiled program run in different ways, because they are > written similarly. > I understand that you'd want to compare the generated code with the handwritten code on the same number of threads but I can not resist to point out that the parallelization of the C++ code is something that needs to be developed but that parallelization via Mathematica come at almost not additional cost. On a completely different note, here is another approach that could be taken. CCodeGenerator/tutorial/CodeGeneration Oliver
- References:
- Re: Compilation: Avoiding inlining
- From: DmitryG <einschlag@gmail.com>
- Re: Compilation: Avoiding inlining