Re: Compilation: Avoiding inlining

• To: mathgroup at smc.vnet.net
• Subject: [mg121840] Re: Compilation: Avoiding inlining
• From: DmitryG <einschlag at gmail.com>
• Date: Tue, 4 Oct 2011 02:44:31 -0400 (EDT)
• Delivered-to: l-mathgroup@mail-archive0.wolfram.com
• References: <201109250234.WAA26203@smc.vnet.net> <j5s88c\$pf1\$1@smc.vnet.net>

```On Sep 27, 6:24 am, Oliver Ruebenkoenig <ruebe... at wolfram.com> wrote:
> On Sat, 24 Sep 2011,DmitryGwrote:
> > On Sep 23, 3:49 am, Oliver Ruebenkoenig <ruebe... at wolfram.com> wrote:
> >> On Thu, 22 Sep 2011,DmitryGwrote:
> >>> On Sep 21, 5:41 am,DmitryG<einsch... at gmail.com> wrote:
> >>>> On Sep 20, 6:08 am, David Bailey <d... at removedbailey.co.uk> wrote:
>
> >>>>> On 16/09/2011 12:08, Oliver Ruebenkoenig wrote:
>
> >>>>>> On Fri, 16 Sep 2011,DmitryGwrote:
>
> >>>>>>> Here is a program with (* Definition of the equations *) made one=
-
> >>>>>>> step, that performs the same as in my previous post..............=
......
>
> >>>>> I tried pasting your example into Mathematica, but unfortunately th=
ere
> >>>>> seems to be a variable 'x' which is undefined - presumably some inp=
ut
> >>>>> data. It might be worth posting a complete example, so that people =
can
> >>>>> explore how to get decent performance.
>
> >>>>> Error:
>
> >>>>> Part::partd: "Part specification x[[1]] is longer than depth of obj=
ect. "
>
> >>>>> David Baileyhttp://www.dbaileyconsultancy.co.uk
>
> >>>> Hi David,
>
> >>>> The RK4 procedure here works with the solution vector x whose initia=
l
> >>>> value is defined after the RK4 procedure and the equations are
> >>>> defined. Mathematica does not know the lenght of x at the beginning
> >>>> and this is why it complains. You can ignore these complaints. Of th=
e
> >>>> several codes posted above, that in Oliver's 19 September post is th=
e
> >>>> best because it is fully compiled. Here is this code with plotting t=
he
> >>>> solution:
>
> >>>> ***************************************
> >>>> (* Runge-Kutta-4 routine *)
> >>>> ClearAll[makeCompRK]
> >>>> makeCompRK[f_] :=
> >>>>  Compile[{{x0, _Real, 1}, {t0, _Real}, {tMax, _Real}, {n, _Integer=
}},
> >>>>   Module[{h, K1, K2, K3, K4, SolList, x = x0, t}, h = (tMax - =
t0)/n;
> >>>>    SolList = Table[x0, {n + 1}];
> >>>>    Do[t = t0 + k h;
> >>>>     K1 = h f[t, x];
> >>>>     K2 = h f[t + (1/2) h, x + (1/2) K1];
> >>>>     K3 = h f[t + (1/2) h, x + (1/2) K2];
> >>>>     K4 = h f[t + h, x + K3];
> >>>>     x = x + (1/6) K1 + (1/3) K2 + (1/3) K3 + (1/6) K4;
> >>>>     SolList[[k + 1]] = x, {k, 1, n}];
> >>>>    SolList](*,Parallelization->True*), CompilationTarget -> "C",
> >>>>   CompilationOptions -> {"InlineCompiledFunctions" -> True}]
>
> >>>> (* Defining equations *)
> >>>> NN = 1000;
> >>>> cRHS = With[{NN = NN}, Compile[{{t, _Real, 0}, {x, _Real, 1}},
> >>>>     Table[-x[[i]]*
> >>>>       Sin[0.1 t]^2/(1 +
> >>>>          100 Sum[x[[i + j]], {j, 1, Min[4, NN - i]}]^2), {=
i,1, NN}]
> >>>> (*,
> >>>>     CompilationTarget->"C"*)(*,
> >>>>     CompilationOptions->{"InlineExternalDefinitions"->True}*)]];
>
> >>>> (*Compilation*)
> >>>> tt0 = AbsoluteTime[];
> >>>> Timing[RK4Comp = makeCompRK[cRHS];]
> >>>> AbsoluteTime[] - tt0
> >>>> (*CompilePrint[RK4Comp2]*)
>
> >>>> (*Setting parameters and Calculation*)
> >>>> x0 = Table[
> >>>>   RandomReal[{0, 1}], {i, 1, NN}]; t0 = 0; tMax = 300; n = 5=
00;
> >>>> tt0 = AbsoluteTime[];
> >>>> Sol = RK4Comp[x0, t0, tMax, n];
> >>>> AbsoluteTime[] - tt0
>
> >>>> Print["Compilation: ", Developer`PackedArrayQ@Sol]
>
> >>>> (* Plotting *)
> >>>> tList = Table[1. t0 + (tMax - t0) k/n, {k, 0, n}];
> >>>> x1List = Transpose[{tList, Transpose[Sol][[1]]}];
> >>>> x2List = Transpose[{tList, Transpose[Sol][[2]]}];
> >>>> x3List = Transpose[{tList, Transpose[Sol][[3]]}];
> >>>> ListLinePlot[{x1List, x2List, x3List}, PlotMarkers -> Automatic,
> >>>>  PlotStyle -> {Blue, Green, Red}, PlotRange -> {0, 1}]
>
> >>>> Best,
>
> >>>> Dmitry
>
> >>> The execution time of the program above on my laptop today is 1.0 for
> >>> compilation RK4 in Mathematica and  0.24 for compilation RK4 in C. =
For
>
> >> You get some further speed up is you give the
>
> >> , "RuntimeOptions" -> "Speed"
>
> >> option to makeCompRK.
>
> >>> other compiled functions, it does not matter if the compilation targe=
t
> >>> is C or Mathematica (why?).
>
> >>> In my actual research program, I have a lot of external definitions
> >>> and all of them have to be compiled. To model this situation, I have
> >>> rewritten the same program with external definitions as follows:
>
> >>> ............
>
> >>> (* Defining equations *)
> >>> NN = 1000;
> >>> ff = Compile[{{t}}, Sin[0.1 t]^2];
> >>> su = With[{NN = NN}, Compile[{{i, _Integer}, {x, _Real, 1}}, Sum[=
x[[i
> >>> + j]], {j, 1, Min[4, NN - i]}]]];
> >>> cRHS = With[{NN = NN}, Compile[{{t}, {x, _Real, 1}},Table[-
> >>> x[[i]]*ff[t]/(1 + 100 su[i, x]^2), {i, 1, NN}, CompilationOptions ->
> >>> {"InlineExternalDefinitions" -> True, "InlineCompiledFunctions" ->
> >>> True}]];
> >>> ...................................................
>
> >>> Now the execution time is 2.2 for compilation in Mathematica and 1.42
> >>> for compilation in C. We see there is a considerable slowdown because
> >>> of the external definition (actually because of su). I wonder why doe=
s
> >>> it happen?
>
> >> If you look at CompilePrint[cRHS] you will see a CopyTensor that is in=
the
> >> loop. That causes the slowdown.
>
> >> With some further optimizations, you could write
>
> >> su2 = With[{NN = NN},
> >>     Compile[{{x, _Real, 1}},
> >>      Table[Sum[x[[i + j]], {j, 1, Min[4, NN - i]}], {i, 1, NN}]]=
];
>
> >> cRHS2 = With[{NN = NN},
> >>     Compile[{{t, _Real, 0}, {x, _Real, 1}}, -x*
> >>       ff[t]/(1 + 100*su2[x]^2)
> >>      , CompilationTarget -> "C"
> >>      , CompilationOptions -> {"InlineExternalDefinitions" -> Tru=
e,
> >>        "InlineCompiledFunctions" -> True}]];
>
> >> Then, the CopyTensor is outside of the loop.
>
> >> Why is there a CopyTensor in the first place? Because su could be evil=
and
> >> (e.g. via a call to MainEvaluate to a function that has Attribute Hold=
All)
> >> change the value of the argument. I have to see if that could be avoid=
ed.
> >> I'll send an email if I find something.
>
> >>   The external definitions are compiled and inlined in the
>
> >>> compiled code of RK4Comp, thus, to my understanding the execution tim=
e
> >>> should be the same. What is wrong here?
>
> >> I think there might be another cave canem: The expression optimizer th=
at is
> >> called by the compiler may not be able to optimize as much if there ar=
e
> >> several function calls instead of one.
>
> >> Oliver
>
> >>> Dmitry
>
> > Thank you Oliver!  Great ideas, as usual!
>
> > I was able to rewrite my actual research programs in this way and they
> > run faster.
>
> > A potentially very important question: I have noticed that the program
> > we are discussing, when compiled in C, runs on both cores of my
>
> Only when compiled to C? You could try to set Parallelization->False
> and/or it might be that MKL runs some stuff in parallel.
>
> Try
>
> SetSystemOptions["MKLThreads" -> 1] and see if that helps.
>
> > processor. No parallelization options have been set, so what is it?
> > Automatic parallelization by the C compiler (I have Microsoft visual
> > under Windows 7) ?  Do you have this effect on your computer?
>
> I can not test that since I use Linux/gcc.
>
> > However, the programs of a different type, such as my research
> > program, still run on one core of the processor. I don't see what
> > makes the compiled program run in different ways, because they are
> > written similarly.
>
> I understand that you'd want to compare the generated code with the
> handwritten code on the same number of threads but I can not resist to
> point out that the parallelization of the C++ code is something that need=
s
> to be developed but that parallelization via Mathematica come at almost
>
> On a completely different note, here is another approach that could be
> taken.
>
> CCodeGenerator/tutorial/CodeGeneration
>
> Oliver

This behavior is not new to me. Calculating matrix exponentials also
leads to a 100% processor usage on multiprocessor computers without
any parallelization. I have observed it on my Windows 7 laptop and on
a Mac Pro at work. The system monitor shows that only one Mathematica
kernel is working but the load of this kernel is much greater than
100%, especially on the Mac Pro that has 8 cores. My laptop may switch
off (because of overheating?) during such calculations while the Mac
is OK.

Also I've seen such a behavior solving PDEs with NDSolve in some
cases.

I wonder what is happening and I do not know whether this effect is
good or bad. As I cannot control it, I cannot measure if such an
extensive processor usage leads to a speed-up.

I am going to get Mathematica for Linux and test it there, too.

Best,

Dmitry

```

• Prev by Date: Fully vectorized system of ODE's - any advantage of C?
• Next by Date: Re: Explicitly specifying the 3d viewing options (pan, rotate, etc.)
• Previous by thread: Re: Fully vectorized system of ODE's - any advantage of C?
• Next by thread: Re: Compilation: Avoiding inlining