MathGroup Archive: September 2011 [00566]

[Date Index] [Thread Index] [Author Index]
Re: Compilation: Avoiding inlining
To: mathgroup at smc.vnet.net
Subject: [mg121724] Re: Compilation: Avoiding inlining
From: Oliver Ruebenkoenig <ruebenko at wolfram.com>
Date: Tue, 27 Sep 2011 06:22:08 -0400 (EDT)
Delivered-to: l-mathgroup@mail-archive0.wolfram.com
References: <201109250234.WAA26203@smc.vnet.net>
On Sat, 24 Sep 2011, DmitryG wrote:

> On Sep 23, 3:49 am, Oliver Ruebenkoenig <ruebe... at wolfram.com> wrote:
>> On Thu, 22 Sep 2011, DmitryG wrote:
>>> On Sep 21, 5:41 am, DmitryG <einsch... at gmail.com> wrote:
>>>> On Sep 20, 6:08 am, David Bailey <d... at removedbailey.co.uk> wrote:
>>
>>>>> On 16/09/2011 12:08, Oliver Ruebenkoenig wrote:
>>
>>>>>> On Fri, 16 Sep 2011, DmitryG wrote:
>>
>>>>>>> Here is a program with (* Definition of the equations *) made one-
>>>>>>> step, that performs the same as in my previous post....................
>>
>>>>> I tried pasting your example into Mathematica, but unfortunately there
>>>>> seems to be a variable 'x' which is undefined - presumably some input
>>>>> data. It might be worth posting a complete example, so that people can
>>>>> explore how to get decent performance.
>>
>>>>> Error:
>>
>>>>> Part::partd: "Part specification x[[1]] is longer than depth of object. "
>>
>>>>> David Baileyhttp://www.dbaileyconsultancy.co.uk
>>
>>>> Hi David,
>>
>>>> The RK4 procedure here works with the solution vector x whose initial
>>>> value is defined after the RK4 procedure and the equations are
>>>> defined. Mathematica does not know the lenght of x at the beginning
>>>> and this is why it complains. You can ignore these complaints. Of the
>>>> several codes posted above, that in Oliver's 19 September post is the
>>>> best because it is fully compiled. Here is this code with plotting the
>>>> solution:
>>
>>>> ***************************************
>>>> (* Runge-Kutta-4 routine *)
>>>> ClearAll[makeCompRK]
>>>> makeCompRK[f_] :=
>>>>  Compile[{{x0, _Real, 1}, {t0, _Real}, {tMax, _Real}, {n, _Integer}},
>>>>   Module[{h, K1, K2, K3, K4, SolList, x = x0, t}, h = (tMax - t0)/n;
>>>>    SolList = Table[x0, {n + 1}];
>>>>    Do[t = t0 + k h;
>>>>     K1 = h f[t, x];
>>>>     K2 = h f[t + (1/2) h, x + (1/2) K1];
>>>>     K3 = h f[t + (1/2) h, x + (1/2) K2];
>>>>     K4 = h f[t + h, x + K3];
>>>>     x = x + (1/6) K1 + (1/3) K2 + (1/3) K3 + (1/6) K4;
>>>>     SolList[[k + 1]] = x, {k, 1, n}];
>>>>    SolList](*,Parallelization->True*), CompilationTarget -> "C",
>>>>   CompilationOptions -> {"InlineCompiledFunctions" -> True}]
>>
>>>> (* Defining equations *)
>>>> NN = 1000;
>>>> cRHS = With[{NN = NN}, Compile[{{t, _Real, 0}, {x, _Real, 1}},
>>>>     Table[-x[[i]]*
>>>>       Sin[0.1 t]^2/(1 +
>>>>          100 Sum[x[[i + j]], {j, 1, Min[4, NN - i]}]^2), {i,1, NN}]
>>>> (*,
>>>>     CompilationTarget->"C"*)(*,
>>>>     CompilationOptions->{"InlineExternalDefinitions"->True}*)]];
>>
>>>> (*Compilation*)
>>>> tt0 = AbsoluteTime[];
>>>> Timing[RK4Comp = makeCompRK[cRHS];]
>>>> AbsoluteTime[] - tt0
>>>> (*CompilePrint[RK4Comp2]*)
>>
>>>> (*Setting parameters and Calculation*)
>>>> x0 = Table[
>>>>   RandomReal[{0, 1}], {i, 1, NN}]; t0 = 0; tMax = 300; n = 500;
>>>> tt0 = AbsoluteTime[];
>>>> Sol = RK4Comp[x0, t0, tMax, n];
>>>> AbsoluteTime[] - tt0
>>
>>>> Print["Compilation: ", Developer`PackedArrayQ@Sol]
>>
>>>> (* Plotting *)
>>>> tList = Table[1. t0 + (tMax - t0) k/n, {k, 0, n}];
>>>> x1List = Transpose[{tList, Transpose[Sol][[1]]}];
>>>> x2List = Transpose[{tList, Transpose[Sol][[2]]}];
>>>> x3List = Transpose[{tList, Transpose[Sol][[3]]}];
>>>> ListLinePlot[{x1List, x2List, x3List}, PlotMarkers -> Automatic,
>>>>  PlotStyle -> {Blue, Green, Red}, PlotRange -> {0, 1}]
>>
>>>> Best,
>>
>>>> Dmitry
>>
>>> The execution time of the program above on my laptop today is 1.0 for
>>> compilation RK4 in Mathematica and  0.24 for compilation RK4 in C. For
>>
>> You get some further speed up is you give the
>>
>> , "RuntimeOptions" -> "Speed"
>>
>> option to makeCompRK.
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>> other compiled functions, it does not matter if the compilation target
>>> is C or Mathematica (why?).
>>
>>> In my actual research program, I have a lot of external definitions
>>> and all of them have to be compiled. To model this situation, I have
>>> rewritten the same program with external definitions as follows:
>>
>>> ............
>>
>>> (* Defining equations *)
>>> NN = 1000;
>>> ff = Compile[{{t}}, Sin[0.1 t]^2];
>>> su = With[{NN = NN}, Compile[{{i, _Integer}, {x, _Real, 1}}, Sum[x[[i
>>> + j]], {j, 1, Min[4, NN - i]}]]];
>>> cRHS = With[{NN = NN}, Compile[{{t}, {x, _Real, 1}},Table[-
>>> x[[i]]*ff[t]/(1 + 100 su[i, x]^2), {i, 1, NN}, CompilationOptions ->
>>> {"InlineExternalDefinitions" -> True, "InlineCompiledFunctions" ->
>>> True}]];
>>> ...................................................
>>
>>> Now the execution time is 2.2 for compilation in Mathematica and 1.42
>>> for compilation in C. We see there is a considerable slowdown because
>>> of the external definition (actually because of su). I wonder why does
>>> it happen?
>>
>> If you look at CompilePrint[cRHS] you will see a CopyTensor that is in the
>> loop. That causes the slowdown.
>>
>> With some further optimizations, you could write
>>
>> su2 = With[{NN = NN},
>>     Compile[{{x, _Real, 1}},
>>      Table[Sum[x[[i + j]], {j, 1, Min[4, NN - i]}], {i, 1, NN}]]];
>>
>> cRHS2 = With[{NN = NN},
>>     Compile[{{t, _Real, 0}, {x, _Real, 1}}, -x*
>>       ff[t]/(1 + 100*su2[x]^2)
>>      , CompilationTarget -> "C"
>>      , CompilationOptions -> {"InlineExternalDefinitions" -> True,
>>        "InlineCompiledFunctions" -> True}]];
>>
>> Then, the CopyTensor is outside of the loop.
>>
>> Why is there a CopyTensor in the first place? Because su could be evil and
>> (e.g. via a call to MainEvaluate to a function that has Attribute HoldAll)
>> change the value of the argument. I have to see if that could be avoided.
>> I'll send an email if I find something.
>>
>>   The external definitions are compiled and inlined in the
>>
>>> compiled code of RK4Comp, thus, to my understanding the execution time
>>> should be the same. What is wrong here?
>>
>> I think there might be another cave canem: The expression optimizer that is
>> called by the compiler may not be able to optimize as much if there are
>> several function calls instead of one.
>>
>> Oliver
>>
>>
>>
>>
>>
>>
>>
>>> Dmitry
>
> Thank you Oliver!  Great ideas, as usual!
>
> I was able to rewrite my actual research programs in this way and they
> run faster.
>
> A potentially very important question: I have noticed that the program
> we are discussing, when compiled in C, runs on both cores of my


Only when compiled to C? You could try to set Parallelization->False 
and/or it might be that MKL runs some stuff in parallel.

Try

SetSystemOptions["MKLThreads" -> 1] and see if that helps.

> processor. No parallelization options have been set, so what is it?
> Automatic parallelization by the C compiler (I have Microsoft visual
> under Windows 7) ?  Do you have this effect on your computer?
>

I can not test that since I use Linux/gcc.

> However, the programs of a different type, such as my research
> program, still run on one core of the processor. I don't see what
> makes the compiled program run in different ways, because they are
> written similarly.
>

I understand that you'd want to compare the generated code with the 
handwritten code on the same number of threads but I can not resist to 
point out that the parallelization of the C++ code is something that needs 
to be developed but that parallelization via Mathematica come at almost 
not additional cost.

On a completely different note, here is another approach that could be 
taken.

CCodeGenerator/tutorial/CodeGeneration

Oliver
References:
- Re: Compilation: Avoiding inlining
  - From: DmitryG <einschlag@gmail.com>
Prev by Date: Re: Constrain locator
Next by Date: algebraic simplification
Previous by thread: Re: Compilation: Avoiding inlining
Next by thread: Nonlinearregress with symbolic partial differential equation and integration