MathGroup Archive: October 2011 [00210]

[Date Index] [Thread Index] [Author Index]
Re: Compilation: Avoiding inlining
To: mathgroup at smc.vnet.net
Subject: [mg121968] Re: Compilation: Avoiding inlining
From: Oliver Ruebenkoenig <ruebenko at wolfram.com>
Date: Fri, 7 Oct 2011 04:49:56 -0400 (EDT)
Delivered-to: l-mathgroup@mail-archive0.wolfram.com
References: <201110060820.EAA22516@smc.vnet.net>
On Thu, 6 Oct 2011, DmitryG wrote:

> On Oct 5, 4:15 am, "Oleksandr Rasputinov"
> <oleksandr_rasputi... at hmamail.com> wrote:
>> On Tue, 04 Oct 2011 07:45:30 +0100, DmitryG <einsch... at gmail.com> wrote:
>>> On Sep 27, 6:24 am, Oliver Ruebenkoenig <ruebe... at wolfram.com> wrote:
>>>> On Sat, 24 Sep 2011,DmitryGwrote:
>>
>>>>> A potentially very important question: I have noticed that the program
>>>>> we are discussing, when compiled in C, runs on both cores of my
>>
>>>> Only when compiled to C? You could try to set Parallelization->False
>>>> and/or it might be that MKL runs some stuff in parallel.
>>
>>>> Try
>>
>>>> SetSystemOptions["MKLThreads" -> 1] and see if that helps.
>>
>>>>> processor. No parallelization options have been set, so what is it?
>>>>> Automatic parallelization by the C compiler (I have Microsoft visual
>>>>> under Windows 7) ?  Do you have this effect on your computer?
>>
>>>> I can not test that since I use Linux/gcc.
>>
>>>>> However, the programs of a different type, such as my research
>>>>> program, still run on one core of the processor. I don't see what
>>>>> makes the compiled program run in different ways, because they are
>>>>> written similarly.
>>
>>>> I understand that you'd want to compare the generated code with the
>>>> handwritten code on the same number of threads but I can not resist to
>>>> point out that the parallelization of the C++ code is something that
>>>> needs
>>>> to be developed but that parallelization via Mathematica come at almost
>>>> not additional cost.
>>
>>>> On a completely different note, here is another approach that could be
>>>> taken.
>>
>>>> CCodeGenerator/tutorial/CodeGeneration
>>
>>>> Oliver
>>
>>> This behavior is not new to me. Calculating matrix exponentials also
>>> leads to a 100% processor usage on multiprocessor computers without
>>> any parallelization. I have observed it on my Windows 7 laptop and on
>>> a Mac Pro at work. The system monitor shows that only one Mathematica
>>> kernel is working but the load of this kernel is much greater than
>>> 100%, especially on the Mac Pro that has 8 cores. My laptop may switch
>>> off (because of overheating?) during such calculations while the Mac
>>> is OK.
>>
>>> Also I've seen such a behavior solving PDEs with NDSolve in some
>>> cases.
>>
>>> I wonder what is happening and I do not know whether this effect is
>>> good or bad. As I cannot control it, I cannot measure if such an
>>> extensive processor usage leads to a speed-up.
>>
>>> I am going to get Mathematica for Linux and test it there, too.
>>
>>> Best,
>>
>>> Dmitry
>>
>> In the case of the matrix exponentials (or really any numerical linear
>> algebra), this behaviour is undoubtedly due to MKL threading and can be
>> controlled by the option Oliver gives above. Obviously it is not good if
>
>> your laptop switches off due to overheating, but this is not so much a
>> problem of Mathematica as badly designed cooling in the laptop. MKL's
>> threading is carefully done and scales well for moderate numbers of cores,
>> so you should be seeing considerably increased performance as a result of
>> it on an 8-core machine. In regard to NDSolve, I don't know how this is
>> implemented internally and so can't comment on any parallelization that
>> might exist.
>
> Thank you, Oleksandr!
>
> For some reason, I've overlooked Oliver's suggestion to try
> SetSystemOptions["MKLThreads" -> 1].
>
> MKL stands for Intel's Math Kernel Library (I have Intel on both
> computers) and it seems to be a big thing, if you can use it right. It
> seems, the Intel processor can parallelize problems in some cases. But
> in which cases? How can we know it? It would be very desirable to be
> able to write codes that allow this kind of automatic parallelization.
>
> I have made experiments,
>
> SetSystemOptions["MKLThreads" -> 2];
> NN = 1000;
> AMatr = Table[RandomReal[{0, 1}], {i, 1, NN}, {j, 1, NN}];
> AbsoluteTiming[ MatrixExp[AMatr]; ]
>
> Out[12]= {2.2281247, Null}
>
> while
>
> SetSystemOptions["MKLThreads" -> 1];
> NN = 1000;
> AMatr = Table[RandomReal[{0, 1}], {i, 1, NN}, {j, 1, NN}];
> AbsoluteTiming[MatrixExp[AMatr];]
>
> Out[16]= {3.6412015, Null}
>
> That is, there is a speed-up and we can control it. The same for
> matrix multiplication.
>
> On the other hand, Oliver's code of 23 September above runs on both
> cores and I cannot control it by SetSystemOptions["MKLThreads" -> 1].
> Here is the complete code:

Try forcing a non parallel version with Parallelization->False.

The default is Automatic and applies some heuristic when code is run in 
parallel and when not.

Does this help?

Oliver

>
> *****************************************************
> (* Runge-Kutta-4 routine *)
> ClearAll[makeCompRK]
> makeCompRK[f_] :=
> Compile[{{x0, _Real, 1}, {t0}, {tMax}, {n, _Integer}},
>  Module[{h, K1, K2, K3, K4, SolList, x = x0, t}, h = (tMax - t0)/n;
>   SolList = Table[x0, {n + 1}];
>   Do[t = t0 + k h;
>    K1 = h f[t, x];
>    K2 = h f[t + (1/2) h, x + (1/2) K1];
>    K3 = h f[t + (1/2) h, x + (1/2) K2];
>    K4 = h f[t + h, x + K3];
>    x = x + (1/6) K1 + (1/3) K2 + (1/3) K3 + (1/6) K4;
>    SolList[[k + 1]] = x, {k, 1, n}];
>   SolList](*,Parallelization->True*), CompilationTarget -> "C",
>  CompilationOptions -> {"InlineCompiledFunctions" -> True},
>  "RuntimeOptions" -> "Speed"]
>
> (* Defining equations *)
> NN = 10000;
> ff = Compile[{t},
>  Sin[0.1 t]^2];  (* ff is inserted into cRHS, the way to go *)
>
> su = With[{NN = NN},
>  Compile[{{i, _Integer}, {x, _Real, 1}},
>   Sum[x[[i + j]], {j, 1, Min[4, NN - i]}]]];
> cRHS = With[{NN = NN}, Compile[{{t}, {x, _Real, 1}},
>    Table[-x[[i]]*ff[t]/(1 + 100 su[i, x]^2), {i, 1, NN}](*,
>    CompilationTarget->"C"*),
>    CompilationOptions -> {"InlineExternalDefinitions" -> True,
>      "InlineCompiledFunctions" -> True}]];
>
> (* With this trick it runs faster *)
> su2 = With[{NN = NN},
>   Compile[{{x, _Real, 1}},
>    Table[Sum[x[[i + j]], {j, 1, Min[4, NN - i]}], {i, 1, NN}]]];
> cRHS2 = With[{NN = NN},
>   Compile[{t, {x, _Real, 1}}, -x*ff[t]/(1 + 100*su2[x]^2),
>    CompilationTarget -> "C",
>    CompilationOptions -> {"InlineExternalDefinitions" -> True,
>      "InlineCompiledFunctions" -> True}]];
>
> (*Compilation*)
> tt0 = AbsoluteTime[];
> Timing[RK4Comp = makeCompRK[cRHS2];]
> AbsoluteTime[] - tt0
> (*CompilePrint[RK4Comp2]*)
> (*switch inling to True/False to see what is happening*)
>
> (*Setting parameters and Calculation*)
> x0 = Table[
>  RandomReal[{0, 1}], {i, 1, NN}]; t0 = 0; tMax = 300; n = 500;
> tt0 = AbsoluteTime[];
> Sol = RK4Comp[x0, t0, tMax, n];
> AbsoluteTime[] - tt0
>
> Print["Compilation: ", Developer`PackedArrayQ@Sol]
>
> (* Plotting *)
> tList = Table[1. t0 + (tMax - t0) k/n, {k, 0, n}];
> x1List = Transpose[{tList, Transpose[Sol][[1]]}];
> x2List = Transpose[{tList, Transpose[Sol][[2]]}];
> x3List = Transpose[{tList, Transpose[Sol][[3]]}];
> ListPlot[{x1List, x2List, x3List}, PlotStyle -> {Blue, Green, Red},
> PlotRange -> All]
>
> ****************************************************************
>
> My initial code uses su and cRHS while Oliver's faster code uses su2
> and cRHS2. Here, SetSystemOptions["MKLThreads" -> 1] does not affect
> the core usage. Thus I do not know if there is automatic
> parallelization via MKL here or it is a parasite effect that does not
> lead to speed-up.
>
> I am quite intrigued now!
>
> Best regards,
>
> Dmitry
>
>
>
References:
- Re: Compilation: Avoiding inlining
  - From: DmitryG <einschlag@gmail.com>
Prev by Date: Re: Re: simplification
Next by Date: Re: Re: simplification
Previous by thread: Re: Compilation: Avoiding inlining
Next by thread: Re: Compilation: Avoiding inlining