MathGroup Archive: October 2011 [00157]

[Date Index] [Thread Index] [Author Index]

Re: Compilation: Avoiding inlining

To: mathgroup at smc.vnet.net
Subject: [mg121890] Re: Compilation: Avoiding inlining
From: DmitryG <einschlag at gmail.com>
Date: Thu, 6 Oct 2011 04:20:14 -0400 (EDT)
Delivered-to: l-mathgroup@mail-archive0.wolfram.com
References: <201109250234.WAA26203@smc.vnet.net> <j5s88c$pf1$1@smc.vnet.net> <j6h3ml$75d$1@smc.vnet.net>

On Oct 5, 4:15 am, "Oleksandr Rasputinov"
<oleksandr_rasputi... at hmamail.com> wrote:
> On Tue, 04 Oct 2011 07:45:30 +0100, DmitryG <einsch... at gmail.com> wrote:
> > On Sep 27, 6:24 am, Oliver Ruebenkoenig <ruebe... at wolfram.com> wrote:
> >> On Sat, 24 Sep 2011,DmitryGwrote:
>
> >> > A potentially very important question: I have noticed that the program
> >> > we are discussing, when compiled in C, runs on both cores of my
>
> >> Only when compiled to C? You could try to set Parallelization->False
> >> and/or it might be that MKL runs some stuff in parallel.
>
> >> Try
>
> >> SetSystemOptions["MKLThreads" -> 1] and see if that helps.
>
> >> > processor. No parallelization options have been set, so what is it?
> >> > Automatic parallelization by the C compiler (I have Microsoft visual
> >> > under Windows 7) ?  Do you have this effect on your computer?
>
> >> I can not test that since I use Linux/gcc.
>
> >> > However, the programs of a different type, such as my research
> >> > program, still run on one core of the processor. I don't see what
> >> > makes the compiled program run in different ways, because they are
> >> > written similarly.
>
> >> I understand that you'd want to compare the generated code with the
> >> handwritten code on the same number of threads but I can not resist to
> >> point out that the parallelization of the C++ code is something that 
> >> needs
> >> to be developed but that parallelization via Mathematica come at almost
> >> not additional cost.
>
> >> On a completely different note, here is another approach that could be
> >> taken.
>
> >> CCodeGenerator/tutorial/CodeGeneration
>
> >> Oliver
>
> > This behavior is not new to me. Calculating matrix exponentials also
> > leads to a 100% processor usage on multiprocessor computers without
> > any parallelization. I have observed it on my Windows 7 laptop and on
> > a Mac Pro at work. The system monitor shows that only one Mathematica
> > kernel is working but the load of this kernel is much greater than
> > 100%, especially on the Mac Pro that has 8 cores. My laptop may switch
> > off (because of overheating?) during such calculations while the Mac
> > is OK.
>
> > Also I've seen such a behavior solving PDEs with NDSolve in some
> > cases.
>
> > I wonder what is happening and I do not know whether this effect is
> > good or bad. As I cannot control it, I cannot measure if such an
> > extensive processor usage leads to a speed-up.
>
> > I am going to get Mathematica for Linux and test it there, too.
>
> > Best,
>
> > Dmitry
>
> In the case of the matrix exponentials (or really any numerical linear 
> algebra), this behaviour is undoubtedly due to MKL threading and can be 
> controlled by the option Oliver gives above. Obviously it is not good if 
 
> your laptop switches off due to overheating, but this is not so much a 
> problem of Mathematica as badly designed cooling in the laptop. MKL's  
> threading is carefully done and scales well for moderate numbers of cores,  
> so you should be seeing considerably increased performance as a result of
> it on an 8-core machine. In regard to NDSolve, I don't know how this is 
> implemented internally and so can't comment on any parallelization that 
> might exist.

Thank you, Oleksandr!

For some reason, I've overlooked Oliver's suggestion to try
SetSystemOptions["MKLThreads" -> 1].

MKL stands for Intel's Math Kernel Library (I have Intel on both
computers) and it seems to be a big thing, if you can use it right. It
seems, the Intel processor can parallelize problems in some cases. But
in which cases? How can we know it? It would be very desirable to be
able to write codes that allow this kind of automatic parallelization.

I have made experiments,

SetSystemOptions["MKLThreads" -> 2];
NN = 1000;
AMatr = Table[RandomReal[{0, 1}], {i, 1, NN}, {j, 1, NN}];
AbsoluteTiming[ MatrixExp[AMatr]; ]

Out[12]= {2.2281247, Null}

while

SetSystemOptions["MKLThreads" -> 1];
NN = 1000;
AMatr = Table[RandomReal[{0, 1}], {i, 1, NN}, {j, 1, NN}];
AbsoluteTiming[MatrixExp[AMatr];]

Out[16]= {3.6412015, Null}

That is, there is a speed-up and we can control it. The same for
matrix multiplication.

On the other hand, Oliver's code of 23 September above runs on both
cores and I cannot control it by SetSystemOptions["MKLThreads" -> 1].
Here is the complete code:

*****************************************************
(* Runge-Kutta-4 routine *)
ClearAll[makeCompRK]
makeCompRK[f_] :=
 Compile[{{x0, _Real, 1}, {t0}, {tMax}, {n, _Integer}},
  Module[{h, K1, K2, K3, K4, SolList, x = x0, t}, h = (tMax - t0)/n;
   SolList = Table[x0, {n + 1}];
   Do[t = t0 + k h;
    K1 = h f[t, x];
    K2 = h f[t + (1/2) h, x + (1/2) K1];
    K3 = h f[t + (1/2) h, x + (1/2) K2];
    K4 = h f[t + h, x + K3];
    x = x + (1/6) K1 + (1/3) K2 + (1/3) K3 + (1/6) K4;
    SolList[[k + 1]] = x, {k, 1, n}];
   SolList](*,Parallelization->True*), CompilationTarget -> "C",
  CompilationOptions -> {"InlineCompiledFunctions" -> True},
  "RuntimeOptions" -> "Speed"]

(* Defining equations *)
NN = 10000;
ff = Compile[{t},
  Sin[0.1 t]^2];  (* ff is inserted into cRHS, the way to go *)

su = With[{NN = NN},
  Compile[{{i, _Integer}, {x, _Real, 1}},
   Sum[x[[i + j]], {j, 1, Min[4, NN - i]}]]];
cRHS = With[{NN = NN}, Compile[{{t}, {x, _Real, 1}},
    Table[-x[[i]]*ff[t]/(1 + 100 su[i, x]^2), {i, 1, NN}](*,
    CompilationTarget->"C"*),
    CompilationOptions -> {"InlineExternalDefinitions" -> True,
      "InlineCompiledFunctions" -> True}]];

(* With this trick it runs faster *)
su2 = With[{NN = NN},
   Compile[{{x, _Real, 1}},
    Table[Sum[x[[i + j]], {j, 1, Min[4, NN - i]}], {i, 1, NN}]]];
cRHS2 = With[{NN = NN},
   Compile[{t, {x, _Real, 1}}, -x*ff[t]/(1 + 100*su2[x]^2),
    CompilationTarget -> "C",
    CompilationOptions -> {"InlineExternalDefinitions" -> True,
      "InlineCompiledFunctions" -> True}]];

(*Compilation*)
tt0 = AbsoluteTime[];
Timing[RK4Comp = makeCompRK[cRHS2];]
AbsoluteTime[] - tt0
(*CompilePrint[RK4Comp2]*)
(*switch inling to True/False to see what is happening*)

(*Setting parameters and Calculation*)
x0 = Table[
  RandomReal[{0, 1}], {i, 1, NN}]; t0 = 0; tMax = 300; n = 500;
tt0 = AbsoluteTime[];
Sol = RK4Comp[x0, t0, tMax, n];
AbsoluteTime[] - tt0

Print["Compilation: ", Developer`PackedArrayQ@Sol]

(* Plotting *)
tList = Table[1. t0 + (tMax - t0) k/n, {k, 0, n}];
x1List = Transpose[{tList, Transpose[Sol][[1]]}];
x2List = Transpose[{tList, Transpose[Sol][[2]]}];
x3List = Transpose[{tList, Transpose[Sol][[3]]}];
ListPlot[{x1List, x2List, x3List}, PlotStyle -> {Blue, Green, Red},
 PlotRange -> All]

****************************************************************

My initial code uses su and cRHS while Oliver's faster code uses su2
and cRHS2. Here, SetSystemOptions["MKLThreads" -> 1] does not affect
the core usage. Thus I do not know if there is automatic
parallelization via MKL here or it is a parasite effect that does not
lead to speed-up.

I am quite intrigued now!

Best regards,

Dmitry

Follow-Ups:
- Re: Compilation: Avoiding inlining
  - From: Oliver Ruebenkoenig <ruebenko@wolfram.com>

Prev by Date: Re: Fully vectorized system of ODE's - any advantage of C?

Next by Date: Re: Find a single solution to a system of equations

Previous by thread: Re: Compilation: Avoiding inlining

Next by thread: Re: Compilation: Avoiding inlining