Re: Compilation: Avoiding inlining

*To*: mathgroup at smc.vnet.net*Subject*: [mg121968] Re: Compilation: Avoiding inlining*From*: Oliver Ruebenkoenig <ruebenko at wolfram.com>*Date*: Fri, 7 Oct 2011 04:49:56 -0400 (EDT)*Delivered-to*: l-mathgroup@mail-archive0.wolfram.com*References*: <201110060820.EAA22516@smc.vnet.net>

On Thu, 6 Oct 2011, DmitryG wrote: > On Oct 5, 4:15 am, "Oleksandr Rasputinov" > <oleksandr_rasputi... at hmamail.com> wrote: >> On Tue, 04 Oct 2011 07:45:30 +0100, DmitryG <einsch... at gmail.com> wrote: >>> On Sep 27, 6:24 am, Oliver Ruebenkoenig <ruebe... at wolfram.com> wrote: >>>> On Sat, 24 Sep 2011,DmitryGwrote: >> >>>>> A potentially very important question: I have noticed that the program >>>>> we are discussing, when compiled in C, runs on both cores of my >> >>>> Only when compiled to C? You could try to set Parallelization->False >>>> and/or it might be that MKL runs some stuff in parallel. >> >>>> Try >> >>>> SetSystemOptions["MKLThreads" -> 1] and see if that helps. >> >>>>> processor. No parallelization options have been set, so what is it? >>>>> Automatic parallelization by the C compiler (I have Microsoft visual >>>>> under Windows 7) ? Do you have this effect on your computer? >> >>>> I can not test that since I use Linux/gcc. >> >>>>> However, the programs of a different type, such as my research >>>>> program, still run on one core of the processor. I don't see what >>>>> makes the compiled program run in different ways, because they are >>>>> written similarly. >> >>>> I understand that you'd want to compare the generated code with the >>>> handwritten code on the same number of threads but I can not resist to >>>> point out that the parallelization of the C++ code is something that >>>> needs >>>> to be developed but that parallelization via Mathematica come at almost >>>> not additional cost. >> >>>> On a completely different note, here is another approach that could be >>>> taken. >> >>>> CCodeGenerator/tutorial/CodeGeneration >> >>>> Oliver >> >>> This behavior is not new to me. Calculating matrix exponentials also >>> leads to a 100% processor usage on multiprocessor computers without >>> any parallelization. I have observed it on my Windows 7 laptop and on >>> a Mac Pro at work. The system monitor shows that only one Mathematica >>> kernel is working but the load of this kernel is much greater than >>> 100%, especially on the Mac Pro that has 8 cores. My laptop may switch >>> off (because of overheating?) during such calculations while the Mac >>> is OK. >> >>> Also I've seen such a behavior solving PDEs with NDSolve in some >>> cases. >> >>> I wonder what is happening and I do not know whether this effect is >>> good or bad. As I cannot control it, I cannot measure if such an >>> extensive processor usage leads to a speed-up. >> >>> I am going to get Mathematica for Linux and test it there, too. >> >>> Best, >> >>> Dmitry >> >> In the case of the matrix exponentials (or really any numerical linear >> algebra), this behaviour is undoubtedly due to MKL threading and can be >> controlled by the option Oliver gives above. Obviously it is not good if > >> your laptop switches off due to overheating, but this is not so much a >> problem of Mathematica as badly designed cooling in the laptop. MKL's >> threading is carefully done and scales well for moderate numbers of cores, >> so you should be seeing considerably increased performance as a result of >> it on an 8-core machine. In regard to NDSolve, I don't know how this is >> implemented internally and so can't comment on any parallelization that >> might exist. > > Thank you, Oleksandr! > > For some reason, I've overlooked Oliver's suggestion to try > SetSystemOptions["MKLThreads" -> 1]. > > MKL stands for Intel's Math Kernel Library (I have Intel on both > computers) and it seems to be a big thing, if you can use it right. It > seems, the Intel processor can parallelize problems in some cases. But > in which cases? How can we know it? It would be very desirable to be > able to write codes that allow this kind of automatic parallelization. > > I have made experiments, > > SetSystemOptions["MKLThreads" -> 2]; > NN = 1000; > AMatr = Table[RandomReal[{0, 1}], {i, 1, NN}, {j, 1, NN}]; > AbsoluteTiming[ MatrixExp[AMatr]; ] > > Out[12]= {2.2281247, Null} > > while > > SetSystemOptions["MKLThreads" -> 1]; > NN = 1000; > AMatr = Table[RandomReal[{0, 1}], {i, 1, NN}, {j, 1, NN}]; > AbsoluteTiming[MatrixExp[AMatr];] > > Out[16]= {3.6412015, Null} > > That is, there is a speed-up and we can control it. The same for > matrix multiplication. > > On the other hand, Oliver's code of 23 September above runs on both > cores and I cannot control it by SetSystemOptions["MKLThreads" -> 1]. > Here is the complete code: Try forcing a non parallel version with Parallelization->False. The default is Automatic and applies some heuristic when code is run in parallel and when not. Does this help? Oliver > > ***************************************************** > (* Runge-Kutta-4 routine *) > ClearAll[makeCompRK] > makeCompRK[f_] := > Compile[{{x0, _Real, 1}, {t0}, {tMax}, {n, _Integer}}, > Module[{h, K1, K2, K3, K4, SolList, x = x0, t}, h = (tMax - t0)/n; > SolList = Table[x0, {n + 1}]; > Do[t = t0 + k h; > K1 = h f[t, x]; > K2 = h f[t + (1/2) h, x + (1/2) K1]; > K3 = h f[t + (1/2) h, x + (1/2) K2]; > K4 = h f[t + h, x + K3]; > x = x + (1/6) K1 + (1/3) K2 + (1/3) K3 + (1/6) K4; > SolList[[k + 1]] = x, {k, 1, n}]; > SolList](*,Parallelization->True*), CompilationTarget -> "C", > CompilationOptions -> {"InlineCompiledFunctions" -> True}, > "RuntimeOptions" -> "Speed"] > > (* Defining equations *) > NN = 10000; > ff = Compile[{t}, > Sin[0.1 t]^2]; (* ff is inserted into cRHS, the way to go *) > > su = With[{NN = NN}, > Compile[{{i, _Integer}, {x, _Real, 1}}, > Sum[x[[i + j]], {j, 1, Min[4, NN - i]}]]]; > cRHS = With[{NN = NN}, Compile[{{t}, {x, _Real, 1}}, > Table[-x[[i]]*ff[t]/(1 + 100 su[i, x]^2), {i, 1, NN}](*, > CompilationTarget->"C"*), > CompilationOptions -> {"InlineExternalDefinitions" -> True, > "InlineCompiledFunctions" -> True}]]; > > (* With this trick it runs faster *) > su2 = With[{NN = NN}, > Compile[{{x, _Real, 1}}, > Table[Sum[x[[i + j]], {j, 1, Min[4, NN - i]}], {i, 1, NN}]]]; > cRHS2 = With[{NN = NN}, > Compile[{t, {x, _Real, 1}}, -x*ff[t]/(1 + 100*su2[x]^2), > CompilationTarget -> "C", > CompilationOptions -> {"InlineExternalDefinitions" -> True, > "InlineCompiledFunctions" -> True}]]; > > (*Compilation*) > tt0 = AbsoluteTime[]; > Timing[RK4Comp = makeCompRK[cRHS2];] > AbsoluteTime[] - tt0 > (*CompilePrint[RK4Comp2]*) > (*switch inling to True/False to see what is happening*) > > (*Setting parameters and Calculation*) > x0 = Table[ > RandomReal[{0, 1}], {i, 1, NN}]; t0 = 0; tMax = 300; n = 500; > tt0 = AbsoluteTime[]; > Sol = RK4Comp[x0, t0, tMax, n]; > AbsoluteTime[] - tt0 > > Print["Compilation: ", Developer`PackedArrayQ@Sol] > > (* Plotting *) > tList = Table[1. t0 + (tMax - t0) k/n, {k, 0, n}]; > x1List = Transpose[{tList, Transpose[Sol][[1]]}]; > x2List = Transpose[{tList, Transpose[Sol][[2]]}]; > x3List = Transpose[{tList, Transpose[Sol][[3]]}]; > ListPlot[{x1List, x2List, x3List}, PlotStyle -> {Blue, Green, Red}, > PlotRange -> All] > > **************************************************************** > > My initial code uses su and cRHS while Oliver's faster code uses su2 > and cRHS2. Here, SetSystemOptions["MKLThreads" -> 1] does not affect > the core usage. Thus I do not know if there is automatic > parallelization via MKL here or it is a parasite effect that does not > lead to speed-up. > > I am quite intrigued now! > > Best regards, > > Dmitry > > >

**References**:**Re: Compilation: Avoiding inlining***From:*DmitryG <einschlag@gmail.com>

**Re: Re: simplification**

**Re: Re: simplification**

**Re: Compilation: Avoiding inlining**

**Re: Compilation: Avoiding inlining**