Re: Compilation: Avoiding inlining

*To*: mathgroup at smc.vnet.net*Subject*: [mg122008] Re: Compilation: Avoiding inlining*From*: DmitryG <einschlag at gmail.com>*Date*: Sun, 9 Oct 2011 03:50:35 -0400 (EDT)*Delivered-to*: l-mathgroup@mail-archive0.wolfram.com*References*: <201110060820.EAA22516@smc.vnet.net> <j6mfc6$7k8$1@smc.vnet.net>

On Oct 7, 5:05 am, Oliver Ruebenkoenig <ruebe... at wolfram.com> wrote: > On Thu, 6 Oct 2011,DmitryGwrote: > > On Oct 5, 4:15 am, "Oleksandr Rasputinov" > > <oleksandr_rasputi... at hmamail.com> wrote: > >> On Tue, 04 Oct 2011 07:45:30 +0100,DmitryG<einsch... at gmail.com> wrote: > >>> On Sep 27, 6:24 am, Oliver Ruebenkoenig <ruebe... at wolfram.com> wrote: > >>>> On Sat, 24 Sep 2011,DmitryGwrote: > > >>>>> A potentially very important question: I have noticed that the prog= ram > >>>>> we are discussing, when compiled in C, runs on both cores of my > > >>>> Only when compiled to C? You could try to set Parallelization->False > >>>> and/or it might be that MKL runs some stuff in parallel. > > >>>> Try > > >>>> SetSystemOptions["MKLThreads" -> 1] and see if that helps. > > >>>>> processor. No parallelization options have been set, so what is it? > >>>>> Automatic parallelization by the C compiler (I have Microsoft visua= l > >>>>> under Windows 7) ? Do you have this effect on your computer? > > >>>> I can not test that since I use Linux/gcc. > > >>>>> However, the programs of a different type, such as my research > >>>>> program, still run on one core of the processor. I don't see what > >>>>> makes the compiled program run in different ways, because they are > >>>>> written similarly. > > >>>> I understand that you'd want to compare the generated code with the > >>>> handwritten code on the same number of threads but I can not resist = to > >>>> point out that the parallelization of the C++ code is something that > >>>> needs > >>>> to be developed but that parallelization via Mathematica come at alm= ost > >>>> not additional cost. > > >>>> On a completely different note, here is another approach that could = be > >>>> taken. > > >>>> CCodeGenerator/tutorial/CodeGeneration > > >>>> Oliver > > >>> This behavior is not new to me. Calculating matrix exponentials also > >>> leads to a 100% processor usage on multiprocessor computers without > >>> any parallelization. I have observed it on my Windows 7 laptop and on > >>> a Mac Pro at work. The system monitor shows that only one Mathematica > >>> kernel is working but the load of this kernel is much greater than > >>> 100%, especially on the Mac Pro that has 8 cores. My laptop may switc= h > >>> off (because of overheating?) during such calculations while the Mac > >>> is OK. > > >>> Also I've seen such a behavior solving PDEs with NDSolve in some > >>> cases. > > >>> I wonder what is happening and I do not know whether this effect is > >>> good or bad. As I cannot control it, I cannot measure if such an > >>> extensive processor usage leads to a speed-up. > > >>> I am going to get Mathematica for Linux and test it there, too. > > >>> Best, > > >>> Dmitry > > >> In the case of the matrix exponentials (or really any numerical linear > >> algebra), this behaviour is undoubtedly due to MKL threading and can b= e > >> controlled by the option Oliver gives above. Obviously it is not good = if > > >> your laptop switches off due to overheating, but this is not so much a > >> problem of Mathematica as badly designed cooling in the laptop. MKL's > >> threading is carefully done and scales well for moderate numbers of co= res, > >> so you should be seeing considerably increased performance as a result= of > >> it on an 8-core machine. In regard to NDSolve, I don't know how this i= s > >> implemented internally and so can't comment on any parallelization tha= t > >> might exist. > > > Thank you, Oleksandr! > > > For some reason, I've overlooked Oliver's suggestion to try > > SetSystemOptions["MKLThreads" -> 1]. > > > MKL stands for Intel's Math Kernel Library (I have Intel on both > > computers) and it seems to be a big thing, if you can use it right. It > > seems, the Intel processor can parallelize problems in some cases. But > > in which cases? How can we know it? It would be very desirable to be > > able to write codes that allow this kind of automatic parallelization. > > > I have made experiments, > > > SetSystemOptions["MKLThreads" -> 2]; > > NN = 1000; > > AMatr = Table[RandomReal[{0, 1}], {i, 1, NN}, {j, 1, NN}]; > > AbsoluteTiming[ MatrixExp[AMatr]; ] > > > Out[12]= {2.2281247, Null} > > > while > > > SetSystemOptions["MKLThreads" -> 1]; > > NN = 1000; > > AMatr = Table[RandomReal[{0, 1}], {i, 1, NN}, {j, 1, NN}]; > > AbsoluteTiming[MatrixExp[AMatr];] > > > Out[16]= {3.6412015, Null} > > > That is, there is a speed-up and we can control it. The same for > > matrix multiplication. > > > On the other hand, Oliver's code of 23 September above runs on both > > cores and I cannot control it by SetSystemOptions["MKLThreads" -> 1]. > > Here is the complete code: > > Try forcing a non parallel version with Parallelization->False. > > The default is Automatic and applies some heuristic when code is run in > parallel and when not. > > Does this help? > > Oliver > > > > > > > > > > > ***************************************************** > > (* Runge-Kutta-4 routine *) > > ClearAll[makeCompRK] > > makeCompRK[f_] := > > Compile[{{x0, _Real, 1}, {t0}, {tMax}, {n, _Integer}}, > > Module[{h, K1, K2, K3, K4, SolList, x = x0, t}, h = (tMax - t0)/= n; > > SolList = Table[x0, {n + 1}]; > > Do[t = t0 + k h; > > K1 = h f[t, x]; > > K2 = h f[t + (1/2) h, x + (1/2) K1]; > > K3 = h f[t + (1/2) h, x + (1/2) K2]; > > K4 = h f[t + h, x + K3]; > > x = x + (1/6) K1 + (1/3) K2 + (1/3) K3 + (1/6) K4; > > SolList[[k + 1]] = x, {k, 1, n}]; > > SolList](*,Parallelization->True*), CompilationTarget -> "C", > > CompilationOptions -> {"InlineCompiledFunctions" -> True}, > > "RuntimeOptions" -> "Speed"] > > > (* Defining equations *) > > NN = 10000; > > ff = Compile[{t}, > > Sin[0.1 t]^2]; (* ff is inserted into cRHS, the way to go *) > > > su = With[{NN = NN}, > > Compile[{{i, _Integer}, {x, _Real, 1}}, > > Sum[x[[i + j]], {j, 1, Min[4, NN - i]}]]]; > > cRHS = With[{NN = NN}, Compile[{{t}, {x, _Real, 1}}, > > Table[-x[[i]]*ff[t]/(1 + 100 su[i, x]^2), {i, 1, NN}](*, > > CompilationTarget->"C"*), > > CompilationOptions -> {"InlineExternalDefinitions" -> True, > > "InlineCompiledFunctions" -> True}]]; > > > (* With this trick it runs faster *) > > su2 = With[{NN = NN}, > > Compile[{{x, _Real, 1}}, > > Table[Sum[x[[i + j]], {j, 1, Min[4, NN - i]}], {i, 1, NN}]]]; > > cRHS2 = With[{NN = NN}, > > Compile[{t, {x, _Real, 1}}, -x*ff[t]/(1 + 100*su2[x]^2), > > CompilationTarget -> "C", > > CompilationOptions -> {"InlineExternalDefinitions" -> True, > > "InlineCompiledFunctions" -> True}]]; > > > (*Compilation*) > > tt0 = AbsoluteTime[]; > > Timing[RK4Comp = makeCompRK[cRHS2];] > > AbsoluteTime[] - tt0 > > (*CompilePrint[RK4Comp2]*) > > (*switch inling to True/False to see what is happening*) > > > (*Setting parameters and Calculation*) > > x0 = Table[ > > RandomReal[{0, 1}], {i, 1, NN}]; t0 = 0; tMax = 300; n = 500; > > tt0 = AbsoluteTime[]; > > Sol = RK4Comp[x0, t0, tMax, n]; > > AbsoluteTime[] - tt0 > > > Print["Compilation: ", Developer`PackedArrayQ@Sol] > > > (* Plotting *) > > tList = Table[1. t0 + (tMax - t0) k/n, {k, 0, n}]; > > x1List = Transpose[{tList, Transpose[Sol][[1]]}]; > > x2List = Transpose[{tList, Transpose[Sol][[2]]}]; > > x3List = Transpose[{tList, Transpose[Sol][[3]]}]; > > ListPlot[{x1List, x2List, x3List}, PlotStyle -> {Blue, Green, Red}, > > PlotRange -> All] > > > **************************************************************** > > > My initial code uses su and cRHS while Oliver's faster code uses su2 > > and cRHS2. Here, SetSystemOptions["MKLThreads" -> 1] does not affect > > the core usage. Thus I do not know if there is automatic > > parallelization via MKL here or it is a parasite effect that does not > > lead to speed-up. > > > I am quite intrigued now! > > > Best regards, > > > Dmitry No, on my computer (Windows 7, Intel Core Duo) these commands do not change anyrhing and the processor load is always 100% with cRHS2. Dmitry

**References**:**Re: Compilation: Avoiding inlining***From:*DmitryG <einschlag@gmail.com>