Re: Re: ParallelDo and C-compiled routines

*To*: mathgroup at smc.vnet.net*Subject*: [mg121771] Re: [mg121765] Re: ParallelDo and C-compiled routines*From*: Oliver Ruebenkoenig <ruebenko at wolfram.com>*Date*: Sat, 1 Oct 2011 03:08:11 -0400 (EDT)*Delivered-to*: l-mathgroup@mail-archive0.wolfram.com*References*: <201109300804.EAA06615@smc.vnet.net>

On Fri, 30 Sep 2011, DmitryG wrote: > On Sep 29, 2:05 am, "Oleksandr Rasputinov" > <oleksandr_rasputi... at hmamail.com> wrote: >> On Wed, 28 Sep 2011 07:49:14 +0100, DmitryG <einsch... at gmail.com> wrote: >>> Hi All, >> >>> I am going to run several instances of a long calculation on different >>> cores of my computer and then average the results. The program looks >>> like this: >> >>> SetSharedVariable[Res]; >>> ParallelDo[ >>> Res[[iKer]] = LongRoutine; >>> , {iKer, 1, NKer}] >> >>> LongRoutine is compiled. When compiled in C, it is two times faster >>> than when compiled in Mathematica. In the case of a Do cycle, this >>> speed difference can be seen, However, in the case of ParallelDo I >>> have the speed of the Mathematica-compiled routine independently of >>> the CompilationTarget in LongRoutine, even if I set NKer=1. >> >>> What does it mean? Are routines compiled in C unable of parallel >>> computing? Or there is a magic option to make them work? I tried >>> Parallelization->True but there is no result, and it seems this option >>> is for applying the routine to lists. >> >>> Here is an example: >>> ************************************************************ >>> NKer = 1; >> >>> (* Subroutine compiled in Mathematica *) >>> m = Compile[ {{x, _Real}, {n, _Integer}}, >>> Module[ {sum, inc}, sum = 1.0; inc = 1.0; >>> Do[inc = inc*x/i; sum = sum + inc, {i, n}]; sum]]; >> >>> (* Subroutine compiled in C *) >>> c = Compile[ {{x, _Real}, {n, _Integer}}, >>> Module[ {sum, inc}, sum = 1.0; inc = 1.0; >>> Do[inc = inc*x/i; sum = sum + inc, {i, n}]; sum], >>> CompilationTarget -> "C"]; >> >>> (* There is a difference between Mathematica and C *) >>> Do[ >>> Print[AbsoluteTiming[m[1.5, 10000000]][[1]]]; >>> Print[AbsoluteTiming[c[1.5, 10000000]][[1]]]; >>> , {iKer, 1, NKer}] >>> Print[]; >> >>> (* With ParallelDo there is no difference *) >>> ParallelDo[ >>> Print[AbsoluteTiming[m[1.5, 10000000]][[1]]]; >>> Print[AbsoluteTiming[c[1.5, 10000000]][[1]]]; >>> , {iKer, 1, NKer}] >>> ************************************************************** >> >>> Any help? >> >>> Best, >> >>> Dmitry >> >> The behaviour you observe here is quite subtle in several respects. >> Ordinarily, one might expect to have to call DistributeDefinitions[m, c] >> to replicate the definitions of m and c in the subkernel(s) before making >> a parallel evaluation, otherwise the argument sent to each subkernel would >> be returned unchanged to be evaluated on the master kernel when the result >> is built. In this case, however, the message printed by the subkernel >> indicates that the evaluation is not in fact being performed in the master >> kernel. (If it was, you would have seen a difference in run-time between >> the two compiled functions.) This caught me a little by surprise at first, >> since unlike in the Parallel Computing Toolkit, definitions are now >> distributed automatically by the internal function parallelIterate (on to > p >> of which ParallelDo is built). At any rate, although ParallelDo succeeds >> in defining m and c in the subkernels, it appears that CompiledFunctions >> referencing LibraryFunctions are stateful, in that they depend on the >> LibraryFunction containing the compiled code having already been loaded. >> If not, they fall back to evaluation in the Mathematica virtual machine, >> which explains why c performs the same as m in this case. >> >> To solve the problem, you must either compile c separately in each >> subkernel by wrapping its definition in ParallelEvaluate, or, preferably, >> load the LibraryFunction manually and update the definition of c >> accordingly. To do the latter, you can use >> >> ParallelDo[ >> Print[AbsoluteTiming[m[1.5, 10000000]][[1]]]; >> c[[-1]] = LibraryFunctionLoad @@ c[[-1]]; >> Print[AbsoluteTiming[c[1.5, 10000000]][[1]]]; >> , {iKer, 1, NKer} ] >> >> in which case you will see the expected difference in run-time. > > Thank you a lot, Oliver! I think the credits should go to Oleksandr and Patrick; I found their assessment of the issue much better then my own. > > Great advice, as ever. > > Yes, wrapping ParallelEvaluation[...] around the definitions of > functions used inside ParallelDo works for the simple example above. > > However, it did not work for my research program where I have several > compiled functions calling each other. Maybe I did something wrong. > > On the other hand, the single line of the type c[[-1]] = > LibraryFunctionLoad @@ c[[-1]] immediately did the job for my research > program and now I have a speed-up from compiling in C (a factor of ~3 > on my computer). What does [[-1]] mean here, BTW? > Also LibraryFunctionLoad is probably better then Get - but old habits die hard.... list[[-1]] means get the first expression from the back - a.k. get the last expression; this is handy if you do not know the length of the expr. Oliver > Warm regards, > > Dmitry > >