Re: ParallelDo and C-compiled routines

*To*: mathgroup at smc.vnet.net*Subject*: [mg121775] Re: ParallelDo and C-compiled routines*From*: "Oleksandr Rasputinov" <oleksandr_rasputinov at hmamail.com>*Date*: Sat, 1 Oct 2011 03:08:54 -0400 (EDT)*Delivered-to*: l-mathgroup@mail-archive0.wolfram.com*References*: <j5ug1a$7r5$1@smc.vnet.net> <j611s1$lva$1@smc.vnet.net>

On Fri, 30 Sep 2011 09:07:52 +0100, DmitryG <einschlag at gmail.com> wrote: > On Sep 29, 2:05 am, "Oleksandr Rasputinov" > <oleksandr_rasputi... at hmamail.com> wrote: >> On Wed, 28 Sep 2011 07:49:14 +0100, DmitryG <einsch... at gmail.com> wrote: >> > Hi All, >> >> > I am going to run several instances of a long calculation on different >> > cores of my computer and then average the results. The program looks >> > like this: >> >> > SetSharedVariable[Res]; >> > ParallelDo[ >> > Res[[iKer]] = LongRoutine; >> > , {iKer, 1, NKer}] >> >> > LongRoutine is compiled. When compiled in C, it is two times faster >> > than when compiled in Mathematica. In the case of a Do cycle, this >> > speed difference can be seen, However, in the case of ParallelDo I >> > have the speed of the Mathematica-compiled routine independently of >> > the CompilationTarget in LongRoutine, even if I set NKer=1. >> >> > What does it mean? Are routines compiled in C unable of parallel >> > computing? Or there is a magic option to make them work? I tried >> > Parallelization->True but there is no result, and it seems this option >> > is for applying the routine to lists. >> >> > Here is an example: >> > ************************************************************ >> > NKer = 1; >> >> > (* Subroutine compiled in Mathematica *) >> > m = Compile[ {{x, _Real}, {n, _Integer}}, >> > Module[ {sum, inc}, sum = 1.0; inc = 1.0; >> > Do[inc = inc*x/i; sum = sum + inc, {i, n}]; sum]]; >> >> > (* Subroutine compiled in C *) >> > c = Compile[ {{x, _Real}, {n, _Integer}}, >> > Module[ {sum, inc}, sum = 1.0; inc = 1.0; >> > Do[inc = inc*x/i; sum = sum + inc, {i, n}]; sum], >> > CompilationTarget -> "C"]; >> >> > (* There is a difference between Mathematica and C *) >> > Do[ >> > Print[AbsoluteTiming[m[1.5, 10000000]][[1]]]; >> > Print[AbsoluteTiming[c[1.5, 10000000]][[1]]]; >> > , {iKer, 1, NKer}] >> > Print[]; >> >> > (* With ParallelDo there is no difference *) >> > ParallelDo[ >> > Print[AbsoluteTiming[m[1.5, 10000000]][[1]]]; >> > Print[AbsoluteTiming[c[1.5, 10000000]][[1]]]; >> > , {iKer, 1, NKer}] >> > ************************************************************** >> >> > Any help? >> >> > Best, >> >> > Dmitry >> >> The behaviour you observe here is quite subtle in several respects. >> Ordinarily, one might expect to have to call DistributeDefinitions[m, c] >> to replicate the definitions of m and c in the subkernel(s) before >> making >> a parallel evaluation, otherwise the argument sent to each subkernel >> would >> be returned unchanged to be evaluated on the master kernel when the >> result >> is built. In this case, however, the message printed by the subkernel >> indicates that the evaluation is not in fact being performed in the >> master >> kernel. (If it was, you would have seen a difference in run-time between >> the two compiled functions.) This caught me a little by surprise at >> first, >> since unlike in the Parallel Computing Toolkit, definitions are now >> distributed automatically by the internal function parallelIterate (on >> to > p >> of which ParallelDo is built). At any rate, although ParallelDo succeeds >> in defining m and c in the subkernels, it appears that CompiledFunctions >> referencing LibraryFunctions are stateful, in that they depend on the >> LibraryFunction containing the compiled code having already been loaded. >> If not, they fall back to evaluation in the Mathematica virtual machine, >> which explains why c performs the same as m in this case. >> >> To solve the problem, you must either compile c separately in each >> subkernel by wrapping its definition in ParallelEvaluate, or, >> preferably, >> load the LibraryFunction manually and update the definition of c >> accordingly. To do the latter, you can use >> >> ParallelDo[ >> Print[AbsoluteTiming[m[1.5, 10000000]][[1]]]; >> c[[-1]] = LibraryFunctionLoad @@ c[[-1]]; >> Print[AbsoluteTiming[c[1.5, 10000000]][[1]]]; >> , {iKer, 1, NKer} ] >> >> in which case you will see the expected difference in run-time. > > Thank you a lot, Oliver! > > Great advice, as ever. > > Yes, wrapping ParallelEvaluation[...] around the definitions of > functions used inside ParallelDo works for the simple example above. > > However, it did not work for my research program where I have several > compiled functions calling each other. Maybe I did something wrong. > > On the other hand, the single line of the type c[[-1]] = > LibraryFunctionLoad @@ c[[-1]] immediately did the job for my research > program and now I have a speed-up from compiling in C (a factor of ~3 > on my computer). What does [[-1]] mean here, BTW? > > Warm regards, > > Dmitry > Happy to have helped. Negative indexing in Part signifies counting from the end of an object rather than the beginning. The last element in a CompiledFunction referencing a procedure compiled from C is a LibraryFunction, of which the FullForm is essentially a copy of the arguments used with LibraryFunctionLoad but with the head LibraryFunction. Thus we substitute the head using Apply and get back a new, properly initialised LibraryFunction, with which we replace the last Part of the CompiledFunction. Clearly, this depends on the file containing the code being accessible to the subkernels, but this is not a problem when they are being run on the same machine as the master kernel. Given that LibraryFunction appears to contain hidden state, one might expect it to be atomic (i.e. for AtomQ to give True), because a copy of its FullForm does not achieve a copy of the object. However, this does not seem to be the case. Best, Oleksandr R.