MathGroup Archive: October 2011 [00009]

[Date Index] [Thread Index] [Author Index]

Re: Re: ParallelDo and C-compiled routines

To: mathgroup at smc.vnet.net
Subject: [mg121771] Re: [mg121765] Re: ParallelDo and C-compiled routines
From: Oliver Ruebenkoenig <ruebenko at wolfram.com>
Date: Sat, 1 Oct 2011 03:08:11 -0400 (EDT)
Delivered-to: l-mathgroup@mail-archive0.wolfram.com
References: <201109300804.EAA06615@smc.vnet.net>

On Fri, 30 Sep 2011, DmitryG wrote:

> On Sep 29, 2:05 am, "Oleksandr Rasputinov"
> <oleksandr_rasputi... at hmamail.com> wrote:
>> On Wed, 28 Sep 2011 07:49:14 +0100, DmitryG <einsch... at gmail.com> wrote:
>>> Hi All,
>>
>>> I am going to run several instances of a long calculation on different
>>> cores of my computer and then average the results. The program looks
>>> like this:
>>
>>> SetSharedVariable[Res];
>>> ParallelDo[
>>>  Res[[iKer]] = LongRoutine;
>>>  , {iKer, 1, NKer}]
>>
>>>  LongRoutine is compiled. When compiled in C, it is two times faster
>>> than when compiled in Mathematica. In the case of a Do cycle, this
>>> speed difference can be seen, However, in the case of ParallelDo I
>>> have the speed of the Mathematica-compiled routine independently of
>>> the CompilationTarget in LongRoutine, even if I set NKer=1.
>>
>>> What does it mean? Are routines compiled in C unable of parallel
>>> computing? Or there is a magic option to make them work? I tried
>>> Parallelization->True but there is no result, and it seems this option
>>> is for applying the routine to lists.
>>
>>> Here is an example:
>>> ************************************************************
>>> NKer = 1;
>>
>>> (* Subroutine compiled in Mathematica *)
>>> m = Compile[ {{x, _Real}, {n, _Integer}},
>>>            Module[ {sum, inc}, sum = 1.0; inc = 1.0;
>>>     Do[inc = inc*x/i; sum = sum + inc, {i, n}]; sum]];
>>
>>> (* Subroutine compiled in C *)
>>> c = Compile[ {{x, _Real}, {n, _Integer}},
>>>            Module[ {sum, inc}, sum = 1.0; inc = 1.0;
>>>     Do[inc = inc*x/i; sum = sum + inc, {i, n}]; sum],
>>>    CompilationTarget -> "C"];
>>
>>> (* There is a difference between Mathematica and C *)
>>> Do[
>>>  Print[AbsoluteTiming[m[1.5, 10000000]][[1]]];
>>>  Print[AbsoluteTiming[c[1.5, 10000000]][[1]]];
>>>  , {iKer, 1, NKer}]
>>> Print[];
>>
>>> (* With ParallelDo there is no difference *)
>>> ParallelDo[
>>>  Print[AbsoluteTiming[m[1.5, 10000000]][[1]]];
>>>  Print[AbsoluteTiming[c[1.5, 10000000]][[1]]];
>>>  , {iKer, 1, NKer}]
>>> **************************************************************
>>
>>> Any help?
>>
>>> Best,
>>
>>> Dmitry
>>
>> The behaviour you observe here is quite subtle in several respects.
>> Ordinarily, one might expect to have to call DistributeDefinitions[m, c]
>> to replicate the definitions of m and c in the subkernel(s) before making
>> a parallel evaluation, otherwise the argument sent to each subkernel would
>> be returned unchanged to be evaluated on the master kernel when the result
>> is built. In this case, however, the message printed by the subkernel
>> indicates that the evaluation is not in fact being performed in the master
>> kernel. (If it was, you would have seen a difference in run-time between
>> the two compiled functions.) This caught me a little by surprise at first,
>> since unlike in the Parallel Computing Toolkit, definitions are now
>> distributed automatically by the internal function parallelIterate (on to
> p
>> of which ParallelDo is built). At any rate, although ParallelDo succeeds
>> in defining m and c in the subkernels, it appears that CompiledFunctions
>> referencing LibraryFunctions are stateful, in that they depend on the
>> LibraryFunction containing the compiled code having already been loaded.
>> If not, they fall back to evaluation in the Mathematica virtual machine,
>> which explains why c performs the same as m in this case.
>>
>> To solve the problem, you must either compile c separately in each
>> subkernel by wrapping its definition in ParallelEvaluate, or, preferably,
>> load the LibraryFunction manually and update the definition of c
>> accordingly. To do the latter, you can use
>>
>> ParallelDo[
>>   Print[AbsoluteTiming[m[1.5, 10000000]][[1]]];
>>   c[[-1]] = LibraryFunctionLoad @@ c[[-1]];
>>   Print[AbsoluteTiming[c[1.5, 10000000]][[1]]];
>>   , {iKer, 1, NKer} ]
>>
>> in which case you will see the expected difference in run-time.
>
> Thank you a lot, Oliver!

I think the credits should go to Oleksandr and Patrick; I found their 
assessment of the issue much better then my own.

>
> Great advice, as ever.
>
> Yes, wrapping ParallelEvaluation[...] around the definitions of
> functions used inside ParallelDo works for the simple example above.
>
> However, it did not work for my research program where I have several
> compiled functions calling each other. Maybe I did something wrong.
>
> On the other hand, the single line of the type  c[[-1]] =
> LibraryFunctionLoad @@ c[[-1]] immediately did the job for my research
> program and now I have a speed-up from compiling in C (a factor of ~3
> on my computer). What does [[-1]] mean here, BTW?
>

Also LibraryFunctionLoad is probably better then Get - but old habits die 
hard....

list[[-1]] means get the first expression from the back - a.k. get the 
last expression; this is handy if you do not know the length of the expr.

Oliver

> Warm regards,
>
> Dmitry
>
>

Prev by Date: Re: A fast way to compare two vectors

Next by Date: Re: Finding if a graph G contains any clique of size N...

Previous by thread: Re: A fast way to compare two vectors

Next by thread: Re: ParallelDo and C-compiled routines