Re: Parallel Toolkit Example

• To: mathgroup at smc.vnet.net
• Subject: [mg50528] Re: Parallel Toolkit Example
• From: Schoeller Porter <sporter at wolfram.com>
• Date: Thu, 9 Sep 2004 05:17:54 -0400 (EDT)
• Sender: owner-wri-mathgroup at wolfram.com

```Geoff,

Your example code will not gain anything in a parallel evaluation,
since the evaluation in serial is less than the what the communication
overhead would be in parallel. On a 500 MHz G4, I get the following
evaluation time in Mathematica 5.0.1:

In[1]:= AbsoluteTiming[Map[FactorInteger, (10^Range[20, 40] )/9];]
Out[1]= {0.020316 Second, Null}

As a rule of thumb, the most efficient parallel algorithms tend to be
those that minimize the communication between the nodes, since each
communication incurs a bit of overhead. You would typically want to
subdivide the domain into only as many segments as you have nodes and
evaluate your function over those segments in parallel. One way to do
this (using your example, but keep in mind that it will not be faster),
is to use Partition to split the input list and Flatten to collapse the
results back into one list:

Flatten[ParallelMap[
(FactorInteger /@ #)&,
Partition[#, Length[#]/Length[\$Slaves]]&[
(10^Range[20, 40])/9
]
]]

In practice, like many other things, the most efficient parallel
algorithm is dependent on the problem that is being solved. The idea
above is useful for problems where the time to evaluate the function
over each individual segment is known to be roughly equal (this
particular implementation also requires the length of the input list to
be divisible by the number slave kernels). But this method is not
terribly efficient for a problem with different characteristics. Using
your example, but with larger integers, on a 4p 1.3GHz Itanium2 we get
the following for the serial case:

In[2]:= AbsoluteTiming[Map[FactorInteger, (10^Range[30, 50])/9+7];]

Out[2]= {15.667588 Second, Null}

and using the blocking described above with three slave kernels we get:

In[3]:= AbsoluteTiming[Flatten[ParallelMap[(FactorInteger /@ #)&,
Partition[#, Length[#]/Length[\$Slaves]]&[(10^Range[30, 50])/9+7]]]]

Out[3]= {13.527589 Second, Null}

for a speedup of 1.15. For this problem, ParallelMap is better out of
the box,

In[4]:= AbsoluteTiming[ParallelMap[FactorInteger, (10^Range[30,
50])/9+7];]

Out[4]= {8.239382 Second, Null}

giving a speedup of 1.9.  A cluster of 1GHz Xserves connected by GigE
with three slave kernels shows similar behavior, with timings or 21.38,
18.06, and 11.81 for the serial case, blocked ParallelMap, and stock
ParalleMap, respectively.

If we look at the timings for each of the individual FactorInteger
evaluations, we can see that there is a large variation in the
evaluation times. Simply using Partition results in a nearly worst case
scenario in which one of the slave kernels ends up with the bulk of the
work:

In[5]:= subtimings = Map[Timing[FactorInteger[#]][[1]] &,
(10^Range[30, 50])/9 + 7]/Second

Out[5]= {0.013672, 0.053712, 0.013672, 0.235351, 0.075195, 0.074219,
0.026367, 0.00293, 0.427735, 0.012695, 1.20605, 0.140625,
0.004882, 0.00293, 3.91309, 0.013671, 1.25684, 0.108399,
0.004883, 6.63379, 1.49707}

In[6]:= Total /@ Partition[subtimings,7]

Out[6]= {0.492187,1.79395,13.4092}

Notice that the total of the sub-times for the last segment is very
close to the blocked ParallelMap evaluation time. Armed with this
knowledge, we can improve things by partitioning the input list to
better distribute the longer-running evaluations:

In[7]:= Total/@Transpose[Partition[subtimings,3]]

Out[7]= {0.3125,9.21875,6.16406}

The runtime for a blocked ParallelMap under the new partitioning scheme
should be about 9.3 seconds (the largest of the sub-time totals):

In[8]:= AbsoluteTiming[Flatten[Transpose[ParallelMap[(FactorInteger /@
#)&,
Transpose[Partition[#, Length[\$Slaves]]] &[
(10^Range[30, 50])/9 + 7]]]];]

Out[8]= {9.370195 Second, Null}

With enough fiddling you could probably beat the out-of-the box time
for ParallelMap, but that requires knowing before-hand which integer
factorizations take the longest.

Regards,

Schoeller Porter
Wolfram Research, Inc.

--
Hi,

I am looking for an efficient example of using the Parallel Toolkit.
Many other people on this forum seem to have noted what I have also
found to be the case, that the ParallelMap and ParallelTable commands
aren't terribly efficient.  The example code in ParEval.nb, for
instance

ParallelMap[FactorInteger, (10^Range[20, 40] )/9]

only seems to run about 20% faster on a six-node cluster vs my laptop.
Does anyone have an example of how to solve this same problem using an
efficient parallel technique?  I am fairly new to Mathematica, so

Thanks,
-geoff

Geoff Hulette