MathGroup Archive: September 2004 [00175]

[Date Index] [Thread Index] [Author Index]

Re: Parallel Toolkit Example

To: mathgroup at smc.vnet.net
Subject: [mg50528] Re: Parallel Toolkit Example
From: Schoeller Porter <sporter at wolfram.com>
Date: Thu, 9 Sep 2004 05:17:54 -0400 (EDT)
Sender: owner-wri-mathgroup at wolfram.com

Geoff,

Your example code will not gain anything in a parallel evaluation, 
since the evaluation in serial is less than the what the communication 
overhead would be in parallel. On a 500 MHz G4, I get the following 
evaluation time in Mathematica 5.0.1:

	In[1]:= AbsoluteTiming[Map[FactorInteger, (10^Range[20, 40] )/9];]
	Out[1]= {0.020316 Second, Null}

As a rule of thumb, the most efficient parallel algorithms tend to be 
those that minimize the communication between the nodes, since each 
communication incurs a bit of overhead. You would typically want to 
subdivide the domain into only as many segments as you have nodes and 
evaluate your function over those segments in parallel. One way to do 
this (using your example, but keep in mind that it will not be faster), 
is to use Partition to split the input list and Flatten to collapse the 
results back into one list:

	Flatten[ParallelMap[
		(FactorInteger /@ #)&,
		Partition[#, Length[#]/Length[$Slaves]]&[
			(10^Range[20, 40])/9
		]
	]]

In practice, like many other things, the most efficient parallel 
algorithm is dependent on the problem that is being solved. The idea 
above is useful for problems where the time to evaluate the function 
over each individual segment is known to be roughly equal (this 
particular implementation also requires the length of the input list to 
be divisible by the number slave kernels). But this method is not 
terribly efficient for a problem with different characteristics. Using 
your example, but with larger integers, on a 4p 1.3GHz Itanium2 we get 
the following for the serial case:

	In[2]:= AbsoluteTiming[Map[FactorInteger, (10^Range[30, 50])/9+7];]

	Out[2]= {15.667588 Second, Null}

and using the blocking described above with three slave kernels we get:

	In[3]:= AbsoluteTiming[Flatten[ParallelMap[(FactorInteger /@ #)&,
		Partition[#, Length[#]/Length[$Slaves]]&[(10^Range[30, 50])/9+7]]]]

	Out[3]= {13.527589 Second, Null}

for a speedup of 1.15. For this problem, ParallelMap is better out of 
the box,

	In[4]:= AbsoluteTiming[ParallelMap[FactorInteger, (10^Range[30, 
50])/9+7];]

	Out[4]= {8.239382 Second, Null}

giving a speedup of 1.9.  A cluster of 1GHz Xserves connected by GigE 
with three slave kernels shows similar behavior, with timings or 21.38, 
18.06, and 11.81 for the serial case, blocked ParallelMap, and stock 
ParalleMap, respectively.

If we look at the timings for each of the individual FactorInteger 
evaluations, we can see that there is a large variation in the 
evaluation times. Simply using Partition results in a nearly worst case 
scenario in which one of the slave kernels ends up with the bulk of the 
work:

	In[5]:= subtimings = Map[Timing[FactorInteger[#]][[1]] &,
		(10^Range[30, 50])/9 + 7]/Second

	Out[5]= {0.013672, 0.053712, 0.013672, 0.235351, 0.075195, 0.074219,
  		0.026367, 0.00293, 0.427735, 0.012695, 1.20605, 0.140625,
		0.004882, 0.00293, 3.91309, 0.013671, 1.25684, 0.108399,
		0.004883, 6.63379, 1.49707}

	In[6]:= Total /@ Partition[subtimings,7]

	Out[6]= {0.492187,1.79395,13.4092}

Notice that the total of the sub-times for the last segment is very 
close to the blocked ParallelMap evaluation time. Armed with this 
knowledge, we can improve things by partitioning the input list to 
better distribute the longer-running evaluations:

	In[7]:= Total/@Transpose[Partition[subtimings,3]]

	Out[7]= {0.3125,9.21875,6.16406}

The runtime for a blocked ParallelMap under the new partitioning scheme 
should be about 9.3 seconds (the largest of the sub-time totals):

	In[8]:= AbsoluteTiming[Flatten[Transpose[ParallelMap[(FactorInteger /@ 
#)&,
		Transpose[Partition[#, Length[$Slaves]]] &[
			(10^Range[30, 50])/9 + 7]]]];]

	Out[8]= {9.370195 Second, Null}

With enough fiddling you could probably beat the out-of-the box time 
for ParallelMap, but that requires knowing before-hand which integer 
factorizations take the longest.

Regards,

Schoeller Porter
Wolfram Research, Inc.


--
Hi,

I am looking for an efficient example of using the Parallel Toolkit.
Many other people on this forum seem to have noted what I have also
found to be the case, that the ParallelMap and ParallelTable commands
aren't terribly efficient.  The example code in ParEval.nb, for
instance

ParallelMap[FactorInteger, (10^Range[20, 40] )/9]

only seems to run about 20% faster on a six-node cluster vs my laptop.
Does anyone have an example of how to solve this same problem using an
efficient parallel technique?  I am fairly new to Mathematica, so
please be explicit.

Thanks,
-geoff

Geoff Hulette
MIT Academic Computing
ghulette at mit.edu

Prev by Date: Histograms do not concur; why?

Next by Date: Cell write protect?

Previous by thread: Re: Parallel Toolkit Example

Next by thread: Re: Publicon problems converting sample document to LaTeX