Re: Parallel Toolkit Example

*To*: mathgroup at smc.vnet.net*Subject*: [mg50528] Re: Parallel Toolkit Example*From*: Schoeller Porter <sporter at wolfram.com>*Date*: Thu, 9 Sep 2004 05:17:54 -0400 (EDT)*Sender*: owner-wri-mathgroup at wolfram.com

Geoff, Your example code will not gain anything in a parallel evaluation, since the evaluation in serial is less than the what the communication overhead would be in parallel. On a 500 MHz G4, I get the following evaluation time in Mathematica 5.0.1: In[1]:= AbsoluteTiming[Map[FactorInteger, (10^Range[20, 40] )/9];] Out[1]= {0.020316 Second, Null} As a rule of thumb, the most efficient parallel algorithms tend to be those that minimize the communication between the nodes, since each communication incurs a bit of overhead. You would typically want to subdivide the domain into only as many segments as you have nodes and evaluate your function over those segments in parallel. One way to do this (using your example, but keep in mind that it will not be faster), is to use Partition to split the input list and Flatten to collapse the results back into one list: Flatten[ParallelMap[ (FactorInteger /@ #)&, Partition[#, Length[#]/Length[$Slaves]]&[ (10^Range[20, 40])/9 ] ]] In practice, like many other things, the most efficient parallel algorithm is dependent on the problem that is being solved. The idea above is useful for problems where the time to evaluate the function over each individual segment is known to be roughly equal (this particular implementation also requires the length of the input list to be divisible by the number slave kernels). But this method is not terribly efficient for a problem with different characteristics. Using your example, but with larger integers, on a 4p 1.3GHz Itanium2 we get the following for the serial case: In[2]:= AbsoluteTiming[Map[FactorInteger, (10^Range[30, 50])/9+7];] Out[2]= {15.667588 Second, Null} and using the blocking described above with three slave kernels we get: In[3]:= AbsoluteTiming[Flatten[ParallelMap[(FactorInteger /@ #)&, Partition[#, Length[#]/Length[$Slaves]]&[(10^Range[30, 50])/9+7]]]] Out[3]= {13.527589 Second, Null} for a speedup of 1.15. For this problem, ParallelMap is better out of the box, In[4]:= AbsoluteTiming[ParallelMap[FactorInteger, (10^Range[30, 50])/9+7];] Out[4]= {8.239382 Second, Null} giving a speedup of 1.9. A cluster of 1GHz Xserves connected by GigE with three slave kernels shows similar behavior, with timings or 21.38, 18.06, and 11.81 for the serial case, blocked ParallelMap, and stock ParalleMap, respectively. If we look at the timings for each of the individual FactorInteger evaluations, we can see that there is a large variation in the evaluation times. Simply using Partition results in a nearly worst case scenario in which one of the slave kernels ends up with the bulk of the work: In[5]:= subtimings = Map[Timing[FactorInteger[#]][[1]] &, (10^Range[30, 50])/9 + 7]/Second Out[5]= {0.013672, 0.053712, 0.013672, 0.235351, 0.075195, 0.074219, 0.026367, 0.00293, 0.427735, 0.012695, 1.20605, 0.140625, 0.004882, 0.00293, 3.91309, 0.013671, 1.25684, 0.108399, 0.004883, 6.63379, 1.49707} In[6]:= Total /@ Partition[subtimings,7] Out[6]= {0.492187,1.79395,13.4092} Notice that the total of the sub-times for the last segment is very close to the blocked ParallelMap evaluation time. Armed with this knowledge, we can improve things by partitioning the input list to better distribute the longer-running evaluations: In[7]:= Total/@Transpose[Partition[subtimings,3]] Out[7]= {0.3125,9.21875,6.16406} The runtime for a blocked ParallelMap under the new partitioning scheme should be about 9.3 seconds (the largest of the sub-time totals): In[8]:= AbsoluteTiming[Flatten[Transpose[ParallelMap[(FactorInteger /@ #)&, Transpose[Partition[#, Length[$Slaves]]] &[ (10^Range[30, 50])/9 + 7]]]];] Out[8]= {9.370195 Second, Null} With enough fiddling you could probably beat the out-of-the box time for ParallelMap, but that requires knowing before-hand which integer factorizations take the longest. Regards, Schoeller Porter Wolfram Research, Inc. -- Hi, I am looking for an efficient example of using the Parallel Toolkit. Many other people on this forum seem to have noted what I have also found to be the case, that the ParallelMap and ParallelTable commands aren't terribly efficient. The example code in ParEval.nb, for instance ParallelMap[FactorInteger, (10^Range[20, 40] )/9] only seems to run about 20% faster on a six-node cluster vs my laptop. Does anyone have an example of how to solve this same problem using an efficient parallel technique? I am fairly new to Mathematica, so please be explicit. Thanks, -geoff Geoff Hulette MIT Academic Computing ghulette at mit.edu