MathGroup Archive: September 2004 [00197]

[Date Index] [Thread Index] [Author Index]
Example Problems
To: mathgroup at smc.vnet.net
Subject: [mg50565] Example Problems
From: "Mark Westwood [EPCC]" <markw at epcc.ed.ac.uk>
Date: Fri, 10 Sep 2004 04:06:17 -0400 (EDT)
Organization: Edinburgh University
References: <200409030735.DAA15550@smc.vnet.net> <41387904.6000904@wolfram.com> <chbleb$rk3$1@smc.vnet.net> <413D8489.2020603@epcc.ed.ac.uk> <chmmjd$9fa$1@smc.vnet.net>
Sender: owner-wri-mathgroup at wolfram.com
Geoff

Some problems which satisfy the first of the two criteria:

1) Many image processing tasks: simply chop the image into p equal parts 
(p is the number of processors you are using) and let each processor go 
to work.  You have to be a little careful at the boundaries between the 
parts of the image, generally it is simplest to send slightly 
overlapping segments out to each processor, the overlap should be the 
same dimension as the matrix which defines the image transformation you 
are applying.  These problems decompose very well because they require 
NO communication between processors except at the start, when the image 
is torn into pieces and at the end when it is reassembled.

2) Cellular automata also decompose neatly, but these are slightly more 
tricky during the course of a computation.  If you know Conway's Game of 
Life you will know that occasionally it throws up patterns which seem to 
move across the map.  You need to take account of these when they reach 
the boundary of the domain on a processor, 'handing off' to the 
neighbour processor.

3) Search problems.  For example, if you want to search a string a 
gazillion characters long for every occurrence of a substring, simply 
chop the long string into p bits and let each processor get to work. 
This won't save any time if communicating the chopped up string takes 
longer than searching it, but if you want to get really sophisticated 
you could have each individual processor read its own bit of the string 
from disk (or wherever it is) and work on it entirely independently.

4) Genetic programming: breed a population of programs on each 
processor, let it evolve for a few generations, then bring the most 
successful program from each processor together and have a grand 
tournament.  Again, you'll have to experiment with the ratio of 
computation to communication for your cluster to get the numbers right.

5) Look in the literature under 'embarassingly parallel problems' or 
'task farm'.


Satisfying the second criterion is easy - just make the problem big 
enough that the gain in computation rate overwhelms the cost of 
communication.  How big that is depends on the problem, the algorithm, 
system characteristics such as interconnect latency and transfer rate.

Hope this helps

Regards
Mark Westwood
Edinburgh Parallel Computing Centre

Geoff Hulette wrote:
> Hmm, that is a very good point.  Well, if factorization is the problem, 
> does anyone have a range of functions that would satisfy Mark's 
> requirements?  That is, they must 1) all compute in roughly the same 
> time and 2) take long enough to compute to make the parallel overhead 
> worthwhile.
> 
> Thanks, this is really helpful to me.
> -geoff
> 
> 
> On Sep 7, 2004, at 5:51 AM, Mark Westwood [EPCC] wrote:
> 
> 
>>Geoff,
>>
>>If I've understood you correctly you are trying to distribute the 
>>factorisation of a number of large integers over a number of 
>>processors.  If that is correct, read on, if I've misunderstood do 
>>something more productive with your time.
>>
>>Suppose that, on a single-processor computer your program would spend 
>>80% of its time factoring one of the integers, and 20% factoring the 
>>others.  Now send the integers to be factored out to the worker 
>>processors.  The one factorising the problem integer will finish in 
>>approximately 80% of the time that your single processor did, the 
>>other processors will finish more quickly, but that doesn't affect the 
>>total execution time of your program which is only 20% lower than on 
>>the single-processor machine.
>>
>>I think Daniel Lichtblau is right.  What perhaps isn't clear is that 
>>ParallelMap does not parallelise the FactorInteger function, in the 
>>sense of having multiple processors working simultaneously on the 
>>factorisation of one integer, it distributes the integers to the 
>>procesors and each works independently on the set it is assigned.  You 
>>may actually have been lucky to get a 20% speedup and not to have had 
>>it consumed by the overhead of distributing the computation and 
>>gathering the results.
>>
>>You could run some tests to see how long each of the integers in your 
>>set takes to factorise.  The distribution of times is most unlikely to 
>>be even approximately uniform.  The total execution time is a little 
>>bit more than the execution time for factoring the worst case.
>>
>>As to the question of examples of where the Parallel Toolkit is really 
>>useful, I don't have any, since I don't have the toolkit.  But I would 
>>look for problems where:
>>a) the individual function evaluations should all take approximately 
>>the same time, which avoids the overall execution time being dominated 
>>by 'pathological' cases;
>>b) the individual execution times are likely to be long enough to 
>>amortise the distribution overheads; you'll have a better sense of 
>>what those overheads are on your cluster than I could guess at.
>>
>>If (a) is not true for your problem but you have some well-founded 
>>expectation of how the execution time varies with the problem 
>>parameters you could write your own scheduling algorithm - look under 
>>bin-packing or the knapsack in the literature - to distribute the 
>>computational load as you direct.
>>
>>Now, if the Parallel Toolkit could really parallelise functions I 
>>might be interested in a copy, 'til then I'll stick to Fortran for 
>>high performance number crunching.
>>
>>Regards
>>Mark Westwood
>>Edinburgh Parallel Computing Centre
>>
>>Geoff Hulette wrote:
>>
>>>On Sep 3, 2004, at 10:00 AM, Daniel Lichtblau wrote:
>>>
>>>>Geoff Hulette wrote:
>>>>
>>>>
>>>>>Hi,
>>>>>I am looking for an efficient example of using the Parallel 
>>>>>Toolkit.  Many other people on this forum seem to have noted what I 
>>>>>have also found to be the case, that the ParallelMap and 
>>>>>ParallelTable commands aren't terribly efficient.  The example code 
>>>>>in ParEval.nb, for instance
>>>>>ParallelMap[FactorInteger, (10^Range[20, 40] )/9]
>>>>>only seems to run about 20% faster on a six-node cluster vs my 
>>>>>laptop.  Does anyone have an example of how to solve this same 
>>>>>problem using an efficient parallel technique?  I am fairly new to 
>>>>>Mathematica, so please be explicit.
>>>>>Thanks,
>>>>>-geoff
>>>>>Geoff Hulette
>>>>>MIT Academic Computing
>>>>>ghulette at mit.edu
>>>>
>>>>Not that I know anything about it, but you are probably looking at a 
>>>>latency issue with this example. Generally speaking for ParallelMap 
>>>>et al to be of use the time to process the individual elements needs 
>>>>to be large compared to the overhead time of shipping back and forth 
>>>>over processors. From what I can tell this is not the case for your 
>>>>example (they all factor quite quickly).
>>>>
>>>>If they did not you could run into a different problem. For some 
>>>>sets of integers the overall factorization speed will be dominated 
>>>>by the time needed to factor one particular element. In such cases 
>>>>again a parallelized version would not be of much help. But for sets 
>>>>where two or more of the inputs are slow to factor it would of 
>>>>course give a considerable boost.
>>>>
>>>>
>>>>Daniel Lichtblau
>>>>Wolfram Research
>>>>
>>>>
>>>
>>>Well, that is certainly true.  But, if I am not mistaken (and I could 
>>>be, since I am new to Mathmatica), the problem I am trying out 
>>>factors a large range of integers, and the computation for any one 
>>>element in the set to be factored can be computed independently of 
>>>the others.  So, I suspect that the ParallelMap command is 
>>>inefficient in that it first computes the range (20 integers, in this 
>>>case), and sends each one (20 problems total) out over the network.  
>>>A far more efficient solution would be break the problem up over the 
>>>number of processors available, and let each node handle its part of 
>>>the range.  In this case each processor would only have to handle one 
>>>(somewhat more difficult) problem, reducing the latency.  This would 
>>>also allow the process to scale better, as the range increases, since 
>>>each processor would always only be tackling one problem (with only 
>>>one associated overhead cost), while the time spent in computation 
>>>would increase according to the complexity of the problem.  Does 
>>>anybody have a way to do this?  Or, as a matter of fact, does anyone 
>>>have any examples at all of efficient Parallel Toolkit routines?
>>>Thanks,
>>>geoff
>>
>>
>
References:
- Parallel Toolkit Example
  - From: Geoff Hulette <ghulette@collidescope.net>
Prev by Date: Re: Re: How to solve a simple Trig cofunction?
Next by Date: Difference in Plots in Mathematica v4.0 and v5.0
Previous by thread: Re: Parallel Toolkit Example
Next by thread: Re: Parallel Toolkit Example