MathGroup Archive: May 2012 [00206]

[Date Index] [Thread Index] [Author Index]
Re: Speed of Mathematica on AMD machines
To: mathgroup at smc.vnet.net
Subject: [mg126509] Re: Speed of Mathematica on AMD machines
From: einschlag at gmail.com
Date: Wed, 16 May 2012 04:21:33 -0400 (EDT)
Delivered-to: l-mathgroup@mail-archive0.wolfram.com
References: <jog00v$g1d$1@smc.vnet.net> <joi3kq$n3s$1@smc.vnet.net>
Thank you, Oleksandr, for your insightful comments!

It appears that the name "Bulldozer" of the AMD FX processor is misleading 
because it implies slow but numerous and well coordinated cores but, in facts, the cores are fast but do not collaborate well. And my testing suggests that there are effectively 4 cores rather than the announced 8 cores.

My posted testing was imperfect, indeed, and I have modified it as shown below:

*******************************************************
DEFINITIONS

(* Test program 1 - efficiency of using MKL, that is, autimatic threading \
over the cores  *)
NN = 1000;
AMatr = Table[RandomReal[], {i, 1, NN}, {j, 1, NN}];

TestProgram1 := Module[{},
  Do[MatrixExp[AMatr], {10}];
  ]

(* Test program 2 - Pure speed of one core  *)
cc = Compile[{{x, _Real}, {n, _Integer}},
   Module[{sum, inc}, sum = 1.0; inc = 1.0;
   
    Do[inc = inc*x/i; sum = sum + inc, {i, n}]; sum],
   CompilationTarget -> "C"];

TestProgram2 := Do[cc[1.6, 10000000], {100}]

(* Test program 2 - Parallelized operations  *)
cP1 = Compile[{{x}},
   Module[{sum = 1.0, inc = 1.0},
    Do[inc = inc*x/i; sum = sum + inc, {i, 10000000}]; sum],
   RuntimeAttributes -> {Listable}, Parallelization -> True,
   CompilationTarget -> "C"];
arg = Table[ 1.6, {100}];

TestProgram3 := cP1[arg];

Note that (timing of TestProgram2 / timing of TestProgram3) = effective number of cores

EXECUTION

1) Execution by Mac Pro (Intel Xeon 2 x 4 core, 2.4 GHz)

TestProgram1 // AbsoluteTiming

{8.600441, Null}  - Best MKL performance in threading

TestProgram2 // AbsoluteTiming

{13.93487, Null} - Slowest single-core speed

TestProgram3; // AbsoluteTiming

{1.229, Null}  - Effective number of cores 11.4 that I do not understand


2) Execution by Lenovo laptop (Intel i7-QM2060, 2.2 GHz, Windows 7 64 bit)

TestProgram1 // AbsoluteTiming

{11.7116698, Null}

In[33]:= TestProgram2 // AbsoluteTiming

Out[33]= {10.5296023, Null}

In[34]:= TestProgram3; // AbsoluteTiming

Out[34]= {3.5562034, Null} - effective number of cores 3 that is plausible for a quad processor


3) Execution by iBuyPower PC (AMD FX "Bulldozer" 8 cores 3.6 GHz Linux Ubuntu 64 bit)

TestProgram1 // AbsoluteTiming

{14.650569, Null} - Poor MKL performance in threading

TestProgram2 // AbsoluteTiming

{4.049293, Null} - the fastest single core

TestProgram3; // AbsoluteTiming

{0.914659, Null} - effective number of cores 4 rather than 8

************************************************************

These results show that the slowest Intel Xeon processors of the Mac are the best coordinated. The AMD machine is just opposite to this. Note their close performance in TestProgram 3, in spite of the big price difference ($3500 Mac Pro vs $850 iBuyPower). For programs using self-written procedures compiled in C  our AMD PC is quite competitive. The Lenovo laptop is a good allrounder in the middle between these two machines.

Dmitry 



On Friday, May 11, 2012 12:13:14 AM UTC-4, Oleksandr Rasputinov wrote:
> On Thu, 10 May 2012 09:59:11 +0100, <einschlag at gmail.com> wrote:
>
> > We have recently bought an iBuyPower gaming PC for our research group:
> >
> > AMD FX 8 core, 3.6 GHz, 16 GB RAM
> >
> > MathematicaMark8 Benchmark 0.86 is not bad, considering the price ~$800
> > of this PC but I was expecting much more.
> >
> > Apparently Intel's MKL library used by Mathematica is not optimized for
> > AMD processors.
> >
> > A test program calculating exponentials of large matrices takes 13 s on
> > the AMD PC and only 8 s on my Mac Pro (Mathematica benchmark 0.7) that 
> > has 8 Intel Xeon cores at 2.4 GHz. And on my Lenovo laptop the program 
> > runs 9 s. I blame it on the MKL inadequacy for AMD.
> >
> > TestProgram := Module[{},
> >   NN = 1000;
> >   AMatr = Table[RandomReal[], {i, 1, NN}, {j, 1, NN}];
> >   NExec = 10;
> >   For[i = 1, i < NExec, i++,
> >    MatrixExp[AMatr];
> >    ];
> >   ]
> >
> > Execution by iBuyPower PC (AMD FX 8 core, Linux Ubuntu 64 bit)
> >
> > TestProgram // AbsoluteTiming
> >
> > {13.230105, Null}
> >
> > Execution by Mac Pro (Intel Xeon 2 x 4 core)
> >
> > TestProgram // AbsoluteTiming
> >
> > {8.126944, Null}
> >
> > Execution by Lenovo laptop (Intel i7-QM2060, Windows 7 64 bit)
> >
> > TestProgram // AbsoluteTiming
> >
> > {9.4275392, Null}
> >
> >
> > On the other hand, a program compiling in C from Mathematica's help runs 
> > very fast on the AMD PC:
> >
> > TestProgram2 := Module[{},
> >   c = Compile[ {{x, _Real}, {n, _Integer}},
> >     	Module[ {sum, inc}, sum = 1.0; inc = 1.0;
> >      Do[inc = inc*x/i; sum = sum + inc, {i, n}]; sum],
> >     CompilationTarget -> "C"];
> >   c[1.6, 10000000];
> >   ]
> >
> > Execution by iBuyPower PC (AMD FX 8 core, Linux Ubuntu 64 bit, GCC 
> > compiler)
> >
> > TestProgram2 // AbsoluteTiming
> >
> > {0.114427, Null}
> >
> > Execution by Mac Pro (Intel Xeon 2 x 4 core, GCC compiler)
> >
> > TestProgram2 // AbsoluteTiming
> >
> > {0.212875, Null}
> >
> > Execution by Lenovo laptop (Intel i7-QM2060, Windows 7 64 bit, Microsoft 
> > Visual C++)
> >
> > TestProgram2 // AbsoluteTiming
> >
> > {0.3540203, Null}
> >
> > It seems the second test program is not using MKL and thus AMD becomes 
> > very efficient.
> >
> > I will continue testing.
> >
> > Is there any way to improve Mathematica's performance on AMD machines?
> >
> > Dmitry
> >
>
> In the past, Intel had been known to engage in anticompetitive practices 
> with respect to AMD, and quite rightly was subject to legal penalties for
> this. (Specifically, they encouraged large computer manufacturers such as
> Dell to take up exclusive supply contracts by means of large discounts and 
> availability guarantees.) As a result of this judgment there has been a 
> lot of general hysteria that Intel may still be discriminating against AMD 
> performance-wise in their library and compiler products, which has 
> culminated in legal threats resulting in the large disclaimers posted all
> over Intel's products stating that they are not meant for anything other 
> than Intel processors.
>
> Suspicion and disclaimers are one thing, but actual performance is 
> another. As you may be aware, AMD offers their own math library, ACML. 
> What most people who level this criticism of MKL are not aware of, 
> however, is that MKL actually performs better than ACML, *even on AMD 
> processors*. So, even if it is not optimized as thoroughly as it might be
> for AMD processors (which is more than likely the case; Intel does not 
> have an infinite development budget and there is no financial incentive 
> for them to go to great lengths optimizing for other manufacturers' 
> processors, which have performance characteristics very different to their 
> own), MKL is still better than the alternatives.
>
> Now, how then to explain the poor performance you observe? Unfortunately,
> the latest generation of AMD processors are simply not very good (the 
> Bulldozer processors are actually worse than the previous-generation 
> Phenom II processors in many applications), whereas Intel's products have
> been making dramatic gains lately despite AMD's reduced competitiveness. 
> The end result is that a Bulldozer core is "worth" about half a Sandy 
> Bridge core, clock for clock, especially in floating-point workloads sinc e 
> a single FP unit is shared between two of what AMD calls cores (indeed, 
> many have said that AMD's "8 core" processors are more correctly referred
> to as genuinely having 4 cores due to much shared apparatus, but for 
> marketing reasons, AMD is obviously not buying that argument). In regard 
> to your results from TestProgram2: sorry to say, these are invalid because 
> the time taken to compile to C completely overwhelms the actual runtime, 
> and you include both in the assessment, as well as using AbsoluteTiming 
> which is not appropriate for single-threaded code with short runtimes 
> executing inside the Mathematica kernel. A more valid test is:
>
> c = Compile[{{x, _Real}, {n, _Integer}},
>     Module[{sum, inc}, sum = 1.0; inc = 1.0;
>     Do[inc = inc*x/i; sum = sum + inc, {i, n}]; sum],
>     CompilationTarget -> "C"
>   ];
>
> Do[c[1.6, 10000000], {10}] // Timing
>
> which on my computer (Intel Core 2, 3.2GHz) takes about 0.65 seconds, i.e. 
> 65 ms for a single evaluation of c[1.6, 10000000].
>
> Your matrix exponential test would also be better posed as:
>
> NN = 1000;
> mat = RandomReal[{0, 1}, {NN, NN}];
> Do[MatrixExp[mat], {10}] // AbsoluteTiming
>
> (I get 9.5 seconds.)
>
> However I would be reluctant to draw any firm conclusion from these tests
> if I were you. Far better to look at published benchmarks for real 
> applications, for instance:
>
> http://techreport.com/articles.x/21813/15
>
> or
>
> http://www.anandtech.com/show/4955/the-bulldozer-review-amd-fx8150-tested/7
>
> which both show that Bulldozer performance is a very mixed bag in general. 
> While there are a few applications in which it can match or only just 
> outperform Intel's offerings, for the most part it falls behind them 
> considerably.
>
> Best,
>
> O. R.
Prev by Date: Re: unexpected behaviour of Sum
Next by Date: Crash in FindShortestPath (bug)
Previous by thread: Re: Speed of Mathematica on AMD machines
Next by thread: Re: dynamic input alias