MathGroup Archive: November 2010 [00676]

[Date Index] [Thread Index] [Author Index]

Re: disappointing CUDA speed

To: mathgroup at smc.vnet.net
Subject: [mg114178] Re: disappointing CUDA speed
From: David Koslicki <dmkoslicki at gmail.com>
Date: Fri, 26 Nov 2010 05:28:26 -0500 (EST)
References: <iclfde$l9u$1@smc.vnet.net>

On Nov 25, 5:56 am, Gianluca Gorni <gianluca.go... at fastwebnet.it>
wrote:
> Hi,
>
> I have a 1 year old Apple MacBookPro. I installed
> the cudadriver_3.1.17_macos and then tried the first
> examples in the documentation:
>
> Needs["CUDALink`"]
> CUDAQ[]
>   True
> randM = RandomReal[1, {3000, 3000}];
> AbsoluteTiming[randM.randM;]
>   {2.688389,Null}
>
> AbsoluteTiming[CUDADot[randM, randM];]
>   {7.328353,Null}
>
> Quite a letdown.
> Did I do something wrong?
>
> Gianluca
>
> CUDAInformation[]
> {1 -> {"Name" -> "GeForce 9400M", "Clock Rate" -> 1100000,
>    "Compute Capabilities" -> 1.1`, "GPU Overlap" -> 0,
>    "Maximum Block Dimensions" -> {512, 512, 64},
>    "Maximum Grid Dimensions" -> {65535, 65535, 1},
>    "Maximum Threads Per Block" -> 512,
>    "Maximum Shared Memory Per Block" -> 16384,
>    "Total Constant Memory" -> 65536, "Warp Size" -> 32,
>    "Maximum Pitch" -> 2147483647,
>    "Maximum Registers Per Block" -> 8192, "Texture Alignment" -> 256,
>    "Multiprocessor Count" -> 2,
>    "Core Count" -> 16,
>    "Execution Timeout" -> 1, "Integrated" -> False,
>    "Can Map Host Memory" -> False, "Compute Mode" -> "Default",
>    "Texture1D Width" -> 8192, "Texture2D Width" -> 65536,
>    "Texture2D Height" -> 32768, "Texture3D Width" -> 2048,
>    "Texture3D Height" -> 2048, "Texture3D Depth" -> 2048,
>    "Texture2D Array Width" -> 8192, "Texture2D Array Height" -> 8192,
>    "Texture2D Array Slices" -> 512, "Surface Alignment" -> 256,
>    "Concurrent Kernels" -> False, "ECC Enabled" -> False,
>    "Total Memory" -> 265945088}}

I think the issue is the inordinate amount of time spent sending the
data to your GPU and getting the results back. The actual computation
time is quite small in comparison to the time it takes to transfer the
data.
Instead, try:

randM = CUDAMemoryLoad[RandomReal[1, {3000, 3000}]];
AbsoluteTiming[res = CUDADot[randM, randM];]

Then your result could be obtained by using:

output=CUDAMemoryGet[res];

I think you will only see a speed increase when the calculation is
much more intensive than simply multiplying 3000x3000 matrices.

Hope this helps,

~David

Prev by Date: Re: New to Mathematica

Next by Date: Re: maximum of a series

Previous by thread: Re: disappointing CUDA speed

Next by thread: Re: disappointing CUDA speed