cuda - Calculating performance of CUFFT -

August 15, 2011

i running cufft on chunks (n*n/p) divided in multiple gpus, , have question regarding calculating performance. first, bit how doing it:

send n*n/p chunks each gpu
batched 1-d fft each row in p gpus
get n*n/p chunks host - perform transpose on entire dataset
ditto step 1
ditto step 2

gflops = ( 1e-9 * 5 * n * n *lg(n*n) ) / execution time

and execution time calculated as:

execution time = sum(memcpyhtod + kernel + memcpydtoh times row , col fft each gpu)

is correct way evaluate cufft performance on multiple gpus? there other way represent performance of fft?

thanks.

if doing complex transform, operation count correct (it should 2.5 n log2(n) real valued transform), gflop formula incorrect. in parallel, multiprocessor operation usual calculation of throughput is

operation count / wall clock time

in case, presuming gpus operating in parallel, either measure wall clock time (ie. how long whole operation took) execution time, or use this:

execution time = max(memcpyhtod + kernel + memcpydtoh times row , col fft each gpu)

as stands, calculation represents serial execution time. allowing overheads multigpu scheme, expect calculated performance numbers getting lower equivalent transform done on single gpu.

Search This Blog

Color

cuda - Calculating performance of CUFFT -

Comments

Post a Comment

Popular posts from this blog

Redirect to a HTTPS version using .htaccess -

Unlimited choices in BASH case statement -

javascript - jQuery: Add class depending on URL in the best way -