cuda - Calculating performance of CUFFT -


i running cufft on chunks (n*n/p) divided in multiple gpus, , have question regarding calculating performance. first, bit how doing it:

  1. send n*n/p chunks each gpu
  2. batched 1-d fft each row in p gpus
  3. get n*n/p chunks host - perform transpose on entire dataset
  4. ditto step 1
  5. ditto step 2

gflops = ( 1e-9 * 5 * n * n *lg(n*n) ) / execution time

and execution time calculated as:

execution time = sum(memcpyhtod + kernel + memcpydtoh times row , col fft each gpu)

is correct way evaluate cufft performance on multiple gpus? there other way represent performance of fft?

thanks.

if doing complex transform, operation count correct (it should 2.5 n log2(n) real valued transform), gflop formula incorrect. in parallel, multiprocessor operation usual calculation of throughput is

operation count / wall clock time 

in case, presuming gpus operating in parallel, either measure wall clock time (ie. how long whole operation took) execution time, or use this:

execution time = max(memcpyhtod + kernel + memcpydtoh times row , col fft each gpu) 

as stands, calculation represents serial execution time. allowing overheads multigpu scheme, expect calculated performance numbers getting lower equivalent transform done on single gpu.


Comments

Popular posts from this blog

javascript - jQuery: Add class depending on URL in the best way -

caching - How to check if a url path exists in the service worker cache -

Redirect to a HTTPS version using .htaccess -