cuda - Calculating performance of CUFFT -
i running cufft on chunks (n*n/p) divided in multiple gpus, , have question regarding calculating performance. first, bit how doing it:
- send n*n/p chunks each gpu
- batched 1-d fft each row in p gpus
- get n*n/p chunks host - perform transpose on entire dataset
- ditto step 1
- ditto step 2
gflops = ( 1e-9 * 5 * n * n *lg(n*n) ) / execution time
and execution time calculated as:
execution time = sum(memcpyhtod + kernel + memcpydtoh times row , col fft each gpu)
is correct way evaluate cufft performance on multiple gpus? there other way represent performance of fft?
thanks.
if doing complex transform, operation count correct (it should 2.5 n log2(n) real valued transform), gflop formula incorrect. in parallel, multiprocessor operation usual calculation of throughput is
operation count / wall clock time
in case, presuming gpus operating in parallel, either measure wall clock time (ie. how long whole operation took) execution time, or use this:
execution time = max(memcpyhtod + kernel + memcpydtoh times row , col fft each gpu)
as stands, calculation represents serial execution time. allowing overheads multigpu scheme, expect calculated performance numbers getting lower equivalent transform done on single gpu.
Comments
Post a Comment