FIR filter in CUDA (as a 1D convolution) -


i'm trying implement fir (finite impulse response) filter in cuda. approach quite simple , looks this:

#include <cuda.h>  __global__ void filterdata(const float *d_data,                            const float *d_numerator,                             float *d_filtereddata,                             const int numeratorlength,                            const int filtereddatalength) {     int = blockdim.x * blockidx.x + threadidx.x;      float sum = 0.0f;      if (i < filtereddatalength)     {         (int j = 0; j < numeratorlength; j++)         {             // first (numeratorlength-1) elements contain filter state             sum += d_numerator[j] * d_data[i + numeratorlength - j - 1];         }     }      d_filtereddata[i] = sum; }  int main(void) {     // (skipping error checks make code more readable)      int datalength = 18042;     int filtereddatalength = 16384;     int numeratorlength= 1659;      // pointers data, filtered data , filter coefficients     // (skipping how these read arrays)     float *h_data = new float[datalength];     float *h_filtereddata = new float[filtereddatalength];     float *h_filter = new float[numeratorlength];       // create device pointers     float *d_data = nullptr;     cudamalloc((void **)&d_data, datalength * sizeof(float));      float *d_numerator = nullptr;     cudamalloc((void **)&d_numerator, numeratorlength * sizeof(float));      float *d_filtereddata = nullptr;     cudamalloc((void **)&d_filtereddata, filtereddatalength * sizeof(float));       // copy data device     cudamemcpy(d_data, h_data, datalength * sizeof(float), cudamemcpyhosttodevice);     cudamemcpy(d_numerator, h_numerator, numeratorlength * sizeof(float), cudamemcpyhosttodevice);        // launch kernel     int threadsperblock = 256;     int blockspergrid = (filtereddatalength + threadsperblock - 1) / threadsperblock;     filterdata<<<blockspergrid,threadsperblock>>>(d_data, d_numerator, d_filtereddata, numeratorlength, filtereddatalength);      // copy results host     cudamemcpy(h_filtereddata, d_filtereddata, filtereddatalength * sizeof(float), cudamemcpydevicetohost);      // clean     cudafree(d_data);     cudafree(d_numerator);     cudafree(d_filtereddata);      // stuff h_filtereddata...      // clean more     delete [] h_data;     delete [] h_filtereddata;     delete [] h_filter; } 

the filter works, i'm new cuda programming , i'm not sure how optimize it.

a slight problem see datalength, filtereddatalength, , numeratorlength not known before hand in application intend use filter in. also, though datalength multiple of 32 in above code, not guaranteed in final application.

when compare code above arrayfire, code takes 3 times longer execute.

does have ideas on how speed things up?

edit: have changed filterlength numeratorlength.

i can suggest following speed code:

  1. use shared memory: tiny cache-like memory extremely faster global card memory. can find more looking __shared__ keyword in cuda documentation. example, can pre-fetch filter numerators , big chunks of data in shared memory, enhance performance. need pay attention data alignment in case matters , can slow down code.
  2. think unrolling for-loop of numerator sum. can check reduce-vector example in cuda documentation.
  3. you can think parallelizing numerator loop itself. can done adding dimension (say 'y') thread-block. need make sum shared vector has dimension of numeratorlength. can check reduce vector example on how take sum of vector @ end.

Comments

Popular posts from this blog

javascript - jQuery: Add class depending on URL in the best way -

caching - How to check if a url path exists in the service worker cache -

Redirect to a HTTPS version using .htaccess -