FIR filter in CUDA (as a 1D convolution) -
i'm trying implement fir (finite impulse response) filter in cuda. approach quite simple , looks this:
#include <cuda.h> __global__ void filterdata(const float *d_data, const float *d_numerator, float *d_filtereddata, const int numeratorlength, const int filtereddatalength) { int = blockdim.x * blockidx.x + threadidx.x; float sum = 0.0f; if (i < filtereddatalength) { (int j = 0; j < numeratorlength; j++) { // first (numeratorlength-1) elements contain filter state sum += d_numerator[j] * d_data[i + numeratorlength - j - 1]; } } d_filtereddata[i] = sum; } int main(void) { // (skipping error checks make code more readable) int datalength = 18042; int filtereddatalength = 16384; int numeratorlength= 1659; // pointers data, filtered data , filter coefficients // (skipping how these read arrays) float *h_data = new float[datalength]; float *h_filtereddata = new float[filtereddatalength]; float *h_filter = new float[numeratorlength]; // create device pointers float *d_data = nullptr; cudamalloc((void **)&d_data, datalength * sizeof(float)); float *d_numerator = nullptr; cudamalloc((void **)&d_numerator, numeratorlength * sizeof(float)); float *d_filtereddata = nullptr; cudamalloc((void **)&d_filtereddata, filtereddatalength * sizeof(float)); // copy data device cudamemcpy(d_data, h_data, datalength * sizeof(float), cudamemcpyhosttodevice); cudamemcpy(d_numerator, h_numerator, numeratorlength * sizeof(float), cudamemcpyhosttodevice); // launch kernel int threadsperblock = 256; int blockspergrid = (filtereddatalength + threadsperblock - 1) / threadsperblock; filterdata<<<blockspergrid,threadsperblock>>>(d_data, d_numerator, d_filtereddata, numeratorlength, filtereddatalength); // copy results host cudamemcpy(h_filtereddata, d_filtereddata, filtereddatalength * sizeof(float), cudamemcpydevicetohost); // clean cudafree(d_data); cudafree(d_numerator); cudafree(d_filtereddata); // stuff h_filtereddata... // clean more delete [] h_data; delete [] h_filtereddata; delete [] h_filter; }
the filter works, i'm new cuda programming , i'm not sure how optimize it.
a slight problem see datalength
, filtereddatalength
, , numeratorlength
not known before hand in application intend use filter in. also, though datalength
multiple of 32
in above code, not guaranteed in final application.
when compare code above arrayfire, code takes 3 times longer execute.
does have ideas on how speed things up?
edit: have changed filterlength
numeratorlength
.
i can suggest following speed code:
- use shared memory: tiny cache-like memory extremely faster global card memory. can find more looking __shared__ keyword in cuda documentation. example, can pre-fetch filter numerators , big chunks of data in shared memory, enhance performance. need pay attention data alignment in case matters , can slow down code.
- think unrolling for-loop of numerator sum. can check reduce-vector example in cuda documentation.
- you can think parallelizing numerator loop itself. can done adding dimension (say 'y') thread-block. need make sum shared vector has dimension of numeratorlength. can check reduce vector example on how take sum of vector @ end.
Comments
Post a Comment