optimization - Compile-time information in CUDA -

January 15, 2014

i'm optimizing time-critical cuda kernel. application accepts wide range of switches affect behavior (for instance, whether use 3rd or 5th order derivative). consider approximation set of 50 switches, every switch integer variable (a bool sometimes, or float, case not relevant question).

all these switches constant during execution of application. of these switches run-time , store them in constant memory, exploit caching mechanism. other switches can compile-time , customer fine having re-compile application if wants change value in switch. simple example be:

__global__ void mykernel(const float* in, float *out) {     ( /* many many times */ )         if (compile_time_switch)             do_this(in, out);         else             do_that(in, out); }

assume do_this , do_that compute-bound , cheap, optimize for loop overhead negligible, have place if inside iteration. if compiler recognizes compile_time_switch static information can optimize out call "wrong" function , create code optimized if if weren't there. real question:

in ways can provide compiler static value of switch? see 2 such ways, listed below, none of them work me. other possibilities remain?

template parameters

providing template parameter enables static optimization.

template<int compile_time_switch> __global__ void mykernel(const float* in, float *out) {     ( /* many many times */ )         if (compile_time_switch)             do_this(in, out);         else             do_that(in, out); }

this simple solution not work me, since don't have direct access code calls kernel.

static members

consider following struct:

struct globalparameters {     static const bool compile_time_switch = true; };

now globalparameters::compile_time_switch contains static information want it, , compiler able optimize kernel. unfortunately, cuda not support such static members.

edit: last statement apparently wrong. definition of struct of course legit , able use static member globalparameters::compile_time_switch in device code. compiler inlines variable, final code directly contain value, not run-time variable access, behavior expect optimizer compiler. so, second options suitable.

i consider problem solved both fact , kronos' answer. however, i'm still looking other alternative methods provide compile-time information compiler.

yor third options preprocessor definitions:

#define compile_time_switch 1  __global__ void mykernel(const float* in, float *out) {     ( /* many many times */ )         if (compile_time_switch)             do_this(in, out);         else             do_that(in, out); }

the preprocessor discard else case compleatly , compiler has nothing optimize in dead code elemination pass, because there no dead code.

furthermore, can specify definition -d comand line switch , (i think) nvidia supported compiler accept -d (msvc may use different switch).

Search This Blog

And