cuda - Is the nvidia kepler shuffle "destructive"? -
i'm using implementation of parallel reduction on cuda using new kepler's shuffle instructions, similar this: http://devblogs.nvidia.com/parallelforall/faster-parallel-reductions-kepler/
i searching minima of rows in given matrix, , in end of kernel had following code:
my_register = min(my_register, __shfl_down(my_register,8,16)); my_register = min(my_register, __shfl_down(my_register,4,16)); my_register = min(my_register, __shfl_down(my_register,2,16)); my_register = min(my_register, __shfl_down(my_register,1,16));
my blocks 16*16, worked fine, code getting minima in 2 sub-rows in same kernel.
now need return indices of smallest elements in every row of matrix, going replace "min" "if" statement , handle these indices in similar fashion, got stuck @ code:
if (my_reg > __shfl_down(my_reg,8,16)){my_reg = __shfl_down(my_reg,8,16);}; if (my_reg > __shfl_down(my_reg,4,16)){my_reg = __shfl_down(my_reg,4,16);}; if (my_reg > __shfl_down(my_reg,2,16)){my_reg = __shfl_down(my_reg,2,16);}; if (my_reg > __shfl_down(my_reg,1,16)){my_reg = __shfl_down(my_reg,1,16);};
no cudaerrors whatsoever, kernel returns trash now. nevertheless have fix that:
myreg_tmp = __shfl_down(myreg,8,16); if (myreg > myreg_tmp){myreg = myreg_tmp;}; myreg_tmp = __shfl_down(myreg,4,16); if (myreg > myreg_tmp){myreg = myreg_tmp;}; myreg_tmp = __shfl_down(myreg,2,16); if (myreg > myreg_tmp){myreg = myreg_tmp;}; myreg_tmp = __shfl_down(myreg,1,16); if (myreg > myreg_tmp){myreg = myreg_tmp;};
so, allocating new tmp variable sneak neighboring registers saves me. question: kepler shuffle instructions destructive ? in sense invoking same instruction twice doesn't issue same result. haven't assigned registers saying "my_reg > __shfl_down(my_reg,8,16)" - adds confusion. can explain me problem invoking shuffle twice? i'm pretty newbie in cuda, detailed explanation dummies welcomed
warp shuffle not destructive. operation, if repeated under exact same conditions, return same result each time. var
value (myreg
in example) not modified warp shuffle function itself.
the problem experiencing due fact number of participating threads on second invocation of __shfl_down()
in first method different other invocations, in either method.
first, let's remind ourselves of key point in documentation:
threads may read data thread actively participating in __shfl() command. if target thread inactive, retrieved value undefined.
now let's take @ first "broken" method:
if (my_reg > __shfl_down(my_reg,8,16)){my_reg = __shfl_down(my_reg,8,16);};
the first time call __shfl_down()
above (within if-clause), all threads participating. therefore values returned __shfl_down()
expect. however, once if clause complete, threads satisfied if-clause participate in body of if-statement. therefore, on second invocation of __shfl_down()
within if-statement body, threads my_reg
value greater my_reg
value of thread 8 lanes above them participate. means of these assignment statements not return value expect, because other thread may not participating. (the participation of thread 8 lanes above dependent on result of if comparison done thread, may or may not true.)
the second method propose has no such issue, , works correctly according statements. threads participate in each invocation of __shfl_down()
.
Comments
Post a Comment