You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, I am researching on ur CUDA of butterfly. Sorry about my limited CUDA experiences. I read ur load_max5 function which includes remaining_input_idx, low_bits, high_bits and etc. It's not clear why U write like this to read data, and would you mind explain it? Thanks much.
` const int t_idx = threadIdx.x;
const int batch_idx = (threadIdx.y + (blockIdx.x >> (log_n - nsteps)) * blockDim.y) * items_per_thread;
const int remaining_input_idx = blockIdx.x & ((1 << (log_n - nsteps)) - 1);
const int low_bits = remaining_input_idx & ((1 << input_idx_start_bit) - 1);
const int high_bits = (remaining_input_idx >> input_idx_start_bit) << (input_idx_start_bit + nsteps);
// All threads with the same t_idx should have the same input_idx
const int input_idx = high_bits | (t_idx << input_idx_start_bit) | low_bits;
const int input_idx_stride = (1 << input_idx_start_bit) * warpSize;
twiddle_reader.load_max5<nsteps, increasing_stride>(s_twiddle, input_idx_start_bit, low_`
Could you explain the details about the load_max5 function? such as why these parameters(low_order_bits,remainder,s_idx) are used for index searching.
template<int nsteps, bool increasing_stride>
__device__ __forceinline__ void load_max5(scalar_t s_twiddle[nsteps][2][1 << nsteps],
int input_idx_start_bit, int low_bits, int high_bits) {
constexpr int span = 1 << nsteps;
const int s = blockIdx.y + gridDim.y * blockIdx.z; // For conv2d butterfly as well
for (int t = threadIdx.x + threadIdx.y * blockDim.x; t < nsteps * (span / 2); t += blockDim.x * blockDim.y) {
const int step = t / (span / 2);
const int s_twiddle_stride = 1 << (increasing_stride ? step : nsteps - 1 - step);
const int remainder = t % (span / 2);
const int low_order_bits = remainder & (s_twiddle_stride - 1);
const int s_idx = 2 * (remainder - low_order_bits) + low_order_bits;
const int idx = (high_bits >> 1) | (remainder << input_idx_start_bit) | low_bits;
s_twiddle[step][0][s_idx] = twiddle_a[s][step][idx][0][0];
s_twiddle[step][1][s_idx] = twiddle_a[s][step][idx][0][1];
s_twiddle[step][1][s_idx + s_twiddle_stride] = twiddle_a[s][step][idx][1][0];
s_twiddle[step][0][s_idx + s_twiddle_stride] = twiddle_a[s][step][idx][1][1];
}
}
The text was updated successfully, but these errors were encountered:
Hi, I am researching on ur CUDA of butterfly. Sorry about my limited CUDA experiences. I read ur load_max5 function which includes remaining_input_idx, low_bits, high_bits and etc. It's not clear why U write like this to read data, and would you mind explain it? Thanks much.
Could you explain the details about the load_max5 function? such as why these parameters(low_order_bits,remainder,s_idx) are used for index searching.
The text was updated successfully, but these errors were encountered: