NVIDIA CUTLASS Epilogues
Introduction
Epilogues are the final stages in a sequence of GPU-accelerated matrix multiplications and/or tensor operations. We typically handle these using the NVIDIA CUTLASS library of Linear Algebra Subroutines and Solvers. The epilogue phase comes after the core matmul (GEMM) has been performed on the input, and these are simply the additional operations applied to the output.
In other words, the epilogue rearranges the result of a matrix product through shared memory to match canonical tensor layouts in global memory. Epilogues also support conversion and reduction operation. Note that the shared memory resource is time-sliced across warps.
Currently in Aphrodite, we only support symmetric quantization for weights, but symmetric and asymmetric for activations. Both can be quantized per-tensor/per-channel (weights) or per-token (activations).
In total, we use 4 epilogues:
ScaledEpilogue
: symmetric quantization for activations, no bias.ScaledEpilogueBias
: symmetric quantization for activations, supports bias.ScaledEpilogueAzp
: asymmetric per-tensor quantization for activations, supports bias.ScaledEpilogueAzpPerToken
: asymmetric per-token quantization for activations, supports bias.
We don't have epilogues for asymmetric quantization of activations without bias in order to reduce the final binary size. Instead, if no bias is passed, the epilogue will use 0 as the bias! That induces a redundant addition operation (and runtime check), but the performance impact seems to be relatively minor.
Underlying Linear Algebra
If
Here,
Expanding further, we can calculate
Now that
Epilogues
ScaledEpilogue
This epilogue computes the symmetric quantization for activations without bias, meaning
Epilogue parameters:
scale_a
: the scale for activations, can be per-tensor (scalar) or per-token (column-vector)scale_b
: the scale for weights, can be per-tensor (scalar) or per-channel (row-vector)
ScaledEpilogueBias
This epilogue computes the symmetric quantization for activations with bias, meaning
Epilogue parameters:
scale_a
: the scale for activations, can be per-tensor (scalar) or per-token (column-vector).scale_b
: the scale for weights, can be per-tensor (scalar) or per-channel (row-vector).bias
: the bias, is always per-channel (row-vector).
ScaledEpilogueAzp
This epilogue computes the asymmetric per-tensor quantization for activations with bias. The output of the GEMM is:
Because azp_with_adj
as a row-vector.
Epilogue parameters:
scale_a
: the scale for activations, can be per-tensor (scalar) or per-token (column-vector).- Generally this will be per-tensor as the zero-points are per-tensor.
scale_b
: the scale for weights, can be per-tensor (scalar) or per-channel (row-vector).azp_with_adj
: the precomputed zero-point term (), is per-channel (row-vector). bias
: the bias, is always per-channel (row-vector).
To use these kernels efficiently, users must precompute the azp_with_adj
term offline and pass it to the kernel.
ScaledEpilogueAzpPerToken
This epilogue computes the asymmetric per-token quantization for activations with bias.
The output of the GEMM is the same as above, but the
Epilogue parameters:
scale_a
: the scale for activations, can be per-tensor (scalar) or per-token (column-vector).- Generally this will be per-token as the zero-points are per-token.
scale_b
: the scale for weights, can be per-tensor (scalar) or per-channel (row-vector).azp_adj
: the precomputed zero-point adjustment term (), is per-channel (row-vector). azp
: the zero-point (z_a
), is per-token (column-vector).bias
: the bias, is always per-channel (row-vector).
To use these kernels efficiently, users must precompute the azp_adj
term offline and pass it to the kernel.
The epilogue performs the following computation (where Dq
is the raw quantized output of the GEMM):
out = scale_a * scale_b * (Dq - azp_adj * azp) + bias