Register Kernels as AutoGrad Ops #91

fabianlim · 2024-10-11T11:48:51Z

We have a quite a few custom autograd functions in the FOAK plugin

see kernels
also see fused ops

We should test compile with these autograds, and register them. Note that it is better to avoid what kernel-hyperdrive does as it registers them as custom_ops, see here,

for the kernel-hyperdrive it cant be helped as there is some stride issue
but if its possible, better to use this kind of wrapping.

If there are functions in autograds that need to be changed, the bench needs to be rerun for accuracy and performance checks

The text was updated successfully, but these errors were encountered:

fabianlim · 2024-11-12T14:47:34Z

Using rms_layer_norm as an example, here is my attempt to list out a set of prescriptive tasks.

Look at all the different kernels that are attached to a model, e.g., llama. Go through them one by one.
For example, start with rms_layer_norm. In the above example, we replace the LlamaRMSNorm with the fast_rms_layernorm
The implementation of fast_rms_layernorm is found here, which is an autograd function Fast_RMS_Layernorm that as a triton kernel _rms_layernorm_forward in the forward, and _rms_layernorm_backward in the backward.
So to make this compilable, you must follow the pattern, to register it as a graph op. One way to do this is custom_op, as it is done here .
Using custom_ops can have overhead, so if its easier, we can do this as a first pass, but we need a clean way to disable the custom_op if compile is not enabled.
Finally, the more "standard" way to register ops is the torch.library.define pattern, see this issue for example.

torch.library.define("mylib::cvmm_triton", "(Tensor x, Tensor sel_index, Tensor sel, Tensor keys, ScalarType out_dtype, Tensor out_index) -> Tensor")

@torch.library.impl("mylib::cvmm_triton", "default")
def func()...

Lastly, after compile works you need to run the bench to test it

tox -e run_benches --  "1 2" "4 8" benchmark_outputs scenarios.yaml full-finetuning

You would add a compiled bench, so that we can bench the speedups that compile will give, in addition to the existing benches.

fabianlim added future Will be affected in future versions (e.g., deprecation) help wanted Extra attention is needed labels Nov 4, 2024

fabianlim changed the title ~~Register Kernels as AutoGrad Ops: Torch Deprecation Warning~~ Register Kernels as AutoGrad Ops Nov 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Register Kernels as AutoGrad Ops #91

Register Kernels as AutoGrad Ops #91

fabianlim commented Oct 11, 2024 •

edited

Loading

fabianlim commented Nov 12, 2024

Register Kernels as AutoGrad Ops #91

Register Kernels as AutoGrad Ops #91

Comments

fabianlim commented Oct 11, 2024 • edited Loading

fabianlim commented Nov 12, 2024

fabianlim commented Oct 11, 2024 •

edited

Loading