Skip to content

Commit

Permalink
address comments and rebase over detailed design
Browse files Browse the repository at this point in the history
Signed-off-by: Yu Chin Fabian Lim <[email protected]>
  • Loading branch information
fabianlim committed Apr 13, 2024
1 parent 0cc6bcb commit 87b38a9
Show file tree
Hide file tree
Showing 2 changed files with 87 additions and 78 deletions.
165 changes: 87 additions & 78 deletions architecture_records/002-acceleration-framework.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Training and FineTuning Acceleration Framework
# Training Enhancements Framework

**Deciders(s)**: Sukriti Sharma ([email protected]), Raghu Ganti ([email protected]), Laura Wynter ([email protected]), Fabian Lim ([email protected]), Aaron Chew ([email protected])
**Date (YYYY-MM-DD)**: 2024-04-11
Expand Down Expand Up @@ -31,14 +31,18 @@ Currently `sft_trainer.py` only can access those tools already integrated in HF.
2. Prefix tuning from [PEFT](https://github.com/huggingface/peft).
3. FSDP training from [accelerate](https://github.com/huggingface/accelerate).

Below are various reasons for a framework to integrate custom training tools into [`sft_trainer.py`].
Below are various reasons for a framework to integrate custom training tools into `sft_trainer.py`.
* Enable quick integrations of open-source techniques that have yet to be integrated into Huggingface.
* Enable integrations of custom techniques developed by IBM researchers, that are not planned be integrated into Huggingface.

Recently, it has been observed that new training techniques are released with an incomplete "preview" version. These "preview" versions tend to be not be fully integrated into OSS. Therefore, using new techniques typically involve additional work. This framework aims to allow timely integrations of such techniques into `sft_trainer.py`. A short exampler list of powerful training techniques but are "preview"-only include:
Recently, it has been observed that new training techniques are released with an incomplete "preview" version. These "preview" versions tend to be not be fully integrated into OSS. Therefore, using new techniques typically involve additional work. This framework aims to allow timely integrations of such techniques into `sft_trainer.py`. A short list of powerful training techniques but are "preview"-only include:
- Huggingface integration of [AutoGPTQ](https://github.com/AutoGPTQ/AutoGPTQ).
* 4-bit quantization kernels to reduce memory storage of the base weights.
- [Unsloth](https://github.com/unslothai/unsloth).
* Fused operation kernels
* Kernels for common model architectures (e.g., cross-entropy losses, RoPE embeddings and RMS norms).
- [megablocks](https://github.com/databricks/megablocks).
- [AutoGPTQ](https://github.com/AutoGPTQ/AutoGPTQ).
* acceleration package for distributing mixture-of-experts that improves upon FSDP sharding.

<!--
Why this is a valuable problem to solve? What background information is needed to show how this design addresses the problem?
Expand All @@ -48,7 +52,6 @@ Which users are affected by the problem? Why is it a problem? What data supports

### User Benefit


Users will benefit from powerful training tools integrated into the platform, that are not readily accessible from huggingface. With these tools, users will be able to train models with less GPU resources and/or quicker, resulting in quicker turnaround and improved user experience.

<!--
Expand Down Expand Up @@ -80,7 +83,7 @@ The framework is designed to only modify them model at two integration points in
3. an *optional* `callback` method to install `TrainerCallbacks` (if needed, e.g. custom save logic).

```python
class FrameworkPlugin:
class TuningAccelerationPlugin:

# if specified, will restricted plugin to specified model archs
# - useful if method is restricted to certain model architectures, e.g., only used
Expand Down Expand Up @@ -111,26 +114,24 @@ Even though they are all optional, at least one out of the three should be imple
### Dependency Management

Take note:
- all plugin deps must be enforced to be optional deps in `pyproject.toml`, see [116](#116). If the dep is not installed, and the plugin is enabled, raise exception.
- all plugin deps must be enforced to be optional deps in `pyproject.toml`, see [116](https://github.com/foundation-model-stack/fms-hf-tuning/pull/116). If the dep is not installed, and the plugin is enabled, raise exception.
- any plugin that requires CUDA build tools (e.g. `triton` kernels) will need to be run in with [CUDA Toolkit dependencies (see this link for an example of a Debian installation)](https://developer.nvidia.com/cuda-12-2-0-download-archive?target_os=Linux&target_arch=x86_64&Distribution=Debian&target_version=11&target_type=deb_local).
* in such cases, both the library (e.g. `triton`), and CUDA tools, need to be checked.


* whenever CUDA is needed, the framework will check for the CUDA_TOOLS dependency.

### Minimal and Controlled Changes to Training Script

All proposed code changes to [`sft_trainer.py`] contained in minimal lines of code:
All proposed code changes to `sft_trainer.py` contained in minimal lines of code:
- Plugins loaded by discovery; transparent to `sft_trainer.py`.
- Plugin configuration automatically parsed.
- Passthrough to original operation if `Framework` is disabled.

```python
from tuning.proposed_framework import Framework
from tuning.acceleration import AccelerationFramework

# Minor Change 1: creating the framework object
framework = None
if framework_args.config_file is not None:
framework = Framework(framework_args.config_file)
framework = AccelerationFramework(framework_args.config_file)

# Minor Change 2: custom loader (if necessary)
_model_loader = AutoModelForCausalLM.from_pretrained # default
Expand All @@ -145,35 +146,45 @@ model = _model_loader(
attn_implementation="flash_attention_2" if model_args.use_flash_attn else None,
)

# instantiate trainer
trainer = Trainer(...)

# Minor Change 3:
if framework is not None and framework.requires_agumentation:
# will also take in some other configs that may affect augmentation
# e.g., peft, train_args
framework.augmentation(
# some of these args may be modified due to the augmentation
# e.g., peft_config will be consumed in augmentation, and returned as None
# to prevent SFTTrainer from doing extraneous PEFT logic
model, (peft_config,) = framework.augmentation(
model, trainer.accelerator,
train_args, peft_config
train_args, modifiable_args=(peft_config,),
)

# instantiate trainer. Pass in model (with training enchancements)
trainer = Trainer(model, ...)

# Minor Change 4: add trainer callbacsk
trainer.add_callbacks(framework.callbacks())

# call train
trainer.train()

```

The picture below summarizes the above discussion.
The picture below summarizes the above discussion in more detail. It demonstrates how the design will not contradict internal workings of [`SFTTrainer`].
- Model is modified and then control passed to [`SFTTrainer`].
- [`SFTTrainer`] also performs model augmentation internally (e.g., it installs PEFT adapters if `peft_config` is passed in).
* However, [`SFTTrainer`]'s model augmentation should be passed through if configs are omitted (e.g., if `peft_config = None`).
- [`SFTTrainer`] will prepare model for distributed training (e.g. wrap with `FSDP`) internally.
* thus Plugin implementers need to be aware that `FrameworkPlugin.augmentation` should not interfere with any model preperation that [`SFTTrainer`] will perform.
* thus Plugin implementers need to be aware that `TuningAccelerationPlugin.augmentation` should not interfere with any model preperation that [`SFTTrainer`] will perform.

![Framework](imgs/002-framework.png)

### Acceleration Methods

A top priority is to incorporate methods that enchance PEFT. While PEFT is known to be memory efficient, it is known to be slower than full-finetuning if not *properly optimized*. Also, another topic of interest is to add support for 4D masks to enable packing while instruction tuning; this acceleration may require some adjustments to the data processing.
1. Add 4-bit `triton` kernels for PEFT base weights.
2. Add fused kernels for PEFT base models, as well as reusable kernels for other models (e.g. cross-entropy loss, RoPE).
3. Add support for 4D masking (may require `TuningAccelerationPlugin.augmentation` to also access the datasets).
4. Add support for distributed training (i.e., `megablocks`).


<!--
This is the meat of the document, where you explain the decision. If you have multiple alternatives, be sure to use sub-sections for better separation of the idea, and list pros/cons to each approach. If there are alternatives that you have eliminated, you should also list those here, and explain why you believe your chosen approach is superior.
Expand All @@ -182,21 +193,45 @@ Make sure you’ve thought through and addressed the following sections. If a se

### Alternatives Considered

[IN PROGRESS]
We considered the following **alternatives**.

1. Alternative script to [`sft_trainer.py`].
2. Do not touch
Consideration | Why it was decided agianst
--|--
Restrict to only performing `augmentation` and not having custom model `loading` | Some methods (e.g., quantization that has special checkpoints) require special loaders. Furthmore any attempt to modify and instantiated models in unintended manners will be error-prone. Finally for extensibility reasons, we decided that preventing drop-in `loading` replacements will be a severe handicap.
Adding tuning enchancements directly to `SFT_Trainer` | The Huggingface trainer is a very complex, and is not recommended to manipulate it directly.

<!--
As such, we choose to allow `TuningAccelerationPlugin.augmentation` to modify only the `Accelerator` object which can already do quite a bit of things, like adjust the FSDP wrapping policy (for distributed training).
-->

<!--
- Make sure to discuss the relative merits of alternatives to your proposal.
-->

## Consequences

[IN PROGRESS]
We considered the following **concerns**.

Concern| Reason for concern | Possible Solution/s | Recommendation
--|--|--|--
Managing python deps not found on PyPI | Enhancement plugins may depend on OSS packages that require custom improvements (e.g., extending an OSS PEFT package to support the latest kernel, etc). | 1. Package can be installed directly from GH, public or private (the latter requires some CI changes to manage deployment keys), 2. Specially host custom wheels for CI/CD purposes. | 2
Managing CUDA compilations | Deploying certain enchancements may require additional CUDA Toolkit deps for kernel compilation. | 1. Extend GH workflow to have a [GH cuda-toolkit action](https://github.com/marketplace/actions/cuda-toolkit) to build the kernels during CI/DC. 2. If kernels are limited to custom deps that are slow-changing, then pre-build custom deps and store as specially hosted wheels. | 2
Licences for OSS Packages | Copyright concerns | All packages under consideration to be used in enhancements will have permissive licences (i.e. Apache 2.0 / MIT).
Testing | Do we need to test enchancements? | Request for comment | N/A

Both concerns can be addresed with an artifactory and centralized location to host custom OSS packages.
- Hosting the OSS packages in a single GH org for accountability. Can be private hosting if this is something we do not want to release.
- Regular users who want to use the enhancements may not be familar with installing cuda-toolkits and compilation. Preparing compiled wheels
for them will be helpful.
- Compiled kernels are sensitive to python and CUDA versions. Can consult existing packages (e.g., flash-attention) to see how this is managed.

Drawbacks:
- cannot support any plugin design that requires a controlled call in places not supported by `TrainerCallbacks`.
### On OSS packages requiring custom wheels

Package | Reason for hosting custom wheel | Urgency
--|--|--
AutoGPTQ | Required changes in `main` (v > 0.7.1) yet to be released. | Low. Can wait for new wheel release (v > 0.7.1) and replace accordingly (last release 1 Mar 2024).
UnSloth | Limited model support. | High. Unclear if new realeases will address the limited model support.
MegaBlocks | Limited model support | High. Unclear if new realeases will address the limited model support.

<!--
Describe the resulting context, after applying the decision. All consequences should be listed here, not just the "positive" ones. A particular decision may have positive, negative, and neutral consequences, but all of them affect the team and project in the future.
Expand All @@ -205,44 +240,35 @@ Describe the resulting context, after applying the decision. All consequences sh

## Detailed Design


<!--
This section is optional. Elaborate on details if they’re important to understanding the design, but would make it hard to read the proposal section above.
-->

`acceleration.yaml`
```
In this section we demonstrate how to implement an `AutoGPTQPlugin` that implements an accelerate PEFT training mode with 4 bit GPTQ base weights.

This is an `acceleration.yaml`
```yaml
quantization:
- requires_quantization: True
- quant_num_bits: 4
- quantize_cache: '/home/user/'
- quant_kernel: 'gptq-tritonv2'
- unsloth: False
```
```
from functools import partial
```python

class Framework:
# Acceleration Plugin for AutoGPTQ acceleration with kernels
class AutoGPTQPlugin(TuningAccelerationPlugin):
def __init__(self, acceleration_config_path:str) -> None:
self.acceleration_config = self.read_acceleration_config(acceleration_config_path)
self.num_bits = self.acceleration_config.quant_num_bits
self.requires_custom_loading = self.acceleration_config.requires_custom_loading
self.requires_quantization = self.acceleration_config.requires_quantization
self.quantize_cache = self.acceleration_config['quantize_cache']
self.kernel = self.acceleration_config['quant_kernel']
def read_acceleration_config(self, acceleration_config_path):
pass
# ... initialize config

def callbacks(self, *args, **kwargs):
pass
def model_loader(self):
def model_loader(self, model_path, **kwargs):

# ... maybe quantize if needed
quantize_config = QuantizeConfig(
bits=self.num_bits,
)
if self.requires_quantization:
return partial(AutoGPTQForCausalLM.from_pretrained, quantize_config = quantize_config)
else:
return partial(AutoGPTQForCausalLM.from_quantized, quantize_config = quantize_config)
return AutoGPTQForCausalLM.from_quantized(model_path, quantize_config = quantize_config)

def augmentation(
self,
Expand All @@ -251,29 +277,12 @@ class Framework:
train_args,
peft_config,
):
'''
This function is used for any augmentation of the model before trainings
e.g. quantization, unsloth/PEFT installation and also MegaBlocks patching
'''
if self.requires_quantization:
model.quantize()
model.save_quantized(save_dir = self.quantize_cache)
if peft_config:
# PEFT Installation
if 'gptq' in self.kernel:
from auto_gptq.utils.peft_utils import get_gptq_peft_model
model = get_gptq_peft_model(
model,
peft_config = peft_config,
)
else:
from peft import get_peft_model
model = get_peft_model(
model,
peft_config
)
return model
```
assert peft_config is not None, "need peft_config to install PEFT adapters"

# PEFT Installation
from auto_gptq.utils.peft_utils import get_gptq_peft_model
return get_gptq_peft_model(
model,
peft_config = peft_config,
)
```
Binary file modified architecture_records/imgs/002-framework.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit 87b38a9

Please sign in to comment.