address comments and rebase over detailed design

Signed-off-by: Yu Chin Fabian Lim <[email protected]>
foundation-model-stack · Apr 12, 2024 · dd153f5 · dd153f5
1 parent 0cc6bcb
commit dd153f5
Show file tree

Hide file tree

Showing 2 changed files with 59 additions and 71 deletions.
diff --git a/architecture_records/002-acceleration-framework.md b/architecture_records/002-acceleration-framework.md
@@ -1,4 +1,4 @@
-# Training and FineTuning Acceleration Framework
+# Training Enhancements Framework
 
 **Deciders(s)**: Sukriti Sharma ([email protected]), Raghu Ganti ([email protected]), Laura Wynter ([email protected]), Fabian Lim ([email protected]), Aaron Chew ([email protected])
 **Date (YYYY-MM-DD)**:  2024-04-11
@@ -31,14 +31,18 @@ Currently `sft_trainer.py` only can access those tools already integrated in HF.
 2. Prefix tuning from [PEFT](https://github.com/huggingface/peft).
 3. FSDP training from [accelerate](https://github.com/huggingface/accelerate).
 
-Below are various reasons for a framework to integrate custom training tools into [`sft_trainer.py`]. 
+Below are various reasons for a framework to integrate custom training tools into `sft_trainer.py`. 
 * Enable quick integrations of open-source techniques that have yet to be integrated into Huggingface.
 * Enable integrations of custom techniques developed by IBM researchers, that are not planned be integrated into Huggingface.
 
-Recently, it has been observed that new training techniques are released with an incomplete "preview" version. These "preview" versions tend to be not be fully integrated into OSS. Therefore, using new techniques typically involve additional work. This framework aims to allow timely integrations of such techniques into `sft_trainer.py`. A short exampler list of powerful training techniques but are "preview"-only include:
+Recently, it has been observed that new training techniques are released with an incomplete "preview" version. These "preview" versions tend to be not be fully integrated into OSS. Therefore, using new techniques typically involve additional work. This framework aims to allow timely integrations of such techniques into `sft_trainer.py`. A short list of powerful training techniques but are "preview"-only include:
+- Huggingface integration of [AutoGPTQ](https://github.com/AutoGPTQ/AutoGPTQ).
+    * 4-bit quantization kernels to reduce memory storage of the base weights.
 - [Unsloth](https://github.com/unslothai/unsloth).
+    * Fused operation kernels
+    * Kernels for common model architectures (e.g., cross-entropy losses, RoPE embeddings and RMS norms).
 - [megablocks](https://github.com/databricks/megablocks).
-- [AutoGPTQ](https://github.com/AutoGPTQ/AutoGPTQ).
+    * acceleration package for distributing mixture-of-experts that improves upon FSDP sharding.
 
 <!--
 Why this is a valuable problem to solve? What background information is needed to show how this design addresses the problem?
@@ -48,7 +52,6 @@ Which users are affected by the problem? Why is it a problem? What data supports
 
 ### User Benefit
 
-
 Users will benefit from powerful training tools integrated into the platform, that are not readily accessible from huggingface. With these tools, users will be able to train models with less GPU resources and/or quicker, resulting in quicker turnaround and improved user experience. 
 
 <!--
@@ -80,7 +83,7 @@ The framework is designed to only modify them model at two integration points in
 3. an *optional* `callback` method to install `TrainerCallbacks` (if needed, e.g. custom save logic).
 
 ```python
-class FrameworkPlugin:
+class TuningAccelerationPlugin:
 
     # if specified, will restricted plugin to specified model archs
     # - useful if method is restricted to certain model architectures, e.g., only used 
@@ -111,26 +114,24 @@ Even though they are all optional, at least one out of the three should be imple
 ### Dependency Management
 
 Take note:
-- all plugin deps must be enforced to be optional deps in `pyproject.toml`, see [116](#116). If the dep is not installed, and the plugin is enabled, raise exception.
+- all plugin deps must be enforced to be optional deps in `pyproject.toml`, see [116](https://github.com/foundation-model-stack/fms-hf-tuning/pull/116). If the dep is not installed, and the plugin is enabled, raise exception.
 - any plugin that requires CUDA build tools (e.g. `triton` kernels) will need to be run in with [CUDA Toolkit dependencies (see this link for an example of a Debian installation)](https://developer.nvidia.com/cuda-12-2-0-download-archive?target_os=Linux&target_arch=x86_64&Distribution=Debian&target_version=11&target_type=deb_local). 
-    * in such cases, both the library (e.g. `triton`), and CUDA tools, need to be checked.
-
-
+    * whenever CUDA is needed, the framework will check for the CUDA_TOOLS dependency.
 
 ### Minimal and Controlled Changes to Training Script
 
-All proposed code changes to [`sft_trainer.py`] contained in minimal lines of code:
+All proposed code changes to `sft_trainer.py` contained in minimal lines of code:
 - Plugins loaded by discovery; transparent to `sft_trainer.py`.
 - Plugin configuration automatically parsed.
 - Passthrough to original operation if `Framework` is disabled.
 
 ```python
-from tuning.proposed_framework import Framework
+from tuning.acceleration import AccelerationFramework
 
 # Minor Change 1: creating the framework object
 framework = None
 if framework_args.config_file is not None:
-    framework = Framework(framework_args.config_file)
+    framework = AccelerationFramework(framework_args.config_file)
 
 # Minor Change 2: custom loader (if necessary)
 _model_loader = AutoModelForCausalLM.from_pretrained # default
@@ -162,18 +163,26 @@ trainer.add_callbacks(framework.callbacks())
 
 # call train
 trainer.train()
-
 ```
 
-The picture below summarizes the above discussion. 
+The picture below summarizes the above discussion in more detail. It demonstrates how the design will not contradict internal workings of [`SFTTrainer`].
 - Model is modified and then control passed to [`SFTTrainer`].
 - [`SFTTrainer`] also performs model augmentation internally (e.g., it installs PEFT adapters if `peft_config` is passed in). 
     * However, [`SFTTrainer`]'s model augmentation should be passed through if configs are omitted (e.g., if `peft_config = None`).
 - [`SFTTrainer`] will prepare model for distributed training (e.g. wrap with `FSDP`) internally. 
-    * thus Plugin implementers need to be aware that `FrameworkPlugin.augmentation` should not interfere with any model preperation that [`SFTTrainer`] will perform.
+    * thus Plugin implementers need to be aware that `TuningAccelerationPlugin.augmentation` should not interfere with any model preperation that [`SFTTrainer`] will perform.
 
 ![Framework](imgs/002-framework.png)
 
+### Acceleration Methods 
+
+A top priority is to incorporate methods that enchance PEFT. While PEFT is known to be memory efficient, it is known to be slower than full-finetuning if not *properly optimized*. Also, another topic of interest is to add support for 4D masks to enable packing while instruction tuning; this acceleration may require some adjustments to the data processing. 
+1. Add 4-bit `triton` kernels for PEFT base weights.
+2. Add fused kernels for PEFT base models, as well as reusable kernels for other models (e.g. cross-entropy loss, RoPE).
+3. Add support for 4D masking (may require `TuningAccelerationPlugin.augmentation` to also access the datasets).
+4. Add support for distributed training (i.e., `megablocks`).
+
+
 <!--
 This is the meat of the document, where you explain the decision. If you have multiple alternatives, be sure to use sub-sections for better separation of the idea, and list pros/cons to each approach. If there are alternatives that you have eliminated, you should also list those here, and explain why you believe your chosen approach is superior.
 
@@ -182,21 +191,26 @@ Make sure you’ve thought through and addressed the following sections. If a se
 
 ### Alternatives Considered
 
-[IN PROGRESS]
+We have considered the following.
+
+Consideration | Reason for Rejection
+--|--
+Only having `augmentation` and not employing any drop-in `loading` | Some methods like quantization has specialized checkpoints which require special loaders. Attempting to modify already instantiated models to load such specialized checkpoints can be error prone. Futhermore, for future extensions not having any drop-in `loading` can be a severe handicap.
+Adding tuning enchancements directly to `SFT_Trainer` | The trainer is a very complex object, and manipulating it in unintended ways can have serious repurcussions. As such, we choose to allow `TuningAccelerationPlugin.augmentation` to modify only the `Accelerator` object which can already do quite a bit of things, like adjust the FSDP wrapping policy (for distributed training).
 
-1. Alternative script to [`sft_trainer.py`].
-2. Do not touch 
 
 <!--
 - Make sure to discuss the relative merits of alternatives to your proposal.
 -->
 
 ## Consequences
 
-[IN PROGRESS]
+We have considered the following
 
-Drawbacks:
-- cannot support any plugin design that requires a controlled call in places not supported by `TrainerCallbacks`.
+Consideration | Background | Steps to Alleviate
+--|--|--
+Hosting custom packages | Sometimes the enhancements depend on OSS packages that have been improved on. | If the package
+Managing depdencies |  | Have `TuningAccelerationPlugin`
 
 <!--
 Describe the resulting context, after applying the decision. All consequences should be listed here, not just the "positive" ones. A particular decision may have positive, negative, and neutral consequences, but all of them affect the team and project in the future.
@@ -205,44 +219,35 @@ Describe the resulting context, after applying the decision. All consequences sh
 
 ## Detailed Design
 
+
+<!--
 This section is optional. Elaborate on details if they’re important to understanding the design, but would make it hard to read the proposal section above.
+-->
 
-`acceleration.yaml`
-```
+In this section we demonstrate how to implement an `AutoGPTQPlugin` that implements an accelerate PEFT training mode with 4 bit GPTQ base weights.
+
+This is an `acceleration.yaml`
+```yaml
 quantization:
   - requires_quantization: True
   - quant_num_bits: 4
-  - quantize_cache: '/home/user/'
   - quant_kernel: 'gptq-tritonv2'
-  - unsloth: False
 ```
 
-```
+```python
 from functools import partial
 
-class Framework:
+class AutoGPTQPlugin(TuningAccelerationPlugin):
     def __init__(self, acceleration_config_path:str) -> None:
-        self.acceleration_config = self.read_acceleration_config(acceleration_config_path)
-        self.num_bits = self.acceleration_config.quant_num_bits
-        self.requires_custom_loading = self.acceleration_config.requires_custom_loading
-        self.requires_quantization = self.acceleration_config.requires_quantization
-        self.quantize_cache = self.acceleration_config['quantize_cache']
-        self.kernel = self.acceleration_config['quant_kernel']
-
-    def read_acceleration_config(self, acceleration_config_path):
-        pass
+        # ...  initialize config
 
-    def callbacks(self, *args, **kwargs):
-        pass
-    
-    def model_loader(self):
+    def model_loader(self, model_path, **kwargs):
+
+        # ... maybe quantize if needed
         quantize_config = QuantizeConfig(
             bits=self.num_bits, 
         )
-        if self.requires_quantization:
-            return partial(AutoGPTQForCausalLM.from_pretrained, quantize_config = quantize_config)
-        else:
-            return partial(AutoGPTQForCausalLM.from_quantized, quantize_config = quantize_config)
+        return AutoGPTQForCausalLM.from_quantized(model_path, quantize_config = quantize_config)
 
     def augmentation(
         self, 
@@ -251,29 +256,12 @@ class Framework:
         train_args, 
         peft_config,
     ):
-        '''
-        This function is used for any augmentation of the model before trainings 
-        e.g. quantization, unsloth/PEFT installation and also MegaBlocks patching
-        '''
-
-        if self.requires_quantization:
-            model.quantize()
-            model.save_quantized(save_dir = self.quantize_cache)
-
-        if peft_config:
-            # PEFT Installation
-            if 'gptq' in self.kernel:
-                from auto_gptq.utils.peft_utils import get_gptq_peft_model
-                model = get_gptq_peft_model(
-                    model, 
-                    peft_config = peft_config, 
-                )
-            else:
-                from peft import get_peft_model
-                model = get_peft_model(
-                    model, 
-                    peft_config
-                )        
-
-        return model
-```
+        assert peft_config is not None, "need peft_config to install PEFT adapters"
+
+        # PEFT Installation
+        from auto_gptq.utils.peft_utils import get_gptq_peft_model
+        return get_gptq_peft_model(
+            model, 
+            peft_config = peft_config, 
+        )
+```
diff --git a/architecture_records/imgs/002-framework.png b/architecture_records/imgs/002-framework.png