merging

pytorch · Jul 30, 2024 · c2cc694 · c2cc694
2 parents fa86089 + 898670f
commit c2cc694
Show file tree

Hide file tree

Showing 98 changed files with 918 additions and 599 deletions.
diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
@@ -45,7 +45,7 @@ repos:
           - usort == 1.0.5
 
 - repo: https://github.com/jsh9/pydoclint
-  rev: d88180a8632bb1602a4d81344085cf320f288c5a
+  rev: 94efc5f989adbea30f3534b476b2931a02c1af90
   hooks:
     - id: pydoclint
       args: [--config=pyproject.toml]
diff --git a/docs/source/deep_dives/checkpointer.rst b/docs/source/deep_dives/checkpointer.rst
@@ -364,7 +364,7 @@ Checkpointing for LoRA
 In torchtune, we output both the adapter weights and the full model "merged" weights
 for LoRA. The "merged" checkpoint can be used just like you would use the source
 checkpoint with any post-training tools. For more details, take a look at our
-:ref:`LoRA Finetuning Tutorial <lora_finetune_label>`.
+:ref:`LoRA Finetuning Tutorial <lora_finetune_label>`.Additionally, by setting the option "save_adapter_weights_only" to True when saving a checkpoint, you can choose to save only the adapter weights.
 
 The primary difference between the two use cases is when you want to resume training
 from a checkpoint. In this case, the checkpointer needs access to both the initial frozen
@@ -407,6 +407,9 @@ looks something like this:
     # set to True if restarting training
     resume_from_checkpoint: True
 
+    # Set to True to save only the adapter weights
+    save_adapter_weights_only: False
+
 |
 
 Putting this all together

diff --git a/docs/source/deep_dives/recipe_deepdive.rst b/docs/source/deep_dives/recipe_deepdive.rst
@@ -139,7 +139,7 @@ Initialize recipe state including seed, device, dtype, metric loggers, relevant
     def __init__(...):
 
         self._device = utils.get_device(device=params.device)
-        self._dtype = utils.get_dtype(dtype=params.dtype)
+        self._dtype = utils.get_dtype(dtype=params.dtype, device=self._device)
         ...
 
 Load checkpoint, update recipe state from checkpoint, initialize components and load state dicts from checkpoint

diff --git a/docs/source/tutorials/datasets.rst b/docs/source/tutorials/datasets.rst
@@ -60,7 +60,7 @@ all of our built-in datasets and dataset builders are using Hugging Face's `load
 to load in your data, whether local or on the hub.
 
 You can pass in a Hugging Face dataset path to the ``source`` parameter in any of our builders
-to specify which dataset on the hub to download. Additionally, all builders accept
+to specify which dataset on the hub to download or use from a local directory path (see `Local and remote datasets`_). Additionally, all builders accept
 any keyword-arguments that ``load_dataset()`` supports. You can see a full list
 on Hugging Face's `documentation. <https://huggingface.co/docs/datasets/en/loading>`_
 
@@ -295,6 +295,17 @@ and create your own class.
     dataset.template=import.path.to.CustomTemplate
 
 
+torchtune uses :code:`importlib.import_module` (see ``importlib`` `docs <https://docs.python.org/3/library/importlib.html>`_ for more details)
+to locate components from their dotpaths. You can place your custom template class
+in any Python file as long as the file is accessible by Python's import mechanism.
+This means the module should be in a directory that is included in Python's search
+paths (:code:`sys.path`). This often includes:
+
+- The current directory from which your Python interpreter or script is run.
+- Directories where Python packages are installed (like :code:`site-packages`).
+- Any directories added to :code:`sys.path` at runtime using :code:`sys.path.append` or through the :code:`PYTHONPATH` environment variable.
+
+
 Custom chat dataset and chat formats
 ------------------------------------
 

diff --git a/docs/source/tutorials/e2e_flow.rst b/docs/source/tutorials/e2e_flow.rst
@@ -321,126 +321,22 @@ Bay Area!
 Speeding up Generation using Quantization
 -----------------------------------------
 
-We saw that the generation recipe took around 11.6 seconds to generate 300 tokens.
-One technique commonly used to speed up inference is quantization. torchtune provides
-an integration with the `TorchAO <https://github.com/pytorch-labs/ao>`_
-quantization APIs. Let's first quantize the model using 4-bit weights-only quantization
-and see if this improves generation speed.
+We rely on `torchao <https://github.com/pytorch-labs/ao>`_ for `post-training quantization <https://github.com/pytorch/ao/tree/main/torchao/quantization#quantization>`_.
+To quantize the fine-tuned model after installing torchao we can run the following command::
 
+  # we also support `int8_weight_only()` and `int8_dynamic_activation_int8_weight()`, see
+  # https://github.com/pytorch/ao/tree/main/torchao/quantization#other-available-quantization-techniques
+  # for a full list of techniques that we support
+  from torchao.quantization.quant_api import quantize\_, int4_weight_only
+  quantize\_(model, int4_weight_only())
 
-For this, we'll use the
-`quantization recipe <https://github.com/pytorch/torchtune/blob/main/recipes/quantize.py>`_.
-
-
-Let's first copy over the config to our local working directory so we can make changes.
-
-.. code-block:: bash
-
-    tune cp quantization ./custom_quantization_config.yaml
-
-Let's modify ``custom_quantization_config.yaml`` to include the following changes.
-
-.. code-block:: yaml
-
-    checkpointer:
-        _component_: torchtune.utils.FullModelHFCheckpointer
-
-        # directory with the checkpoint files
-        # this should match the output_dir specified during
-        # finetuning
-        checkpoint_dir: <checkpoint_dir>
-
-        # checkpoint files for the fine-tuned model. This should
-        # match what's shown in the logs above
-        checkpoint_files: [
-            hf_model_0001_0.pt,
-            hf_model_0002_0.pt,
-        ]
-
-        output_dir: <checkpoint_dir>
-        model_type: LLAMA2
-
-
-Once the config is updated, let's kick off quantization! We'll use the default
-quantization method from the config.
-
-
-.. code-block:: bash
-
-    tune run quantize --config ./custom_quantization_config.yaml
-
-Once quantization is complete, you'll see the following in the logs.
-
-.. code-block:: bash
-
-    [quantize.py:68] Time for quantization: 19.76 sec
-    [quantize.py:69] Memory used: 13.95 GB
-    [quantize.py:82] Model checkpoint of size 3.67 GB saved to <checkpoint_dir>/hf_model_0001_0-4w.pt
-
-
-.. note::
-    Unlike the fine-tuned checkpoints, this outputs a single checkpoint file. This is
-    because our quantization APIs currently don't support any conversion across formats.
-    As a result you won't be able to use these quantized models outside of torchtune.
-    But you should be able to use these with the generation and evaluation recipes within
-    torchtune. These results will help inform which quantization methods you should use
-    with your favorite inference engine.
-
-Now that we have the quantized model, let's re-run generation.
-
-Modify ``custom_generation_config.yaml`` to include the following changes.
-
-.. code-block:: yaml
-
-    checkpointer:
-        # we need to use the custom torchtune checkpointer
-        # instead of the HF checkpointer for loading
-        # quantized models
-        _component_: torchtune.utils.FullModelTorchTuneCheckpointer
-
-        # directory with the checkpoint files
-        # this should match the output_dir specified during
-        # finetuning
-        checkpoint_dir: <checkpoint_dir>
-
-        # checkpoint files point to the quantized model
-        checkpoint_files: [
-            hf_model_0001_0-4w.pt,
-        ]
-
-        output_dir: <checkpoint_dir>
-        model_type: LLAMA2
-
-    # we also need to update the quantizer to what was used during
-    # quantization
-    quantizer:
-        _component_: torchtune.utils.quantization.Int4WeightOnlyQuantizer
-        groupsize: 256
-
-
-Once the config is updated, let's kick off generation! We'll use the
-same sampling parameters as before. We'll also use the same prompt we did with the
-unquantized model.
-
-.. code-block:: bash
-
-    tune run generate --config ./custom_generation_config.yaml \
-    prompt="What are some interesting sites to visit in the Bay Area?"
-
-
-Once generation is complete, you'll see the following in the logs.
-
-
-.. code-block:: bash
-
-    [generate.py:92] A park in San Francisco that sits at the top of a big hill.
-                     There are lots of trees and a beautiful view of San Francisco...
+After quantization, we rely on torch.compile for speedups. For more details, please see `this example usage <https://github.com/pytorch/ao/blob/main/torchao/quantization/README.md#quantization-flow-example>`_.
 
-    [generate.py:96] Time for inference: 4.13 sec total, 72.62 tokens/sec
-    [generate.py:99] Memory used: 17.85 GB
+torchao also provides `this table <https://github.com/pytorch/ao#inference>`_ listing performance and accuracy results for ``llama2`` and ``llama3``.
 
-With quantization (and torch compile under the hood), we've sped up generation
-by almost 3x!
+For Llama models, you can run generation directly in torchao on the quantized model using their ``generate.py`` script as
+discussed in `this readme <https://github.com/pytorch/ao/tree/main/torchao/_models/llama>`_. This way you can compare your own results
+to those in the previously-linked table.
 
 |
 

diff --git a/docs/source/tutorials/llama3.rst b/docs/source/tutorials/llama3.rst
@@ -241,105 +241,23 @@ Running generation with our LoRA-finetuned model, we see the following output:
 Faster generation via quantization
 ----------------------------------
 
-We can see that the model took just under 11 seconds, generating almost 19 tokens per second.
-We can speed this up a bit by quantizing our model. Here we'll use 4-bit weights-only quantization
-as provided by `torchao <https://github.com/pytorch-labs/ao>`_.
+We rely on `torchao <https://github.com/pytorch-labs/ao>`_ for `post-training quantization <https://github.com/pytorch/ao/tree/main/torchao/quantization#quantization>`_.
+To quantize the fine-tuned model after installing torchao we can run the following command::
 
-If you've been following along this far, you know the drill by now.
-Let's copy the quantization config and point it at our fine-tuned model.
+  # we also support `int8_weight_only()` and `int8_dynamic_activation_int8_weight()`, see
+  # https://github.com/pytorch/ao/tree/main/torchao/quantization#other-available-quantization-techniques
+  # for a full list of techniques that we support
+  from torchao.quantization.quant_api import quantize\_, int4_weight_only
+  quantize\_(model, int4_weight_only())
 
-.. code-block:: bash
-
-    tune cp quantization ./custom_quantization_config.yaml
-
-And update ``custom_quantization_config.yaml`` with the following:
-
-.. code-block:: yaml
-
-    # Model arguments
-    model:
-      _component_: torchtune.models.llama3.llama3_8b
-
-    checkpointer:
-      _component_: torchtune.utils.FullModelMetaCheckpointer
-
-      # directory with the checkpoint files
-      # this should match the output_dir specified during
-      # fine-tuning
-      checkpoint_dir: <checkpoint_dir>
-
-      # checkpoint files for the fine-tuned model. These will be logged
-      # at the end of your fine-tune
-      checkpoint_files: [
-        meta_model_0.pt
-      ]
-
-      output_dir: <checkpoint_dir>
-      model_type: LLAMA3
-
-To quantize the model, we can now run:
-
-.. code-block:: bash
+After quantization, we rely on torch.compile for speedups. For more details, please see `this example usage <https://github.com/pytorch/ao/blob/main/torchao/quantization/README.md#quantization-flow-example>`_.
 
-    tune run quantize --config ./custom_quantization_config.yaml
+torchao also provides `this table <https://github.com/pytorch/ao#inference>`_ listing performance and accuracy results for ``llama2`` and ``llama3``.
 
-    [quantize.py:90] Time for quantization: 2.93 sec
-    [quantize.py:91] Memory used: 23.13 GB
-    [quantize.py:104] Model checkpoint of size 4.92 GB saved to /tmp/Llama-3-8B-Instruct-hf/consolidated-4w.pt
-
-We can see that the model is now under 5 GB, or just over four bits for each of the 8B parameters.
-
-.. note::
-    Unlike the fine-tuned checkpoints, the quantization recipe outputs a single checkpoint file. This is
-    because our quantization APIs currently don't support any conversion across formats.
-    As a result you won't be able to use these quantized models outside of torchtune.
-    But you should be able to use these with the generation and evaluation recipes within
-    torchtune. These results will help inform which quantization methods you should use
-    with your favorite inference engine.
-
-Let's take our quantized model and run the same generation again.
-First, we'll make one more change to our ``custom_generation_config.yaml``.
-
-.. code-block:: yaml
-
-    checkpointer:
-      # we need to use the custom torchtune checkpointer
-      # instead of the HF checkpointer for loading
-      # quantized models
-      _component_: torchtune.utils.FullModelTorchTuneCheckpointer
-
-      # directory with the checkpoint files
-      # this should match the output_dir specified during
-      # fine-tuning
-      checkpoint_dir: <checkpoint_dir>
-
-      # checkpoint files point to the quantized model
-      checkpoint_files: [
-        consolidated-4w.pt,
-      ]
-
-      output_dir: <checkpoint_dir>
-      model_type: LLAMA3
-
-    # we also need to update the quantizer to what was used during
-    # quantization
-    quantizer:
-      _component_: torchtune.utils.quantization.Int4WeightOnlyQuantizer
-      groupsize: 256
-
-Let's re-run generation!
-
-.. code-block:: bash
-
-    tune run generate --config ./custom_generation_config.yaml \
-    prompt="Hello, my name is"
-
-    [generate.py:122] Hello, my name is Jake.
-    I am a multi-disciplined artist with a passion for creating, drawing and painting.
-    ...
-    Time for inference: 1.62 sec total, 57.95 tokens/sec
+For Llama models, you can run generation directly in torchao on the quantized model using their ``generate.py`` script as
+discussed in `this readme <https://github.com/pytorch/ao/tree/main/torchao/_models/llama>`_. This way you can compare your own results
+to those in the previously-linked table.
 
-By quantizing the model and running `torch.compile <https://pytorch.org/tutorials/intermediate/torch_compile_tutorial.html>`_ we get over a 3x speedup!
 
 This is just the beginning of what you can do with Meta Llama3 using torchtune and the broader ecosystem.
 We look forward to seeing what you build!
diff --git a/pyproject.toml b/pyproject.toml
@@ -79,7 +79,7 @@ target-version = ["py38"]
 [tool.pydoclint]
 style = 'google'
 check-return-types = 'False'
-exclude = ['tests/torchtune/models/llama2/scripts/', 'tests/torchtune/models/mistral/scripts/']
+exclude = 'tests/torchtune/models/(\w+)/scripts/'
 
 [tool.pytest.ini_options]
 addopts = ["--showlocals", "--import-mode=prepend", "--without-integration", "--without-slow-integration"]

diff --git a/recipes/configs/code_llama2/7B_lora_single_device.yaml b/recipes/configs/code_llama2/7B_lora_single_device.yaml
@@ -49,6 +49,7 @@ checkpointer:
   output_dir: /tmp/CodeLlama-7b-hf
   model_type: LLAMA2
 resume_from_checkpoint: False
+save_adapter_weights_only: False
 
 # Fine-tuning arguments
 batch_size: 2

diff --git a/recipes/configs/code_llama2/7B_qlora_single_device.yaml b/recipes/configs/code_llama2/7B_qlora_single_device.yaml
@@ -49,6 +49,7 @@ checkpointer:
   output_dir: /tmp/CodeLlama-7b-hf
   model_type: LLAMA2
 resume_from_checkpoint: False
+save_adapter_weights_only: False
 
 # Fine-tuning arguments and training
 batch_size: 2

diff --git a/recipes/configs/dev/llama2/13B_lora_fsdp2.yaml b/recipes/configs/dev/llama2/13B_lora_fsdp2.yaml
@@ -45,6 +45,7 @@ checkpointer:
   output_dir: /tmp/Llama-2-13b-hf/
   model_type: LLAMA2
 resume_from_checkpoint: False
+save_adapter_weights_only: False
 
 # Tokenizer
 tokenizer:

diff --git a/recipes/configs/dev/llama2/70B_lora_fsdp2.yaml b/recipes/configs/dev/llama2/70B_lora_fsdp2.yaml
@@ -50,6 +50,7 @@ checkpointer:
   output_dir: /tmp/Llama-2-70b-hf
   model_type: LLAMA2
 resume_from_checkpoint: False
+save_adapter_weights_only: False
 
 # Dataset and Sampler
 dataset:

diff --git a/recipes/configs/dev/llama2/70B_qlora_fsdp2.yaml b/recipes/configs/dev/llama2/70B_qlora_fsdp2.yaml
@@ -50,6 +50,7 @@ checkpointer:
   output_dir: /tmp/Llama-2-70b-hf
   model_type: LLAMA2
 resume_from_checkpoint: False
+save_adapter_weights_only: False
 
 # Dataset and Sampler
 dataset:

diff --git a/recipes/configs/dev/llama2/7B_lora_fsdp2.yaml b/recipes/configs/dev/llama2/7B_lora_fsdp2.yaml
@@ -47,6 +47,7 @@ checkpointer:
   output_dir: /tmp/Llama-2-7b-hf
   model_type: LLAMA2
 resume_from_checkpoint: False
+save_adapter_weights_only: False
 
 # Dataset and Sampler
 dataset:

diff --git a/recipes/configs/dev/llama2/7B_qlora_fsdp2.yaml b/recipes/configs/dev/llama2/7B_qlora_fsdp2.yaml
@@ -46,6 +46,7 @@ checkpointer:
   output_dir: /tmp/Llama-2-7b-hf
   model_type: LLAMA2
 resume_from_checkpoint: False
+save_adapter_weights_only: False
 
 # Dataset and Sampler
 dataset:

diff --git a/recipes/configs/gemma/2B_lora.yaml b/recipes/configs/gemma/2B_lora.yaml
@@ -46,6 +46,7 @@ checkpointer:
   output_dir: /tmp/gemma-2b
   model_type: GEMMA
 resume_from_checkpoint: False
+save_adapter_weights_only: False
 
 optimizer:
   _component_: torch.optim.AdamW