Skip to content

Commit

Permalink
merging
Browse files Browse the repository at this point in the history
  • Loading branch information
SalmanMohammadi committed Jul 30, 2024
2 parents fa86089 + 898670f commit c2cc694
Show file tree
Hide file tree
Showing 98 changed files with 918 additions and 599 deletions.
2 changes: 1 addition & 1 deletion .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -45,7 +45,7 @@ repos:
- usort == 1.0.5

- repo: https://github.com/jsh9/pydoclint
rev: d88180a8632bb1602a4d81344085cf320f288c5a
rev: 94efc5f989adbea30f3534b476b2931a02c1af90
hooks:
- id: pydoclint
args: [--config=pyproject.toml]
5 changes: 4 additions & 1 deletion docs/source/deep_dives/checkpointer.rst
Original file line number Diff line number Diff line change
Expand Up @@ -364,7 +364,7 @@ Checkpointing for LoRA
In torchtune, we output both the adapter weights and the full model "merged" weights
for LoRA. The "merged" checkpoint can be used just like you would use the source
checkpoint with any post-training tools. For more details, take a look at our
:ref:`LoRA Finetuning Tutorial <lora_finetune_label>`.
:ref:`LoRA Finetuning Tutorial <lora_finetune_label>`.Additionally, by setting the option "save_adapter_weights_only" to True when saving a checkpoint, you can choose to save only the adapter weights.

The primary difference between the two use cases is when you want to resume training
from a checkpoint. In this case, the checkpointer needs access to both the initial frozen
Expand Down Expand Up @@ -407,6 +407,9 @@ looks something like this:
# set to True if restarting training
resume_from_checkpoint: True
# Set to True to save only the adapter weights
save_adapter_weights_only: False
|
Putting this all together
Expand Down
2 changes: 1 addition & 1 deletion docs/source/deep_dives/recipe_deepdive.rst
Original file line number Diff line number Diff line change
Expand Up @@ -139,7 +139,7 @@ Initialize recipe state including seed, device, dtype, metric loggers, relevant
def __init__(...):
self._device = utils.get_device(device=params.device)
self._dtype = utils.get_dtype(dtype=params.dtype)
self._dtype = utils.get_dtype(dtype=params.dtype, device=self._device)
...
Load checkpoint, update recipe state from checkpoint, initialize components and load state dicts from checkpoint
Expand Down
13 changes: 12 additions & 1 deletion docs/source/tutorials/datasets.rst
Original file line number Diff line number Diff line change
Expand Up @@ -60,7 +60,7 @@ all of our built-in datasets and dataset builders are using Hugging Face's `load
to load in your data, whether local or on the hub.

You can pass in a Hugging Face dataset path to the ``source`` parameter in any of our builders
to specify which dataset on the hub to download. Additionally, all builders accept
to specify which dataset on the hub to download or use from a local directory path (see `Local and remote datasets`_). Additionally, all builders accept
any keyword-arguments that ``load_dataset()`` supports. You can see a full list
on Hugging Face's `documentation. <https://huggingface.co/docs/datasets/en/loading>`_

Expand Down Expand Up @@ -295,6 +295,17 @@ and create your own class.
dataset.template=import.path.to.CustomTemplate
torchtune uses :code:`importlib.import_module` (see ``importlib`` `docs <https://docs.python.org/3/library/importlib.html>`_ for more details)
to locate components from their dotpaths. You can place your custom template class
in any Python file as long as the file is accessible by Python's import mechanism.
This means the module should be in a directory that is included in Python's search
paths (:code:`sys.path`). This often includes:

- The current directory from which your Python interpreter or script is run.
- Directories where Python packages are installed (like :code:`site-packages`).
- Any directories added to :code:`sys.path` at runtime using :code:`sys.path.append` or through the :code:`PYTHONPATH` environment variable.


Custom chat dataset and chat formats
------------------------------------

Expand Down
128 changes: 12 additions & 116 deletions docs/source/tutorials/e2e_flow.rst
Original file line number Diff line number Diff line change
Expand Up @@ -321,126 +321,22 @@ Bay Area!
Speeding up Generation using Quantization
-----------------------------------------

We saw that the generation recipe took around 11.6 seconds to generate 300 tokens.
One technique commonly used to speed up inference is quantization. torchtune provides
an integration with the `TorchAO <https://github.com/pytorch-labs/ao>`_
quantization APIs. Let's first quantize the model using 4-bit weights-only quantization
and see if this improves generation speed.
We rely on `torchao <https://github.com/pytorch-labs/ao>`_ for `post-training quantization <https://github.com/pytorch/ao/tree/main/torchao/quantization#quantization>`_.
To quantize the fine-tuned model after installing torchao we can run the following command::

# we also support `int8_weight_only()` and `int8_dynamic_activation_int8_weight()`, see
# https://github.com/pytorch/ao/tree/main/torchao/quantization#other-available-quantization-techniques
# for a full list of techniques that we support
from torchao.quantization.quant_api import quantize\_, int4_weight_only
quantize\_(model, int4_weight_only())

For this, we'll use the
`quantization recipe <https://github.com/pytorch/torchtune/blob/main/recipes/quantize.py>`_.


Let's first copy over the config to our local working directory so we can make changes.

.. code-block:: bash
tune cp quantization ./custom_quantization_config.yaml
Let's modify ``custom_quantization_config.yaml`` to include the following changes.

.. code-block:: yaml
checkpointer:
_component_: torchtune.utils.FullModelHFCheckpointer
# directory with the checkpoint files
# this should match the output_dir specified during
# finetuning
checkpoint_dir: <checkpoint_dir>
# checkpoint files for the fine-tuned model. This should
# match what's shown in the logs above
checkpoint_files: [
hf_model_0001_0.pt,
hf_model_0002_0.pt,
]
output_dir: <checkpoint_dir>
model_type: LLAMA2
Once the config is updated, let's kick off quantization! We'll use the default
quantization method from the config.


.. code-block:: bash
tune run quantize --config ./custom_quantization_config.yaml
Once quantization is complete, you'll see the following in the logs.

.. code-block:: bash
[quantize.py:68] Time for quantization: 19.76 sec
[quantize.py:69] Memory used: 13.95 GB
[quantize.py:82] Model checkpoint of size 3.67 GB saved to <checkpoint_dir>/hf_model_0001_0-4w.pt
.. note::
Unlike the fine-tuned checkpoints, this outputs a single checkpoint file. This is
because our quantization APIs currently don't support any conversion across formats.
As a result you won't be able to use these quantized models outside of torchtune.
But you should be able to use these with the generation and evaluation recipes within
torchtune. These results will help inform which quantization methods you should use
with your favorite inference engine.

Now that we have the quantized model, let's re-run generation.

Modify ``custom_generation_config.yaml`` to include the following changes.

.. code-block:: yaml
checkpointer:
# we need to use the custom torchtune checkpointer
# instead of the HF checkpointer for loading
# quantized models
_component_: torchtune.utils.FullModelTorchTuneCheckpointer
# directory with the checkpoint files
# this should match the output_dir specified during
# finetuning
checkpoint_dir: <checkpoint_dir>
# checkpoint files point to the quantized model
checkpoint_files: [
hf_model_0001_0-4w.pt,
]
output_dir: <checkpoint_dir>
model_type: LLAMA2
# we also need to update the quantizer to what was used during
# quantization
quantizer:
_component_: torchtune.utils.quantization.Int4WeightOnlyQuantizer
groupsize: 256
Once the config is updated, let's kick off generation! We'll use the
same sampling parameters as before. We'll also use the same prompt we did with the
unquantized model.

.. code-block:: bash
tune run generate --config ./custom_generation_config.yaml \
prompt="What are some interesting sites to visit in the Bay Area?"
Once generation is complete, you'll see the following in the logs.


.. code-block:: bash
[generate.py:92] A park in San Francisco that sits at the top of a big hill.
There are lots of trees and a beautiful view of San Francisco...
After quantization, we rely on torch.compile for speedups. For more details, please see `this example usage <https://github.com/pytorch/ao/blob/main/torchao/quantization/README.md#quantization-flow-example>`_.

[generate.py:96] Time for inference: 4.13 sec total, 72.62 tokens/sec
[generate.py:99] Memory used: 17.85 GB
torchao also provides `this table <https://github.com/pytorch/ao#inference>`_ listing performance and accuracy results for ``llama2`` and ``llama3``.

With quantization (and torch compile under the hood), we've sped up generation
by almost 3x!
For Llama models, you can run generation directly in torchao on the quantized model using their ``generate.py`` script as
discussed in `this readme <https://github.com/pytorch/ao/tree/main/torchao/_models/llama>`_. This way you can compare your own results
to those in the previously-linked table.

|
Expand Down
106 changes: 12 additions & 94 deletions docs/source/tutorials/llama3.rst
Original file line number Diff line number Diff line change
Expand Up @@ -241,105 +241,23 @@ Running generation with our LoRA-finetuned model, we see the following output:
Faster generation via quantization
----------------------------------

We can see that the model took just under 11 seconds, generating almost 19 tokens per second.
We can speed this up a bit by quantizing our model. Here we'll use 4-bit weights-only quantization
as provided by `torchao <https://github.com/pytorch-labs/ao>`_.
We rely on `torchao <https://github.com/pytorch-labs/ao>`_ for `post-training quantization <https://github.com/pytorch/ao/tree/main/torchao/quantization#quantization>`_.
To quantize the fine-tuned model after installing torchao we can run the following command::

If you've been following along this far, you know the drill by now.
Let's copy the quantization config and point it at our fine-tuned model.
# we also support `int8_weight_only()` and `int8_dynamic_activation_int8_weight()`, see
# https://github.com/pytorch/ao/tree/main/torchao/quantization#other-available-quantization-techniques
# for a full list of techniques that we support
from torchao.quantization.quant_api import quantize\_, int4_weight_only
quantize\_(model, int4_weight_only())

.. code-block:: bash
tune cp quantization ./custom_quantization_config.yaml
And update ``custom_quantization_config.yaml`` with the following:

.. code-block:: yaml
# Model arguments
model:
_component_: torchtune.models.llama3.llama3_8b
checkpointer:
_component_: torchtune.utils.FullModelMetaCheckpointer
# directory with the checkpoint files
# this should match the output_dir specified during
# fine-tuning
checkpoint_dir: <checkpoint_dir>
# checkpoint files for the fine-tuned model. These will be logged
# at the end of your fine-tune
checkpoint_files: [
meta_model_0.pt
]
output_dir: <checkpoint_dir>
model_type: LLAMA3
To quantize the model, we can now run:

.. code-block:: bash
After quantization, we rely on torch.compile for speedups. For more details, please see `this example usage <https://github.com/pytorch/ao/blob/main/torchao/quantization/README.md#quantization-flow-example>`_.

tune run quantize --config ./custom_quantization_config.yaml
torchao also provides `this table <https://github.com/pytorch/ao#inference>`_ listing performance and accuracy results for ``llama2`` and ``llama3``.

[quantize.py:90] Time for quantization: 2.93 sec
[quantize.py:91] Memory used: 23.13 GB
[quantize.py:104] Model checkpoint of size 4.92 GB saved to /tmp/Llama-3-8B-Instruct-hf/consolidated-4w.pt
We can see that the model is now under 5 GB, or just over four bits for each of the 8B parameters.

.. note::
Unlike the fine-tuned checkpoints, the quantization recipe outputs a single checkpoint file. This is
because our quantization APIs currently don't support any conversion across formats.
As a result you won't be able to use these quantized models outside of torchtune.
But you should be able to use these with the generation and evaluation recipes within
torchtune. These results will help inform which quantization methods you should use
with your favorite inference engine.

Let's take our quantized model and run the same generation again.
First, we'll make one more change to our ``custom_generation_config.yaml``.

.. code-block:: yaml
checkpointer:
# we need to use the custom torchtune checkpointer
# instead of the HF checkpointer for loading
# quantized models
_component_: torchtune.utils.FullModelTorchTuneCheckpointer
# directory with the checkpoint files
# this should match the output_dir specified during
# fine-tuning
checkpoint_dir: <checkpoint_dir>
# checkpoint files point to the quantized model
checkpoint_files: [
consolidated-4w.pt,
]
output_dir: <checkpoint_dir>
model_type: LLAMA3
# we also need to update the quantizer to what was used during
# quantization
quantizer:
_component_: torchtune.utils.quantization.Int4WeightOnlyQuantizer
groupsize: 256
Let's re-run generation!

.. code-block:: bash
tune run generate --config ./custom_generation_config.yaml \
prompt="Hello, my name is"
[generate.py:122] Hello, my name is Jake.
I am a multi-disciplined artist with a passion for creating, drawing and painting.
...
Time for inference: 1.62 sec total, 57.95 tokens/sec
For Llama models, you can run generation directly in torchao on the quantized model using their ``generate.py`` script as
discussed in `this readme <https://github.com/pytorch/ao/tree/main/torchao/_models/llama>`_. This way you can compare your own results
to those in the previously-linked table.

By quantizing the model and running `torch.compile <https://pytorch.org/tutorials/intermediate/torch_compile_tutorial.html>`_ we get over a 3x speedup!

This is just the beginning of what you can do with Meta Llama3 using torchtune and the broader ecosystem.
We look forward to seeing what you build!
2 changes: 1 addition & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -79,7 +79,7 @@ target-version = ["py38"]
[tool.pydoclint]
style = 'google'
check-return-types = 'False'
exclude = ['tests/torchtune/models/llama2/scripts/', 'tests/torchtune/models/mistral/scripts/']
exclude = 'tests/torchtune/models/(\w+)/scripts/'

[tool.pytest.ini_options]
addopts = ["--showlocals", "--import-mode=prepend", "--without-integration", "--without-slow-integration"]
Expand Down
1 change: 1 addition & 0 deletions recipes/configs/code_llama2/7B_lora_single_device.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -49,6 +49,7 @@ checkpointer:
output_dir: /tmp/CodeLlama-7b-hf
model_type: LLAMA2
resume_from_checkpoint: False
save_adapter_weights_only: False

# Fine-tuning arguments
batch_size: 2
Expand Down
1 change: 1 addition & 0 deletions recipes/configs/code_llama2/7B_qlora_single_device.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -49,6 +49,7 @@ checkpointer:
output_dir: /tmp/CodeLlama-7b-hf
model_type: LLAMA2
resume_from_checkpoint: False
save_adapter_weights_only: False

# Fine-tuning arguments and training
batch_size: 2
Expand Down
1 change: 1 addition & 0 deletions recipes/configs/dev/llama2/13B_lora_fsdp2.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,7 @@ checkpointer:
output_dir: /tmp/Llama-2-13b-hf/
model_type: LLAMA2
resume_from_checkpoint: False
save_adapter_weights_only: False

# Tokenizer
tokenizer:
Expand Down
1 change: 1 addition & 0 deletions recipes/configs/dev/llama2/70B_lora_fsdp2.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -50,6 +50,7 @@ checkpointer:
output_dir: /tmp/Llama-2-70b-hf
model_type: LLAMA2
resume_from_checkpoint: False
save_adapter_weights_only: False

# Dataset and Sampler
dataset:
Expand Down
1 change: 1 addition & 0 deletions recipes/configs/dev/llama2/70B_qlora_fsdp2.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -50,6 +50,7 @@ checkpointer:
output_dir: /tmp/Llama-2-70b-hf
model_type: LLAMA2
resume_from_checkpoint: False
save_adapter_weights_only: False

# Dataset and Sampler
dataset:
Expand Down
1 change: 1 addition & 0 deletions recipes/configs/dev/llama2/7B_lora_fsdp2.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -47,6 +47,7 @@ checkpointer:
output_dir: /tmp/Llama-2-7b-hf
model_type: LLAMA2
resume_from_checkpoint: False
save_adapter_weights_only: False

# Dataset and Sampler
dataset:
Expand Down
1 change: 1 addition & 0 deletions recipes/configs/dev/llama2/7B_qlora_fsdp2.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,7 @@ checkpointer:
output_dir: /tmp/Llama-2-7b-hf
model_type: LLAMA2
resume_from_checkpoint: False
save_adapter_weights_only: False

# Dataset and Sampler
dataset:
Expand Down
1 change: 1 addition & 0 deletions recipes/configs/gemma/2B_lora.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,7 @@ checkpointer:
output_dir: /tmp/gemma-2b
model_type: GEMMA
resume_from_checkpoint: False
save_adapter_weights_only: False

optimizer:
_component_: torch.optim.AdamW
Expand Down
Loading

0 comments on commit c2cc694

Please sign in to comment.