-
Notifications
You must be signed in to change notification settings - Fork 481
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Generalizing reward models #1160
Generalizing reward models #1160
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/1160
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 12488b7 with merge base 7eb89e2 (): This comment was automatically generated by Dr. CI and updates every 15 minutes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great PR! Thanks for writing it.
I have some questions, a few of them are simple, but others may be more towards the core maintainers. My main points are:
- What should be named "classifier" and "reward". IMO, reward should be broad, allowing multiple scalars, for example.
- When constructing these builders, should we recreate the whole model, or leverage the base model and add the reward head?
- Are we comfortable with having 9 builders per model size? Maybe the answer is yes, and thats ok
torchtune/models/llama2/__init__.py
Outdated
) | ||
from ._model_utils import scale_hidden_dim_for_mlp | ||
from ._tokenizer import Llama2Tokenizer | ||
|
||
__all__ = [ | ||
"Llama2Tokenizer", | ||
"llama2", | ||
"llama2_classifier_7b", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you mind double checking this list? For example, "llama2_classifier_7b" should be "llama2_classifier", i believe
# ------------------ Llama2 Classifier ------------------ | ||
|
||
|
||
def llama2_classifier( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dont want to overcomplicate things, but is "classifier" the right word?
My knowledge of RL is limited, but can the output of the LLM just be a scalar, like how "polite", "trustworthy", "helpful", "friendly" the model is?
In this case, would classifier still be the right naming? I have seen it in other places too, so I guess this may be the standard.
qlora_llama2_13b, | ||
qlora_llama2_70b, | ||
qlora_llama2_7b, | ||
qlora_llama2_reward_7b, | ||
) | ||
from ._model_utils import scale_hidden_dim_for_mlp | ||
from ._tokenizer import Llama2Tokenizer | ||
|
||
__all__ = [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seeing this huge list scares me a little bit, because every model size would have 6 variants: normal, classifier, lora, qlora, classifier_lora, classifier_qlora. If we add DoRA, this would have 3 more builders. I think you followed the pattern correctly. Maybe a question for @kartikayk and @ebsmothers. I guess we dont like hooks in torchtune, but something like replace_with_reward_head(model=llama3) seems convenient, instead of rebuilding llama3 completely.
""" | ||
Builder for creating a Llama2 model initialized w/ the default 7B parameter values | ||
from https://arxiv.org/abs/2307.09288, where the output layer is a classification layer | ||
projecting to a single class for reward modelling. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
following up on my previous question about the naming "classifier", here we use reward to denote this specific model instantiation. I think that they could be inverted.
the llama2_classifier_7b
is a generic builder, that can be use for any type of reward modeling with num_classes = n (num_classes could also be renamed for generality to something denoting the output size).
However, the llama2_reward_7b
is indeed a classifier, hardcoded with num_classes=1.
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's actually a typo here, sorry, llama2_classifier_7b
shouldn't exist. I think we're on the same page. The general pattern here is: _classifier
exists in _component_builders
as a general interfacew to a classifier model. The only use for these models currently in the codebase is for reward modelling for PPO, where reward models output a scalar (I do agree this constraint doesn't apply to all reward models).
For simplicity, I could drop the _reward
in model builders, and expose a num_classes
in the model builder which is defaulted to 1. So you'd have:
model_family/model_builders.py
... import model_family_classifier
def model_family_7b_classifier(num_classes: int = 1):
return model_family_classifier(num_classes=num_classes, ...)
I think this would clean up the API, and offer flexibility to users who wish to define recipes for classification models for which there is some interest from the community e.g. #1124.
How does this sound?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the way you have it here where classifier is a generic component builder and reward is just num_classes = 1 in the model builders is great. We also don't want to add too many builder layers that add distance between what the user calls and the actual model architecture definition
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @RdoubleA. Let's roll with things as-is for now.
) | ||
|
||
|
||
qlora_llama2_reward_7b = partial(lora_llama2_7b, quantize_base=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if for simplification we should do something like this for the reward/classification versions, and have a flag: "use_reward_head=True, reward_head_size=n".
It would add some if/else to the builder, so I am not sure if others would approve it.
@@ -35,7 +35,7 @@ def mistral_reward_hf_to_tune( | |||
) -> Dict[str, torch.Tensor]: | |||
""" | |||
Convert a state dict from HF's format to torchtune's format, which contains the weights | |||
of a Mistral reward model. | |||
of a reward model (i.e. a classifier with a single class). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this confirms that classifier is a case of a reward, and maybe we should swap the current naming?
Thanks so much for the review @felipemello1. Apologies for not providing a bit more context about the PR to help make things clearer. I discussed this briefly with @pbontrager on Discord in the "Discussing PPO" thread (which has been my defacto channel for pinging devs for contributing questions - maybe I should move to #dev..), it'd be great you have your input on some of the points we raised. As you've correctly gathered, the main use for these classifier models is to support my work on RLHF to allow the use of pre-trained reward models during PPO. My primary aim for this PR was to reduce the review overhead on #1005, and I'd like my changes to be minimal and non-invasive (in that they can easily be refactored), and rely on future refactors (such as #1017 (comment) ) to address some of the potential long-term effects on the codebase.
There's been a few suggestions around this (see some comments in the original PR #837) - generally, there's some agreement that we can probably define these classifiers in a more minimal way like you suggested. However, we're touching on support for general classifiers vs support for classifiers as they're needed now in the codebase here. For now, I'd argue that we should revisit this further down the line when the need for generalisation is apparent and/or in a |
""" | ||
Builder for creating a Llama2 model initialized w/ the default 7B parameter values | ||
from https://arxiv.org/abs/2307.09288, where the output layer is a classification layer | ||
projecting to a single class for reward modelling. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the way you have it here where classifier is a generic component builder and reward is just num_classes = 1 in the model builders is great. We also don't want to add too many builder layers that add distance between what the user calls and the actual model architecture definition
@@ -76,17 +81,17 @@ def _permute(t, n_heads): | |||
return converted_state_dict | |||
|
|||
|
|||
def mistral_reward_tune_to_hf( | |||
def reward_tune_to_hf( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if there's a better location for these - our other converters are in models/convert_weights. PEFT adapter conversions were added there. I think it makes sense to also throw these in there, or do you think that will make the file bloated?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I was thinking similar. I think it does make the file a bit bloated and would reflect the sentiment from my above comment; we could revisit this and put weight conversion somewhere more general when we need it.
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1160 +/- ##
==========================================
- Coverage 67.81% 67.67% -0.14%
==========================================
Files 219 220 +1
Lines 9908 9941 +33
==========================================
+ Hits 6719 6728 +9
- Misses 3189 3213 +24 ☔ View full report in Codecov by Sentry. |
sounds good to me! Specially since you are already aligned with Philip and Rafi. Thanks for your reply! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good to me. Do we have tests for the reward convert weights utilities?
There's a test here
|
Context
What is the purpose of this PR? Is it to
#1005
A hopefully quick add in parallel to my PPO PR to generalize reward models to also support e.g. llama2 reward models.
Model builders for the previous so-called classifier models are now correctly named as reward models, since they only classify a single label. I've also moved checkpoint conversion logic to
modules/rlhf
. This checkpointing logic can generally be useful for any classifier model, but I've tried to keep things simple and contained here without leaking the scope of this functionality to the rest of the API.Previously, if someone wanted to use classifiers for their own recipes, they'd use the underlying component builder and set the model type to
MISTRAL_REWARD
. This hasn't really changed - they just now useREWARD
, and the model builders are a bit clearer to indicate their default use is for reward modelling.Test plan
pre-commit install
)pytest tests
pytest tests -m integration_test