Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generalizing reward models #1160

Merged

Conversation

SalmanMohammadi
Copy link
Collaborator

@SalmanMohammadi SalmanMohammadi commented Jul 11, 2024

Context

What is the purpose of this PR? Is it to

  • add a new feature
  • fix a bug
  • update tests and/or documentation
  • other (please add here)

#1005

A hopefully quick add in parallel to my PPO PR to generalize reward models to also support e.g. llama2 reward models.
Model builders for the previous so-called classifier models are now correctly named as reward models, since they only classify a single label. I've also moved checkpoint conversion logic to modules/rlhf. This checkpointing logic can generally be useful for any classifier model, but I've tried to keep things simple and contained here without leaking the scope of this functionality to the rest of the API.

Previously, if someone wanted to use classifiers for their own recipes, they'd use the underlying component builder and set the model type to MISTRAL_REWARD. This hasn't really changed - they just now use REWARD, and the model builders are a bit clearer to indicate their default use is for reward modelling.

Test plan

  • run pre-commit hooks and linters (make sure you've first installed via pre-commit install)
  • add unit tests for any new functionality
  • update docstrings for any new or updated methods or classes
  • run unit tests via pytest tests
  • run recipe tests via pytest tests -m integration_test
  • manually run any new or modified recipes with sufficient proof of correctness
  • include relevant commands and any other artifacts in this summary (pastes of loss curves, eval results, etc.)

Copy link

pytorch-bot bot commented Jul 11, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/1160

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 12488b7 with merge base 7eb89e2 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jul 11, 2024
@SalmanMohammadi SalmanMohammadi mentioned this pull request Jul 16, 2024
19 tasks
Copy link
Contributor

@felipemello1 felipemello1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great PR! Thanks for writing it.

I have some questions, a few of them are simple, but others may be more towards the core maintainers. My main points are:

  1. What should be named "classifier" and "reward". IMO, reward should be broad, allowing multiple scalars, for example.
  2. When constructing these builders, should we recreate the whole model, or leverage the base model and add the reward head?
  3. Are we comfortable with having 9 builders per model size? Maybe the answer is yes, and thats ok

)
from ._model_utils import scale_hidden_dim_for_mlp
from ._tokenizer import Llama2Tokenizer

__all__ = [
"Llama2Tokenizer",
"llama2",
"llama2_classifier_7b",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you mind double checking this list? For example, "llama2_classifier_7b" should be "llama2_classifier", i believe

# ------------------ Llama2 Classifier ------------------


def llama2_classifier(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont want to overcomplicate things, but is "classifier" the right word?

My knowledge of RL is limited, but can the output of the LLM just be a scalar, like how "polite", "trustworthy", "helpful", "friendly" the model is?

In this case, would classifier still be the right naming? I have seen it in other places too, so I guess this may be the standard.

qlora_llama2_13b,
qlora_llama2_70b,
qlora_llama2_7b,
qlora_llama2_reward_7b,
)
from ._model_utils import scale_hidden_dim_for_mlp
from ._tokenizer import Llama2Tokenizer

__all__ = [
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seeing this huge list scares me a little bit, because every model size would have 6 variants: normal, classifier, lora, qlora, classifier_lora, classifier_qlora. If we add DoRA, this would have 3 more builders. I think you followed the pattern correctly. Maybe a question for @kartikayk and @ebsmothers. I guess we dont like hooks in torchtune, but something like replace_with_reward_head(model=llama3) seems convenient, instead of rebuilding llama3 completely.

"""
Builder for creating a Llama2 model initialized w/ the default 7B parameter values
from https://arxiv.org/abs/2307.09288, where the output layer is a classification layer
projecting to a single class for reward modelling.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

following up on my previous question about the naming "classifier", here we use reward to denote this specific model instantiation. I think that they could be inverted.

the llama2_classifier_7b is a generic builder, that can be use for any type of reward modeling with num_classes = n (num_classes could also be renamed for generality to something denoting the output size).

However, the llama2_reward_7b is indeed a classifier, hardcoded with num_classes=1.

What do you think?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's actually a typo here, sorry, llama2_classifier_7b shouldn't exist. I think we're on the same page. The general pattern here is: _classifier exists in _component_builders as a general interfacew to a classifier model. The only use for these models currently in the codebase is for reward modelling for PPO, where reward models output a scalar (I do agree this constraint doesn't apply to all reward models).

For simplicity, I could drop the _reward in model builders, and expose a num_classes in the model builder which is defaulted to 1. So you'd have:

model_family/model_builders.py

... import model_family_classifier

def model_family_7b_classifier(num_classes: int = 1):
    return model_family_classifier(num_classes=num_classes, ...)

I think this would clean up the API, and offer flexibility to users who wish to define recipes for classification models for which there is some interest from the community e.g. #1124.

How does this sound?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the way you have it here where classifier is a generic component builder and reward is just num_classes = 1 in the model builders is great. We also don't want to add too many builder layers that add distance between what the user calls and the actual model architecture definition

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @RdoubleA. Let's roll with things as-is for now.

)


qlora_llama2_reward_7b = partial(lora_llama2_7b, quantize_base=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if for simplification we should do something like this for the reward/classification versions, and have a flag: "use_reward_head=True, reward_head_size=n".

It would add some if/else to the builder, so I am not sure if others would approve it.

@@ -35,7 +35,7 @@ def mistral_reward_hf_to_tune(
) -> Dict[str, torch.Tensor]:
"""
Convert a state dict from HF's format to torchtune's format, which contains the weights
of a Mistral reward model.
of a reward model (i.e. a classifier with a single class).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this confirms that classifier is a case of a reward, and maybe we should swap the current naming?

@SalmanMohammadi
Copy link
Collaborator Author

Thanks so much for the review @felipemello1. Apologies for not providing a bit more context about the PR to help make things clearer. I discussed this briefly with @pbontrager on Discord in the "Discussing PPO" thread (which has been my defacto channel for pinging devs for contributing questions - maybe I should move to #dev..), it'd be great you have your input on some of the points we raised.

As you've correctly gathered, the main use for these classifier models is to support my work on RLHF to allow the use of pre-trained reward models during PPO. My primary aim for this PR was to reduce the review overhead on #1005, and I'd like my changes to be minimal and non-invasive (in that they can easily be refactored), and rely on future refactors (such as #1017 (comment) ) to address some of the potential long-term effects on the codebase.

Are we comfortable with having 9 builders per model size? Maybe the answer is yes, and thats ok

There's been a few suggestions around this (see some comments in the original PR #837) - generally, there's some agreement that we can probably define these classifiers in a more minimal way like you suggested. However, we're touching on support for general classifiers vs support for classifiers as they're needed now in the codebase here. For now, I'd argue that we should revisit this further down the line when the need for generalisation is apparent and/or in a TransformerDecoder refactor.

"""
Builder for creating a Llama2 model initialized w/ the default 7B parameter values
from https://arxiv.org/abs/2307.09288, where the output layer is a classification layer
projecting to a single class for reward modelling.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the way you have it here where classifier is a generic component builder and reward is just num_classes = 1 in the model builders is great. We also don't want to add too many builder layers that add distance between what the user calls and the actual model architecture definition

@@ -76,17 +81,17 @@ def _permute(t, n_heads):
return converted_state_dict


def mistral_reward_tune_to_hf(
def reward_tune_to_hf(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if there's a better location for these - our other converters are in models/convert_weights. PEFT adapter conversions were added there. I think it makes sense to also throw these in there, or do you think that will make the file bloated?

Copy link
Collaborator Author

@SalmanMohammadi SalmanMohammadi Jul 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I was thinking similar. I think it does make the file a bit bloated and would reflect the sentiment from my above comment; we could revisit this and put weight conversion somewhere more general when we need it.

@SalmanMohammadi SalmanMohammadi requested a review from RdoubleA July 18, 2024 08:45
@SalmanMohammadi SalmanMohammadi deleted the generalising_reward_models branch July 20, 2024 22:03
@SalmanMohammadi SalmanMohammadi restored the generalising_reward_models branch July 20, 2024 22:03
@codecov-commenter
Copy link

codecov-commenter commented Jul 21, 2024

Codecov Report

Attention: Patch coverage is 59.32203% with 24 lines in your changes missing coverage. Please review.

Project coverage is 67.67%. Comparing base (7eb89e2) to head (12488b7).

Files Patch % Lines
torchtune/models/llama2/_component_builders.py 18.51% 22 Missing ⚠️
torchtune/models/llama2/_model_builders.py 71.42% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1160      +/-   ##
==========================================
- Coverage   67.81%   67.67%   -0.14%     
==========================================
  Files         219      220       +1     
  Lines        9908     9941      +33     
==========================================
+ Hits         6719     6728       +9     
- Misses       3189     3213      +24     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@felipemello1
Copy link
Contributor

For now, I'd argue that we should revisit this further down the line

sounds good to me! Specially since you are already aligned with Philip and Rafi. Thanks for your reply!

Copy link
Contributor

@RdoubleA RdoubleA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good to me. Do we have tests for the reward convert weights utilities?

@SalmanMohammadi
Copy link
Collaborator Author

Overall looks good to me. Do we have tests for the reward convert weights utilities?

There's a test here

class TestHFMistralRewardModelFullModelCheckpointer:

@SalmanMohammadi SalmanMohammadi merged commit 5a92135 into pytorch:main Jul 23, 2024
29 checks passed
@SalmanMohammadi SalmanMohammadi deleted the generalising_reward_models branch July 26, 2024 10:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants