-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multiple albitration / steering presets? #20
Comments
You can find a notebook for a non-refusal use-case here: Of course, you'll need to adjust to your needs. The "refusal" / "harmful" / "harmless" terminology in this library can be seen as whatever behaviors you want to ablate. That is, you want to achieve non-"refusal" responses to the whatever you decide is a "harmful" prompt, but "refusal" is simply what you don't want to see given a prompt. This would require two datasets of polarized/opposite prompts. Alternatively, as shown in the notebook above, you can use also use special system prompt (see notebook). Eventually we hope to change the terminology towards a general behavioral-ablation use-case. Most of this is still very exploratory and, at best, experimental. |
doesnt work.... does it need a gpu |
perhaps could make an idea where there are configs that could steer the model towards certain things.. for example different personalitys different emotions etc preset into the code?.. just an idea i had... very cool though!
The text was updated successfully, but these errors were encountered: