Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiple albitration / steering presets? #20

Open
Skorchekd opened this issue May 30, 2024 · 2 comments
Open

Multiple albitration / steering presets? #20

Skorchekd opened this issue May 30, 2024 · 2 comments

Comments

@Skorchekd
Copy link

perhaps could make an idea where there are configs that could steer the model towards certain things.. for example different personalitys different emotions etc preset into the code?.. just an idea i had... very cool though!

@tretomaszewski
Copy link
Contributor

You can find a notebook for a non-refusal use-case here:
https://huggingface.co/failspy/Llama-3-8B-Instruct-MopeyMule/blob/main/MopeyMule-Induce-Melancholy.ipynb

Of course, you'll need to adjust to your needs.

The "refusal" / "harmful" / "harmless" terminology in this library can be seen as whatever behaviors you want to ablate. That is, you want to achieve non-"refusal" responses to the whatever you decide is a "harmful" prompt, but "refusal" is simply what you don't want to see given a prompt. This would require two datasets of polarized/opposite prompts.

Alternatively, as shown in the notebook above, you can use also use special system prompt (see notebook).

Eventually we hope to change the terminology towards a general behavioral-ablation use-case.

Most of this is still very exploratory and, at best, experimental.
If you find anything of interest, let us know!

@Skorchekd
Copy link
Author

You can find a notebook for a non-refusal use-case here: https://huggingface.co/failspy/Llama-3-8B-Instruct-MopeyMule/blob/main/MopeyMule-Induce-Melancholy.ipynb

Of course, you'll need to adjust to your needs.

The "refusal" / "harmful" / "harmless" terminology in this library can be seen as whatever behaviors you want to ablate. That is, you want to achieve non-"refusal" responses to the whatever you decide is a "harmful" prompt, but "refusal" is simply what you don't want to see given a prompt. This would require two datasets of polarized/opposite prompts.

Alternatively, as shown in the notebook above, you can use also use special system prompt (see notebook).

Eventually we hope to change the terminology towards a general behavioral-ablation use-case.

Most of this is still very exploratory and, at best, experimental. If you find anything of interest, let us know!

doesnt work.... does it need a gpu

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants