You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In https://huggingface.co/failspy/Meta-Llama-3-70B-Instruct-abliterated-v3.5 you mentioned a new methodology, but what changed that made it so much more effective? For a while I've been trying to reproduce this (originally with Llama 3 and now with 3.1, both 8B and 70B). With Llama 3.1 70B I have to edit layers 10 through 40, and it gets less effective as I narrow the range further.
The only way I've been able to get a decent effect from just a single layer is by multiplying the direction by about 1.5 after normalization. You mentioned somewhere that you did something that sounds similar. On Llama 3.1 8B I can get a good result by scaling the direction by 1.5 and applying it just to layer 11. But that only worked for me when hooking activations, I wasn't able to figure out how to bake that to the matrix (just scaling the direction when orthogonalizing didn't work). I haven't tried it with the 70B.
Was I accidentally on the right track with scaling the directions, or was there something else? Nothing else I've tried (layer selection, sampling different tokens, varying and mixing training sets) has worked with fewer than about 7 layers on 8B and 30 layers on 70B.
The text was updated successfully, but these errors were encountered:
I've tried everything I can think of, and with Llama 3.2 out there's another model I'd really like a good version of. Any info about how you arrived at a single-layer edit would be really helpful.
In https://huggingface.co/failspy/Meta-Llama-3-70B-Instruct-abliterated-v3.5 you mentioned a new methodology, but what changed that made it so much more effective? For a while I've been trying to reproduce this (originally with Llama 3 and now with 3.1, both 8B and 70B). With Llama 3.1 70B I have to edit layers 10 through 40, and it gets less effective as I narrow the range further.
The only way I've been able to get a decent effect from just a single layer is by multiplying the direction by about 1.5 after normalization. You mentioned somewhere that you did something that sounds similar. On Llama 3.1 8B I can get a good result by scaling the direction by 1.5 and applying it just to layer 11. But that only worked for me when hooking activations, I wasn't able to figure out how to bake that to the matrix (just scaling the direction when orthogonalizing didn't work). I haven't tried it with the 70B.
Was I accidentally on the right track with scaling the directions, or was there something else? Nothing else I've tried (layer selection, sampling different tokens, varying and mixing training sets) has worked with fewer than about 7 layers on 8B and 30 layers on 70B.
The text was updated successfully, but these errors were encountered: