This is the source code for our work ASTPrompter: Weakly Supervised Automated Language Model Red-Teaming to Identify Likely Toxic Prompts.
- Create a new conda environment
conda create --name ast
- Activate the conda environment
conda activate ast
- Install big, platform specific packages (via Conda, if you use that, or pip):
pytorch
,accelerate
,transformers
- Install the other requirements
pip3 install -r requirements.txt
For reproducibility, the results presented in our work uses fixed data train/dev/test conversation ID splits for the filtered non-toxic prefixes. Please download them from the data
subfolder here and place them into ./data
.
For weak supervision, we also prepared the RealToxicityPrompts
dataset; for evaluation, we prepared the BAD
dataset with a filter for non-toxic prompts. These support files are available here and should be placed at the top level directly of the repository.
To train a toxicity elicitation model with the data given above, use
python main.py
By default, this scheme will use gpt2
as both the adversary and the defender, and place the resulting model in ./models
Call:
python main.py --help
for all options.
To evaluate the toxicity elicitation of your model, use
python main_eval.py ./models/your_weights_dir
By default, the evaluation results will be given in ./results
as a JSON.
Adjust the number of turns by other options by following the instructions given in:
python main_eval.py --help
If the code or ideas contained here was useful for your porject, please cite our work at:
@misc{hardy2024astprompter,
title={ASTPrompter: Weakly Supervised Automated Language Model Red-Teaming to Identify Likely Toxic Prompts},
author={Hardy, Amelia F and Liu, Houjun and Lange, Bernard and Kochenderfer, Mykel J},
journal={arXiv preprint arXiv:2407.09447},
year={2024}
}
If you run into any issues, please feel free to email {houjun,ahardy} at stanford dot edu
.