Toxicity Elicitation with AST

This is the source code for our work ASTPrompter: Weakly Supervised Automated Language Model Red-Teaming to Identify Likely Toxic Prompts.

Setting Up

Environment

Create a new conda environment conda create --name ast
Activate the conda environment conda activate ast
Install big, platform specific packages (via Conda, if you use that, or pip): pytorch, accelerate, transformers
Install the other requirements pip3 install -r requirements.txt

Data

For reproducibility, the results presented in our work uses fixed data train/dev/test conversation ID splits for the filtered non-toxic prefixes. Please download them from the data subfolder here and place them into ./data.

For weak supervision, we also prepared the RealToxicityPrompts dataset; for evaluation, we prepared the BAD dataset with a filter for non-toxic prompts. These support files are available here and should be placed at the top level directly of the repository.

Running the Code

Training

To train a toxicity elicitation model with the data given above, use

python main.py

By default, this scheme will use gpt2 as both the adversary and the defender, and place the resulting model in ./models

Call:

python main.py --help

for all options.

Evaluation

To evaluate the toxicity elicitation of your model, use

python main_eval.py ./models/your_weights_dir

By default, the evaluation results will be given in ./results as a JSON.

Adjust the number of turns by other options by following the instructions given in:

python main_eval.py --help

Citing the Work

If the code or ideas contained here was useful for your porject, please cite our work at:

@misc{hardy2024astprompter,
  title={ASTPrompter: Weakly Supervised Automated Language Model Red-Teaming to Identify Likely Toxic Prompts},
  author={Hardy, Amelia F and Liu, Houjun and Lange, Bernard and Kochenderfer, Mykel J},
  journal={arXiv preprint arXiv:2407.09447},
  year={2024}
}

If you run into any issues, please feel free to email {houjun,ahardy} at stanford dot edu.

Name		Name	Last commit message	Last commit date
Latest commit History 193 Commits
data		data
detox_code		detox_code
toxicity		toxicity
.gitignore		.gitignore
README.md		README.md
bad_evaluator.py		bad_evaluator.py
detox_results.jsonl		detox_results.jsonl
environment.py		environment.py
evaluator.py		evaluator.py
lm.py		lm.py
main.py		main.py
main_eval.py		main_eval.py
play.py		play.py
prompts.jsonl		prompts.jsonl
requirements.txt		requirements.txt
sft.py		sft.py
toxic_comments.csv		toxic_comments.csv
trainer.py		trainer.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Toxicity Elicitation with AST

Setting Up

Environment

Data

Running the Code

Training

Evaluation

Citing the Work

About

Releases

Packages

Contributors 3

Languages

sisl/ASTPrompter

Folders and files

Latest commit

History

Repository files navigation

Toxicity Elicitation with AST

Setting Up

Environment

Data

Running the Code

Training

Evaluation

Citing the Work

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages