Can Editing LLMs Inject Harm?

Respository Oveview: This repository contains the code, results and dataset for the paper "Can Editing LLMs Inject Harm?"
TLDR : We propose to reformulate knowledge editing as a new type of safety threat for LLMs, namely Editing Attack, and discover its emerging risk of injecting misinformation or bias into LLMs stealthily, indicating the feasibility of disseminating misinformation or bias with LLMs as new channels.
Authors : Canyu Chen*, Baixiang Huang*, Zekun Li, Zhaorun Chen, Shiyang Lai, Xiongxiao Xu, Jia-Chen Gu, Jindong Gu, Huaxiu Yao, Chaowei Xiao, Xifeng Yan, William Yang Wang, Philip Torr, Dawn Song, Kai Shu (*equal contributions)
Correspondence to: Kai Shu <[email protected]>.
Paper : Read our paper
Project Website: Visit the project website https://llm-editing.github.io for more resources.

Overview

Knowledge editing has been increasingly adopted to correct the false or outdated knowledge in Large Language Models (LLMs). Meanwhile, one critical but under-explored question is: can knowledge editing be used to inject harm into LLMs? In this paper, we propose to reformulate knowledge editing as a new type of safety threat for LLMs, namely Editing Attack, and conduct a systematic investigation with a newly constructed dataset EditAttack. Specifically, we focus on two typical safety risks of Editing Attack including Misinformation Injection and Bias Injection. For the risk of misinformation injection, we first categorize it into commonsense misinformation injection and long-tail misinformation injection. Then, we find that editing attacks can inject both types of misinformation into LLMs, and the effectiveness is particularly high for commonsense misinformation injection. For the risk of bias injection, we discover that not only can biased sentences be injected into LLMs with high effectiveness, but also one single biased sentence injection can cause a bias increase in general outputs of LLMs, which are even highly irrelevant to the injected sentence, indicating a catastrophic impact on the overall fairness of LLMs. Then, we further illustrate the high stealthiness of editing attacks, measured by their impact on the general knowledge and reasoning capacities of LLMs, and show the hardness of defending editing attacks with empirical evidence. Our discoveries demonstrate the emerging misuse risks of knowledge editing techniques on compromising the safety alignment of LLMs and the feasibility of disseminating misinformation or bias with LLMs as new channels.

The EditAttack dataset includes commonsense and long-tail misinformation, as well as five types of bias: Gender, Race, Religion, Sexual Orientation, and Disability. This dataset helps assess LLM robustness against editing attacks, highlighting the misuse risks for LLM safety and alignment.

Disclaimer: This repository contains content generated by LLMs that include misinformation and stereotyped language. These do not reflect the opinions of the authors. Please use the data responsibly.

data/: Contains the EditAttack dataset.
code/: Includes scripts and code for data processing and evaluation.
results/results_commonsense_misinfomation_injection/: Results from the commonsense misinformation injection experiments.
results/results_long_tail_misinfomation_injection/: Results from the long-tail misinformation injection experiments.
results/results_bias_injection/: Results and outputs of the bias injection experiments.
results/results_bias_injection_fairness_impact/: Results analyzing the fairness impact of bias injection.
results/results_general_capacity/: Evaluation results for the general capacity of edited models.

Installation

To set up the environment for running the code, follow these steps:

Clone the repository:

git clone https://github.com/llm-editing/editing-attack.git
cd editing-attack

Create a virtual environment and activate it:

conda create -n EditingAttack python=3.9.7
conda activate EditingAttack

Install the required dependencies:
```
pip install -r requirements.txt
```

Usage

Data Preparation

Datasets are stored in the data/ directory. There are three folders:

data/
    ├── bias
    │   └── bias_injection.csv
    ├── general_capacity
    │   ├── boolq.jsonl
    │   ├── natural_language_inference.tsv
    │   ├── natural_questions.jsonl
    │   ├── gsm8k.jsonl
    └── misinfomation
        ├── long_tail_100.csv
        ├── commonsense_100.csv
        └── commonsense_868.csv

Evaluation Setting

The default evaluation setting in the code uses a local LLM (e.g., Llama3-8b) as the evaluator. We recommend running experiments with at least one GPU with 48 GB of memory (we use NVIDIA RTX A6000 GPUs) or two GPUs with 24 GB of vRAM: one for loading the edited models (both the pre-edit and post-edit models) and one for loading the local evaluation model. The device numbers in code/hparams can be modified to adjust the devices used for editing. Note that the experiments in our paper use the GPT-4 API as the evaluator. If you also use API models as the evaluator, one GPU is usually sufficient.

If you use a local LLM (such as Llama3-8b) as the evaluator, you can modify the device number and the evaluation model through --eval_model_device and --eval_model as shown in the example below:

python3 inject_misinfomation.py \
    --editing_method=ROME \
    --hparams_dir=./hparams/ROME/llama3-8b \
    --ds_size=100 \
    --long_tail_data=False \
    --metrics_save_dir=./results_commonsense_misinfomation_injection \
    --eval_model='meta-llama/Meta-Llama-3-8B-Instruct' \
    --eval_model_device='cuda:0'

If you use an API model (such as GPT-4) as the evaluator, you need to set your YOUR_API_KEY in Line 60 of code/editor_new_eval.py. One example is as follows:

python3 inject_misinfomation.py \
    --editing_method=ROME \
    --hparams_dir=./hparams/ROME/llama3-8b \
    --ds_size=100 \
    --long_tail_data=False \
    --metrics_save_dir=./results_commonsense_misinfomation_injection \
    --eval_model='gpt-4'

Running Experiments

To get started (e.g. using ROME to edit llama3-8b on EditAttack misinformation injection dataset), run:

python3 inject_misinfomation.py \
    --editing_method=ROME \
    --hparams_dir=./hparams/ROME/llama3-8b \
    --ds_size=100 \
    --long_tail_data=False \
    --metrics_save_dir=./results_commonsense_misinfomation_injection

For full experiments:

To run the misinformation injection experiment:
```
./code/misinfomation_injection.sh
```
To run the bias injection experiment:
```
./code/bias_injection.sh
```
To run the general knowledge and reasoning capacities evaluations for edited models:
```
./code/general_capacity.sh
```

We evaluate instruction-tuned models including Meta-Llama-3.1-8B-Instruct, Mistral-7B-v0.3, Vicuna-7b-v1.5, and Alpaca-7B. All parameters are in the code/hparams/<method_name>/<model_name>.

Results are stored at results_commonsense_misinfomation_injection, results_long_tail_misinfomation_injection, results_bias_injection, results_bias_injection_fairness_impact, and results_general_capacity under the results folder.

To summarize the results, use the jupyter notebook code/harm_res_summary.ipynb and code/harm_general_capacity.ipynb

Contributing

We welcome contributions to improve the code and dataset. Please open an issue or submit a pull request if you have any suggestions or improvements.

License

This project is licensed under the Creative Commons Attribution 4.0 International License (CC BY 4.0).

Ethics Statement

Considering that the knowledge editing techniques such as ROME, FT and IKE are easy to implement and widely adopted, we anticipate these methods have been potentially exploited to inject harm such as misinformation or biased information into open-source LLMs. Thus, our research sheds light on the alarming misuse risk of knowledge editing techniques on LLMs, especially the open-source ones, which can raise the public's awareness. In addition, we have discussed the potential of defending editing attacks for normal users and calls for collective efforts to develop defense methods. Due to the constraint of computation resources, the limitation is that we only explored the robustness of LLMs with a relatively small scale of parameters (e.g., Llama3-8b) against editing attacks. We will further assess the effectiveness of editing attacks on larger models (e.g., Llama3-70b) as our next step.

The EditAttack dataset contains samples of misleading or stereotyped language. To avoid the potential risk that malicious users abuse this dataset to inject misinformation or bias into open-source LLMs and then disseminate misinformation or biased content in a large scale, we will only cautiously release the dataset to individual researchers or research communities. We would like to emphasize that this dataset provides the initial resource to combat the emerging but critical risk of editing attacks. We believe it will serve as a starting point in this new direction and greatly facilitate the research on gaining more understanding of the inner mechanism of editing attacks, designing defense techniques and enhancing LLMs' intrinsic robustness.

Acknowledgements

We gratefully acknowledge the use of code and data from the following projects: BBQ, BoolQ, GSM8K, EasyEdit,Natural Questions, NLI, ROME

Citation

If you find our paper or code useful, we will greatly appreacite it if you could consider citing our paper:

@article{chen2024canediting,
    title   = {Can Editing LLMs Inject Harm?},
    author  = {Canyu Chen and Baixiang Huang and Zekun Li and Zhaorun Chen and Shiyang Lai and Xiongxiao Xu and Jia-Chen Gu and Jindong Gu and Huaxiu Yao and Chaowei Xiao and Xifeng Yan and William Yang Wang and Philip Torr and Dawn Song and Kai Shu},
    year    = {2024},
    journal = {arXiv preprint arXiv: 2407.20224}
}

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
code		code
data		data
results		results
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Can Editing LLMs Inject Harm?

Overview

Table of Contents

Repository Structure

Installation

Usage

Data Preparation

Evaluation Setting

Running Experiments

Contributing

License

Ethics Statement

Acknowledgements

Citation

About

Releases

Packages

Contributors 3

Languages

License

llm-editing/editing-attack

Folders and files

Latest commit

History

Repository files navigation

Can Editing LLMs Inject Harm?

Overview

Table of Contents

Repository Structure

Installation

Usage

Data Preparation

Evaluation Setting

Running Experiments

Contributing

License

Ethics Statement

Acknowledgements

Citation

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages