Original implementation for paper Augmenting Decompiler Output with Learned Variable Names and Types.
DIRTY is a Transformer-based model which improves the quality of decompiler outputs by automatically generating meaningful variable names and types, assigning variable names that agree with those written by developers 66.4% of the time and types 75.8% of the time. We also release a large real-world dataset DIRT for this task, which consists of 75K+ programs and 1M+ human-written C functions mined from GitHub paired with their decompiler outputs.
- Linux with Python 3.6/3.7/3.8
- PyTorch ≥ 1.5.1
pip install -r requirements.txt
The first step to train DIRTY is to download the preprocessed DIRT dataset. If you would like the full, unpreprocessed dataset, it is available at https://doi.org/10.1184/R1/20732656.v1
cd dirty/
wget cmu-itl.s3.amazonaws.com/dirty/dirt.tar.gz -O dirt.tar.gz
tar -xzf dirt.tar.gz
The command would automatically download and decompress the dataset from Amazon S3.
If your machine does not have access to AWS, please manually download from the
above link and untar it to data1/
.
Our dataset was generated using GHCC, which
was fed the list of projects found in the projects.txt
file. This list was
generated by querying the GHTorrent database in late
2019, and contains projects from January 28, 2008 through May 31, 2019. Note
that not all of these projects compiled, and it is likely that some of them no
longer exist.
We have setup configuration files for different models reported in our paper:
file | model | time (estimated hours) |
---|---|---|
multitask.xfmr.jsonnet | DIRTY-Multitask | 120 |
rename.xfmr.jsonnet | DIRTY-Rename | 80 |
retype.xfmr.jsonnet | DIRTY-Retype | 80 |
retype_nomem.xfmr.jsonnet | DIRTY_NDL | 80 |
retype_base.xfmr.jsonnet | DIRTY_S | 40 |
Training a models is as easy as specifying the name of the experimental run and the config file. Suppose we want to reproduce the Multi-task model in Table~7 in the paper:
cd dirty/
python exp.py train --cuda --expname=dirty_mt multitask.xfmr.jsonnet
Then, please watch for the line wandb: Run data is saved locally in ...
in the output.
This is where the logs and models are to be saved.
You can also monitor the automatically uploaded training and validation status (e.g., losses, accuracy) in your browser in real-time with the link printed after wandb: 🚀 View run at ...
.
Feel free to adjust the hyperparameters in *.jsonnet
config files to train your own model.
As an alternative to train the model by yourself, you can download our trained DIRTY model.
cd dirty/
mkdir exp_runs/
wget cmu-itl.s3.amazonaws.com/dirty/dirty_mt.ckpt -O exp_runs/dirty_mt.ckpt
First, run your trained/downloaded model to produce predictions on the DIRE test set.
python exp.py train --cuda --expname=eval_dirty_mt multitask.xfmr.jsonnet --eval-ckpt <ckpt_path>
<ckpt_path>
is either exp_runs/dirty_mt.ckpt
if you download our trained model,
or saved during training at wandb/run-YYYYMMDD_HHMMSS-XXXXXXXX/files/dire/XXXXXXXX/checkpoints/epoch=N.ckpt
.
We sugguest changing beam_size
in config files to 0
to switch to greedy decoding, which is significantly faster.
The default configuration of beam_size = 5
can take hours.
The predictions will be saved to pred_XXX.json
.
This filename depends on models and can be modified in config files.
You can inspect the prediction results, which is in the following format.
{
binary: {
func_name: {
var1: [var1_retype, var1_rename], ...
}, ...
}, ...
}
Finally, use our standalone benchmark script:
python -m utils.evaluate --pred-file pred_mt.json --config-file multitask.xfmr.jsonnet
Here is a walk-through of the code files of this repo.
The dirty/
folder contains the main code for the DIRTY model and DIRT dataset.
The entry point for running DIRTY experiments. It loads a configuration file, constructs a dataset instance, a model instance, and launches into a Trainer which runs training or inference according to configuration, and save logs and results in wandb.
Configuration files for running DIRTY experiments.
This folder contains neural models consisting of the DIRTY model.
├── dirty
│ ├── model
│ │ ├── beam.py # Beam search
│ │ ├── decoder.py # factory class for building Decoders from configs
│ │ ├── encoder.py # factory class for building Encoders from configs
│ │ ├── model.py # training and evaluation step and metric logging
│ │ ├── simple_decoder.py # A `decoder' consists of a linear layer,
# used for producing a soft mask from Data Layout Encoder
│ │ ├── xfmr_decoder.py # Type/Multitask Decoder
│ │ ├── xfmr_mem_encoder.py # Data Layout Encoder
│ │ ├── xfmr_sequential_encoder.py # Code Encoder
│ │ └── xfmr_subtype_decoder.py # Not used in the current version
This folder contains code for the DIRT dataset, data preprocessing, evaluation, helper functions, and demos in the paper.
├── dirty
│ └── utils
│ ├── case_study.py # Generate results for Table 3 and Table 6 in the paper
│ ├── code_processing.py # Code canonicalization such as converting literals
│ ├── compute_mi.py # Compute the mutual information between variables and types as a proof-of-concept for MT
│ ├── dataset.py # A parallelized data loading class for preparing batched samples from DIRT for DIRTY
│ ├── dataset_statistics.py # Compute dataset statistics
│ ├── dire_types.py -> ../../binary/dire_types.py
│ ├── evaluate.py # Evaluate final scores from json files saved from different methods for fair comparison
│ ├── function.py -> ../../binary/function.py
│ ├── ida_ast.py -> ../../binary/ida_ast.py
│ ├── lexer.py
│ ├── preprocess.py # Preprocess data produced from `dataset-gen/` into the DIRT dataset
│ ├── util.py
│ ├── variable.py -> ../../binary/variable.py
│ └── vocab.py
Empirical baselines included in the paper.
Use python -m baselines.<xxxxxx>
to run.
Results are saved to corresponding json files and can be evaluated with python -m utils.evaluate
.
├── dirty
│ ├── baselines
│ │ ├── copy_decompiler.py
│ │ ├── most_common.py
│ │ └── most_common_decomp.py
The binary/
folder contains definitions for classes, including types, variables, and functions, constructed from decompiler outputs from binaries.
├── binary
│ ├── __init__.py
│ ├── dire_types.py # constructing types and a type library
│ ├── function.py # definition and serialization for function instances
│ ├── ida_ast.py # constructing ASTs from IDA-Pro outputs
│ └── variable.py # definition and serialization for variable instances
The idastubs/
folder contains helper functions used by the ida_ast.py
file.
The dataset-gen/
folder contains producing unpreprocessed data from binaries using IDA-Pro (required).
Legacy code for the DIRE paper.
If you use DIRTY/DIRT in your research or wish to refer to the baseline results, please use the following BibTeX.
@inproceedings {chen2021augmenting,
title = {Augmenting Decompiler Output with Learned Variable Names and Types},
author = {Chen, Qibin and Lacomis, Jeremy and Schwartz, Edward J. and {Le~Goues}, Claire and Neubig, Graham and Vasilescu, Bogdan},
booktitle = {31st USENIX Security Symposium},
year = {2022},
address = {Boston, MA},
url = {https://www.usenix.org/conference/usenixsecurity22/presentation/chen-qibin},
month = aug,
}
@inproceedings {lacomis2019dire,
title = {DIRE: A Neural Approach to Decompiled Identifier Naming},
author = {Lacomis, Jeremy and Yin, Pengcheng and Schwartz, Edward J. and Allamanis, Miltiadis and {Le~Goues}, Claire and Neubig, Graham and Vasilescu, Bogdan},
booktitle = {34th IEEE/ACM International Conference on Automated Software Engineering},
year = {2019},
address = {San Diego, CA},
pages = {628--639},
}