diff --git a/.gitignore b/.gitignore
index 1d91d08..1c9efc7 100644
--- a/.gitignore
+++ b/.gitignore
@@ -123,3 +123,5 @@ lightning_logs/
 
 # sandbox
 sandbox
+
+.DS_Store
\ No newline at end of file
diff --git a/README.md b/README.md
index c92d2f7..44984a3 100644
--- a/README.md
+++ b/README.md
@@ -8,10 +8,19 @@ You may run the scripts in this repo using your own heatmaps/annotations/segment
 - [Overview](#overview)
 - [Setup](#setup)
 - [Download data](#download)
+- [Exploratory data analysis](#eda)
 - [Generate segmentations from saliency method heatmaps](#heatmap_to_segm)
-	- [Fine tune segmentation thresholds](#threshold)
+	- [Fine-tune segmentation thresholds](#seg_threshold)
+	- [Fine-tune probability thresholds](#prob_threshold)
 - [Generate segmentations from human annotations](#ann_to_segm)
 - [Evaluate localization performance](#eval)
+	- [Calculate percentage decrease](#pct_dec)
+- [Compute pathology features](#path_features)
+	- [Plot distribution of pathology features](#dist_path_features)
+- [Run regressions on pathology features](#regression_pathology)
+- [Run regressions on model assurance](#regression_model_assurance)
+- [Get precision, recall/sensitivity, and specificity values](#precision_recall)
+- [Bugs or questions?](#bugs)
 - [Citation](#citation)
 
 
@@ -61,16 +70,31 @@ We have also included a small sample of the above data in this repo in [`./sampl
 
 If you'd like to use your own heatmaps/annotations/segmentations, see the relevant sections below for the expected data formatting.
 
+<a name="eda"></a>
+## Exploratory data analysis
+Given a json file with segmentations, for each pathology, to count the number of CXRs with at least one segmentation, run:
+
+```
+(chexlocalize) > python count_segs.py [FLAGS]
+```
+
+**Required flags**
+* `--seg_path`: the json file path where segmentations are saved (encoded). [_Evaluate localization performance_](#eval) describes how the segmentation json should be formated.
+
+**Optional flags**
+* `--save_dir`: the directory to save the csv file `n_segs.csv` with results. Default is current directory.
+
 <a name="heatmap_to_segm"></a>
 ## Generate segmentations from saliency method heatmaps
 
 To generate binary segmentations from saliency method heatmaps, run:
 
 ```
-(chexlocalize) > python heatmap_to_segmentation.py --map_dir <map_dir> --threshold_path <threshold_path> --output_path <output_path>
+(chexlocalize) > python heatmap_to_segmentation.py [FLAGS]
 ```
 
-`<map_dir>` is the directory with pickle files containing the heatmaps. The script extracts the heatmaps from the pickle files.
+**Required flags**
+* `--map_dir`: the directory with pickle files containing the heatmaps. The script extracts the heatmaps from the pickle files.
 
 If you downloaded the CheXlocalize dataset, then these pickle files are in `/cheXlocalize_dataset/gradcam_maps_val/`. Each CXR has a pickle file associated with each of the ten pathologies, so that each pickle file contains information for a single CXR and pathology in the following format:
 
@@ -116,29 +140,54 @@ If you downloaded the CheXlocalize dataset, then these pickle files are in `/che
 
 If using your own saliency maps, please be sure to save them as pickle files using the above formatting.
 
-`<threshold_path>` is an optional csv file path that you can pass in to use your own thresholds to binarize the heatmaps. As an example, we provide [`./sample/tuning_results.csv`](https://github.com/rajpurkarlab/cheXlocalize/blob/master/sample/tuning_results.csv), which saves the threshold for each pathology that maximizes mIoU on the validation set. When passing in your own csv file, make sure to follow the same formatting as this example csv. By defaul, no threshold path is passed in, in which case we will apply Otsu's method (an automatic global thresholding algorithm provided by the cv2 package).
-
-`<output_path>` is the json file path used for saving the encoded segmentation masks. The json file is formatted such that it can be used as input to `eval.py` (see [_Evaluate localization performance_](#eval) for formatting details).
+**Optional flags**
+* `--threshold_path`: an optional csv file path that you can pass in to use your own thresholds to binarize the heatmaps. As an example, we provide [`./sample/tuning_results.csv`](https://github.com/rajpurkarlab/cheXlocalize/blob/master/sample/tuning_results.csv), which contains the threshold for each pathology that maximizes mIoU on the validation set. When passing in your own csv file, make sure to follow the same formatting as this example csv. By default, no threshold path is passed in, in which case we will apply Otsu's method (an automatic global thresholding algorithm provided by the cv2 package).
+* `--probability_threshold_path`: an optional csv file path that you can pass in to reduce false positive segmentation prediction using probability cutoffs. If the predicted model probability is below the cutoff, we generate a segmentation mask of all zeros regardless of thresholding scheme. As an example, we provide [`./sample/probability_tuning_results.csv`](https://github.com/rajpurkarlab/cheXlocalize/blob/master/sample/probability_tuning_results.csv). When passing in your own csv file, make sure to follow the same formatting as this example csv. See [_Fine-tune probability thresholds_](#prob_threshold) for more context.
+* `--output_path`: the json file path used for saving the encoded segmentation masks. The json file is formatted such that it can be used as input to `eval.py` (see [_Evaluate localization performance_](#eval) for formatting details). Default is `./saliency_segmentations.json`.
+* `--if_smoothing`: Set this flag to `True` to smooth the pixelated heatmaps using box filtering. Default is `False`.
+* `--k`: the kernel size used for box filtering (`int`). Default is 0. The user-defined `k` must be >= 0. If you set `k` as any number > 0, make sure to set `if_smoothing` to `True`, otherwise no smoothing would be performed.
 
 To store the binary segmentations efficiently, we use RLE format, and the encoding is implemented using [pycocotools](https://github.com/cocodataset/cocoapi/tree/master/PythonAPI/pycocotools). If an image has no saliency segmentations, we store a mask of all zeros.
 
 Running this script on the validation set heatmaps from the CheXlocalize dataset should take about 10 minutes.
 
-<a name="threshold"></a>
-### Fine tune segmentation thresholds
+<a name="seg_threshold"></a>
+### Fine-tune segmentation thresholds
 To find the thresholds that maximize mIoU for each pathology on the validation set, run:
 
 ```
-(chexlocalize) > python tune_heatmap_threshold.py --map_dir <map_dir> --gt_path <gt_path> --save_dir <save_dir>
+(chexlocalize) > python tune_heatmap_threshold.py [FLAGS]
 ```
 
-`<map_dir>` is the directory with pickle files containing the heatmaps.
+**Required flags**
+* `--map_dir`: the directory with pickle files containing the heatmaps.
+* `--gt_path`: the json file where ground-truth segmentations are saved (encoded).
+
+**Optional flags**
+* `--save_dir`: the directory to save the csv file that stores the tuned thresholds. Default is current directory.
 
-`<gt_path>` is the json file where ground-truth segmentations are saved (encoded).
+This script will replicate `./sample/tuning_results.csv` when you use the CheXlocalize validation set DenseNet121 + Grad-CAM heatmaps in `/cheXlocalize_dataset/gradcam_maps_val/` as `<map_dir>` and the validation set ground-truth pixel-level segmentations in `/cheXlocalize_dataset/gt_segmentations_val.json` as `<gt_path>`. Running this script should take about one hour.
 
-`<save_dir>` is the directory to save the csv file that stores the tuned thresholds. Default is current directory.
+<a name="prob_threshold"></a>
+### Fine-tune probability thresholds
+We noticed that for many CXRs false positive saliency segmentations were generated even though model probability was low. To ensure that the saliency segmentation is consistent with model probability output, we apply a logic such that the segmentation mask is all zeros if the predicted probability was below a chosen cutoff. To choose these cutoffs for each pathology, we maximize the mIoU on the validation set.
+
+To create the csv file with these cutoffs (which should be used with the flag `--probability_threshold_path` for `heatmap_to_segmentation.py`), run:
+
+```
+(chexlocalize) > python tune_probability_threshold.py [FLAGS]
+```
+
+**Required flags**
+* `--map_dir`: the directory with pickle files containing the heatmaps.
+* `--gt_path`: the json file where ground-truth segmentations are saved (encoded).
+
+**Optional flags**
+* `--save_dir`: the directory to save the csv file that stores the tuned thresholds. Default is current directory.
 
-This script will replicate './sample/tuning_results.csv' when you use the CheXlocalize validation set DenseNet121 + Grad-CAM heatmaps in `/cheXlocalize_dataset/gradcam_maps_val/` as `<map_dir>` and the validation set ground-truth pixel-level segmentations in `/cheXlocalize_dataset/gt_segmentations_val.json`. Running this script should take about one hour.
+In [our paper](https://www.medrxiv.org/content/10.1101/2021.02.28.21252634v3), we use these cutoffs to report results in Table 3 and Extended Data Fig. 4 because when evaluating localization performance using the full dataset, we found that many false postive saliency segmentations were generated even though model probability was low. When evaluation only uses the true positive slice of the dataset, we recommend using `<threshold_path>` to find the best thresholds since we're not accounting for false positives.
+
+This script will replicate `./sample/probability_tuning_results.csv` when you use the CheXlocalize validation set DenseNet121 + Grad-CAM heatmaps in `/cheXlocalize_dataset/gradcam_maps_val/` as `<map_dir>` and the validation set ground-truth pixel-level segmentations in `/cheXlocalize_dataset/gt_segmentations_val.json` as `<gt_path>`. Running this script should take about one hour.
 
 <a name="ann_to_segm"></a>
 ## Generate segmentations from human annotations
@@ -146,12 +195,11 @@ This script will replicate './sample/tuning_results.csv' when you use the CheXlo
 To generate binary segmentations from raw human annotations, run:
 
 ```
-(chexlocalize) > python annotation_to_segmentation.py --ann_path <ann_path> --output_path <output_path>
+(chexlocalize) > python annotation_to_segmentation.py [FLAGS]
 ```
 
-`<ann_path>` is the json file path with raw human annotations.
-
-If you downloaded the CheXlocalize dataset, then this is the json file `/cheXlocalize_dataset/gt_annotations_val.json`. Each key of the json file is a single CXR id with its data formatted as follows:
+**Required flags**
+* `--ann_path`: the json file path with raw human annotations. If you downloaded the CheXlocalize dataset, then this is the json file `/cheXlocalize_dataset/gt_annotations_val.json`. Each key of the json file is a single CXR id with its data formatted as follows:
 
 ```
 {
@@ -189,7 +237,8 @@ This input json should include only those CXRs with at least one positive ground
 
 If using your own human annotations, please be sure to save them in a json using the above formatting.
 
-`<output_path>` is the json file path used for saving the encoded segmentation masks. The json file is formatted such that it can be used as input to `eval.py` (see [_Evaluate localization performance_](#eval) for formatting details).
+**Optional flags**
+* `--output_path`: the json file path used for saving the encoded segmentation masks. The json file is formatted such that it can be used as input to `eval.py` (see [_Evaluate localization performance_](#eval) for formatting details). Default is `./human_segmentations.json`.
 
 Running this script on the validation set heatmaps from the CheXlocalize dataset should take about 5 minutes.
 
@@ -212,16 +261,19 @@ To run evaluation, use the following command:
 ```
 
 **Required flags**
-* `--metric`: options are `miou` or `hitrate`
-* `--gt_path`: Directory where ground-truth segmentations are saved (encoded). This could be the json output of `annotation_to_segmentation.py`. Or, if you downloaded the CheXlocalize dataset, then this is the json file `/cheXlocalize_dataset/gt_segmentations_val.json`.
-* `--pred_path`: If `metric = miou`, then this should be the directory where predicted segmentations are saved (encoded). This could be the json output of `heatmap_to_segmentation.py`, or, if you downloaded the CheXlocalize dataset, then this could be the json file `/cheXlocalize_dataset/gradcam_segmentations_val.json`. If `metric = hitrate`, then this should be directory with pickle files containing heatmaps (the script extracts the most representative point from the pickle files). If you downloaded the CheXlocalize dataset, then these pickle files are in `/cheXlocalize_dataset/gradcam_maps_val/`.
+* `--metric`: options are `iou` or `hitmiss`
+* `--gt_path`: Path to file where ground-truth segmentations are saved (encoded). This could be the json output of `annotation_to_segmentation.py`. Or, if you downloaded the CheXlocalize dataset, then this is the json file `/cheXlocalize_dataset/gt_segmentations_val.json`.
+* `--pred_path`:
+	* If `metric = iou`: this should be the path to file where predicted segmentations are saved (encoded). This could be the json output of `heatmap_to_segmentation.py`, or `annotation_to_segmentation.py`, or, if you downloaded the CheXlocalize dataset, then this could be the json file `/cheXlocalize_dataset/gradcam_segmentations_val.json`.
+	* If `metric = hitmiss`: when evaluating saliency methods, then this should be directory with pickle files containing heatmaps (the script extracts the most representative point from the pickle files). If you downloaded the CheXlocalize dataset, then these pickle files are in `/cheXlocalize_dataset/gradcam_maps_val/`. However, when evaluating human benchmark annotations, then this should be a json file that holds the human annotations for most representative points.
 
 **Optional flags**
-* `--true_pos_only`: Default is `True`. If `True`, run evaluation only on the true positive slice of the dataset (CXRs that contain both predicted and ground-truth segmentations).
+* `--true_pos_only`: Default is `True`. If `True`, run evaluation only on the true positive slice of the dataset (CXRs that contain both predicted and ground-truth segmentations). If `False`, also include CXRs with a predicted segmentation but without a ground-truth segmentation, and include CXRs with a ground-truth segmentation but without a predicted segmentation.
 * `--save_dir`: Where to save evaluation results. Default is current directory.
+* `--if_human_benchmark`: If `True`, run evaluation on human benchmark performance. Default is set to `False`. This is especially important when evaluating hit rate performance, since the most representative point input for saliency method is formatted differently than the most representative point input for human benchmark.
 * `--seed`: Default is `0`. Random seed to fix for bootstrapping.
 
-Both `pred_path` (if `metric = miou`) and `gt_path` must be json files where each key is a single CXR id with its data formatted as follows:
+If `metric = iou`, both `pred_path` and `gt_path` must be json files where each key is a single CXR id with its data formatted as follows:
 
 ```
 {
@@ -241,12 +293,127 @@ Both `pred_path` (if `metric = miou`) and `gt_path` must be json files where eac
 }
 ```
 
-Both `pred_path` (if `metric = miou`) and `gt_path` json files must contain a key for all CXR ids (regardless of whether it has any positive ground-truth labels), and each CXR id key must have values for all ten pathologies (regardless of ground-truth label). In other words, all CXRs and images are indexed. If a CXR has no segmentations, we store a segmentation mask of all zeros. If using your own `pred_path` and `gt_path` json files as input to this script, be sure that they are formatted per the above, with segmentation masks encoded using RLE using [pycocotools](https://github.com/cocodataset/cocoapi/tree/master/PythonAPI/pycocotools).
+Both `pred_path` (if `metric = iou`) and `gt_path` json files must contain a key for all CXR ids (regardless of whether it has any positive ground-truth labels), and each CXR id key must have values for all ten pathologies (regardless of ground-truth label). In other words, all CXRs and images are indexed. If a CXR has no segmentations, we store a segmentation mask of all zeros. If using your own `pred_path` and `gt_path` json files as input to this script, be sure that they are formatted per the above, with segmentation masks encoded using RLE using [pycocotools](https://github.com/cocodataset/cocoapi/tree/master/PythonAPI/pycocotools).
 
 This evaluation script generates three csv files:
-* `{miou/hitrate}_results.csv`: IoU or hit/miss results for each CXR and each pathology.
-* `{miou/hitrate}_bootstrap_results.csv`: 1000 bootstrap samples of mIoU or hit rate for each pathology.
-* `{miou/hitrate}_summary_results.csv`: mIoU or hit rate 95% bootstrap confidence intervals for each pathology.
+* `{iou/hitmiss}_results_per_cxr.csv`: IoU or hit/miss results for each CXR and each pathology.
+* `{iou/hitmiss}_bootstrap_results.csv`: 1000 bootstrap samples of IoU or hit/miss for each pathology.
+* `{iou/hitmiss}_summary_results.csv`: mIoU or hit rate 95% bootstrap confidence intervals for each pathology.
+
+<a name="pct_dec"></a>
+### Calculate percentage decrease
+
+To calculate the percentage decrease from the human benchmark localization metric to the saliency method pipeline localization metric, run:
+
+```
+(chexlocalize) > python calculate_percentage_decrease.py [FLAGS]
+```
+
+**Required flags**
+* `--metric`: options are `miou` or `hitrate`
+* `--hb_bootstrap_results`: Path to csv file with 1000 bootstrap samples of human benchmark IoU or hit/miss for each pathology. This is the output of `eval.py` called `{iou/hitmiss}_humanbenchmark_bootstrap_results_per_cxr.csv`.
+* `--pred_bootstrap_results`: Path to csv file with 1000 bootstrap samples of saliency method IoU or hit/miss for each pathology. This is the output of `eval.py` called `{iou/hitmiss}_bootstrap_results_per_cxr.csv`.
+
+**Optional flags**
+* `--save_dir`: Where to save results as csv files. Default is current directory.
+
+<a name="path_features"></a>
+## Compute pathology features
+We define four pathological features: (1) number of instances (for example, bilateral Pleural Effusion would have two instances, whereas there is only one instance for Cardiomegaly), (2) size (pathology area with respect to the area of the whole CXR), (3) elongation and (4) irrectangularity (the last two features measure the complexity of the pathology shape and were calculated by fitting a rectangle of minimum area enclosing the binary mask).
+
+To compute the four pathology features, run:
+
+```
+(chexlocalize) > python compute_pathology_features.py [FLAGS]
+```
+
+**Required flags**
+* `--gt_ann`: Path to json file with raw ground-truth annotations. If you downloaded the CheXlocalize dataset, then this is the json file `/cheXlocalize_dataset/gt_annotations_val.json`.
+* `--gt_seg`: Path to json file with ground-truth segmentations (encoded). This could be the json output of `annotation_to_segmentation.py`. Or, if you downloaded the CheXlocalize dataset, then this is the json file `/cheXlocalize_dataset/gt_segmentations_val.json`.
+
+**Optional flags**
+* `--save_dir`: Where to save four pathology feature dataframes as csv files. Default is current directory.
+
+Note that we use the ground-truth annotations to extract the number of instances, and we use the ground-truth segmentation masks to calculate area, elongation and rectangularity. We chose to extract number of instances from annotations because sometimes radiologists draw two instances for a pathology that are overlapping; in this case, the number of annotations would be 2, but the number of segmentations would be 1.
+
+Running this script on the validation set annotations and segmentations from the CheXlocalize dataset should take about 5 minutes.
+
+<a name="dist_path_features"></a>
+### Plot distribution of pathology features
+
+To plot the distribution of the four pathology features across all 10 pathologies, run:
+
+```
+(chexlocalize) > python plot_pathology_features.py [FLAGS]
+```
+
+**Required flags**
+* `--features_dir`: Path to directory that holds four csv files: `area_ratio.csv`, `elongation.csv`, `num_instances.csv`, and `rec_area_ratio.csv`. These four files are the output of `compute_pathology_features.py`.
+
+**Optional flags**
+* `--save_dir`: Where to save plots. Default is current directory.
+
+<a name="regression_pathology"></a>
+## Run regressions on pathology features
+We provide a script to run a simple linear regression with the evaluation metric (IoU or hit/miss) as the dependent variable (to understand the relationship between the geometric features of a pathology and saliency method localization performance). Each regression uses one of the above four geometric features as a single independent variable.
+
+```
+(chexlocalize) > python regression_pathology_features.py [FLAGS]
+```
+
+**Required flags**
+* `--features_dir`: Path to directory that holds four csv files: `area_ratio.csv`, `elongation.csv`, `num_instances.csv`, and `rec_area_ratio.csv`. These four files are the output of `compute_pathology_features.py`.
+* `--pred_iou_results`: path to csv file with saliency method IoU results for each CXR and each pathology. This is the output of `eval.py` called `iou_results_per_cxr.csv`.
+* `--pred_hitmiss_results`: path to csv file with saliency method hit/miss results for each CXR and each pathology. This is the output of `eval.py` called `hitmiss_results_per_cxr.csv`.
+
+**Optional flags**
+* `--evalute_hb`: Default is `False`. If true, evaluate human benchmark in addition to saliency method. If `True`, the flags `hb_iou_results` and `hb_hitmiss_results` (below) are also required. If `True`, additional regressions will be run using the difference between the evaluation metrics of the saliency method pipeline and the human benchmark as the dependent variable (to understand the relationship between the geometric features of a pathology and the gap in localization performance between the saliency method pipeline and the human benchmark).
+* `--hb_iou_results`: Path to csv file with human benchmark IoU results for each CXR and each pathology. This is the output of `eval.py` called `miou_humanbenchmark_results_per_cxr.csv`.
+* `--hb_hitmiss_results`: Path to csv file with human benchmark hit/miss results for each CXR and each pathology. This is the output of `eval.py` called `hitmiss_humanbenchmark_results_per_cxr.csv`.
+* `--save_dir`: Where to save regression results. Default is current directory. If `evaluate_hb` is `True`, four files will be saved: `regression_features_pred_iou.csv`, `regression_features_pred_hitmiss.csv`, `regression_features_iou_diff.csv`, `regression_features_hitmiss_diff.csv`. If `evaluate_hb` is `False`, only two files will be saved: `regression_features_pred_iou.csv`, `regression_features_pred_hitmiss.csv`.
+
+In [our paper](https://www.medrxiv.org/content/10.1101/2021.02.28.21252634v3), only the true positive slice was included in each regression (see Table 2). Each feature is normalized using min-max normalization and the regression coefficient can be interpreted as the effect of that geometric feature on the evaluation metric at hand. The regression results report the 95% confidence interval and the Bonferroni corrected p-values. For confidence intervals and p-values, we use the standard calculation for linear models.
+
+<a name="regression_model_assurance"></a>
+## Run regressions on model assurance
+We provide a script to run a simple linear regression for each pathology using the model’s probability output as the single independent variable and using the predicted evaluation metric (IoU or hit/miss) as the dependent variable. The script also runs a simple regression that uses the same approach as above, but that includes all 10 pathologies.
+
+```
+(chexlocalize) > python regression_model_assurance.py [FLAGS]
+```
+
+**Required flags**
+* `--map_dir`: the directory with pickle files containing the heatmaps. The script extracts the heatmaps from the pickle files.
+* `--pred_iou_results`: path to csv file with saliency method IoU results for each CXR and each pathology. This is the output of `eval.py` called `iou_results_per_cxr.csv`.
+* `--pred_hitmiss_results`: path to csv file with saliency method hit/miss results for each CXR and each pathology. This is the output of `eval.py` called `hitmiss_results_per_cxr.csv`.
+
+**Optional flags**
+* `--save_dir`: Where to save regression results. Default is current directory.
+
+Note that in [our paper](https://www.medrxiv.org/content/10.1101/2021.02.28.21252634v3), for each of the 11 regressions, we use the _full_ dataset since the analysis of false positives and false negatives was also of interest (see Table 3). In addition to the linear regression coefficients, the regression results also report the Spearman correlation coefficients to capture any potential non-linear associations.
+
+<a name="precision_recall"></a>
+## Get precision, recall/sensitivity, and specificity values
+
+To get the precision, recall/sensitivity, and specificity values of the saliency method pipeline and the human benchmark segmentations, run:
+
+```
+(chexlocalize) > python precision_recall_specificity.py [FLAGS]
+```
+
+**Required flags**
+* `--gt_path`: the json file path where ground-truth segmentations are saved (encoded).
+* `--pred_seg_path`: the json file path where saliency method segmentations are saved (encoded). This could be the json output of `heatmap_to_segmentation.py`, or, if you downloaded the CheXlocalize dataset, then this could be the json file `/cheXlocalize_dataset/gradcam_segmentations_val.json`.
+* `--hb_seg_path`: the json file path where human benchmark segmentations are saved (encoded). This could be the json output of `annotation_to_segmentation.py`.
+
+**Optional flags**
+* `--save_dir`: the directory to save the csv file that stores the precision, recall/sensitivity, and specificity values. Default is current directory.
+
+We treat each pixel in the saliency method pipeline and the human benchmark segmentations as a classification, use each pixel in the ground-truth segmentation as the ground-truth label, and calculate precision, recall/sensitivity, and specificity over all CXRs for each pathology.
+
+<a name="bugs"></a>
+## Bugs or questions?
+Feel free to email Adriel Saporta (adriel@nyu.edu), or to open an issue in this repo, with any questions related to the code or our paper.
 
 <a name="citation"></a>
 ## Citation
diff --git a/calculate_percentage_decrease.py b/calculate_percentage_decrease.py
new file mode 100644
index 0000000..083267c
--- /dev/null
+++ b/calculate_percentage_decrease.py
@@ -0,0 +1,94 @@
+"""
+Calculate the percentage decrease from human benchmark localization
+performance to saliency method pipeline localization performance, along
+with the 95% CIs.
+"""
+from argparse import ArgumentParser
+import numpy as np
+import pandas as pd
+
+from eval import compute_cis, create_ci_record
+
+
+def create_pct_diff_df(metric, pred_bootstrap_results, hb_bootstrap_results,
+                       save_dir):
+    """
+    Calculate percentage decrease per pathology from human benchmark
+    localization metric to saliency method pipeline localization metric,
+    and obtain 95% CI on the percentage decreases.
+    """
+    # get 1000 bootstrap samples of IoU or hit/miss
+    pred_bs = pd.read_csv(pred_bootstrap_results)
+    hb_bs = pd.read_csv(hb_bootstrap_results)
+
+    # use the percentage difference as the statistic;
+    # get the CI (2.5th and 97.5th percentile) on the percentage difference
+    pct_diff_bs = (hb_bs - pred_bs)/hb_bs
+    records = []
+    for task in pct_diff_bs.columns:
+        records.append(create_ci_record(pct_diff_bs[task], task))
+    pct_diff_ci = pd.DataFrame.from_records(records)
+
+    # create results df
+    pct_diff_df = pd.DataFrame()
+    pct_diff_df['hb'] = hb_bs.mean()
+    pct_diff_df['pred'] = pred_bs.mean()
+    pct_diff_df['pct_diff'] = round(
+            (pct_diff_df['hb']-pct_diff_df['pred'])/pct_diff_df['hb']*100,
+            3)
+    pct_diff_df['pct_diff_lower'] = round(pct_diff_ci['lower'] * 100, 3).\
+                                        tolist()
+    pct_diff_df['pct_diff_upper'] = round(pct_diff_ci['upper'] * 100, 3).\
+                                        tolist()
+    pct_diff_df = pct_diff_df.sort_values(['pct_diff'], ascending=False)
+
+    # calculate avg human benchmark and saliency method localization metric
+    avg_pred = round(pct_diff_df['pred'].mean(), 3)
+    avg_hb = round(pct_diff_df['hb'].mean(), 3)
+    avg_pct_diff = round((avg_hb-avg_pred)/avg_hb * 100, 3)
+
+    # find the 95% CI of the percentage difference between average saliency
+    # method and average human benchmark
+    #   - get bootstrap sample of average saliency method localization metric
+    avg_pred_bs = pred_bs.mean(axis=1)
+    #   - get bootstrap sample of average human benchmark localization metric
+    avg_hb_bs = hb_bs.mean(axis=1)
+    #   - use pct diff between avg saliency and avg human benchmark as the
+    #     statistic, and get the 2.5th and 97.5th percentile of the bootstrap
+    #     distribution to create the CI
+    avg_bs_df = 100 * (avg_hb_bs - avg_pred_bs)/avg_hb_bs
+    lower, mean, upper = compute_cis(avg_bs_df, confidence_level = 0.05)
+
+    pct_diff_df.loc['Average'] = {'pred': avg_pred, 'hb': avg_hb,
+                                  'pct_diff': avg_pct_diff,
+                                  'pct_diff_lower': round(lower, 3),
+                                  'pct_diff_upper': round(upper, 3)}
+    print(pct_diff_df)
+    pct_diff_df.to_csv(f'{save_dir}/{metric}_pct_decrease.csv')
+
+
+if __name__ == '__main__':
+    parser = ArgumentParser()
+    parser.add_argument('--metric', type=str,
+                        help='options are: miou or hitrate')
+    parser.add_argument('--hb_bootstrap_results', type=str,
+                        help='path to csv file with 1000 bootstrap samples of \
+                              human benchmark IoU or hit/miss for each \
+                              pathology')
+    parser.add_argument('--pred_bootstrap_results', type=str,
+                        help='path to csv file with 1000 bootstrap samples of \
+                              saliency method IoU or hit/miss for each \
+                              pathology')
+    parser.add_argument('--save_dir', default='.',
+                        help='where to save results')
+    parser.add_argument('--seed', type=int, default=0,
+                        help='random seed to fix')
+    args = parser.parse_args()
+
+    assert args.metric in ['miou', 'hitrate'], \
+        "`metric` flag must be either `miou` or `hitrate`"
+
+    np.random.seed(args.seed)
+
+    create_pct_diff_df(args.metric, args.pred_bootstrap_results,
+                       args.hb_bootstrap_results, args.save_dir)
diff --git a/compute_pathology_features.py b/compute_pathology_features.py
new file mode 100644
index 0000000..123ce5d
--- /dev/null
+++ b/compute_pathology_features.py
@@ -0,0 +1,121 @@
+"""
+Compute four pathological features: (1) number of instances (for example, bilateral Pleural Effusion would have two instances, whereas there is only one instance for Cardiomegaly), (2) size (pathology area with respect to the area of the whole CXR), (3) elongation and (4) irrectangularity (the last two features measure the complexity of the pathology shape and were calculated by fitting a rectangle of minimum area enclosing the binary mask).
+
+Note that we use the ground-truth annotations to extract the number of instances, and we use the ground-truth segmentation masks to calculate area, elongation and rectangularity. We chose to extract number of instances from annotations because sometimes radiologists draw two instances for a pathology that are overlapping; in this case, the number of annotations would be 2, but the number of segmentations would be 1.
+"""
+from argparse import ArgumentParser
+import cv2
+import glob
+import json
+import numpy as np
+import pandas as pd
+import pickle
+from pycocotools import mask
+
+from eval_constants import LOCALIZATION_TASKS
+
+
+def get_geometric_features(segm):
+    """
+    Given a segmentation mask, return geometric features.
+
+    Args:
+        segm (np.array): the binary segmentation mask
+    """
+    # load segmentation
+    rgb_img = cv2.cvtColor(255 * segm, cv2.COLOR_GRAY2RGB)
+
+    # find contours
+    contours, _ = cv2.findContours(segm.copy(), 1, 1)
+
+    # get number of instances and area
+    n_instance = len(contours)
+    area_ratio = np.sum(segm) / (segm.shape[0] * segm.shape[1])
+
+    # use the longest coutour to calculate geometric features
+    max_idx = np.argmax([len(contour) for contour in contours])
+    cnt = contours[max_idx]
+
+    rect = cv2.minAreaRect(cnt)
+    (x, y), (w, h), a = rect
+
+    instance_area = cv2.contourArea(cnt)
+    elongation = max(w, h) / min(w, h)
+    rec_area_ratio = instance_area / (w * h)
+
+    return n_instance, area_ratio, elongation, rec_area_ratio
+
+
+def main(args):
+    # load ground-truth annotations (needed to extract number of instances)
+    # and ground-truth segmentations
+    with open(args.gt_ann) as f:
+        gt_ann = json.load(f)
+    with open(args.gt_seg) as f:
+        gt_seg = json.load(f)
+
+    # extract features from all cxrs with at least one pathology
+    all_instances = {}
+    all_areas = {}
+    all_elongations = {}
+    all_rec_area_ratios = {}
+    all_ids = sorted(gt_ann.keys())
+    pos_ids = sorted(gt_seg.keys())
+    for task in sorted(LOCALIZATION_TASKS):
+        print(task)
+        n_instances = []
+        areas = []
+        elongations = []
+        rec_area_ratios = []
+        for img_id in all_ids:
+            n_instance = 0
+            area = 0
+            elongation = np.nan
+            rec_area_ratio = np.nan
+            # calculate features for cxr with a pathology segmentation
+            if img_id in pos_ids:
+                gt_item = gt_seg[img_id][task]
+                gt_mask = mask.decode(gt_item)
+                if np.sum(gt_mask) > 0:
+                    # use annotation to get number of instances
+                    n_instance = len(gt_ann[img_id][task]) \
+                            if task in gt_ann[img_id] else 0
+                    # use segmentation to get other features
+                    n_instance_segm, area, elongation, rec_area_ratio = \
+                            get_geometric_features(gt_mask)
+            n_instances.append(n_instance)
+            areas.append(area)
+            elongations.append(elongation)
+            rec_area_ratios.append(rec_area_ratio)
+        all_instances[task] = n_instances
+        all_areas[task] = areas
+        all_elongations[task] = elongations
+        all_rec_area_ratios[task] = rec_area_ratios
+
+    instance_df = pd.DataFrame(all_instances)
+    area_df = pd.DataFrame(all_areas)
+    elongation_df = pd.DataFrame(all_elongations)
+    rec_area_ratio_df = pd.DataFrame(all_rec_area_ratios)
+
+    instance_df['img_id'] = all_ids
+    area_df['img_id'] = all_ids
+    elongation_df['img_id'] = all_ids
+    rec_area_ratio_df['img_id'] = all_ids
+
+    instance_df.to_csv(f'{args.save_dir}/num_instances.csv', index=False)
+    area_df.to_csv(f'{args.save_dir}/area_ratio.csv', index=False)
+    elongation_df.to_csv(f'{args.save_dir}/elongation.csv', index=False)
+    rec_area_ratio_df.to_csv(f'{args.save_dir}/rec_area_ratio.csv', index=False)
+
+
+if __name__ == '__main__':
+    parser = ArgumentParser()
+    parser.add_argument('--gt_ann', type=str,
+                        help='path to json file with raw ground-truth annotations')
+    parser.add_argument('--gt_seg', type=str,
+                        help='path to json file with ground-truth segmentations \
+                              (encoded)')
+    parser.add_argument('--save_dir', default='.',
+                        help='where to save feature dataframes')
+    args = parser.parse_args()
+    main(args)
diff --git a/count_segs.py b/count_segs.py
new file mode 100644
index 0000000..d19aee8
--- /dev/null
+++ b/count_segs.py
@@ -0,0 +1,44 @@
+from argparse import ArgumentParser
+import json
+import numpy as np
+import pandas as pd
+from pycocotools import mask
+
+from eval_constants import LOCALIZATION_TASKS
+
+def main(args):
+    """
+    For each pathology, count the number of CXRs with at least one segmentation.
+    """
+    with open(args.seg_path) as f:
+        seg_dict = json.load(f)
+
+    cxr_ids = sorted(seg_dict.keys())
+    segmentation_label = {}
+    for task in sorted(LOCALIZATION_TASKS):
+        print(task)
+        has_seg = []
+        for cxr_id in cxr_ids:
+            seg_item = seg_dict[cxr_id][task]
+            seg_mask = mask.decode(seg_item)
+            if np.sum(seg_mask) == 0:
+                has_segmentation = 0
+            else:
+                has_segmentation = 1
+            has_seg.append(has_segmentation)
+        segmentation_label[task] = has_seg
+
+    df = pd.DataFrame.from_dict(segmentation_label)
+    n_cxr_per_pathology = df.sum()
+    print(n_cxr_per_pathology)
+    n_cxr_per_pathology.to_csv(f'{args.save_dir}/n_segs.csv')
+
+if __name__ == "__main__":
+    parser = ArgumentParser()
+    parser.add_argument('--seg_path', type=str,
+                        help='json file path where segmentations are saved \
+                              (encoded)')
+    parser.add_argument('--save_dir', default='.',
+                        help='where to save results')
+    args = parser.parse_args()
+    main(args)
diff --git a/eval.py b/eval.py
index 195b155..0502149 100644
--- a/eval.py
+++ b/eval.py
@@ -10,6 +10,7 @@
 from tqdm import tqdm
 
 from eval_constants import LOCALIZATION_TASKS
+from utils import CPU_Unpickler
 
 
 def calculate_iou(pred_mask, gt_mask, true_pos_only):
@@ -31,7 +32,7 @@ def calculate_iou(pred_mask, gt_mask, true_pos_only):
         else:
             iou_score = np.sum(intersection) / (np.sum(union))
     else:
-        if np.sum(gt_mask) == 0:
+        if np.sum(union) == 0:
             iou_score = np.nan
         else:
             iou_score = np.sum(intersection) / (np.sum(union))
@@ -48,7 +49,11 @@ def get_ious(gt_path, pred_path, true_pos_only):
         pred_path (str): path to predicted segmentation json file (encoded)
         true_pos_only (bool): if true, run evaluation only on the true positive
                               slice of the dataset (CXRs that contain predicted
-                              and ground-truth segmentations)
+                              and ground-truth segmentations); if false, also
+                              include CXRs with a predicted segmentation but
+                              without a ground-truth segmentation, and include
+                              CXRs with a ground-truth segmentation but without
+                              a predicted segmentation.
 
     Returns:
         ious (dict): dict with 10 keys, one for each pathology (task). Values
@@ -62,10 +67,10 @@ def get_ious(gt_path, pred_path, true_pos_only):
         pred_dict = json.load(f)
 
     ious = {}
-    cxr_ids = sorted(gt_dict.keys())
     tasks = sorted(LOCALIZATION_TASKS)
 
     for task in tasks:
+        cxr_ids = sorted(gt_dict.keys())
         print(f'Evaluating {task}')
         ious[task] = []
 
@@ -86,7 +91,20 @@ def get_ious(gt_path, pred_path, true_pos_only):
             iou_score = calculate_iou(pred_mask, gt_mask, true_pos_only)
             ious[task].append(iou_score)
 
-        assert len(ious[task]) == len(gt_dict.keys())
+        # if true_pos_only is false, include cxrs that do not have ground-truth
+        # segmentations but that have predicted segmentations
+        if not true_pos_only:
+            for cxr_id in sorted(pred_dict.keys()):
+                if cxr_id not in gt_dict:
+                    pred_item = pred_dict[cxr_id][task]
+                    pred_mask = mask.decode(pred_item)
+                    gt_mask = np.zeros(pred_item['size'])
+                    assert gt_mask.shape == pred_mask.shape
+                    iou_score = calculate_iou(pred_mask, gt_mask, true_pos_only)
+                    ious[task].append(iou_score)
+                    cxr_ids.append(cxr_id)
+        else:
+            assert len(ious[task]) == len(gt_dict.keys())
 
     return ious, cxr_ids
 
@@ -131,19 +149,17 @@ def create_ci_record(perfs, task):
     return record
 
 
-def create_map(pkl_path):
-    """
-    Create saliency map of original img size·
-    """
-    info = pickle.load(open(pkl_path,'rb'))
+def get_map(pkl_path):
+    info = CPU_Unpickler(open(pkl_path, 'rb')).load()
     saliency_map = info['map']
     img_dims = info['cxr_dims']
-    map_resized = F.interpolate(saliency_map, size=(img_dims[1],img_dims[0]), mode='bilinear', align_corners=False)
+    map_resized = F.interpolate(saliency_map, size=(img_dims[1],img_dims[0]),
+                                mode='bilinear', align_corners=False)
     saliency_map = map_resized.squeeze().squeeze().detach().cpu().numpy()
-    return saliency_map, img_dims
+    return saliency_map
 
 
-def get_hit_rates(gt_path, pred_path):
+def get_hitrates(gt_path, pred_path):
     """
 	Args:
         gt_path (str): directory where ground-truth segmentations are saved (encoded)
@@ -182,8 +198,7 @@ def get_hit_rates(gt_path, pred_path):
         gt_mask = mask.decode(gt_item)
 
         # get saliency heatmap
-        sal_map, img_dims = create_map(pkl_path)
-
+        sal_map = get_map(pkl_path)
         x = np.unravel_index(np.argmax(sal_map, axis = None), sal_map.shape)[0]
         y = np.unravel_index(np.argmax(sal_map, axis = None), sal_map.shape)[1]
 
@@ -194,70 +209,128 @@ def get_hit_rates(gt_path, pred_path):
             results[img_id][task] = np.nan
 
     all_ids = sorted(gt_dict.keys())
+    results_df = pd.DataFrame.from_dict(results, orient='index')
+    return results_df, all_ids
 
-    return results, all_ids
 
+def get_hb_hitrates(gt_path, pred_path):
+    """
+	Args:
+        gt_path (str): directory where ground-truth segmentations are saved (encoded)
+        pred_path (str): json file with human annotations for most representative point
+    """
+    with open(pred_path) as f:
+        hb_salient_pts = json.load(f)
+    with open(gt_path) as f:
+        gt_dict = json.load(f)
 
-def evaluate(gt_path, pred_path, save_dir, metric, true_pos_only):
+    # evaluate hit
+    results = {}
+    all_ids = sorted(gt_dict.keys())
+    for task in sorted(LOCALIZATION_TASKS):
+        print(f'Evaluating {task}')
+        results[task] = []
+        for img_id in all_ids:
+            hit = np.nan
+            gt_item = gt_dict[img_id][task]
+            gt_mask = mask.decode(gt_item)
+
+            if np.sum(gt_mask) !=0:
+                if img_id in hb_salient_pts and task in hb_salient_pts[img_id]:
+                    salient_pts = hb_salient_pts[img_id][task]
+                    hit = 0
+                    for pt in salient_pts:
+                        if gt_mask[int(pt[1]), int(pt[0])]:
+                            hit = 1
+                else:
+                    hit = 0
+
+            results[task].append(hit)
+
+    results['cxr_id'] = all_ids
+    results_df = pd.DataFrame.from_dict(results)
+    results_df = results_df.set_index('cxr_id')
+    return results_df, all_ids
+
+
+def evaluate(gt_path, pred_path, save_dir, metric, true_pos_only, if_human_benchmark):
     """
 	Generates and saves three csv files:
-	-- `{miou/hitrate}_results.csv`: IoU or hit/miss results for each CXR and each pathology.
-	-- `{miou/hitrate}_bootstrap_results.csv`: 1000 bootstrap samples of mIoU or hit rate for each pathology.
-	-- `{miou/hitrate}_summary_results.csv`: mIoU or hit rate 95% bootstrap confidence intervals for each pathology.
+	-- `{iou/hitmiss}_results.csv`: IoU or hit/miss results for each CXR and
+                                     each pathology.
+	-- `{iou/hitmiss}_bootstrap_results.csv`: 1000 bootstrap samples of IoU
+                                               or hit/miss for each pathology.
+	-- `{miou/hitrate}_summary_results.csv`: mIoU or hit rate 95% bootstrap
+                                             confidence intervals for each pathology.
     """
     # create save_dir if it does not already exist
     Path(save_dir).mkdir(exist_ok=True, parents=True)
 
-    if metric == 'miou':
+    if metric == 'iou':
         ious, cxr_ids = get_ious(gt_path, pred_path, true_pos_only)
         metric_df = pd.DataFrame.from_dict(ious)
-    elif metric == 'hitrate':
-        hit_rates, cxr_ids = get_hit_rates(gt_path, pred_path)
-        metric_df = pd.DataFrame.from_dict(hit_rates, orient='index')
+    elif metric == 'hitmiss' and if_human_benchmark == False:
+        metric_df, cxr_ids = get_hitrates(gt_path, pred_path)
+    elif metric == 'hitmiss' and if_human_benchmark == True:
+        metric_df, cxr_ids = get_hb_hitrates(gt_path, pred_path)
     else:
-        raise ValueError('`metric` must be either `miou` or `hitrate`')
+        raise ValueError('`metric` must be either `iou` or `hitmiss`')
+
+    hb = 'humanbenchmark_' if if_human_benchmark else ''
 
     metric_df['img_id'] = cxr_ids
-    metric_df.to_csv(f'{save_dir}/{metric}_results.csv', index=False)
+    metric_df = metric_df.sort_values(by='img_id')
+    metric_df.to_csv(f'{save_dir}/{metric}_{hb}results_per_cxr.csv', index=False)
 
     bs_df = bootstrap_metric(metric_df, 1000)
-    bs_df.to_csv(f'{save_dir}/{metric}_bootstrap_results.csv', index=False)
+    bs_df.to_csv(f'{save_dir}/{metric}_{hb}bootstrap_results_per_cxr.csv', index=False)
 
     # get confidence intervals
     records = []
     for task in bs_df.columns:
         records.append(create_ci_record(bs_df[task], task))
 
-    summary_df = pd.DataFrame.from_records(records)
+    summary_df = pd.DataFrame.from_records(records).sort_values(by='name')
     print(summary_df)
-    summary_df.to_csv(f'{save_dir}/{metric}_summary_results.csv', index=False)
+    summary_df.to_csv(f'{save_dir}/{metric}_{hb}summary_results.csv', index=False)
 
 
 if __name__ == '__main__':
     parser = ArgumentParser()
     parser.add_argument('--metric', type=str,
-                        help='options are: miou or hitrate')
+                        help='options are: iou or hitmiss')
     parser.add_argument('--gt_path', type=str,
                         help='directory where ground-truth segmentations are \
                               saved (encoded)')
     parser.add_argument('--pred_path', type=str,
                         help='json path where predicted segmentations are saved \
-                              (if metric = miou) or directory with pickle files \
-							  containing heat maps (if metric = hitrate)')
-    parser.add_argument('--true_pos_only', default='True',
+                              (if metric = iou) or directory with pickle files \
+							  containing heat maps (if metric = hitmiss and \
+                              if_human_benchmark = false) or json path with \
+                              human annotations for most representative points \
+                              (if metric = hitmiss and if_human_benchmark = \
+                              true)')
+    parser.add_argument('--true_pos_only', type=str, default='True',
                         help='if true, run evaluation only on the true positive \
                         slice of the dataset (CXRs that contain predicted and \
-                        ground-truth segmentations)')
+                        ground-truth segmentations); if false, also include cxrs \
+                        with a predicted segmentation but without a ground-truth \
+                        segmentation, and include cxrs with a ground-truth\
+                        segmentation but without a predicted segmentation.')
     parser.add_argument('--save_dir', default='.',
                         help='where to save evaluation results')
+    parser.add_argument('--if_human_benchmark', type=str, default='False',
+                        help='if true, scripts expects human benchmark inputs')
     parser.add_argument('--seed', type=int, default=0,
                         help='random seed to fix')
     args = parser.parse_args()
 
-    assert args.metric == 'miou' or args.metric == 'hitrate', \
-        "`metric` flag must be either `miou` or `hitrate`"
+    assert args.metric in ['iou', 'hitmiss'], \
+        "`metric` flag must be either `iou` or `hitmiss`"
+    assert args.if_human_benchmark in ['True', 'False'], \
+        "`if_human_benchmark` flag must be either `True` or `False`"
 
     np.random.seed(args.seed)
 
     evaluate(args.gt_path, args.pred_path, args.save_dir, args.metric,
-             eval(args.true_pos_only))
+             eval(args.true_pos_only), eval(args.if_human_benchmark))
diff --git a/eval_constants.py b/eval_constants.py
index 7138d43..1df1de4 100644
--- a/eval_constants.py
+++ b/eval_constants.py
@@ -8,3 +8,18 @@
                        "Pneumothorax",
                        "Pleural Effusion",
                        "Support Devices"]
+
+CHEXPERT_TASKS = ["No Finding",
+                  "Enlarged Cardiomediastinum",
+                  "Cardiomegaly",
+                  "Lung Lesion",
+                  "Airspace Opacity",
+                  "Edema",
+                  "Consolidation",
+                  "Pneumonia",
+                  "Atelectasis",
+                  "Pneumothorax",
+                  "Pleural Effusion",
+                  "Pleural Other",
+                  "Fracture",
+                  "Support Devices"]
diff --git a/heatmap_to_segmentation.py b/heatmap_to_segmentation.py
index 9f4e97c..b684f3f 100644
--- a/heatmap_to_segmentation.py
+++ b/heatmap_to_segmentation.py
@@ -17,21 +17,27 @@
 import os
 from pathlib import Path
 import pickle
+from PIL import Image, ImageDraw
 import sys
+import torch
 import torch.nn.functional as F
 from tqdm import tqdm
 
-from eval_constants import LOCALIZATION_TASKS
-from utils import encode_segmentation
+from eval_constants import CHEXPERT_TASKS, LOCALIZATION_TASKS
+from utils import CPU_Unpickler, encode_segmentation, parse_pkl_filename
 
 
-def cam_to_segmentation(cam_mask, threshold=np.nan):
+def cam_to_segmentation(cam_mask, threshold=np.nan, smoothing=False, k=0):
     """
     Threshold a saliency heatmap to binary segmentation mask.
     Args:
         cam_mask (torch.Tensor): heat map in the original image size (H x W).
             Will squeeze the tensor if there are more than two dimensions.
         threshold (np.float64): threshold to use
+        smoothing (bool): if true, smooth the pixelated heatmaps using box filtering
+        k (int): size of kernel used for box filter smoothing (int); k must be
+                 >= 0; if k is > 0, make sure to set if_smoothing to True,
+                 otherwise no smoothing would be performed.
 
     Returns:
         segmentation (np.ndarray): binary segmentation output
@@ -49,15 +55,41 @@ def cam_to_segmentation(cam_mask, threshold=np.nan):
     # use Otsu's method to find threshold if no threshold is passed in
     if np.isnan(threshold):
         mask = np.uint8(255 * mask)
+
+        if smoothing:
+            heatmap = cv2.applyColorMap(mask, cv2.COLORMAP_JET)
+            gray_img = cv2.boxFilter(cv2.cvtColor(heatmap, cv2.COLOR_RGB2GRAY),
+                                     -1, (k, k))
+            mask = 255 - gray_img
+
         maxval = np.max(mask)
-        segmentation = cv2.threshold(mask, 0, maxval, cv2.THRESH_OTSU)[1]
+        thresh = cv2.threshold(mask, 0, maxval, cv2.THRESH_OTSU)[1]
+
+        # draw out contours
+        cnts = cv2.findContours(thresh, cv2.RETR_TREE, cv2.CHAIN_APPROX_SIMPLE)
+        cnts = cnts[0] if len(cnts) == 2 else cnts[1]
+        polygons = []
+        for cnt in cnts:
+            if len(cnt) > 1:
+                polygons.append([list(pt[0]) for pt in cnt])
+
+        # create segmentation based on contour
+        img_dims = (mask.shape[1], mask.shape[0])
+        segmentation_output = Image.new('1', img_dims)
+        for polygon in polygons:
+            coords = [(point[0], point[1]) for point in polygon]
+            ImageDraw.Draw(segmentation_output).polygon(coords,
+                                                        outline=1,
+                                                        fill=1)
+        segmentation = np.array(segmentation_output, dtype="int")
     else:
         segmentation = np.array(mask > threshold, dtype="int")
 
     return segmentation
 
 
-def pkl_to_mask(pkl_path, threshold=np.nan):
+def pkl_to_mask(pkl_path, threshold=np.nan, prob_cutoff=0,
+                smoothing=False, k=0):
     """
     Load pickle file, get saliency map and resize to original image dimension.
     Threshold the heatmap to binary segmentation.
@@ -69,8 +101,10 @@ def pkl_to_mask(pkl_path, threshold=np.nan):
     Returns:
         segmentation (np.ndarray): binary segmentation output
     """
-    # load pickle file, get saliency map and resize
-    info = pickle.load(open(pkl_path, 'rb'))
+    # load pickle file
+    info = CPU_Unpickler(open(pkl_path, 'rb')).load()
+
+    # get saliency map and resize
     saliency_map = info['map']
     img_dims = info['cxr_dims']
     map_resized = F.interpolate(saliency_map,
@@ -78,38 +112,63 @@ def pkl_to_mask(pkl_path, threshold=np.nan):
                                 mode='bilinear',
                                 align_corners=False)
 
-    # convert to segmentation
-    segmentation = cam_to_segmentation(map_resized, threshold=threshold)
+    # if probability cutoffs are given, then if the cxr has a predicted
+    # probability that is lower than the cutoff, force the predicted
+    # segmentation mask to be all zeros.
+    if torch.is_tensor(info['prob']) and info['prob'].size()[0] == 14:
+        prob_idx = CHEXPERT_TASKS.index(info['task'])
+        pred_prob = info['prob'][prob_idx].item()
+    else:
+        pred_prob = info['prob']
 
+    if pred_prob < prob_cutoff:
+        segmentation = np.zeros((img_dims[1], img_dims[0]))
+    else:
+        # convert to segmentation
+        segmentation = cam_to_segmentation(map_resized, threshold=threshold,
+                                           smoothing=smoothing, k=k)
     return segmentation
 
 
-def heatmap_to_mask(map_dir, output_path, threshold_path):
+def heatmap_to_mask(args):
     """
     Converts all saliency maps to segmentations and stores segmentations in a
     json file.
     """
     print('Parsing saliency maps')
-    all_paths = list(Path(map_dir).rglob('*_map.pkl'))
+    all_paths = list(Path(args.map_dir).rglob('*_map.pkl'))
 
     results = {}
     for pkl_path in tqdm(all_paths):
-        # break down path to image name and task
-        path = str(pkl_path).split('/')
-        task = path[-1].split('_')[-2]
-        img_id = '_'.join(path[-1].split('_')[:-2])
-
+        task, img_id = parse_pkl_filename(pkl_path)
         if task not in LOCALIZATION_TASKS:
             continue
 
-        # get encoded segmentation mask
-        if threshold_path:
-            tuning_results = pd.read_csv(threshold_path)
-            best_threshold = tuning_results[tuning_results['task'] == task]['threshold'].values[0]
+        # get encoded segmentation mask; check if self-defined thresholds are
+        # given to threshold heatmaps
+        if args.threshold_path:
+            tuning_results = pd.read_csv(args.threshold_path)
+            best_threshold = tuning_results[tuning_results['task'] ==
+                                            task]['threshold'].values[0]
         else:
             best_threshold = np.nan
 
-        segmentation = pkl_to_mask(pkl_path, threshold=best_threshold)
+        # check if probability cutoffs are given, in which case segmentation
+        # masks are all zeros
+        if args.probability_threshold_path:
+            prob_results = pd.read_csv(args.probability_threshold_path)
+            max_miou = prob_results.loc[prob_results.groupby(['task'])['mIoU'].\
+                                                     agg('idxmax')]
+            prob_cutoff = max_miou[max_miou['task'] ==
+                                   task]['prob_threshold'].values[0]
+        else:
+            prob_cutoff = 0
+
+        segmentation = pkl_to_mask(pkl_path,
+                                   threshold=best_threshold,
+                                   prob_cutoff=prob_cutoff,
+                                   smoothing=eval(args.if_smoothing),
+                                   k=args.k)
         encoded_mask = encode_segmentation(segmentation)
 
         # add image and segmentation to results dict
@@ -124,10 +183,10 @@ def heatmap_to_mask(map_dir, output_path, threshold_path):
             results[img_id][task] = encoded_mask
 
     # save to json
-    Path(os.path.dirname(output_path)).mkdir(exist_ok=True, parents=True)
-    with open(output_path, 'w') as f:
+    Path(os.path.dirname(args.output_path)).mkdir(exist_ok=True, parents=True)
+    with open(args.output_path, 'w') as f:
         json.dump(results, f)
-    print(f'Segmentation masks (in RLE format) saved to {output_path}')
+    print(f'Segmentation masks (in RLE format) saved to {args.output_path}')
 
 
 if __name__ == '__main__':
@@ -137,9 +196,24 @@ def heatmap_to_mask(map_dir, output_path, threshold_path):
     parser.add_argument('--threshold_path', type=str,
                         help="csv file that stores pre-defined threshold values. \
                         If no path is given, script uses Otsu's.")
+    parser.add_argument('--probability_threshold_path', type=str,
+                        help='csv file that stores pre-defined probability cutoffs. \
+                              If a cutoff is given, then we force the predicted \
+                              segmentation to be all zero if the predicted \
+                              probability is below the cutoff.')
     parser.add_argument('--output_path', type=str,
                         default='./saliency_segmentations.json',
                         help='json file path for saving encoded segmentations')
+    parser.add_argument('--if_smoothing', type=str, default='False',
+                        help='If true, smooth the pixelated heatmaps using box \
+                              filtering.')
+    parser.add_argument('--k', type=int, default=0,
+                        help='size of kernel used for box filter smoothing (int); \
+                              k must be >= 0; if k is > 0, make sure to set \
+                              if_smoothing to True, otherwise no smoothing would \
+                              be performed.')
     args = parser.parse_args()
+    assert args.if_smoothing in ['True', 'False'], \
+        "`if_smoothing` flag must be either `True` or `False`"
 
-    heatmap_to_mask(args.map_dir, args.output_path, args.threshold_path)
+    heatmap_to_mask(args)
diff --git a/plot_pathology_features.py b/plot_pathology_features.py
new file mode 100644
index 0000000..4ec3238
--- /dev/null
+++ b/plot_pathology_features.py
@@ -0,0 +1,73 @@
+"""
+Plot the distribution of pathology features.
+"""
+from argparse import ArgumentParser
+import matplotlib.pyplot as plt
+import pandas as pd
+import seaborn as sns
+
+from eval_constants import LOCALIZATION_TASKS
+
+
+def plot(args):
+    # read geometric features
+    instance_df =       pd.read_csv(f'{args.features_dir}/num_instances.csv').\
+                           drop(['img_id'], axis=1)
+    area_df =           pd.read_csv(f'{args.features_dir}/area_ratio.csv').\
+                           drop(['img_id'], axis=1)
+    elongation_df =     pd.read_csv(f'{args.features_dir}/elongation.csv').\
+                           drop(['img_id'], axis=1)
+    rec_area_ratio_df = pd.read_csv(f'{args.features_dir}/rec_area_ratio.csv').\
+                           drop(['img_id'], axis=1)
+    irrectangularity_df = 1-rec_area_ratio_df # irrectangularity = 1-(area_ratio)
+
+    # get cxrs with at least one ground-truth segmentation for any pathology
+    neg_idx = instance_df[instance_df.eq(0).all(1)].index.values
+    pos_idx = [i for i in instance_df.index.values if i not in neg_idx]
+    instance_df_pos = instance_df.iloc[pos_idx]
+    area_df_pos = area_df.iloc[pos_idx]
+    elongation_df_pos = elongation_df.iloc[pos_idx]
+    irrectangularity_df_pos = irrectangularity_df.iloc[pos_idx]
+
+    overall_df = pd.DataFrame()
+    for task in sorted(LOCALIZATION_TASKS):
+        df = pd.DataFrame()
+        data = {'n_instance': instance_df_pos[task].values,
+                'area_ratio':area_df_pos[task].values,
+                'elongation': elongation_df_pos[task].values,
+                'irrectangularity': irrectangularity_df_pos[task].values,
+                'task': task}
+        df = pd.DataFrame(data)
+        # get cxrs with ground-truth segmentation for this task
+        df = df[df.n_instance>0]
+        overall_df = pd.concat([overall_df, df])
+
+    sns.set_style("whitegrid")
+    features = ['n_instance', 'area_ratio', 'elongation', 'irrectangularity']
+    features_labels = ['Number of Instances', 'Area Ratio', 'Elongation',
+                       'Irrectangularity']
+    for feature, feature_label in zip(features, features_labels):
+        task_labels = sorted(LOCALIZATION_TASKS)
+        task_labels[5] = 'E. Cardiom.'
+        plt.figure(figsize=(12,6))
+        g1 = sns.boxenplot(x='task', y=feature, data=overall_df,
+                           palette=sns.color_palette("husl",10))
+        g1.set_xticklabels(task_labels, fontsize=14, rotation=60, ha="right",
+                           rotation_mode="anchor")
+        plt.xlabel('')
+        plt.ylabel(feature_label,fontsize=14)
+        plt.tight_layout()
+        plt.savefig(f'{args.save_dir}/{feature}_dist.png',dpi=300)
+
+
+if __name__ == "__main__":
+    parser = ArgumentParser()
+    parser.add_argument('--features_dir', type=str,
+                        help='directory with four csv files: area_ratio.csv, \
+                              elongation.csv, num_instances.csv`, and \
+                              rec_area_ratio.csv.')
+    parser.add_argument('--save_dir', type=str, default='.',
+                        help='directory where plots will be saved')
+    args = parser.parse_args()
+
+    plot(args)
diff --git a/precision_recall_specificity.py b/precision_recall_specificity.py
new file mode 100644
index 0000000..c470b67
--- /dev/null
+++ b/precision_recall_specificity.py
@@ -0,0 +1,105 @@
+from argparse import ArgumentParser
+import json
+import numpy as np
+import pandas as pd
+from pathlib import Path
+from pycocotools import mask
+from tqdm import tqdm
+
+from eval_constants import LOCALIZATION_TASKS
+from heatmap_to_segmentation import pkl_to_mask
+
+
+def get_results(gt_dict, seg_path):
+    """
+    For each pathology, count the total number of pixels that are TP, TN, FP
+    and FN. Only include CXRs that have ground-truth segmentations.
+    """
+    with open(seg_path) as f:
+        seg_dict = json.load(f)
+
+    results = {}
+    all_ids = sorted(gt_dict.keys())
+    tasks = sorted(LOCALIZATION_TASKS)
+
+    for task in tqdm(tasks):
+        for img_id in all_ids:
+            gt_item = gt_dict[img_id][task]
+            gt_mask = mask.decode(gt_item)
+
+            if img_id not in seg_dict:
+                seg_mask = np.zeros(gt_mask.shape)
+            else:
+                seg_item = seg_dict[img_id][task]
+                seg_mask = mask.decode(seg_item)
+
+            TP = np.sum(np.logical_and(seg_mask == 1, gt_mask == 1))
+            TN = np.sum(np.logical_and(seg_mask == 0, gt_mask == 0))
+            FP = np.sum(np.logical_and(seg_mask == 1, gt_mask == 0))
+            FN = np.sum(np.logical_and(seg_mask == 0, gt_mask == 1))
+
+            if task in results:
+                results[task]['tp'] += TP
+                results[task]['tn'] += TN
+                results[task]['fp'] += FP
+                results[task]['fn'] += FN
+            else:
+                results[task] = {}
+                results[task]['tp'] = TP
+                results[task]['tn'] = TN
+                results[task]['fp'] = FP
+                results[task]['fn'] = FN
+    return results
+
+
+def calculate_precision_recall_specificity(dict_item):
+    TP = dict_item['tp']
+    TN = dict_item['tn']
+    FP = dict_item['fp']
+    FN = dict_item['fn']
+    precision = TP/(TP+FP)
+    recall = TP/(TP+FN)
+    specificity = TN/(TN+FP)
+    return precision, recall, specificity
+
+
+def main(args):
+    with open(args.gt_path) as f:
+        gt_dict = json.load(f)
+
+    for source in ['pred', 'hb']:
+        seg_path = args.pred_seg_path if source == 'pred' else args.hb_seg_path
+        results = get_results(gt_dict, seg_path)
+        precisions = []
+        recalls = []
+        specificities = []
+        for t in sorted(LOCALIZATION_TASKS):
+            p, r, s = calculate_precision_recall_specificity(results[t])
+            precisions.append(p)
+            recalls.append(r)
+            specificities.append(s)
+
+        df = pd.DataFrame()
+        df['pathology'] = sorted(LOCALIZATION_TASKS)
+        df['precision'] = precisions
+        df['recall/sensitivity'] = recalls
+        df['specificity'] = specificities
+        df.to_csv(f'{args.save_dir}/{source}_precision_recall_specificity.csv')
+
+
+if __name__ == '__main__':
+    parser = ArgumentParser()
+    parser.add_argument('--gt_path', type=str,
+                        help='json file path where ground-truth segmentations \
+                              are saved (encoded)')
+    parser.add_argument('--pred_seg_path', type=str,
+                        help='json file path where saliency method segmentations \
+                              are saved (encoded)')
+    parser.add_argument('--hb_seg_path', type=str,
+                        help='json file path where human benchmark segmentations \
+                              are saved (encoded)')
+    parser.add_argument('--save_dir', default='.',
+                        help='where to save precision/recall results')
+    args = parser.parse_args()
+
+    main(args)
diff --git a/regression_model_assurance.py b/regression_model_assurance.py
new file mode 100644
index 0000000..b2f27f7
--- /dev/null
+++ b/regression_model_assurance.py
@@ -0,0 +1,102 @@
+"""
+Run a simple linear regression for each pathology using the model’s probability output as the single independent variable and using the predicted evaluation metric (IoU or hit/miss) as the dependent variable. The script also runs a simple regression that uses the same approach as above, but that includes all 10 pathologies.
+"""
+from argparse import ArgumentParser
+import json
+import numpy as np
+import pandas as pd
+from pathlib import Path
+import pickle
+import torch
+
+from eval_constants import CHEXPERT_TASKS, LOCALIZATION_TASKS
+from utils import format_ci, parse_pkl_filename, run_linear_regression
+
+
+def get_model_probability(map_dir):
+    """
+    Extract model's predicted probability per cxr and per pathology
+    """
+    prob_dict = {}
+    img_ids = []
+    for task in sorted(LOCALIZATION_TASKS):
+        print(f'Extracting model probability for {task}')
+        probs = []
+        pkl_paths = sorted(list(Path(map_dir).rglob(f"*{task}_map.pkl")))
+        for pkl_path in pkl_paths:
+            # get model probability
+            task, img_id = parse_pkl_filename(pkl_path)
+            info = pickle.load(open(pkl_path,'rb'))
+            if torch.is_tensor(info['prob']) and info['prob'].size()[0] == 14:
+                prob_idx = CHEXPERT_TASKS.index(info['task'])
+                pred_prob = info['prob'][prob_idx]
+            else:
+                pred_prob = info['prob']
+
+            probs.append(pred_prob)
+            if img_id not in img_ids:
+                img_ids.append(img_id)
+        prob_dict[task] = probs
+
+    prob_df = pd.DataFrame.from_dict(prob_dict)
+    prob_df['img_id'] = sorted(img_ids)
+    return prob_df
+
+
+def run_model_assurance_regression(args):
+    """Run regression using model probability as the independent variable."""
+    pred_results = pd.read_csv(args.pred_results)
+    model_probs_df = get_model_probability(args.map_dir)
+    y = args.metric
+
+    # align localization perf metrics and probabilities
+    ids = pred_results['img_id'].tolist()
+    prob_results = model_probs_df[model_probs_df['img_id'].isin(ids)]
+
+    coef_summary = pd.DataFrame(columns = ["lower", "mean", "upper",
+                        "coef_pval","corr_lower", "corr","corr_upper",
+                        "corr_pval", "feature", "task"])
+    overall_regression = pd.DataFrame()
+
+    for task in sorted(LOCALIZATION_TASKS):
+        df = pd.DataFrame()
+
+        # create regression data frame
+        data = {y: pred_results[task].values,
+                'prob': prob_results[task].tolist()}
+        regression_df = pd.DataFrame(data)
+        overall_regression = pd.concat([overall_regression, regression_df])
+
+        # run regression
+        results = run_linear_regression(regression_df, task, y, 'prob')
+        coef_summary = pd.concat([coef_summary, results])
+
+    # add overall regression
+    results = run_linear_regression(overall_regression, 'Overall', y, 'prob')
+    coef_summary = pd.concat([coef_summary, results])
+    coef_summary = coef_summary.apply(format_ci,
+                                      bonferroni_correction=1,
+                                      axis = 1)\
+                    [['task', 'n',
+                      'Linear regression coefficients',
+                      'Spearman correlations']]
+    coef_summary.to_csv(f'{args.save_dir}/regression_modelprob_{y}.csv',
+                        index = False)
+
+
+if __name__ == "__main__":
+    parser = ArgumentParser()
+    parser.add_argument('--metric', type=str,
+                        help='options are: iou or hitmiss')
+    parser.add_argument('--map_dir', type=str,
+                        help='directory with pickle files containing heatmaps')
+    parser.add_argument('--pred_results', type=str,
+                        help='path to csv file with saliency method IoU or \
+                              hit/miss results for each CXR and each pathology.')
+    parser.add_argument('--save_dir', type=str, default='.',
+                        help='where to save regression results')
+    args = parser.parse_args()
+    assert args.metric in ['iou', 'hitmiss'], \
+        "`metric` flag must be either `iou` or `hitmiss`"
+
+    run_model_assurance_regression(args)
diff --git a/regression_pathology_features.py b/regression_pathology_features.py
new file mode 100644
index 0000000..a8d3239
--- /dev/null
+++ b/regression_pathology_features.py
@@ -0,0 +1,117 @@
+"""
+Run regression using a pathology characteristic as the independent variable and
+evaluation metric as the dependent variable.
+"""
+from argparse import ArgumentParser
+import pandas as pd
+
+from eval_constants import LOCALIZATION_TASKS
+from utils import format_ci, run_linear_regression
+
+
+def normalize(column):
+    if column.min() == column.max():
+        return column
+    else:
+        return (column-column.min())/(column.max()-column.min())
+
+
+def run_features_regression(args):
+    evaluate_hb = eval(args.evaluate_hb)
+
+    # read localization performance
+    pred_iou_results = pd.read_csv(args.pred_iou_results)
+    pred_hitmiss_results = pd.read_csv(args.pred_hitmiss_results)
+    if evaluate_hb:
+        hb_iou_results = pd.read_csv(args.hb_iou_results)
+        hb_hitmiss_results = pd.read_csv(args.hb_hitmiss_results)
+
+    # read geometric features
+    instance_df =       pd.read_csv(f'{args.features_dir}/num_instances.csv').\
+                           drop(['img_id'], axis=1)
+    area_df =           pd.read_csv(f'{args.features_dir}/area_ratio.csv').\
+                           drop(['img_id'], axis=1)
+    elongation_df =     pd.read_csv(f'{args.features_dir}/elongation.csv').\
+                           drop(['img_id'], axis=1)
+    rec_area_ratio_df = pd.read_csv(f'{args.features_dir}/rec_area_ratio.csv').\
+                           drop(['img_id'], axis=1)
+    irrectangularity_df = 1-rec_area_ratio_df # irrectangularity = 1-(area_ratio)
+
+    # get cxrs with at least one ground-truth segmentation for any pathology
+    neg_idx = instance_df[instance_df.eq(0).all(1)].index.values
+    pos_idx = [i for i in instance_df.index.values if i not in neg_idx]
+    instance_df_pos = instance_df.iloc[pos_idx]
+    area_df_pos = area_df.iloc[pos_idx]
+    elongation_df_pos = elongation_df.iloc[pos_idx]
+    irrectangularity_df_pos = irrectangularity_df.iloc[pos_idx]
+
+    # create regression dataframe 
+    if evaluate_hb:
+        regression_cols = ['pred_iou', 'iou_diff', 'pred_hitmiss', 'hitmiss_diff']
+    else:
+        regression_cols = ['pred_iou', 'pred_hitmiss']
+
+    for y in regression_cols:
+        overall_regression = pd.DataFrame()
+        coef_summary = pd.DataFrame(columns = ['lower', 'upper', 'mean',
+            'coef_pval', 'corr', 'corr_pval', 'corr_lower', 'corr_upper', 'feature'])
+        for task in sorted(LOCALIZATION_TASKS):
+            df = pd.DataFrame()
+            data = {'pred_iou': pred_iou_results[task].values,
+                    'pred_hitmiss': pred_hitmiss_results[task].values,
+                    'n_instance': instance_df_pos[task].values,
+                    'area_ratio':area_df_pos[task].values,
+                    'elongation': elongation_df_pos[task].values,
+                    'irrectangularity': irrectangularity_df_pos[task].values,
+                    'img_id': pred_iou_results['img_id']}
+            if evaluate_hb:
+                data['iou_diff'] =     hb_iou_results[task].values \
+                                            - pred_iou_results[task].values
+                data['hitmiss_diff'] =  hb_hitmiss_results[task].values \
+                                            - pred_hitmiss_results[task].values
+            df = pd.DataFrame(data)
+            # get cxrs with ground-truth segmentation for this task
+            df = df[df.n_instance>0]
+            overall_regression = pd.concat([overall_regression, df])
+
+        features = ['n_instance', 'area_ratio', 'elongation', 'irrectangularity']
+        for feature in features:
+            overall_regression[feature] = normalize(overall_regression[feature])
+            results = run_linear_regression(overall_regression, 'Overall', y, feature)
+            coef_summary = pd.concat([coef_summary, results])
+
+        coef_summary = coef_summary.apply(format_ci,
+                                          bonferroni_correction=4,
+                                          axis=1)\
+                        [['task', 'feature', 'Linear regression coefficients']]
+        coef_summary.to_csv(f'{args.save_dir}/regression_features_{y}.csv',
+                            index=False)
+
+if __name__ == "__main__":
+    parser = ArgumentParser()
+    parser.add_argument('--features_dir', type=str,
+                        help='directory with four csv files: area_ratio.csv, \
+                              elongation.csv, num_instances.csv`, and \
+                              rec_area_ratio.csv.')
+    parser.add_argument('--pred_iou_results', type=str,
+                        help='path to csv file with saliency method IoU results \
+                              for each CXR and each pathology.')
+    parser.add_argument('--pred_hitmiss_results', type=str,
+                        help='path to csv file with saliency method hit/miss \
+                              results for each CXR and each pathology.')
+    parser.add_argument('--evaluate_hb', type=str, default='False',
+                        help='if true, evaluate human benchmark in addition to \
+                              saliency method.')
+    parser.add_argument('--hb_iou_results', type=str,
+                        help='path to csv file with human benchmark IoU results \
+                              for each CXR and each pathology.')
+    parser.add_argument('--hb_hitmiss_results', type=str,
+                        help='path to csv file with human benchmark hit/miss \
+                              results for each CXR and each pathology.')
+    parser.add_argument('--save_dir', type=str, default='.',
+                        help='where to save regression results')
+    args = parser.parse_args()
+    assert args.evaluate_hb in ['True', 'False'], \
+        "`evaluate_hb` flag must be either `True` or `False`"
+
+    run_features_regression(args)
diff --git a/requirements.txt b/requirements.txt
index 1eff6bb..cf3091d 100644
--- a/requirements.txt
+++ b/requirements.txt
@@ -1,16 +1,20 @@
-cycler==0.11.0
+cycler==0.10.0
 fonttools==4.33.3
-kiwisolver==1.4.3
-matplotlib==3.5.2
-numpy==1.23.0
-opencv-python==4.6.0.66
+kiwisolver==1.2.0
+matplotlib==3.1.3
+numpy==1.18.1
+opencv-python==4.3.0.36
 packaging==21.3
-pandas==1.4.3
-Pillow==9.2.0
+pandas==1.2.1
+Pillow==7.1.2
 pycocotools==2.0.4
-pyparsing==3.0.9
-python-dateutil==2.8.2
-six==1.16.0
-torch==1.12.0
+pyparsing==2.4.7
+python-dateutil==2.8.1
+scipy==1.7.3
+seaborn==0.10.1
+six==1.15.0
+statsmodels==0.13.2
+torch==1.4.0
+torchvision==0.5.0
 tqdm==4.64.0
-typing-extensions==4.3.0
+typing-extensions==3.7.4.3
diff --git a/sample/probability_tuning_results.csv b/sample/probability_tuning_results.csv
new file mode 100644
index 0000000..f85b46e
--- /dev/null
+++ b/sample/probability_tuning_results.csv
@@ -0,0 +1,90 @@
+prob_threshold,mIoU,task
+0.0,0.123,Airspace Opacity
+0.1,0.134,Airspace Opacity
+0.2,0.162,Airspace Opacity
+0.3,0.184,Airspace Opacity
+0.4,0.204,Airspace Opacity
+0.5,0.198,Airspace Opacity
+0.6,0.147,Airspace Opacity
+0.7,0.072,Airspace Opacity
+0.8,0.007,Airspace Opacity
+0.0,0.067,Atelectasis
+0.1,0.089,Atelectasis
+0.2,0.115,Atelectasis
+0.3,0.134,Atelectasis
+0.4,0.123,Atelectasis
+0.5,0.081,Atelectasis
+0.6,0.02,Atelectasis
+0.7,0.008,Atelectasis
+0.8,0.0,Atelectasis
+0.0,0.132,Cardiomegaly
+0.1,0.223,Cardiomegaly
+0.2,0.203,Cardiomegaly
+0.3,0.158,Cardiomegaly
+0.4,0.134,Cardiomegaly
+0.5,0.091,Cardiomegaly
+0.6,0.084,Cardiomegaly
+0.7,0.016,Cardiomegaly
+0.8,0.016,Cardiomegaly
+0.0,0.054,Consolidation
+0.1,0.211,Consolidation
+0.2,0.081,Consolidation
+0.3,0.0,Consolidation
+0.4,0.0,Consolidation
+0.5,0.0,Consolidation
+0.6,0.0,Consolidation
+0.7,0.0,Consolidation
+0.8,0.0,Consolidation
+0.0,0.063,Edema
+0.1,0.12,Edema
+0.2,0.161,Edema
+0.3,0.189,Edema
+0.4,0.218,Edema
+0.5,0.233,Edema
+0.6,0.188,Edema
+0.7,0.136,Edema
+0.8,0.053,Edema
+0.0,0.191,Enlarged Cardiomediastinum
+0.1,0.041,Enlarged Cardiomediastinum
+0.2,0.01,Enlarged Cardiomediastinum
+0.3,0.005,Enlarged Cardiomediastinum
+0.4,0.0,Enlarged Cardiomediastinum
+0.5,0.0,Enlarged Cardiomediastinum
+0.6,0.0,Enlarged Cardiomediastinum
+0.7,0.0,Enlarged Cardiomediastinum
+0.8,0.0,Enlarged Cardiomediastinum
+0.1,0.0,Lung Lesion
+0.2,0.0,Lung Lesion
+0.3,0.0,Lung Lesion
+0.4,0.0,Lung Lesion
+0.5,0.0,Lung Lesion
+0.6,0.0,Lung Lesion
+0.7,0.0,Lung Lesion
+0.8,0.0,Lung Lesion
+0.0,0.046,Pleural Effusion
+0.1,0.079,Pleural Effusion
+0.2,0.103,Pleural Effusion
+0.3,0.117,Pleural Effusion
+0.4,0.123,Pleural Effusion
+0.5,0.123,Pleural Effusion
+0.6,0.133,Pleural Effusion
+0.7,0.097,Pleural Effusion
+0.8,0.068,Pleural Effusion
+0.0,0.008,Pneumothorax
+0.1,0.035,Pneumothorax
+0.2,0.068,Pneumothorax
+0.3,0.066,Pneumothorax
+0.4,0.08,Pneumothorax
+0.5,0.102,Pneumothorax
+0.6,0.125,Pneumothorax
+0.7,0.085,Pneumothorax
+0.8,0.085,Pneumothorax
+0.0,0.087,Support Devices
+0.1,0.123,Support Devices
+0.2,0.147,Support Devices
+0.3,0.16,Support Devices
+0.4,0.167,Support Devices
+0.5,0.164,Support Devices
+0.6,0.16,Support Devices
+0.7,0.144,Support Devices
+0.8,0.114,Support Devices
diff --git a/tune_probability_threshold.py b/tune_probability_threshold.py
new file mode 100644
index 0000000..228adc2
--- /dev/null
+++ b/tune_probability_threshold.py
@@ -0,0 +1,119 @@
+"""
+When evaluating mIoU on the full dataset, we ensure that the final binary
+segmentation is consistent with model probability output by applying another
+layer of thresholding such that the segmentation mask is all zeros if the
+predicted probability is below a chosen level.
+
+The probability threshold is determined per pathology by maximizing the
+mIoU on the validation set.
+"""
+from argparse import ArgumentParser
+import glob
+import json
+import numpy as np
+import pandas as pd
+from pathlib import Path
+import pickle
+from pycocotools import mask
+import torch
+import torch.nn.functional as F
+from tqdm import tqdm
+
+from eval import calculate_iou
+from eval_constants import CHEXPERT_TASKS, LOCALIZATION_TASKS
+from heatmap_to_segmentation import cam_to_segmentation
+
+
+def compute_miou(cutoff, pkl_paths,gt):
+    """Caculate mIoU given a threshold and a list of pkl paths."""
+    ious = []
+    for pkl_path in tqdm(pkl_paths):
+        # get saliency segmentation
+        info = pickle.load(open(pkl_path, 'rb'))
+        img_dims = info['cxr_dims']
+        map_resized = F.interpolate(info['map'], size=(img_dims[1],img_dims[0]),
+                                    mode='bilinear', align_corners=False)
+        if torch.is_tensor(info['prob']) and info['prob'].size()[0] == 14:
+            prob_idx = CHEXPERT_TASKS.index(info['task'])
+            pred_prob = info['prob'][prob_idx]
+        else:
+            pred_prob = info['prob']
+
+        if pred_prob > cutoff:
+            segm = cam_to_segmentation(map_resized)
+            pred_mask = np.array(segm)
+        else:
+            pred_mask = np.zeros((img_dims[1],img_dims[0]))
+
+        # get gt segmentation
+        path = str(pkl_path).split('/')
+        task = path[-1].split('_')[-2]
+        img_id = '_'.join(path[-1].split('_')[:-2])
+        if img_id in gt:
+            gt_item = gt[img_id][task]
+            gt_mask = mask.decode(gt_item)
+        else:
+            gt_mask = np.zeros((img_dims[1],img_dims[0]))
+
+        iou = calculate_iou(pred_mask, gt_mask, true_pos_only=False)
+        ious.append(iou)
+
+    miou = round(np.nanmean(np.array(ious)), 3)
+    return miou
+
+
+def find_threshold(task, gt_dict, cam_dir):
+    """
+    For a given task, find the probability threshold with max mIoU on val set.
+    """
+    cam_pkl = sorted(list(Path(cam_dir).rglob(f"*{task}_map.pkl")))
+    cutoffs = np.arange(0,.9,.1)
+    # We make this one exception for Lung Lesion. On the val set, using
+    # threshold = 0 gives mIoU of 0.001, whereas other thresholds yield mIoU of
+    # 0. This is because there is only one Lung Lesion segmentation in the
+    # CheXpert val set, which makes the 0.001 improvement less trustworthy.
+    # If no threshold is set, then we will end up with 668 saliency
+    # segmentations on Lung Lesion (given the low prevalence of this pathology,
+    # this will result in false postives).
+    if task == 'Lung Lesion':
+        cutoffs = np.arange(0.1,.9,.1)
+
+    mious = [compute_miou(cutoff, cam_pkl, gt_dict) for cutoff in cutoffs]
+    cutoff = cutoffs[mious.index(max(mious))]
+    print(f"cutoff: {cutoffs}; iou: {mious}")
+    return cutoffs, mious
+
+
+def main(args):
+    with open(args.gt_path) as f:
+        gt_dict = json.load(f)
+
+    tuning_results = pd.DataFrame(columns=['prob_threshold','mIoU','task'])
+    for task in sorted(LOCALIZATION_TASKS):
+        print(f"Task: {task}")
+        cutoff, miou = find_threshold(task, gt_dict, args.map_dir)
+        df = pd.concat([pd.DataFrame([[round(cutoff[i], 1),
+                                       round(miou[i], 3),
+                                       task]],
+                                     columns=['prob_threshold','mIoU','task']) \
+                                        for i in range(len(cutoff))],
+                       ignore_index=True)
+        tuning_results = tuning_results.append(df, ignore_index=True)
+
+    tuning_results.to_csv(f'{args.save_dir}/probability_tuning_results.csv',
+                          index=False)
+
+
+if __name__ == '__main__':
+    parser = ArgumentParser()
+    parser.add_argument('--map_dir', type=str,
+                        help='directory with pickle files containing heatmaps \
+                              and model output')
+    parser.add_argument('--gt_path', type=str,
+                        help='json file where ground-truth segmentations are \
+                              saved (encoded)')
+    parser.add_argument('--save_dir', type=str, default='.',
+                        help='where to save the probability threshold tuned on the \
+                              validation set')
+    args = parser.parse_args()
+    main(args)
diff --git a/utils.py b/utils.py
index 56382f7..fd0e77b 100644
--- a/utils.py
+++ b/utils.py
@@ -1,5 +1,27 @@
+import io
+import math
 import numpy as np
+import pandas as pd
+import pickle
 from pycocotools import mask
+from scipy import stats
+import statsmodels.formula.api as smf
+import torch
+
+
+class CPU_Unpickler(pickle.Unpickler):
+    def find_class(self, module, name):
+        if module == 'torch.storage' and name == '_load_from_bytes':
+            return lambda b: torch.load(io.BytesIO(b), map_location='cpu')
+        else:
+            return super().find_class(module, name)
+
+
+def parse_pkl_filename(pkl_path):
+    path = str(pkl_path).split('/')
+    task = path[-1].split('_')[-2]
+    img_id = '_'.join(path[-1].split('_')[:-2])
+    return task, img_id
 
 
 def encode_segmentation(segmentation_arr):
@@ -14,3 +36,71 @@ def encode_segmentation(segmentation_arr):
     Rs = mask.encode(segmentation)
     Rs['counts'] = Rs['counts'].decode()
     return Rs
+
+
+def run_linear_regression(regression_df, task, y, x):
+    """
+    Run linear regression model given a regression dataframe of a single pathology.
+
+    Args:
+        task (str): localization task
+        y (str): the dependent variable
+        x (str): the independent variable
+    """
+    est = smf.ols(f"{y} ~ {x}", data = regression_df)
+    est2 = est.fit()
+    ci = est2.conf_int(alpha=0.05, cols=None)
+    lower, upper = ci.loc[x]
+    mean = est2.params.loc[x]
+    pval = est2.pvalues.loc[x]
+    corr, corr_pval = stats.spearmanr(regression_df[y].values,regression_df[x].values,nan_policy = 'omit')
+    n = len(regression_df[regression_df[y].notnull()])
+    stderr = 1.0 / math.sqrt(n - 3)
+    delta = 1.96 * stderr
+    lower_r = math.tanh(math.atanh(corr) - delta)
+    upper_r = math.tanh(math.atanh(corr) + delta)
+
+    results = {'lower': round(lower,3),
+               'upper': round(upper,3),
+               'mean': round(mean,3),
+               'coef_pval': pval,
+               'corr_lower': round(lower_r,3),
+               'corr_upper': round(upper_r,3),
+               'corr': round(corr,3),
+               'corr_pval': corr_pval,
+               'n': int(n),
+               'feature': x,
+               'task': task}
+    return pd.DataFrame([results])
+
+
+def format_ci(row, **kwargs):
+    """Format confidence interval."""
+    def format_stats_sig(p_val):
+        """Output *, **, *** based on p-value."""
+        stats_sig_level = ''
+        if p_val < 0.001 / kwargs['bonferroni_correction']:
+            stats_sig_level = '***'
+        elif p_val < 0.01 / kwargs['bonferroni_correction']:
+            stats_sig_level = '**'
+        elif p_val < 0.05 / kwargs['bonferroni_correction']:
+            stats_sig_level = '*'
+        return stats_sig_level
+
+    # CI on linear regression coefficients
+    p_val = row['coef_pval']
+    stats_sig_level = format_stats_sig(p_val)
+    mean = row['mean']
+    upper = row['upper']
+    lower = row['lower']
+    row['Linear regression coefficients'] =  f'{mean}, ({lower}, {upper}){stats_sig_level}'
+
+    # CI on spearman correlations
+    p_val = row['corr_pval']
+    stats_sig_level = format_stats_sig(p_val)
+    mean = row['corr']
+    upper = row['corr_upper']
+    lower = row['corr_lower']
+    row['Spearman correlations'] =  f'{mean}, ({lower}, {upper}){stats_sig_level}'
+
+    return row