- This is the code accompanying the following paper O. Hrynenko, A. Cavallaro, "Identifying Privacy Personas" paper, accepted at Proceeding on Privacy Enhancement Technologies, 2025.
- This code computes the dissimilarity matrix and constructs a dendrogram (without pruning) as part of a processing pipeline described in the paper.
- We provide randomly generated dummy data for demonstration purposes
feature_vector_generation_set_p_dummy.csv
andfeature_vector_generation_set_p_dummy_prime.csv
. - The paper includes both qualitative and quantitative analyses. The code builds on the previously conducted qualitative analysis (coding, trait formation, annotation). The output of this code can be used for the subsequent quantitative analysis, namely Boschloo's test.
Install R:
sudo apt install r-base-core
Clone the project:
git clone [email protected]:idiap/identifying-privacy-personas.git
cd identifying-privacy-personas
Open the constants.py
file to provide the following essential information:
-
path_to_data
– the name of your data folder. -
feature_vector_generation_set_p
– the name of the participants' feature vectors,$p_i$ , -
feature_vector_generation_set_p_prime
– the name of the participants' feature vectors,$p_i’$ , -
max_likert_distance
– the maximum possible distance between the participants in the Likert space, -
number_of_likert_variables
– the number of the Likert explanatory variables, -
num_of_participants_generation_set
– the number of the participants in the generation set.
Install the project:
pip install .
Run ./scripts/run.sh
The input to this step is a feature_vector_generation_set_p_prime
file that contains
compute_dissimilarity_matrix(path_to_data = path_to_data,
input_file_name = feature_vector_generation_set_p_prime,
outfile_name = dissimilarity_matrix_generation_set
)
This function is called in the run.sh
file:
python ipp/steps/step_1.py
For dendrogram construction, use the corresponding R script step_2.R
. The default name of the output folder is "Converted_R_output_generation_set". The output folder contains
cluster_in_r()
The path to the output folder and to the input file and the function call is completed in the run.sh
file:
Rscript "ipp/steps/step_2.R" $path_to_data/$dissimilarity_matrix_generation_set $path_to_data/$path_to_r_results
For consequent analysis we recommend using Python, hence we unparse the output from R into Python. The function below saves each of the clusters’ information into a separate file (for each cluster, for each level of the dendrogram). The output of this call is a set of files "u.v.csv", where
unparsing_for_python(path_to_data = path_to_data,
file_name_binary_descriptor = feature_vector_generation_set_p,
path_to_r_results = path_to_r_results,
path_to_parsed_results = path_to_parsed_results
)
In the paper, we represent a descriptor as the frequency of appearance of the traits in a cluster. For further use of the pipeline described in the paper, namely for Boschloo's test, we recommend storing the count of how many times a trait appeared and the number of people in a cluster separately.
save_descriptors_to_table(path_to_data = path_to_data,
path_to_parsed_results = path_to_parsed_results,
number_of_participants = num_of_participants_generation_set
)
save_number_of_ppl_to_dictionary(path_to_data = path_to_data,
path_to_parsed_results = path_to_parsed_results,
number_of_participants = num_of_participants_generation_set
)
Additionally, we save how a parent cluster u.v is split into clusters u+1.j and u+1.k into a table.
save_cluster_splits(path_to_data = path_to_data,
path_to_r_results = path_to_r_results,
number_of_participants = num_of_participants_generation_set,
outfile_name = dendrogram_cluster_splits_generation_set
)
These functions are called in the run.sh
file:
python ipp/steps/step_3.py