How do you anonymize PII values? Can you limit this to specific locales? #667

npatki · 2021-12-10T23:34:55Z

npatki
Dec 10, 2021
Maintainer

Problem

Most of the time, you want the synthetic data to reuse the same category values that exist in your real data.

For example, see our demo dataset for students below. Only certain values for degree_type are allowed (Sci&Tech, Comms&Mgmt). The synthetic data should reuse those same values.

The problem is when you have Personal Identifiable Information (PII) that can be used to identify a person. Reusing the same PII values in the synthetic data will leak sensitive information.

The column address is PII. We want to make sure the synthetic data doesn't contain the same exact set of addresses that are in our real data.

Solution

When creating synthetic data, you can ask the SDV to stop reusing values and generate entirely new values instead. You can do this using anonymization options. The SDV allows you to:

Fully anonymize PII data in an irreversible way
Pseudo-anonymize PII data to apply a consistent mapping. Anyone with access to the mapping can reverse lookup the actual values.
Provide locales to make the PII data more realistic. For example, you may want to generate addresses from specific locations

Resources

In 2023, we released the new SDV 1.0 library with an improved API and workflow. To learn more about anonymization options, check out our new resources!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How do you anonymize PII values? Can you limit this to specific locales? #667

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

How do you anonymize PII values? Can you limit this to specific locales? #667

npatki Dec 10, 2021 Maintainer

Problem

Solution

Resources

Replies: 0 comments

npatki
Dec 10, 2021
Maintainer