You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Most of the time, you want the synthetic data to reuse the same category values that exist in your real data.
For example, see our demo dataset for students below. Only certain values for degree_type are allowed (Sci&Tech, Comms&Mgmt). The synthetic data should reuse those same values.
The problem is when you have Personal Identifiable Information (PII) that can be used to identify a person. Reusing the same PII values in the synthetic data will leak sensitive information.
The column address is PII. We want to make sure the synthetic data doesn't contain the same exact set of addresses that are in our real data.
Solution
When creating synthetic data, you can ask the SDV to stop reusing values and generate entirely new values instead. You can do this using anonymization options. The SDV allows you to:
Fully anonymize PII data in an irreversible way
Pseudo-anonymize PII data to apply a consistent mapping. Anyone with access to the mapping can reverse lookup the actual values.
Provide locales to make the PII data more realistic. For example, you may want to generate addresses from specific locations
Resources
In 2023, we released the new SDV 1.0 library with an improved API and workflow. To learn more about anonymization options, check out our new resources!
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
Problem
Most of the time, you want the synthetic data to reuse the same category values that exist in your real data.
For example, see our demo dataset for students below. Only certain values for
degree_type
are allowed (Sci&Tech
,Comms&Mgmt
). The synthetic data should reuse those same values.The problem is when you have Personal Identifiable Information (PII) that can be used to identify a person. Reusing the same PII values in the synthetic data will leak sensitive information.
The column
address
is PII. We want to make sure the synthetic data doesn't contain the same exact set of addresses that are in our real data.Solution
When creating synthetic data, you can ask the SDV to stop reusing values and generate entirely new values instead. You can do this using anonymization options. The SDV allows you to:
Resources
In 2023, we released the new SDV 1.0 library with an improved API and workflow. To learn more about anonymization options, check out our new resources!
Beta Was this translation helpful? Give feedback.
All reactions