Differential privacy with Python
A very common challenge data engineers and information managers face when handling an organization’s information is granting the appropriate access to the employees or external entities involved. This must be done in a way where useful data is shared with analysts and engineers while maintaining the customer’s privacy, respecting data regulations and compliance.
In this post we will be implementing an anonymization algorithm using the excellent cn-protect python module from CryptoNumerics, an awesome company focused on data privacy solutions.
We will show important concepts and functionalities when anonymizing data without the hassle of setting up complex infrastructure.
Full code is available on the following github repo, but first, let me present important concepts and a bit of history.
Traditional information flow has been the following:
- Business applications generate data which is stored in transactional databases (optimized for high row insertion rates and fast row-wise retrieval).
- At night, weekend or at times when network traffic is at its minimum, companies execute their etls (DataStage or Pentaho) to update their read replicas (same schema as transactional), datawarehouses (columnar databases,sql) or data lakes (plain denormalized files Hadoop) to optimize data for analytics. Only ETL people have access to the information.
- Analysts and business users typically access information in the read replica, data warehouse or data lake. This makes it extremely critical for the organization to update and for reliable identity management systems to grant the correct access to the entire analytics team.
Having mentioned the standard information flow and procedures in organizations, let’s take a look at what important dbms and data lake technologies have been implemented to accomplish security and privacy objectives.
Traditional database management systems have implemented schemes for handling access permissions at different granularity levels. For example: PostgreSQL has the extension: PostgreSQL Anonymizer, Microsoft SQL Server has data masking built-in functionality and Oracle also has its own data masking module.
Data Lake technologies based on the Hadoop stack have also been complemented with other projects that provide data masking and privacy, just to mention a few: Apache Ranger, Apache Atlas.
But wait, you might be wondering what the hell is data masking ?
Data masking is the process of hiding some part of the information as showed in the following example:
While data masking may be sufficient in many cases, it also has some important flaws that need to be considered.
- Important statistical information may be lost in the process.
- Masked data may still be vulnerable to de-anonymization attacks which consist of matching the anonymized data with public or available information, thus discovering the anonymized elements complete data. A quick fact: A study from Latanya Sweeney found that 87 percent of the U.S. population can be identified using a combination of their gender, birthdate and zip code.
For this matter, let me introduce to you the concept of Differential Privacy.
In essence, differential privacy alters the information so subjects cannot be re-identified, but keeps the data useful enough for statistics and machine learning purposes. A specific type of differential privacy algorithm is k-anonymity where an individual cannot be distinguished from at otehr least k-1 individuals.
Differential privacy is a very common technique used by government agencies when making demographic information public.
Another key aspect to keep in mind while handling company data is the compliance with government data regulations a few of them being: HIPAA, GDPR, CCPA.
- HIPAA: Meaning Health Insurance Portability and Accountability Act is a law enacted by former President Bill Clinton in 1996. It was created to foster information privacy of patients in the healthcare industry in the United States.
- GDPR: Meaning General Data Protection Regulation is an established regulation in the European Union that aims to increase privacy and extend data rights for EU residents.
- CCPA: Meaning California Consumer Privacy Act is a regulation that aims to enhance consumer privacy rights in California.
A little bit of math and computation concepts
- Cardinality: Specifies the number of different values a feature can have in a dataset. This concept is really important when considering k-anonymity since a high cardinality dataset is more prone to generalization techniques on its features than a low cardinality one.
- Computational complexity theory: Broadly speaking, it is the field of study that focuses on the number of calculations, computations and resources an algorithm needs to be executed. A fair understanding of this field of knowledge is necessary in order to comprehend how k-anonymity’s algorithm is implemented internally as it can be quite expensive computationally since it’s a NP-hard type of problem.
Installation
First we have to make sure Java 11 is installed since cn-protect relies on it. If java 11 is not installed, check the following link to install it.
After java is correctly installed, we proceed to install cn-protect module.
$ pip install cn-protect==0.9.4
Finally after having our environment ready, we proceed to what we like most: programming !!!
Step 1: Generate the sample dataset
The above script’s function is to generate a base dataset to anonymize. We execute the script with the following command:
$ python generate_data.py -n 1000000 -o dataset.txt
After the command was successfully executed, we must have a dataset.txt file with the following structure and conformed by a million rows.
Step 2: Execute the anonymization process
Once we have a generated dataset, we proceed to the anonymization process.
Some key aspects of the above script need some explanations:
Hierarchies: Relates to the way in which the dataset is going to be generalized to comply with the specified k parameter. There are three types of hierarchies supported by cn-protect and they are the following:
- OrderHierarchy: Is a type of generalization technique for numeric features where a continuous range of values are categorized in groups. For example, the piece of code below specifies that age column can be generalized in ranges of 5,10,20,40,80 years if needed and the position of the numbers of the OrderHierarchy function are cummulative.
5, 5*2 = 10, 10*2 = 20, 20*2 = 40, 40*2 = 80
prot.hierarchies.Age = OrderHierarchy('interval',5,2,2,2,2)
- DataHierarchy: A type of generalization technique for categorical features where these require another level of abstraction to comply with the k-value. For example, the following piece of code specifies the way of generalizing nationality feature by reading a csv file.
prot.hierarchies.Nationality = DataHierarchy(pd.read_csv("data/Nationality.csv"))
If the nationality column needs to be generalized in a way that complies with the specified k value, the original nationality will be replaced by the continent it belongs to.
- IntervalHierarchy: To be completely honest I have not been able to fully understand this type of hierarchy, however I have included below a brief description regarding this hierarchy:
“Interval based hierarchy for numeric data
An interval hierarchy allows full control over the size of the hierarchies,
including specifying heterogeneous sizes, as well as heterogeneous
combinations of those hierarchies to deal with data whose distributions
may not function well with an OrderHierarchy class”.
Anonymization process
In order for us to compare the adequate level of anonymization, we will try with different k-values, checking both the output characteristics and information loss.
$ python anonymization.py -k 3 -d data/data.txt -o dataset.txt
With k = 3 we can see that only zipcode and age columns are generalized with relatively small intervals.
Next, we run the algorithm with k = 100.
$ python anonymization.py -k 100 -d data/data.txt -o dataset.txt
With k = 100, no more columns were generalized but the information loss increased from 0.08 to 0.12.
Next we run with k = 1000
$ python anonymization.py -k 1000 -d data/data.txt -o dataset.txt
With k = 1000, nationality column is now generalized with the corresponding continent, yet the other generalized columns remain the same.
K = 10000
$ python anonymization.py -k 10000 -d data/data.txt -o dataset.txt
With k = 10000 we can appreciate another substantial change, the zipcode column was generalized into a unique category (10000–19000), hence the column becomes useless resulting in important information loss.
From the above graph we can clearly observe the trade-offs between the levels of anonymity (k value) and the utility of the anonymized information.
Considerations
Anonymization use cases may vary greatly regarding business needs, technological maturity and regulations. However, they must all achieve the right balance between privacy, utility, computation infrastructure expenditures and compliance with regulations in order to provide the maximum possible value to the organization.