Pseudonymization, Anonymization, Encryption ... what is the difference?
The year 2018 will, at least in Europe, be a turning point for data privacy and personal information protection. In this article, I will focus on personal data processing within the organization so that the risk of its disclosure is minimized (i.e., cyber theft). Since I have experience in this area, I will describe methods of de-identification of personal data, such as pseudonymization, anonymization, and encryption.
Organizations typically run a production IT system in which personal data is located, namely customers’ emails, names, addresses, bank accounts, and even credit card information. Likewise, the database can contain similar information about employees, suppliers, and partners. Sensitive data may leak from these systems, but it is a good and common practice to restrict access to the IT system, and it is difficult to gain unauthorized access to the relevant production databases. However, support systems and applications are often linked to these production systems so they can export data from production databases, including personal data. Typically, these support systems don't have strict rules in place, and sensitive personal data can often be found in unsecured storage systems, like on a common notebook disk drive. From there, data leakage is very easy, and organizations lose control over this data. One common example is in the export of production data to the marketing department or even an external agency.
And this is just one way of many how your organization can leak sensitive data.
The data you do not have cannot leak
The basic principle is that we must only collect, process, and store personal data that is really needed. The organization should then complete the personal data inventory, identify the data fields that contain sensitive personal information, and explore their reasons for collecting and storing this information. If the data breach risk is greater than the added value, then it is appropriate to stop collecting and storing this information. An example might be a mailing list for a newsletter campaign. Such a list may represent a high risk for the organization compared to its added value. Also, during the personal data inventory process, you may find that the original data collection purpose is no longer applicable but the data is still being collected. This is, of course, a completely unnecessary risk and it needs to be eliminated without delay.
De-identify anything else
When it comes to personal information that we need to collect, process, or store for business reasons, access should be limited to the necessary IT systems and staff only. This approach should be automated and monitored. Moreover, an audit log should be created that stores records of all times the data is accessed. These measures serve to demonstrate that the protection of confidential personal data is a serious concern for the organization and that it is not taken lightly. This approach is also efficient preemptive action against a possible internal attack. Such measures will also be useful if a data leak occurs and the organization needs evidence in their investigation.
All data exports from this restricted core database must go through a de-identification process. The goal of de-identification is to adjust personal data in a way so that it is no longer possible to identify the originating person.
Picture: While a person can usually be readily identified from a picture taken directly of them, the task of identifying them on the basis of limited data is harder.
All sensitive data fields are de-identified to remove private information and to retain the information value of the data for subsequent analysis or research. It is necessary to ensure protection from statistical disclosure, which is a method that enables the identification of the person by advanced data research (e.g., determining a person's identity from a known geographical location of that person at some point in time).
Methods of de-identification of personal data
De-identification methods apply to individual data fields, their parts, or their groups.
Picture: De-identification of the individual data field.
Picture: De-identification of the group of data fields.
This method applies to data fields that have a clear link between them. For instance: name and email, or geographic information (geography and longitude).
Picture: De-identification of part of a data field.
De-identification of part of a data field can be applied to e-mail addresses and telephone numbers where it is possible to keep information about that data point for analytical reasons. For instance: the email domain or the prefix of the phone number.
GDPR (a new European Union data privacy law) defines pseudonymization as the processing of personal data in such a way that the data can no longer be attributed to a specific data subject without the use of additional information. By holding the de-identified data separately from the “additional information,” the GDPR permits data handlers to use personal data more liberally without fear of infringing on the rights of data subjects. This is because the data only becomes identifiable when both elements are held together.
According to the GDPR, Pseudonymization may facilitate the processing of personal data beyond the original collection purposes.
Picture: Illustration of the pseudonymization process at work.
Pseudonymization translates a sensitive data field into a pseudorandom string (hence the name). The resulting string is always the same for the same input, so that analytical correlations are still possible. This process is alternatively called “data tokenization.”
The pseudonymization can be configured (or initialized) by a digital secret key so that only those with access to that secret key can pseudonymize inputs into the same output. This means that an external attacker without the secret key cannot guess the pseudonymized form of the email even if he knows the initial unprotected email. Also, you can periodically change this secret value to further increase protection of the data privacy.
We recommend to use cryptographic hash functions at the pseudonymization core such as SHA-384 or SHA-512.
Anonymization is the irreversible removal of information that could lead to an individual being identified, either on the basis of the removed information or in combination with other information. This definition emphasizes that anonymized data must be stripped of any identifiable information, making it impossible to derive insights on a discreet individual, even by the party that is responsible for the anonymization.
When done properly, anonymization places the processing and storage of personal data outside the scope of the GDPR.
Picture: Illustration of the data anonymization at work
Suppression or data masking is an extreme form of anonymization. It replaces information with pre-defined fixed text (or a black tape). Data masking is very simple to implement and very effective in removal of sensitive data. On the other hand, any statistical or analytical value of data is lost in the masking process.
Picture: Illustration of the suppression process at work. The sensitive information has been replaced by XXX.
In this method, individual values of fields are replaced with a broader category. For example, the value 19 of the field Age may be replaced by ' ≤ 20', the value '23' by '20 < Age ≤ 30', etc. This is the permanent modification of sensitive data in such a manner that it cannot be attributed back to the specific person. It is also an irreversible process, meaning that the data cannot be restored back to its original identifiable form.
Generalized data is not subject to GDPR restrictions.
Picture: Illustration of the generalization process at work.
By combining generalization and suppression, one can archive so-called k-anonymity. This is an attempt to solve the problem: "Given person-specific data, produce an export of the data with scientifically guarantees that the individuals who are the subjects of the data cannot be re-identified while the data remains practically useful."
Encryption translates data into another form so that only people or a system with access to a secret key (formally called a “decryption key) can read it. Under Article 32 of GDPR, controllers are required to implement risk-based measures to protect data security. One such measure is the “encryption of personal data” that “renders the data unintelligible to any person who is not authorized to access it.”
Organisations can use encryption to meet the GDPR’s data security requirements.
Picture: Illustration of the encryption process at work.
There are two major data encryption schemes: symmetric encryption scheme and asymmetric encryption scheme. Frequently, these two schemes are mixed together, and it is then called “hybrid encryption scheme.”
Today’s standard of symmetric encryption scheme is AES. It is very fast, often accelerated by a processor (CPU). For that reason, it has a negligible impact on the performance of the system. Symmetric encryption has its weakness in the presence of the secret key on the encryption side, which is inherently difficult to protect. Any person (such as the system admin) with access to a production system can steal the secret key and use it decrypt data. Whilst some hardware solutions exist, this is a difficult problem to crack.
Asymmetric encryption schemes such as RSA, DSA, or ECC use two keys: public and private. Encryption of data uses the public key and even if an attacker attains this public key, he is not able to decrypt protected data. Asymmetric encryption, however, is far slower that symmetric encryption and, for that reason, it is rarely used for data encryption on its own.
The purpose of homomorphic encryption is to allow computation of encrypted data. Homomorphic encryption is a form of encryption that allows computation of encrypted data, generating an encrypted result which, when decrypted, matches the result of the operations as if they had been performed on the original data.
Cloud computing platforms can perform computations on homomorphically encrypted data without ever having access to the original data.
Picture: Illustration of the homomorphic encryption at work.
Practical deployment of homomorphic encryption is complicated by a number of challenges because the Fully Homomorphic Encryption system has not yet been developed. Restrictions are mainly based around how mathematical functions are supported over encrypted data. Hopefully this will improve over the course of time.
There are plenty of ways to de-identify personal information. It is important to start early and pick methods correctly so that their deployment will not impact the effectiveness of the organization. It is also important not to underestimate the actual deployment of the de-identifying tool, because it can be time consuming.
TeskaLabs has developed a specialized tool called TurboCat.io that offers all of the above-mentioned methods out-of-the-box. TurboCat.io de-identifies any sensitive data without a negative impact on vital business functions. It is designed to prevent data leakage by de-identifying any Personally Identifiable Information (PII) in a given data flow.
Picture "A reflection on sand and water" © Tomas Castelazo, www.tomascastelazo.com / Wikimedia Commons
(CC BY-SA 4.0)
Data anonymization tool for GDPRMore information
You Might Be Interested in Reading These Articles
Data encryption is a critical part of GDPR compliance although there are no explicit GDPR encryption requirements. The regulation vaguely states that businesses must enforce safeguards and security measures to protect all consumer data that they handle. The GDPR refers to pseudonymization and encryption as “appropriate technical and organizational measures.
Published on May 16, 2018