Mathy Vanhoef: February 2012

Privacy is a difficult concept and there are many sides to the privacy issues we are facing in our digital age. For my master thesis I studied privacy in databases, where the goal is to find a mathematical definition of privacy. But in this post I won't focus too much on the math behind it all, instead I'll go over some interesting observations I have made during my work, and explain some of the basic concepts. To get started we'll look at some privacy fiasco's that occurred in the past.

1. AOL Search Fiasco

We begin with the AOL search fiasco where AOL released around 20 million search queries from 65,000 AOL users. In an attempt to protect privacy, the username of each individual has been changed to a random ID. However, different searches made by the same person still have the same ID. Shortly after the release the person with ID 441779 was identified as Thelma Arnold. By combining all the searches she made, it became possible to find out her real identity. This already shows that simply removing unique identifiers such as a person's name, address or social security number do not provide privacy, and more generally that anonymity does not imply privacy (more on this later). No real care was given to anonymizing the search results. So this is not a true example that removing identifying attributes (eg., name, address, etc.) fails to protect privacy, as no real care was given to anonymizing the dataset. This can be deduced from the observation that the person who was responsible for releasing the data was quickly fired.

2. The HMO Records

In this attack the medical records of the governor of Massachusetts were revealed. The attack begins with the following observation: One can uniquely identify 63% of the U.S. population with only knowing the gender, ZIP code and date of birth of an individual. We now turn our attention to two different datasets. The first one is the voter registration list for Cambridge Massachusetts and includes information of each voter. A bit simplified the dataset can be represented using the following relation:

VoterRegistration(ZipCode, BirthDate, Gender, Name, Address, DateLastVoted)

The second dataset is the Massachusetts Group Insurance Commissions medical encounter data, containing medical information on state employees (and thus also contained the medical records of the governor of Massachusetts). A small part of this medical data, without all identifying information such as name, address and social security number removed, were released. It can be represented using the following relation:

MedicalData(VisitDate, Diagnosis, Procedure, ZipCode, BirthDate, Gender)

As we can see both datasets can be linked to each other by combining gender, ZIP code and date of birth! So even though the names were removed in the medical dataset, researchers were still able to link the data back to individuals. As a result the medical records of the governor of Massachusetts were found.

3. Kaggle

Finally we come to an example that shows it's possible to find correspondences between social networks merely based on the structure of underlying friendship relations. Even more interesting is that the art of "de-anonymizing" a dataset was used to win a competition. Kaggle organized a competition where participants were given a dataset of people and their underlying relations, which we will call a social graph (each node is represented using a random ID and no information such as username, name, location, etc. were included). An example of a social graph is shown in the figure below:

The circles/letters represent people and a line is drawn between them if they are friends. However not all relationships were given to the participants. In fact, the goal of the competition was to determine whether certain given relationships, which were not present in the original social graph given to the contestants, were fake or real. Of course if we'd knew who the individuals behind all the nodes are, we could just look up if two nodes are really friends or not! So if we manage to de-anonymize the social graph given by Kaggle we can game the competition: Instead of making a machine learning algorithm to predict the answer we simply look them up.

It was found out the social graph was created by crawling Flickr. So the researchers made their own crawl of Flickr and based on the structure alone created a mapping between their own crawl and the Kaggle social graph. In other words they now know which individuals correspond to the supposedly anonymous nodes, and thus they identified the individuals using only the structure of his or her social graph.

The Difficulty of Absolute Privacy Protection

It should be obvious by now: Assuring privacy is hard. We can't simply remove attributes such as name, address, social security number, etc. from a dataset. The reason is that seemingly innocent information such as gender, ZIP code, and birthdate can still be used to uniquely identify an individual. Clearly the need for a rigorous privacy definition is needed. Researchers have proposed theoretical models such a k-anonymity, but it turned out to have problems so L-diversity was suggested. I turn weakness were found in L-diversity, resulting in a new definition called T-closeness. But still there are problems even for T-closeness. So in practice assuring privacy is hard, and even finding a robust mathematical definition appears to be a challenging task. What can be done?!

Before answering that question we're going to argument that the situation, at least from a theoretical point of view, is even worse than one might already think. There is in fact theoretical evidence suggesting that absolute privacy protection is impossible. This proof was heavily based on an earlier attempt at the proof. Now of course not releasing your information does provide absolute privacy. But what they have proven is that if a database is sufficiently useful there is always a piece of external information that, combined with the output of the database, violates the privacy of an individual. An example explains this best. Assume that learning the salary of an individual is considered a privacy violation. Further assume that we know the salary of Alice is 200 EUR higher than the average salary of a Belgian citizen (this is the external information). So we don't know her exact salary. Let's say we now receive access to a database containing the salary of every Belgian citizen. From this database we can calculate the average salary of a Belgian citizen, say 2500 EUR. Combining this with the external information teaches us that the salary of Alice is 2700 EUR! Et voila, the privacy of Alice has been violated. All due to gaining access to a database from which we merely learned an average of a particular value.

So it seems it's very difficult (impossible) to assure that there will be no privacy violation whatsoever. What can we do? The answer is simple. We will reduce the probability of a privacy violation as much as possible.

Differential Privacy

One of the promising definitions in the privacy research community is called differential privacy. It provides relative privacy prevention, meaning that the probability of a possible privacy violating occurring can be greatly reduced. Important is that when using differential privacy the dataset is not handed out to the public. Instead users can pose questions (queries) to the dataset (eg., over the internet), and answers will be given in such a way to reduce the probability of a privacy violation. In practice this is done by adding a small amount of random noise to the answer. Note that there is always a tension between accuracy of the data, and the privacy guarantees provided by the data release. The higher the privacy guarantees, the lower the accuracy.

Differential privacy assures that, when you provide your personal information to the database, the answers to queries will not differ significantly. In other words handing over your information should not be visible in the calculated answers. This results in a more natural notion of privacy: When giving your information your privacy will on be very minimally reduced (remember that we consider absolute privacy protection impossible).

More formally, when you have two databases, one with your personal information (D1) and one without your personal information (D2). Them the probability (Pr) that an answer mechanism (K) returns a specific answer (S) should be nearly identical (multiplication by exp(epsilon)) for both databases. Mathematically this becomes

The parameter epsilon defines how much the probabilities are allowed to vary. A small epsilon such as 0.01 means the probabilities should be almost identical (within multiplicative factor 1.01). However for a larger epsilon such as 1 the probabilities can differ by a larger amount (they must now be within multiplicative factor 2.71). The reason we use the exp(epsilon) instead of just epsilon is because manipulating formulas that use exp(epsilon) is a lot more straightforward.

We can now design algorithms that answer queries while assuring differential privacy. In short we can now prove we assure a certain definition of privacy! There are still some difficulties in implementing differential privacy in practice. There first one is that it's not clear what a good value for epsilon would be. The second problem is that you cannot ask an unlimited amount of questions (queries) under differential privacy. So in practice a reliable mechanism must still be designed to ensure only a limited amount of queries are answered.

Another downside is that the actual dataset is not released. For researchers being able to visually inspect the dataset and first "play around with it" can be important to gain a better understanding of it. Therefore the problem of releasing an anonymous dataset, while minimizing the probability of a privacy violation, is still an important topic.

Anonymity vs. Privacy

Another important observation is that anonymity does not imply privacy. Take for example k-anonymity. It states that an individual must be indistinguishable to at least k-1 other individuals. Below we give an example of a 4-anonymous database. Based on the non-sensitive attributes, which are used to identify an individual, we notice there are always at least 4 rows having the same values. Hence each individual is indistinguishable with at least 3 other individuals.

If you know a person having Zip code 1303 and an age of 34 is present in the database, he or she can correspond to any of the four last rows. So the anonymity of that individual is preserved. However all four rows specify that he or she has cancer! Hence we learn our targeted individual has cancer. Anonymity is preserved while privacy was violated.

The Ethics of Using Leaked Datasets

Another observation made during my master thesis is that it's hard for academic researchers to obtain dataset to test their algorithms or hypothesis. Even worse is that most are forced to create their own datasets by crawling public profiles on sites such as Twitter, LiveJournal and Flickr. This means their test data only contains public profiles, creating a potentially significant bias in their datasets. Another problem is that researchers won't make these dataset publically available, probably out of fear of getting sued. And that is not an irrational fear, demonstrated by the fact that Pete Warden was sued by facebook for crawling publicly available data. Because these datasets are hard to obtain, peer review becomes difficult. If another, independent, researcher doesn't have access to the data he or she will have a very hard time trying to reproduce experiments and results.

But more interestingly is that there are public dataset available! It's just that no one seems to dare to use them. As mentioned the AOL dataset was publicly released, but researchers are hesitant to use it. Another dataset, called the Facebook100 data, is also a very interesting case. Researchers were given social graphs of 100 American institutions in anonymized form (private attributes such as name, address, etc. were removed). Amazingly the dataset contains all friendship relations between individuals present in the dataset, independent of their privacy settings on facebook. As we've seen an unmodified social graph can be vulnerable to re-identification attacks (see the Kaggle case). A week after the release facebook requested the dataset to be taken down. Deleting data on the internet is hard however, and copies of it are still floating around. Nevertheless, because facebook requested to dataset not to be used, researchers seem to be hesitant to use this dataset.

"Privacy issues are preventing a leap forward in study of human behavior by preventing the collection and public dissemination of high-quality datasets. Are academics overly-sensitive to privacy issues?"

It's clear why researchers aren't touching these datasets. Either they include too much personal information, the data provider has requested to take the dataset down, or out of fear of getting sued. The question is if this behavior really benifits the user, the one who we're trying to protect. The answer is NO. Let's look at all the players in this game:

Researchers don't conduct research on the "leaked" datasets. This hinders scientific progress.
Users are less aware of vulnerabilities as researchers can't demonstrate them.
Companies benefit as they don't get bad press about leaked data & possible privacy issues.
Malicious individuals benefit since vulnerabilities remain unknown, which in turns morivates users to continue sharing information publicly. They are also not limited by moral objections and will use the leaked data if valuable.

Currently only the companies and malicious individuals benefit from this behavior. Scientists and users are actually in a bad position by not allowing/doing research on public/leaked datasets. A utopian view would be that researches conduct analysis on the datasets and make advancements in their field. At the same time users can be warned about potential vulnerabilities caused by the leaked data before hackers will abuse them.

Fear Based Privacy

A mistake to watch out for, at least in my opinion, would be what I call fear based privacy. Yes, you have to remain diligent to make sure the government and companies don't collect too much information. And yes, you should be careful about the information you provide to the public! But one must first carefully consider the arguments for, and against, making information public. A knee-jerk reaction saying that any information you share might potentially be abused by a malicious individual is not a valid argument. It's an argument purely based on fear: "Even if we don't know yet how a hacker could abuse certain information, who knows he or she might find a method in the future, so it's better not to share anything."

Now don't get me wrong here. Currently I believe that an absurd amount of information is being collected, especially on the internet, and that this is not a good thing. But a good balance must be found between the benefit of sharing information, and the potential risks for sharing said information. Compare it to driving a car. It involves the risk of possibly getting in an accident. However, the benefit that a car provides outweighs its risks, and instead of completely avoiding cars we attempt to reduce the probability and severity of accidents.

Interesting Links

33 Bits of Entropy: Privacy related blog by the researcher Arvind Narayanan who is behind the Kaggle de-anonymization. He also has a twitter acount.
Database privacy at Microsoft Research.
28c3 talk by Conrad Lee titled Privacy Invasion or Innovative Science?
A Firm Foundation for Private Data Analysis: A good paper covering the applications of differential privacy.

Special thanks goes to Jan Van den Bussche of University Hasselt (Belgium) for helping me during my master thesis.

Mathy Vanhoef

Pages

Monday 27 February 2012

TEDxUHasselt Salon: How you could be hacked

Wednesday 1 February 2012

Foundations of Privacy