Self-learning computer programs can sift through large volumes of data and detect regularities. Such programs are used, for example, in modern spam filters: they observe which e-mails users mark as spam and learn from this what constitutes spam e-mails without being explicitly told how to recognize it.
Detecting correlations in collected patient data
This machine learning helps not only to improve spam filters, but generally where large data sets are to be examined for patterns, such as in search engines, in automatic speech recognition or even in the analysis of medical data. Here, algorithms could recognize correlations in a stack of patient files and, for example, find criteria to diagnose diseases at an early stage.
"However, it must be ensured that the privacy of individual patients is protected," warns Professor Dr. Hans Simon from the Chair of Mathematics and Computer Science. Together with doctoral students Francesco Alda and Filipp Valovich, he is working on the question of how sensitive data can be statistically analyzed without revealing the anonymity of the individuals involved.
Specifically, the mathematicians are investigating a form of machine learning that is restricted to counting queries. It works as follows: If an algorithm is to evaluate a patient database, for example, it is only allowed to ask yes/no questions and count for how many patients the answer is "yes."
Allowed questions are, for example: Is the person a smoker? Is the person male? Does the person weigh more than 80 kilograms? When a learning algorithm sends these questions to a patient database, it receives three numbers in response: the number of smokers, the number of men, and the number of patients heavier than 80 kilograms.
Make data noisy
The principle of counting inquiries sounds anonymous - after all, the result is merely a statistical summary. Nevertheless, it is possible to obtain information about individual patients. Hans Simon and his colleagues are therefore investigating mechanisms such as "randomized response schemes" that are designed to protect the privacy of patients.
In this process, patient data are noisy, i.e., they are randomly altered. Put simply, it's like rolling the dice on each patient and adding up the number of dice to the values in the record. The process changes individual patient data violently and unpredictably, but ideally makes itself no more noticeable in statistical summaries than any random statistical variation already present in the data.
What does anonymous mean?
To come close to the ideal case, the noise must meet certain framework conditions. "Our challenge is to find and clearly formulate such framework conditions," explains Hans Simon. "So we're looking for conditions for random noising that, on the one hand, ensure that the individual data are changed to such an extent that patients can no longer be identified, but on the other hand guarantee that the changed dataset still returns similar answers to the learning algorithm as the original dataset."
This is quickly described in one sentence, but in practice it means small-scale, complex thought work for mathematicians. First, they must define exactly what the terms at issue mean: what exactly should it mean that patients remain "anonymous"? What exactly should it mean that responses to the learning algorithm are "similar"?
They then look for conditions from which it can be inferred that noise anonymizes patient data but does not substantially corrupt statistical summaries.
Retrieving individual patients in a database
The mathematicians led by Hans Simon are concerned with "differential privacy" in their search. The concept goes back to the U.S. computer scientist Cynthia Dwork and answers the question of what anonymity can mean in strictly mathematical terms: Differential Privacy is a measure of how well an individual patient can be identified in a database.
To do this, assume one has a concrete counting query and two databases A and B that differ only in a single patient. Random noise satisfies differential privacy if it is virtually indifferent whether database A or database B was queried; that is, if the probability of a given collection response to the counting query is similar in both cases.
"So the overall result of the query does not depend so much on a single patient," Hans Simon explains. Similarly, he and his colleagues use exact definitions to determine mathematically exactly what it means that similar statistical conclusions can be drawn from the noisy data as from the original data; this is referred to as accuracy.
Representing data as vectors
Hans Simon's team wants to meet both requirements when it comes to data noise, that is, to ensure both differential privacy and accuracy. To do this, the researchers have used a trick: They recognized a connection to another topic in mathematics, to so-called linear arrangements. This is where they translated the problem.
The scientists represent both each individual patient record and each allowed counting question as a vector, that is, as an arrow in a geometric space. A yes answer exists if the arrows of the patient record and counting query form an acute angle. A no answer is given if the arrows form an obtuse angle.
Noise vectors instead of original data
Now, the noising is no longer performed on the original patient data, but on the arrows assigned to them. Hans Simon and his team have proven that noising works well with this method, that is, it satisfies both Differential Privacy and Accuracy. The prerequisite is that the acute angles are really acute, i.e. significantly smaller than 90 degrees, and the obtuse angles are really obtuse, i.e. significantly larger than 90 degrees.
The mathematicians from Bochum have achieved a first interim goal, but they still want to expand their research results. "Translating patient data and algorithm queries into geometric space is a major hurdle," laments Hans Simon. "There is no general recipe on how to do it. Moreover, there are types of database queries that are known not to be suitably represented geometrically."
It would be great if real data could be replaced by synthetic data.
Hans Simon has a vision based on an idea by U.S. researcher Avrim Blum: "It would be great if real data could be replaced by synthetic data, from which you could no longer extract information about patients, but you could draw the same statistical conclusions." One could then make such synthetic data available to the general public without restriction. "From available results from the research community, we know that this ideal goal will not be achievable," says Hans Simon. But it could be a driving force that produces further important results in the field of tension between statistical data analysis and privacy.
Click here for the original article in the RUB news.