In times when terrorist attacks are unfortunately on the agenda, the topic Privacy and protection of Confidentiality is considered by many to be suspect by itself.

But really the renunciation of Privacy represents a necessary evil, in order to guarantee the safety of citizens, or, in light of the use of Big Data in an increasingly pervasive way, this does not risk to turn into a boomerang?

That false trade-off between security and privacy

It has now become a commonplace the idea that in order to guarantee the safety of citizens, they must renounce their “claims” in terms of confidentiality.

In reality, as we will see shortly, to the increase of available information does not necessarily correspond an increase in the “signal” (that is to say, an increment in the really relevant information) but more likely in the “noise” (this term meaning useless, misleading or simply random information).

The search for the signal in this case looks more like that of a needle in a haystack… with the aggravating circumstance that as the information grows the needle always remains the same, while the haystack grows out of all proportion!

Let’s try to clarify the concept with a numerical example.

Looking for a needle in a haystack that grows out exponentially

With the availability of the data relating to hotel reservations (data that obviously includes name, surname, address, etc.) made by a large number of tourists (in the order of 1 billion), we now want to find out if among these there are any potential terrorists, such as for example two subjects of different nationality and residence, who have decided to meet in the same hotel, located in any part of the world, on two different days (which we consider suspicious, and which we therefore interpret as an indication of the planning of a possible terrorist attack).

So, let’s summarize the data of our example 1 and try to do some simple calculations:

  1. the number of hotel bookings concerns 1 billion (109) of individuals of all nationalities;
  2. every tourist goes to a hotel 1 day out of 100;
  3. Let’s imagine that we focus our surveys on 100,000 hotels, and that each hotel can accommodate 100 people each;
  4. our analysis is developed over a period of 1000 days.

Based on these assumptions, let’s start now by evaluating the probability that two people will meet in the same hotel on two different days.

First of all, let’s consider the case of the single tourist who decides to visit one of the 100,000 (105) hotels on a given day: the probability is therefore 0.01 (1/100).

Consequently, the probability that any two people decide to visit a hotel on a given day is 0.0001 (given by the product of the individual probabilities: 0.01 x 0.01).

The probability that these two people specifically decide to visit the same hotel on a given day is 0.0001 (10-4) divided by the number of available hotels (100,000 or 105).

Therefore, this probability is equal to:

10-4 / 105 = 10-9 (0.000000001, or one in a billion).

Similarly, the probability that they visit the same hotel on two specific days (an event which in our hypothesis constitutes the “alarm bell”) is equal to:

10-9 x 10-9 = 10-18

At this point, we just have to estimate the number of “suspicious encounters” that set off the alarm bell (i.e. two people visiting the same hotel on two different days).

This value is the product of the possible combinations of tourist pairs, multiplied by the possible combinations of pairs of days (extracted on the 1000 days observation period), and the probability that each couple of tourists visit the same hotel on two different days (which we already know is 10-18):

C(10^9, 2) x C(1000, 2) x 10-18 ~ = 250,000

That is to say that we should check a number of pairs of likely “suspects” people that amount to 250,000, people that could be absolutely innocent, given that the coincidence of two individuals meeting in the same hotel in two different days is largely justified and determined by chance (as we have seen applying our probabilistic estimates).

In addition to intruding into the private life of a disproportionate number of harmless (and innocent) citizens, the police would still be called upon to make an absolutely unsustainable investigative effort in practical terms.

It is for these reasons that a security project initially proposed by the Bush administration in 2002, with the evocative name of “Total Information Awareness”, was prematurely “closed” and not refinanced.

But in 2002 the “craze” of Big Data Analytics had not yet exploded, and since then, many seem to have thought about it again…

  1. The example is taken up and adapted from the original shown in the masterful text “Mining of Massive Datasets”, Anand Rajaraman, Jure Leskovec, and Jeffrey D. Ullman.