“If we torture data long enough, they will confess
(revealing the secret messages that God has sent us)”

R. Coase (freely readapted quote)

The search for “hidden” connections and obscure meanings within the data, if not supported by rigorous methodological criteria characterized by scientificity, can lead to detect correlations determined simply by chance (also known as “spurious correlations”), which may manifest with greater ease just as the size of the datasets grows.

In this sense, the example of numerology, an ancient practice that has survived to the present day, can be instructive and has come back into vogue thanks to the famous book by Dan Brown (“The Da Vinci Code”).

The results obtained through these practices have no scientific value (therefore they should remain confined within the fiction).

Science doesn’t play with numbers

Nevertheless, numbers maintain their “halo” of likelihood, due to their high capacity of suggestion, associated with their “narrative” characteristics, such as:

  • reconstructing the “facts” in such a way as to give a complete sense that corresponds to the “truth” (by leveraging the “Bias of Confirmation”);

  • relying on numerical “cherry picked” samples, evoking in so doing an “appearance of scientificity” (by virtue of the “data driven” fallacy, according to which “data always speak for themselves”).

We deal with it here, therefore, precisely to avoid falling into the same methodological errors when dealing with large amounts of data, as when performing Big Data analytics.

Let’s see how to prevent us to fall prey of data illusions.

Everything was already written in the Bible…

To persuade people about the genuine nature of such “predictions”, many authors resort to the “halo of sacredness” commonly associated to sacred texts.

In fact, they claim, the examples of recent attacks (such as the assassination of President Kennedy, or more recently, of Israeli Prime Minister Yitzhak Rabin), would have been “predicted” by the sacred texts, if only we had been able to “read between the lines”, as offered by the author of the famous book “The Bible Code”, Michael Drosnin.

Similar “experiments” were conducted by authors such as Doron Witztum, Eliyahu Rips and Yoav Rosenberg, and they also appeared in reputable publications, such as the Statistical Science journal.

The aforementioned Drosnin even challenged the skeptics to prove that such coincidences were only work of the chance, stating that he would change his mind only if someone had been able to find similar “premonitions” even in “profane” writings like the novel “Moby Dick” by Melville.

…but it was the Chance that provided the Ink

To his bewilderment, Drosnin was promptly served.

Drosnin did not actually reveal anything secret and “premonitory” that could not be obtained from any text, provided that it’s long enough.

Accepting the challenge proposed by Drosnin himself, Prof. Brendan McKay was able to find even in the ‘profane’ novel “Moby Dick” the predictions of attacks that would only occur after many years to come.

Below we report some of the most striking “predictions” from Melville’s famous novel:

  • the murder of Indian Prime Minister Indira Gandhi (on October 31, 1984);

  • the attack on the Reverend Martin Luther King (Tennesse, 04 April 1968);

  • the death of Lady Diana;

Besides obviously the attacks on J.F. Kennedy and Y. Rabin, already “expected” from the Bible (according to Drosnin):

  • the attack on Kennedy;

  • the attack on Rubin;

In conclusion,

“Prediction” is in the data, as much as
“Truth” (and Beauty) is in the eye of the beholder”

As a matter of facts, the risk of running into random (spurious) correlations increases with the growth of available data.

Just as it is possible to draw false conclusions from a sufficiently long text, in much the same way it is possible to draw any conclusion from large amounts of data, by applying the “cherry picking” approach.

How to verify (and unmask) the distortions (bias) due to chance, will be the object of next interventions.

Stay Tuned!