The Pandora Papers shook the world. The massive leak has dominated headlines worldwide and called into question the financial propriety of some of the world's most powerful people.
What are the Pandora Papers?
The Pandora Papers investigation is the world's largest-ever journalistic collaboration, with over 600 journalists from 150 media outlets in 117 countries participating.
The investigation is based on the leak of confidential records from 14 offshore service providers who provide professional services to wealthy individuals and corporations looking to establish shell companies, trusts, foundations, and other entities in low- or no-tax jurisdictions. The entities allow owners to hide their identities from the public and, in some cases, regulators. Frequently, the providers assist them in opening bank accounts in countries with light financial regulations.
It was reported that the Pandora Papers revelations came from an enormous amount of data: 2.94 terabytes in total, 11.9 million records and documents dating back to the 1970s. But the question remains, how do you handle a massive leak of this magnitude?
If this whole incident sounds familiar, it is because it resonates closely with the 2016 Panama Papers scandal, which exposed US$ 2 billion of questionable tax evasion (so far) between the 1970s and 2016 (46 years), based on 2.6 terabytes of data in 11.5 million documents.
However, in an interview with Nik Vora, Vice President of Neo4j APAC, he stated that the Pandora Papers presented a massive data management challenge. In comparison to the Panama Papers, which analysed information from just a single provider, the Pandora Papers involved 2.94 terabytes of information in over 11.9 million records dating from the 1970s to 2020, obtained from 14 providers in at least 38 jurisdictions.
“The Pandora Papers connected offshore activity to more than twice as many politicians (330) and public officials [as did the Panama Papers], from more than 90 countries and territories, including 35 current and former country leaders. For comparison, the FinCEN Files investigation has exposed $2 trillion of questionable funds/transactions between 1999 and 2017 (18 years), conducted by 400+ journalists and 108 other media partners in 88 countries,” he explained.
Finding Hidden Connections in a Maze of Unstructured Data
According to a report from the International Consortium of Investigative Journalists (ICIJ), only 4% of the files were structured, with data in tables (spreadsheets, csv files, and a few "dbf files"). To explore and analyse the data, the ICIJ identified and structured files containing beneficial ownership information by company and jurisdiction. Each provider's data necessitated a distinct procedure.
After structuring the data, ICIJ used graph platforms such as Neo4j and Linkurious to generate visualisations and make them searchable. This allowed reporters to investigate connections between people and businesses across providers.
As Emilia Díaz-Struck, ICIJ's research editor and Latin American coordinator has said: “The way we use graph databases is always the same: to find hidden connections that are not obvious. If you find a shareholder or a person, could this person also actually be this person or entity you've seen over here, and so be connected to more things I'm not seeing yet. Whenever you have vast amounts of data, your risk is missing what is there; technology and machine-learning, things like graph databases, allow you to see things that sometimes could take you years as a human.”
When asked to provide examples of the types of information Neo4j would have discovered that would have been difficult to find otherwise, Nik stated that the investigation exposed more than 330 politicians from more than 90 countries and territories. They bought real estate, held money in trust, owned other companies and other assets using entities in secret jurisdictions, sometimes anonymously.
“The Pandora Papers also reveals how banks and law firms work closely with offshore service providers to design complex corporate structures. The files show that providers don’t always know their customers,” he said.
He added that the dataset has not yet been released but it will likely be integrated into the Offshoreleaks database which is powered by Neo4j and Linkurious.
In addition, given that many types of data, such as PDFs, had to be converted to structured data prior to the arrival of Neo4J, the question is whether this is common, and how much work goes into the preparation.
Could More Be Uncovered Using Graph Technology?
According to Nik, the ICIJ embraces a very modern approach that uses open source technology, including:
Their own platform Datashare, for entity extraction, knowledge sharing and validation.
Python data science toolkits for making sense of messy source data with machine-learning.
A graph-based solution to connect the dots, comprised of Neo4j as the underlying database and Linkurious for visualisation.
“The data extracted by the tooling is then used by journalists to do more ground-research, investigations, local validation and more before writing the stories for their publications to highlight core insights from the investigation,” he explained.
He also mentioned that one challenge with the Pandora Papers was that the data came from 14 different offshore providers, each with its own method of keeping, storing, and structuring data.
“A lot of the data was unstructured, which means there was a massive data cleansing, consolidation and verification effort necessary to get the data in a shape to work with, which also requires a lot of local knowledge,” he said. “Since the ICIJ works in secret, we have little means to help them with their investigations and guide them towards using even more leading technologies like Neo4j Graph Data Science which utilises the contextual information around entities to train and use ML models based on the network structure of data.”