Revealing the Hidden Structure of Corruption

December 9, 2021

Partagez cette page

How do you solve a problem like corruption? In this Societal Resilience case study, learn about our development of data tools that could bring new levels of transparency to public procurement data, building on collaborations with the World Bank and Inter-American Development Bank and supporting the new Microsoft ACTS (Anti-Corruption Technologies and Solutions) initiative.

The annual cost of corruption is estimated as more than $2.6 trillion (opens in new tab), or 5% of the global gross domestic product (GDP). It is a problem that disproportionally affects those living in extreme poverty, especially when they are forced to spend a sizeable proportion of their income on bribe payments just to access basic public services like healthcare and clean water. However, the problem isn’t just about the loss of income – it’s about the loss of investment in public services when government funds are diverted by corrupt officials. Those in poverty end up paying more, and suffering more, for less. The resulting loss of faith in government and the rule of law can be devastating.

Corruption is commonly defined as the abuse of public office for private gain. But where do the potential gains come from? One major source is the staggering $10 trillion (opens in new tab) per year that governments award in contracts for public goods and services. The associated procurement process, in which bids are submitted by prospective suppliers and selected by government officials, represents an ideal opportunity for corrupt officials to award contracts based not on merit but on bribery, nepotism, cronyism, and other forms of self-interest.

While opportunistic acts of corruption are damaging, it is the systematic practice of corruption that erodes public trust and entrenches power in the hands of the few. The resulting power networks permeate both government and business, yet are all but invisible to the citizens who suffer under their influence. Building a more resilient society means addressing corruption as both a violation of fundamental human rights (opens in new tab) and the source of collective harm to societal development (opens in new tab).

A problem of relationships and relatedness

Corrupt actors can go to great lengths to conceal evidence of the relationships that represent the means, motive, and opportunity to engage in corruption.

In the case of grand corruption, it involves relationships between those in power (e.g., government officials) and those benefiting from the exercise of that power (e.g., favored suppliers awarded government contacts). In the case of collusion and cartel formation, it involves relationships between coordinating suppliers operating under the guise of competitive independence. In both cases, corrupt actors use a variety of methods to suppress evidence of the relationships that facilitate their corruption, including use of front persons, bid rigging, and opaque ownership structures (e.g., anonymous shell companies, trusts, or foundations).

However, in order to benefit from corruption, corrupt actors must act in observable ways – and it is from the combination of company registration data and observable patterns of activity that we can develop an overall measure of relatedness that allows us to reason about possibly suppressed relationships. For example, the coordinated activity required for bid rigging, whether through bid rotation or suppression, creates distinctive patterns of activity that take on added significance in the presence of other information connecting the bidding companies. This could be as little as a shared phone number, email address, or postal address, or even common use of a legal representative or document template.

Statistical indicators of relatedness do not guarantee a real-world relationship of course, just as a real-world relationship does not imply corruption. We must always be cautious about drawing false conclusions. However, no matter how we interpret close relatedness in any given context, tracking the existence of potential relationships is important for understanding how risk and influence could propagate. It is transparency into potential pathways, supported by evidence from source data, that could allow us to make the connections that would otherwise evade detection.

The need for a transparency engine

There is a broad consensus in the anti-corruption field that greater transparency can help reveal corruption in action, reducing the ability of corrupt actors to go undetected and unpunished. Open data is a key enabler of transparency, with initiatives like Open Contracting (opens in new tab), Open Corporates (opens in new tab), and Open Ownership (opens in new tab) helping to seed an open data ecosystem that can facilitate collective action, both by those in government as well as those holding governments accountable for their actions.

However, open data only creates transparency if it is actually used, and that requires tools that enable domain experts to view, explore, and make sense of the relevant data for themselves. While open data standards like the Open Contracting Data Standard (OCDS) (opens in new tab) are being adopted by governments around the world, OCDS tools (opens in new tab) remain at an early stage in their development with significant opportunity for innovation. In particular, the potential for new technology to transform anti-corruption activity can draw inspiration from the development of breakthrough technology in another related area – web search.

In the early days of the web, ranking web pages didn’t take into account the global structure of those pages – it relied on local content matches. This is like matching a beneficial ownership query against a published dataset that is sparsely populated with a high proportion of “adversarial” content (deliberate omission, misrepresentation, misspelling, etc.). Google’s PageRank (opens in new tab) algorithm transformed the quality of search results by taking into account the global structure created by hyperlinks. With beneficial ownership and control, we similarly need to unlock the implicit value of the explicit structures found in open datasets.

Just like a search engine, we need an anti-corruption “transparency engine” that collects, integrates, and presents the right information for users to make the right decision – but it cannot make the decision for them. How a transparency engine derives and communicates evidence of relatedness is critical for human decision making around whether and how to respond to potentially material relationships.

Today, on the United Nations’ International Anti-Corruption Day 2021, Microsoft is announcing (opens in new tab) a new effort to develop the kind of transparency engine that could help to reveal the influence of common beneficial ownership and control, even in the absence of accurate ownership data. While the real work is only just beginning, it is informed by research explorations spanning multiple years and partners.

We now review this history as an introduction to our current proof-of-concept, before concluding with the announcement of two new tools, released on GitHub today, that further strength the ability of the community to engage in collective, evidence-based action in the fight against corruption.

Learning from domain and research partners

There is one group of organizations in a privileged position to observe corrupt networks in action, and more importantly, to cut off the finance that sustains them: the banks that lend governments the money to pay for their public projects.

In 2018, we partnered with the World Bank (opens in new tab) to explore new ways in which data visualization and AI could help to identify and mitigate corruptions risks in the public procurement process. Our resulting proof-of-concept, presented at the biennial Anti-Corruption Collective Action Conference (opens in new tab) hosted by the Basel Institute on Governance, used Microsoft Power BI (opens in new tab) to create rich visual interfaces to both private datasets and “red flag” definitions provided by the World Bank.

Our World Bank collaboration also saw our first explorations towards quantifying relatedness, via various forms of spectral graph embedding closely related to the PageRank algorithm. Using both Adjacency and Laplacian Spectral Embedding (ASE and LSE), we were able to map the discrete graph structure of observed company relationships into a continuous vector space. These mappings preserve neighborhood structure while also accounting for the global structure of the graph, such that nearby companies in the graph tend to be placed at nearby points in the embedded space. This allows the distances between company vectors to define their relatedness, even in the absence of observed relationships.

Using this technique, we were able to answer one of the most challenging questions posed by the World Bank – how to reveal the implicit connections between colluding suppliers who never bid or win together precisely because they don’t want any documented evidence of their relationships.

In 2020, we used the same approach in a collaboration with the Inter-American Development Bank (IDB) looking specifically at open procurement data from Colombia. This was part of the new Microsoft ACTS (opens in new tab) (Anti-Corruption Technologies and Solutions) initiative launched on UN Anti-Corruption Day 2020 – continuing a longstanding collaboration with IDB (opens in new tab). This work, covered in three ACTS features (part 1 (opens in new tab), part 2 (opens in new tab), part 3 (opens in new tab)), saw us extend our use of spectral methods to the joint embedding of dynamic graphs that evolve over time. The approach, Omnibus embedding (opens in new tab), had previously been developed by our research collaborators at Johns Hopkins University and successfully applied by our team to a wide range of problems. In the case of our IDB collaboration, we used it to measure the change in the behavior of a company over time and detect anomalous patterns of activity (opens in new tab) in terms of contract awards.

In our current work, we are collaborating with the University of Bristol on the application of new methods for the joint embedding of dynamic multipartite graphs that span all relevant parties of the procurement ecosystem and their linked attributes: buyer and supplier organizations; tenders and line items; company owners and contact details; and so on. The underlying method of Unfolded Spectral Embedding (USE) offers a principled statistical foundation for comparing behavior at different points in time, with provable stability guarantees (opens in new tab) that constant node behavior at any time results in a constant node position. The same method also allows for combining behavioral signals from all time periods into a single vector representation for each node, enabling state-of-the-art statistical inference (opens in new tab). These are precisely the qualities we need to establish a principled measure of relatedness, informed by manifold geometry (opens in new tab) in the embedded space, that accounts for all the different kinds of relationship that can be observed in real-world data.

Transparency engine proof-of-concept

Our proof-of-concept solution uses dynamic multipartite graph embedding with USE to create vector-based representations of each company that incorporates information from its historic procurement activity, ownership structure, and other sources. The resulting “transparency engine” aims to detect otherwise undiscoverable sources of relatedness, and then breaks them down into visual explanations that can be explored and evaluated in the context of Microsoft Power BI. The following sequence of images demonstrates our approach applied to open government data from Brazil.

Our experimental transparency engine has shown that much can be achieved with existing open datasets, even with imperfect information on company ownership. For many kinds of anti-corruption analysis, it isn’t necessary to know precisely who controls a given company, but whether a given group of companies have sufficiently strong relatedness to suggest potential non-independence, prompting deeper investigations into possible coordination, collusion, or common beneficial control. So while establishing the ultimate beneficial owners of a company remains a significant challenge – and one that requires new global policies and reporting requirements for meaningful change – any future increase in the quality of ownership data will only improve the quality of inferred relationships.

Our embedding-based relatedness model also offers a principled foundation for the propagation of risk. We are continuing to work with our partners to understand how red flags (opens in new tab), typically defined at the level of individual entities, could diffuse through the embedded space to create a measure of relational risk exposure – the aggregate risk that an entity is exposed to via the overall structure of all entity relationships. The promise of this ongoing research is that it could automate much of the manual due diligence work that happens today – exploring all possible pathways by which an entity could be exposed to, or expose others to, corruption risk. This represents an even greater level of transparency into the procurement ecosystem as a whole, and a new kind of tool with which to detect, disrupt, and ultimately prevent corrupt activity.

Open data tools for the fight against corruption

So how do you solve a problem like corruption? There is no easy answer, but openness and transparency provide a clear path forward.

In early 2022, we will release an open-source version of our transparency engine that can be adapted and deployed for real-world use. The core algorithms for ASE, LSE, and Omnibus embedding, among others, are already available in the Microsoft graspologic (opens in new tab) package for graph statistics in Python, and we are currently working to incorporate USE and related methods.

Today, on UN Anti-Corruption Day 2021, we are also releasing two new tools that further support real-world evidence development in the fight against corruption, building on our parallel efforts in the fight against human trafficking (MSR blog (opens in new tab), AI for Business blog (opens in new tab), TechRepublic (opens in new tab), TechCrunch (opens in new tab), GeekWire (opens in new tab)).

The first release is an update to our Synthetic Data Showcase (opens in new tab) tool for privacy-preserving data sharing that reimplements the core data synthesis and aggregation components in Rust. This enables compilation to WebAssembly for optimized execution in the browser, which in turn allows us to convert our previous command line tool into an interactive client-side web application with no data ever leaving the device. Users are thereby able to curate their data release by making column selections and transformations that control the dimensionality of the sensitive dataset. This process is itself informed by metrics describing the privacy and utility of the synthetic dataset, which is regenerated on demand in a real-time feedback loop until it meets the requirements for release.

In the context of corruption, Synthetic Data Showcase could help to generate new kinds of open data that describe actual instances of corruption risk (e.g., detected using OCDS red flag definitions (opens in new tab)), not just the systems of activity in which such risks may be identified (e.g., procurement data published using OCDS). By sharing the characteristics of detected risks – but not in a way that is linkable to any individual or company – the anti-corruption community can more easily share data that can be used for higher-level risk mapping and evidence development.

The second release is a public preview of our new ShowWhy (opens in new tab) tool designed to support the kind of causal evidence development that can inform public policy. Causation represents a higher standard of evidence than the kinds of associations and correlations discovered during exploratory data analysis, but making causal claims from real-world data – in contrast to data obtained through randomized controlled trials, experiments, or A/B tests – is challenging. Since observational datasets are inherently biased, models of the causal relationships affecting both the domain and the data collection process are necessary to correct for this bias.

ShowWhy aims to make the end-to-end process of causal inference accessible to domain experts, using the Microsoft Research DoWhy (opens in new tab) et EconML (opens in new tab) Python packages behind the scenes in an easy to use, no-code application. ShowWhy guides the user though the process of defining all the data variables, causal graphs, and effect estimators necessary to answer a causal question. The end result of this process is a collection of interactive summaries, describing both the process and results, that can be openly presented and defended to a range of audiences.

ShowWhy could enable a significantly broader cross-section of the anti-corruption community to develop evidence about the causes and consequences of corruption, adding to existing work that shows, for example, that tender transparency reduces corruption risks (opens in new tab).

Open data policy and technology go hand-in-hand, but the current state of the art in anti-corruption tools is only scratching the surface of what is possible with modern approaches to data science and machine learning. And only by incorporating advances in visual analytics and HCI can we hope to build accessible data tools that realize the transparency promise of open data, for diverse users and audiences, spanning all spheres of government, business, and society.

Graph showing vector representations for comparing company behavior

Vector representations for comparing company behavior over a 3.5-year time period (projected down from the higher-dimensional embedded space for visualization purposes). The same company is shown in the same color, once per quarter, with changes in position representing changes in bidding activity (i.e., the buyers, tenders, and items associated with bids made by the company). Different companies with similar activity in any pair of time periods are assigned similar positions.

Scatterplots representing time-varying behavior

The same representation of time-varying behavior (right scatterplot) compared with all-time behavior (left scatterplot). Points in the left scatterplot represent a single vector-based representation of a company that integrates both time-varying bid activity and other company information (partners, address, email, phone numbers). Clusters of similar companies are shown in the same color.

Scatter chart of weakly related companies

Selecting three weakly related companies in the visual to the right and observing distant positions in both embeddings. The maximum and average relatedness measured between all pairs of companies is shown in the top right, with values of 0.17 and 0.11 (out of 1.00) confirming weak relatedness.

Scatter chart of strongly related companies

Selecting three strongly related companies in the visual to the right and observing similar positions in both embeddings. The maximum and average relatedness measured between all pairs of companies is shown in the top right, with values of 1.00 and 1.00 (out of 1.00) confirming strong relatedness.

Showing an inferred edge (relationship) between companies that can be explained by “synchronous similarity” – similar behavior in the same time period, repeated across multiple time periods. Strongly synchronized behavior could come from the natural structure of competition in a given area, but it could also be an indicator of coordination, collusion, or common beneficial control.

Graph showing an inferred edge explained by asynchronous similarity

Showing an inferred edge (relationship) between companies that can be explained by “asynchronous similarity” – similar behavior in non-overlapping periods over time. Strongly asynchronized behavior could come from the natural seasonality of competition in a given area, but it could also be an indicator of the “same” company dropping out of the ecosystem and reentering under a new identity.

Graph showing synchronous similarity with a common contact

An example of synchronous similarity where the two companies of the selected edge share a substantial number of contact details, but no registered partners (owners). Shared activity (in orange; bottom and rightmost bars respectively) dominates the independent activity of each company. Common ultimate beneficial ownership and control is a distinct possibility, but this may still be legitimate depending on the context.

Graph showing asynchronous similarity with a common contact

An example of asynchronous similarity where the two companies of the selected edge share a single contact detail (e.g., address, email, or phone number), but no registered partners (owners). The two companies demonstrate alternating periods of bid activity with minimal overlap. Coordination over time is a distinct possibility, but this may still be legitimate depending on the context.

An example of asynchronous similarity where the two companies of the selected edge share a single contact detail (e.g., address, email, or phone number), but no registered partners (owners). Very few of the many shared items, buyers, and tenders occur in the same time period, and independent bid activity is almost completely separate in time. Common company identity under a different name is a distinct possibility, but this may still be legitimate depending on the context.

Graph showing clusters of related entities

Drilling down into the observed evidence for a given inferred relationship (right) in a given cluster (center), selected either directly or by searching for a given company. The user can inspect all elements of common and independent activity, including linked buyers, tenders, items, partner, contact details, and time periods. Together with information about the temporal patterns of activity exhibited by the two companies, the user can make their own judgement about the likelihood and significance of any potential real-world relationship, including whether it needs further investigation.