To Catch a Thief… You Need Entity Resolution

FBI’s use of machine learning based entity resolution to catch criminals and fraudulent activity.

Contact us

To Catch a Thief… You Need Entity Resolution

FBI’s use of machine learning based entity resolution to catch criminals and fraudulent activity.

Fill out form to continue
All fields required.
Enter your info once to access all resources.
By submitting this form, you agree to Expero’s Privacy Policy.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
 

How does a 21st century police department track down criminals? By using data science, of course! In order to direct their investigations, the police aggregate information from disparate data sources about criminals in order to gather a holistic view of their lives and interactions with others. It’s common for officers to amass tax history from the IRS, call records from cell service providers, and arrest bookings from other police departments about thousands of suspects during the course of an investigation. In most cases, it’s very difficult to reconcile information from these disparate sources due to inconsistencies in records.

Is Verizon’s Bob Johnstone also the NYPD’s Robert johnston? Is the individual who lives at 433 Main St. the same as the one who lives at 433 Maine Street? Record duplicity is nontrivial problem when analyzing large amounts of data. It’s easy for a human to recognize the similarity between “Main St.” and “Maine Street,” but a human can’t mine through millions or billions of records.

The above picture is a graph database representing the financial interactions of individuals. A small fraction are criminals. This database contains thousands of duplicate records as a result of typos and non-unique entries. It is impossible to make decisions using data this convoluted.

In practice, we teach a computer to resolve unique entities from duplicated records in a data store.

Entity Resolution by Automated Clustering

Before we can combine, or “canonicalize,” duplicate records, we have to flag the duplicates. We do this by using a combination of similarity scoring, logistic regression, and clustering algorithms.

The steps in an entity resolution workflow.

Similarity scoring can be done in a variety of ways, but in the case of string fields (ex. “Bob Johnstone” versus “Bob johnson”) we use a variation of a simple procedure called the Hamming distance. The Hamming distance is defined as the number of changes it takes to produce the alternate record. Transforming “Bob johnson” to “Bob Johnstone” is done by adding a “t”, changing a lowercase “j” to uppercase, and adding and “e.” That’s three changes, so we have a Hamming distance of three. Similarly, the Hamming distance between “433 Main St.” and “433 Maine Street” is five.

Now that we have a field-wise string distance, we need to decide how to weight the fields for overall comparison of records. In the Bob johnson example, we must decide the relative importance of the name field and the address field. This is the machine learning part. We use an algorithm called logistic regression to set the weights of each field by showing the algorithm examples of records which are duplicates and records which aren’t duplicates. If the algorithm learns that addresses are more important for finding duplicates it weights the address distances more heavily than the name name distances.

Logistic regression is a minimization of the error between real data and predictions made with the logit equation.

Got that? Now that we have absolute scores between records we need to group records into clusters to label duplicates as belonging to each other, and unique entries as belonging only to themselves (a cluster of one). Information theorists might pose this problem as an entropy maximization of the information space. The rest of us would scratch our heads at that statement, and call it sorting records into unique groups.

Clustered data, ready for canonicalization.

Canonicalization of Records

Now that we know which records are duplicates, we need to do something with that information. In some cases the duplicated records will contain information we want to keep, like Bob’s arrest records. In other cases duplicated records might simply be a copy of something else, like Bob’s drivers license information, once from the DMV and again from his apartment application. We need to decide what information to keep, and what to discard.

The zeroth-order solution here is to only keep the first instance of a record in a cluster. Technically you’d be accomplishing your goal (building a unique set of records), but you’d be throwing away a lot of information. If you’re a police officer, you probably don’t want to do that! On the other end of the spectrum, you could simply keep all of the information in a relational database by using a bunch of tables. When it was time to retrieve the information you could simply join the tables on the cluster IDs you generated in the entity resolution step above. More elegantly, we’ve solved this problem using a graph database. We form “same as” edges between clustered nodes to build supernodes which represent unique individuals.

A couple of unique entity supernodes popped open to show all of their properties.

Criminals Beware!

After all this work, we’re left with a usable graph, ready for police analysis. Without all the clutter of duplicate entries, the agents use are able to clearly see the financial transactions between individuals:

The entity resolved, canonicalized FBI suspect graph. Note that you can actually see the nodes and their connections in this view, as compared to the original, non-resolved graph (first figure in this article).

User Audience

Services & capabilities

Project Details

Technologies

Graham Ganssle, Ph.D., P.G.

January 10, 2018

To Catch a Thief… You Need Entity Resolution

FBI’s use of machine learning based entity resolution to catch criminals and fraudulent activity.

Tags:
 

How does a 21st century police department track down criminals? By using data science, of course! In order to direct their investigations, the police aggregate information from disparate data sources about criminals in order to gather a holistic view of their lives and interactions with others. It’s common for officers to amass tax history from the IRS, call records from cell service providers, and arrest bookings from other police departments about thousands of suspects during the course of an investigation. In most cases, it’s very difficult to reconcile information from these disparate sources due to inconsistencies in records.

Is Verizon’s Bob Johnstone also the NYPD’s Robert johnston? Is the individual who lives at 433 Main St. the same as the one who lives at 433 Maine Street? Record duplicity is nontrivial problem when analyzing large amounts of data. It’s easy for a human to recognize the similarity between “Main St.” and “Maine Street,” but a human can’t mine through millions or billions of records.

The above picture is a graph database representing the financial interactions of individuals. A small fraction are criminals. This database contains thousands of duplicate records as a result of typos and non-unique entries. It is impossible to make decisions using data this convoluted.

In practice, we teach a computer to resolve unique entities from duplicated records in a data store.

Entity Resolution by Automated Clustering

Before we can combine, or “canonicalize,” duplicate records, we have to flag the duplicates. We do this by using a combination of similarity scoring, logistic regression, and clustering algorithms.

The steps in an entity resolution workflow.

Similarity scoring can be done in a variety of ways, but in the case of string fields (ex. “Bob Johnstone” versus “Bob johnson”) we use a variation of a simple procedure called the Hamming distance. The Hamming distance is defined as the number of changes it takes to produce the alternate record. Transforming “Bob johnson” to “Bob Johnstone” is done by adding a “t”, changing a lowercase “j” to uppercase, and adding and “e.” That’s three changes, so we have a Hamming distance of three. Similarly, the Hamming distance between “433 Main St.” and “433 Maine Street” is five.

Now that we have a field-wise string distance, we need to decide how to weight the fields for overall comparison of records. In the Bob johnson example, we must decide the relative importance of the name field and the address field. This is the machine learning part. We use an algorithm called logistic regression to set the weights of each field by showing the algorithm examples of records which are duplicates and records which aren’t duplicates. If the algorithm learns that addresses are more important for finding duplicates it weights the address distances more heavily than the name name distances.

Logistic regression is a minimization of the error between real data and predictions made with the logit equation.

Got that? Now that we have absolute scores between records we need to group records into clusters to label duplicates as belonging to each other, and unique entries as belonging only to themselves (a cluster of one). Information theorists might pose this problem as an entropy maximization of the information space. The rest of us would scratch our heads at that statement, and call it sorting records into unique groups.

Clustered data, ready for canonicalization.

Canonicalization of Records

Now that we know which records are duplicates, we need to do something with that information. In some cases the duplicated records will contain information we want to keep, like Bob’s arrest records. In other cases duplicated records might simply be a copy of something else, like Bob’s drivers license information, once from the DMV and again from his apartment application. We need to decide what information to keep, and what to discard.

The zeroth-order solution here is to only keep the first instance of a record in a cluster. Technically you’d be accomplishing your goal (building a unique set of records), but you’d be throwing away a lot of information. If you’re a police officer, you probably don’t want to do that! On the other end of the spectrum, you could simply keep all of the information in a relational database by using a bunch of tables. When it was time to retrieve the information you could simply join the tables on the cluster IDs you generated in the entity resolution step above. More elegantly, we’ve solved this problem using a graph database. We form “same as” edges between clustered nodes to build supernodes which represent unique individuals.

A couple of unique entity supernodes popped open to show all of their properties.

Criminals Beware!

After all this work, we’re left with a usable graph, ready for police analysis. Without all the clutter of duplicate entries, the agents use are able to clearly see the financial transactions between individuals:

The entity resolved, canonicalized FBI suspect graph. Note that you can actually see the nodes and their connections in this view, as compared to the original, non-resolved graph (first figure in this article).

User Audience

Services

Project Details

Similar Resources

Expero Announces Turn-Key Anti-Fraud Solution Availability on AWS Marketplace

Expero is announcing the CoNNected for Financial Crimes platform, an all-in-one application to help fraud teams through dashboards, alerts, network investigation and exploration, case management, and reporting, is now available on the AWS Marketplace.

Watch Demo

Software Solutions to Fight Retail Fraud

The focus of this webinar is to identify how Machine Learning, Visualizations and new technology like Graph can directly increase the accuracy and output of retail systems. We will feature unique Expero lightning talks on ML & Business Visualization technology, followed by a Q&A session.

Watch Demo

Financial Crimes Workbench & Software Development (SDK)

Watch Demo

Part Two: The Future of the Financial Desktop

The second webinar in a series on the future of the financial desktop. We will identify how to unlock new technology and show how ML techniques can be used with existing technology to drive productivity.We’ll draw from examples from the world of financial crimes, investment management, and capital markets to demonstrate how new development frameworks and techniques are enabling workflows to be reimagined in a more user centric joined-up way.

Watch Demo