How does a 21st century police department track down criminals? By using data science, of course! In order to direct their investigations, the police aggregate information from disparate data sources about criminals in order to gather a holistic view of their lives and interactions with others. It’s common for officers to amass tax history from the IRS, call records from cell service providers, and arrest bookings from other police departments about thousands of suspects during the course of an investigation. In most cases, it’s very difficult to reconcile information from these disparate sources due to inconsistencies in records.
Is Verizon’s Bob Johnstone also the NYPD’s Robert johnston? Is the individual who lives at 433 Main St. the same as the one who lives at 433 Maine Street? Record duplicity is nontrivial problem when analyzing large amounts of data. It’s easy for a human to recognize the similarity between “Main St.” and “Maine Street,” but a human can’t mine through millions or billions of records.
The above picture is a graph database representing the financial interactions of individuals. A small fraction are criminals. This database contains thousands of duplicate records as a result of typos and non-unique entries. It is impossible to make decisions using data this convoluted.
In practice, we teach a computer to resolve unique entities from duplicated records in a data store.
Before we can combine, or “canonicalize,” duplicate records, we have to flag the duplicates. We do this by using a combination of similarity scoring, logistic regression, and clustering algorithms.
The steps in an entity resolution workflow.
Similarity scoring can be done in a variety of ways, but in the case of string fields (ex. “Bob Johnstone” versus “Bob johnson”) we use a variation of a simple procedure called the Hamming distance. The Hamming distance is defined as the number of changes it takes to produce the alternate record. Transforming “Bob johnson” to “Bob Johnstone” is done by adding a “t”, changing a lowercase “j” to uppercase, and adding and “e.” That’s three changes, so we have a Hamming distance of three. Similarly, the Hamming distance between “433 Main St.” and “433 Maine Street” is five.
Now that we have a field-wise string distance, we need to decide how to weight the fields for overall comparison of records. In the Bob johnson example, we must decide the relative importance of the name field and the address field. This is the machine learning part. We use an algorithm called logistic regression to set the weights of each field by showing the algorithm examples of records which are duplicates and records which aren’t duplicates. If the algorithm learns that addresses are more important for finding duplicates it weights the address distances more heavily than the name name distances.
Logistic regression is a minimization of the error between real data and predictions made with the logit equation.
Got that? Now that we have absolute scores between records we need to group records into clusters to label duplicates as belonging to each other, and unique entries as belonging only to themselves (a cluster of one). Information theorists might pose this problem as an entropy maximization of the information space. The rest of us would scratch our heads at that statement, and call it sorting records into unique groups.
Clustered data, ready for canonicalization.
Now that we know which records are duplicates, we need to do something with that information. In some cases the duplicated records will contain information we want to keep, like Bob’s arrest records. In other cases duplicated records might simply be a copy of something else, like Bob’s drivers license information, once from the DMV and again from his apartment application. We need to decide what information to keep, and what to discard.
The zeroth-order solution here is to only keep the first instance of a record in a cluster. Technically you’d be accomplishing your goal (building a unique set of records), but you’d be throwing away a lot of information. If you’re a police officer, you probably don’t want to do that! On the other end of the spectrum, you could simply keep all of the information in a relational database by using a bunch of tables. When it was time to retrieve the information you could simply join the tables on the cluster IDs you generated in the entity resolution step above. More elegantly, we’ve solved this problem using a graph database. We form “same as” edges between clustered nodes to build supernodes which represent unique individuals.
A couple of unique entity supernodes popped open to show all of their properties.
After all this work, we’re left with a usable graph, ready for police analysis. Without all the clutter of duplicate entries, the agents use are able to clearly see the financial transactions between individuals:
The entity resolved, canonicalized FBI suspect graph. Note that you can actually see the nodes and their connections in this view, as compared to the original, non-resolved graph (first figure in this article).