Graph-based machine learning, or graphML, as we refer to it at Expero, is still a relatively new area of interest. As such, we receive lots of questions about the types of work we do, and how it fits into the world of analytics. There are many options for storing graph data, including several wonderful graph databases. I’m going to ignore any consideration of how the graph data is stored in this article as it’s a tangential subject, mutually exclusive from graph analytics.
Graph analytics is a super hot area of research right now because we’re entering a world dominated by machine learning. There are many types of traditional graph analytics which don’t require machine learning. I’m also going to ignore these for two reasons: first, that isn’t my area of expertise, and second, there is a whole internet full of references you can read to bone up on those techniques. With respect to graphML, there are three paradigms of processing which typify thousands of individual analysis types.
This is the most boring flavor of graphML. The idea here is that, using graph queries, you extract information about particular records in a more meaningful way than in a traditional data structure. Once you do so, you’re armed with a suite of information for each record which is more powerful than its relational counterpart, and inject that information into any old machine learning algorithm. Boredom notwithstanding, the outputs from your ML model will generally be of much greater accuracy. I’ll explain why using a type example.
Traditionally, when forecasting the performance of alternative assets like CLOs, companies have used temporal analysis systems which take as input features like par value, interest rate, market valuation, time rate of change, and many other characteristics describing one asset. This is necessary, and simple enough to do: query your data source for all the info about your asset, feed those features into a machine learning algorithm, and you’re off to the races.
In the case of graphML processing paradigm one, you have a tremendously more powerful option at your disposal. Instead of asking the data source for the current interest rate on the single CLO, you can ask for the interest rates in all CLOs which have obligors in the same industry and are backed by assets with class A office space. This type of query can be impossible without a graph data structure, but is more representative of the complete state of the asset in question. Armed with this information as input, your machine learning algorithm is now equipped with a much more nuanced picture of the asset, and indeed the local neighborhood in the market information space. This means more accuracy, and thus higher returns.
Any graph database system will enable you to build graph queries which extract information in graphy ways. We commonly use JanusGraph, TigerGraph, Datastax, OrientDB, Neo4j, and Neptune. Many of these use Cypher or Gremlin as their query language, enabling fast and easy retrieval of information in its naitive form. If your graph is small enough, you can keep it (or parts of it) in memory and use it that way. In this case, we typically use one of the following: NetworkX, Memgraph, or Trovares.
In this processing paradigm, we don’t deal with the properties of individual data records. Instead we analyze the structure of how each record interacts with the rest of the records. Instead of looking at characteristics of an individual entity, we look at characteristics of the whole system. We do this by extracting the structure, or topology, of a graph network for each of its constituent neighborhoods. This paradigm is the most abstract, but bear with me, and I’ll try to demonstrate what I’m talking about by way of example.
It’s fairly easy for bad actors to falsify records like phone numbers and addresses in fake business operations. These are the types of information attached to a data record in a traditional organizational fraud analysis using machine learning. One feeds all the information they have about one entity, feeds it into an ML algorithm, and waits for results. This strategy is insufficient in an era when digital records can be falsified. Instead, we recommend using graphML topology analysis.
It works like this: instead of analyzing the properties of businesses, you analyze the properties of how businesses are related to one another. Using graph analysis we can easily extract the pattern of the flow of money, and relative relationships between businesses. That’s a key distinction: we inject into our ML algorithms the structure of the monetary relationships between businesses, not the properties of the businesses individually.
What we’ve found is that though it’s relatively easy for bad actors to fake business documents like bank accounts and holdings statements, it’s very difficult (or impossible) to disguise the flow of money in a fraud scheme. Imagine an ordinary business; money flows out to vendors and in from customers. In a Ponzi scheme, money flows into and out of an investmentment management organization hierarchically as a function of time. In a money laundering ring, money flows cyclically, while the net sum stays constant over a full time period. These are just case examples, but the point is that while you can fake business parameters, you can’t fake the structure of business relationships. Using machine learning on graph topology metrics, you can discover anomalous relationships, and thus find fraud with extremely high accuracy.
Graph topology analysis can be done in a huge variety of ways, and in many cases we write custom code to perform specific types of analysis. There are open source packages, however, which perform the standard operations one would use to analyze graph topology. Our most frequently used libraries include NetworkX and Graph-Tool.
In the third paradigm, we combine the analysis of node properties (Paradigm One) and graph topology (Paradigm Two) into a single, simultaneous algorithm. The beauty here (and the reason you get such amazing results) is that you’re including all information available about every data record at analysis time. This means that you preserve not only the information content in each record, but also their relationships to the rest of all the records in the system. These types of analyses are complete, meaning they have all of the possible information about the system, and as such they can be slow, but the results are worth the wait. We’ve seen accuracy improvements of up to 16% just by switching to this processing paradigm. Let me demonstrate why these algorithms work so well with an example.
We do a lot of work in the transportation analysis domain here at Expero. One of the problems we’re currently tackling is the evaluation of the safety of roads, so that companies can build better autonomous systems which keep drivers safer. Scoring individual segments of road with risk characteristics is an extremely difficult problem. Imagine you know something about the number of accidents on a local surface road in the suburbs at mile marker fifteen, but you don’t have a count of accidents on the parallel segment of interstate highway thirty yards away.
You can’t assume that even though the surface road appears to be dangerous at that particular latitude/longitude location that the highway just next door is also as dangerous at that location. The most accurate way to infer risk on the neighboring highway is to use graphML. In this case we use a learning algorithm called a neighborhood weighted graph convolutional network to analyze the nearby road segments’ accident counts, while simultaneously analyzing the path from the highway in question to the surface road we have information about. This type of analysis is extremely powerful because the algorithm has information simultaneously about the structure of the road network and the conditions at each point in that network.
This is a field of intense and rapid research, and as such tools are quickly evolving. There are fully functional open source tools one can use, such as Kipf’s GCN or Grover’s node2vec. Most powerfully, however, the thing to do at the time of writing (June 2019) is to build your own specialized code for the application domain. We use libraries like Deepmind’s graph_nets to build custom deep graph neural networks, or Facebook’s PBG to build embeddings for massive graphs.
Each one of the sections in this article could, themselves, be articles. In some cases, they already are (see our blogs and lightning talks on using graphML)! If you’re interested in discussing the application of graphML techniques to your business analytics projects, please drop me a line; I love brainstorming about applying these techniques in every industry.