Company for a Cup of Joe: Finding Your Tribe

Combining traditional search techniques with graph algorithms to efficiently find subgroups within data.

Company for a Cup of Joe: Finding Your Tribe

Fill out form to continue
All fields required.
Enter your info once to access all resources.
By submitting this form, you agree to Expero’s Privacy Policy.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Kristin Stone

October 8, 2019

Company for a Cup of Joe: Finding Your Tribe

Combining traditional search techniques with graph algorithms to efficiently find subgroups within data.



As graph consultants we eagerly apply graph data modeling and algorithms to solve customer problems. A recent example was for us to solve, what’s known in healthcare as, “Care Path Determination” for a particular patient. Let’s take your average patient, call him Joe. Determining his care path goes something like this:

  1. Find all the patients that look or looked like Joe (that term is intentionally vague)
  2. For all of those patients, find the one whose health care treatment was the most successful (again intentionally vague). We’ll call this patient Bob.
  3. Now go back and see what healthcare steps or interventions Bob took and suggest that Joe take them as well.

For the purposes of this blog, we’re simply going to address step 1. We’ll cover the other steps in later blogs in this series.

Finding the “Joes”

Given a large batch of healthcare data, how do you find all the patients that “look like Joe?”  Being graph enthusiasts, we’re going to utilize a graph database to hold our data structure. Next, we’ll  apply algorithms to turn our data into discoveries. 

Who is “Joe?"

For some of our internal thesis vetting, we’ve been using data sets generated by Synthea.  Synthea provides mocked generated patient data that will serve our purposes for this blog.  The generated data provides patients, allergies, procedures, medications, immunizations as well as providers and care plans.  The data set that was generated has 1,177 patients altogether.  

First, let’s review what information is available for patients in this dataset. Central to the issue at hand is the notion of a Patient vertex - and as in real life, our patients have certain attributes. Furthermore Patients have related ideas like Conditions, Allergies and Addresses. None of these are surprising but as we explain later, the details matter and we’ll need to refer back to these figures in a bit.

In our case Joe happens to be Mr. Joe Turcotte, who is a 47 year old white male of Greek descent.  Joe has had 3 immunizations, has been on 11 medications, had 7 recorded procedures and no known allergies. 

Simple Search

A simple query search is something we’re all familiar with, and shaving off unwanted data could provide more control in the initial steps of journey mapping. This notion may be blasphemous to most graph lovers, however, we’re in the business of solving data problems in the best manner possible. 

It’s important to remember that a productive simple search requires us to know all the queried data for each entity. In determining a care path, we would reasonably want to find similar patients based on attributes such as location, age, sex, health profile, etc. all within some given time period in history. However, some or much of that information could be missing leading us to overlook vital “Joes.”

Community Detection Algorithms

Of all the graph based algorithms available, one of the most simple and powerful classes is that of community detection. The job of community detection is to seek out nodes within a graph which are similar based on the structure of the graph. There are many ways to detect communities within a graph, finding the best (eg: smart and efficient) way is the trick and, generally, has to do with your graph structure. Here are a few popular options:

  1. Modularity maximization uses the standard metric of modularity, on the scale of (-1, 1), to determine the quality of communities found by the algorithm. Specifically, Louvain Modularity takes a greedy approach and can be time consuming depending on graph size, with a complexity of O(n log(n)), where n is the size of the node set. While another popular option is the Newman Modularity method that scales O(n^2 log(n)). These are both considerably more efficient than the Girvan-Newman Modularity method which presents at O(n^3). However, a major benefit is that modularity maximization is unsupervised, meaning the number of clusters is not required a priori. By maximizing modularity, you find the tightness, or quality, of communities. 

[From wikipedia, “it has been shown that modularity suffers a resolution limit and, therefore, it is unable to detect small communities. Biological networks, including animal brains, exhibit a high degree of modularity..” A higher modularity (-1 to 1) suggests clusters where members are “more similar to each other” with a single cluster. Note that this is completely derived by the structure of the graph rather than attributes on nodes or edges.]

  1. Spectral clustering utilizes the spectrum, or eigenvectors, of the adjacency matrix of the graph to determine community assignment. Convergence can be slow, as complexity is measured at O(n^3). Thus scaling can be problematic. However, efficiency is only gauged by the size of the node set and not the edge density.  Employing preconditioning or parallel processing can enhance efficiencies. Again, this is an unsupervised method requiring no a priori information other than the structure.


  1. Label propagation randomly seeds a graph with a suspected number of initial labels. Then, the algorithm iteratively labels nodes with its closest labeled neighbor until all nodes are labeled. The complexity of label propagation is linearly dependent on its edge set, O(m), where m is the size of the edge set.  However, the algorithm does require some a priori information; namely how many communities will be detected, and therefore is only semi-supervised. So, unless we know how many groups we have, this approach - while quick - has its disadvantages. 

If we built our graph with nodes of patients and edges defined to connect patients on similar attributes, it is certainly feasible to find a community which represents all the “Joes.” Which method we use is yet to be determined. As Label Propagation requires some a priori information, it’s probably best to table that method for this use case. Modularity and spectral approaches, however, are equally valid in their ability to approximate the solution. Without knowing ground truth, knowing which approach is “better” is arbitrary. However, with a look at complexities, we can move forward here using the Louvain method for modularity maximization.  

Similarity Coefficients

Similarity is a bit different than clustering. On the surface, similarity coefficients determine how alike two graphs (or subgraphs) are. For this scenario we would look at our target patient, Joe, and his neighborhood within the graph. Next we would look at all other patients and rate how similar their neighborhoods are to the target. These coefficient calculations scale by O(n).

  1. Jaccard Similarity is simply a coefficient defined by the ratio of the size of the intersection of two sets to the size of the union. Again, in our case, we’re comparing neighborhood sets of “Joe” and someone we think might be like “Joe”. That trend will be true for each of the following metrics as well.


  1. Overlap Similarity is defined by the ratio of the size of the intersection of two sets to the size of the smaller set.


  1. Cosine Similarity measures the ratio of the dot product of the two vector sets to the product of their magnitudes and is derived from the Euclidean dot product formula. 

The purpose of calculating similarity after community detection is to rank the members in the community on how alike they are to our target patient, Joe. Then we can focus our efforts on a given top X% or top X amount. 

In the case of these coefficients, they’re all similarly efficient and easy to implement. When doing any kind of comparison measure, the key is consistency. So we really do just need to pick one and run with it. Cosine similarity seems to be a community standard and is implemented in many tool kits. However, for the sake of science, we’re going to look at all three. 

Healing Cocktail

The magic happens when we can combine all available resources. Our best approach is to start with a simple search on the few basic attributes that will necessarily be available. This best serves our graph algorithms by reducing their load to bear. In smartly utilizing simple search as a preprocessor to any community detection and/or similarity algorithm, determining a care path can be much less painful. Here we drill down to get the list we want by prescribing the following steps:

  1. Simple search
  2. Community detection
  3. Similarity

This drill down approach allows some parts to run continuously like simple search and community detection - constantly updating those communities. Then pushing an appropriate and narrow set through a similarity algorithm to find the best path forward for the patient.

Example Scenario

Now back to our Synthea dataset of 1,177 patients, we found the following:

First, looking at our target patient - “Joe” we perform a simple search based on gender, race, and age +/- 3 years. This shaves off 96% of our dataset to initially target just 45 patients, including Joe. (Note that these may not be the best attributes for our search given what we’re seeking, but they were chosen here because those baseline characteristics are expected to be known.) 

As you can see below, simple search can greatly lighten the load on an algorithm like community detection. However, due to our synthetic data set being so complete we have some interesting results. Performing community detection on these 45, left us with just one community as these 45 are highly connected on the remaining data. In a much larger data set, we may expect to see more distinct communities even after a simple search is performed. 

Looking at similarity, as stated before, each coefficient is valid in its own right and the goal is to maintain consistency. Below we consider the top ten results for each method both for the entire dataset and after performing search and community detection. It is important to note that we did not consider the attributes that were previously searched on when performing the similarity calculations. This accounts for the difference in result bounds as there were fewer and different values to compute from. 

What is interesting here is the common results between the methods before and after search and community detection. After logically narrowing down the data, the similarity coefficient  calculations agreed 125% more than before doing so. This gives us more faith in our results and cause to investigate further. 

In conclusion, by combining traditional database search methods with graph algorithms, you lighten the computational load. This method will help you to find your “Joes” and process them accordingly. If you have questions about this blog or want to meet with our technical team, email us at!

User Audience


Project Details