Fitting the Data: How NOT to do a Graph Database in SQL
Josh Perryman

This is the first post in a series looking at choosing the right persistence and query engines for your data.

We talk about data, and how several data concepts such as “Big Data” and “NoSQL” are currently in the vogue. But just as all politics is local, all data is ultimately specific to its own subject domain. Data is not all the same, and so we shouldn’t expect that the general data tools will be the best tools when working with any particular set of data. Choose the right tools with the best fit for your data and you’ll spend more time in analysis and realizing the value of your data, and less time working around the restrictions of your tools.

A client had an enterprise application that was slow. The data and queries were slow, and even customers that threw big Oracle racks at the problem didn’t see appreciable speed ups. One of the items we uncovered in our architecture review was that the slowness was caused by — among other things — a mismatch between their data’s shape in their heads and its shape in their database. In this post I’ll introduce the problem and in the following posts I’ll walk through our evaluation of some possible solutions.

The application data could be described has several thousand entities, connected to one another in various ways. One could easily sketch a snippet of their data

on a napkin with some circles for entities and lines between them for the connections, like in the picture below.

‍Graph on a napkin

Their application handled many problem domains. Imagine the circles are computers, and lines are networks, and you want to know how connectivity changes when a network goes down. Or how many 10GigE connections you’ve got. Both the nodes (computers) and the edges (networks) had properties, like CPU speed or interface type.

At this point, you should see what we saw with their data: it is a graph database. The entities are nodes in the graph and the connections are edges or relationships. (Note: I won’t go into what a graph database is, or how all social media is based on graph theory, or even that the Internet is essentially a huge graph database.)

Tall Tables

The root of the performance issue was the implementation of this graphy data using a Relational Database Management System (RDBMS), also commonly known as SQL. The organization of the data in the RDBMS was highly abstract, with all entities in one table (regardless of type), all connections in second table (regardless of type), and all properties in a third table.  Elegant, flexible, nice to program against in some cases.

You see the pattern commonly in some enterprise software: it allows you to arbitrarily add attributes to your entities (CPU speed, for example) that weren’t anticipated by the original design, without altering actual tables. Instead of adding a column (“CPU_SPEED”) to a table (“COMPUTER”), you just add a row to an existing table for every computer’s CPU speed.

You end up with three tables with just a few columns and hundreds of thousands, or millions of rows. We call them “tall tables”.

Unfortunately, RDBMS systems don’t like tall tables. Using them neuters the optimizers in their query engine. This is the key point that I’ll finish with for today.  Rarely is one concerned with all nodes in a graph, or in this case, all of the entities within the application. Usually there is some specific subset, one or two types of entities that are of interest. The query engine helps us to quickly find those types of specific interest and quickly get on with our work.

Their implementation was so abstract that it was effectively impossible for the query engine to optimize. It was hard to build indexes, and difficult to use them.

The Right Tool for the Right Job

It looked like what appeared to be a slow SQL database was actually a just a poor implementation of a graph database. But we like doing tests with real data.

We proposed migrating realistic application data into variety of graph database options: Neo4jTitan, and AllegroGraph, as well as looking at a “wide tables” approach in an RDBMS. The evaluation would score an objective set of query speeds and a subjective sense of “ease of representation and query language.” My next post will look a little deeper into the Neo4j evaluation and the value of a good query language combined with a database optimized to handle graph data.