Fitting the Data: Neo4j & Cypher
Josh Perryman

The Value of a Good Declarative Query Language

In this second post we look at how the application’s graph data fit into a leading graph database: Neo4j. Quite well actually.

In my first post I described the problem of an application with graph data that was implemented in a SQL database using a “tall tables” approach. (Note that throughout my use of SQL is agnostic. It could apply to Oracle’s flagship RDBMS product or to the eponymous competing title from Microsoft, or any of a number of other titles that use some form of SQL as their query language.)

Going into this, the suspicion was that a database purpose-built to handle graph data would perform better than the tall tables implementation. Perform a search for “graph database” or “property graph database” and Neo4j will be among the first hits so that’s where I started. Our customer provided a virtual machine with the application and two sample data sets. They also provided a couple of sample queries which we’ll get to in minute.

The conversion process went very smoothly, as one might expect moving from an RDBMS environment with its strong data typing and straight-jacket -like schema to an environment with no strict data typing and only the barest hints of a schema. I found the Cypher query language quick to learn. Trivial queries are easy and obvious and more complicated ones aren’t too hard to figure out.

Though the data provided wasn’t terribly large (28,000 entities, with 46,000 connections between them), there was a particular query that should have been easy for the technology to handle but was very long-running. A user of the data would have 5 different types of entities within the data that were related to one another so that you could follow a path from 1 to 5 like this:

path through 5 entities
path through 5 entities

That pattern was common in the data set. So common, in this small sample that this specific type of path occurred nearly 4,000 times. Finding all 4,000 of those examples in this small data set running on a VM required, at best 4 minutes and 40 seconds. A simplified version of the SQL to find it with a tall table implementation looks like:

SELECT E1.entityName AS 'Entity Type 1'
 , C1.connectionName AS 'Connection from type 1 to type 2' 
 , E2.entityName AS 'Entity Type 2'
 , C2.connectionName AS 'Connection from type 2 to type 3'
 , E3.entityName AS 'Entity Type 3'
 , C3.connectionName AS 'Connection from type 3 to type 4' 
 , E4.entityName AS 'Entity Type 4' 
 , C4.connectionName AS 'Connection from type 4 to type 5'
 , E5.entityName AS 'Entity Type 5' 
FROM Entities E1 
 INNER JOIN Connections C1 ON (C1.incomingEntityID = E1.entityID) 
 INNER JOIN Entities E2 ON (E2.type = 2 AND E2.entityID - C1.outgoingEntityID) 
 INNER JOIN Connections C2 ON (C2.incomingEntityID = E1.entityID) 
 INNER JOIN Entities E3 ON (E3.type = 3 AND E3.entityID - C2.outgoingEntityID) 
 INNER JOIN Connections C3 ON (C3.incomingEntityID = E1.entityID) 
 INNER JOIN Entities E4 ON (E4.type = 4 AND E4.entityID - C3.outgoingEntityID) 
 INNER JOIN Connections C4 ON (C4.incomingEntityID = E1.entityID) 
 INNER JOIN Entities E5 ON (E5.type = 5 AND E5.entityID - C4.outgoingEntityID) 
WHERE E1.type = 1

Eight joins! And we’re only looking at 2 tables! No wonder it takes nearly 5 minutes. And that’s without returning any additional property values for the entities. You’ll recall from my first post that the property values were stored in their own table separate from the other two tall tables. Though one can used CTEs, you’re still adding more joins just to get additional properties for each entity.

I then implemented the same query in Cypher. Here’s a version that’s similar to the SQL query above:

MATCH (E1)-[C1]-(E2)-[C2]-(E3)-[C3]-(E4)-[C4]-(E5) 
WHERE E1.type = 1 AND E2.type = 2 AND E3.type = 3 AND E4.type = 4 AND E5.type = 5 
RETURN E1.entityName AS 'Entity Type 1'
         , C1.connectionName AS 'Connection from type 1 to type 2'
     , E2.entityName AS 'Entity Type 2' 
         , C2.connectionName AS 'Connection from type 2 to type 3'
     , E3.entityName AS 'Entity Type 3'
         , C3.connectionName AS 'Connection from type 3 to type 4'
     , E4.entityName AS 'Entity Type 4'
         , C4.connectionName AS 'Connection from type 4 to type 5'
     , E5.entityName AS 'Entity Type 5'

In Cypher, graph nodes are designated by parenthesis (e.g. (E1) ), and relationships by brackets with hyphens connecting them to the nodes (e.g. -[C1]- ). It is slightly more preferable in Cypher to use a JSON-like notation within the node for the MATCH statement. We can even leave out the relationship brackets in between if we just want to list the node names:

MATCH (E1 { type : 1 } )--(E2 { type : 2 } )--(E3 { type : 3 } )--(E4 { type : 4 } )--(E5 { type : 5 } ) 
RETURN E1.entityName AS 'Entity Type 1'
     , E2.entityName AS 'Entity Type 2'
     , E3.entityName AS 'Entity Type 3'
     , E4.entityName AS 'Entity Type 4'
     , E5.entityName AS 'Entity Type 5'

That generic query completed in 1.4 seconds. In its most optimized form (with labelson the nodes and the relationship types all specified along with their directions), it took about 400 ms, nearly 800 times faster than the SQL version.

More importantly, I showed the simply Cypher query to our customer and, cleaning up the response because we’re a family-friendly blog, the individual that provided the original SQL said “Oh my! I can read that. It’s so obvious the question that it’s asking.”

We tested three or four other queries in both environments. Queries that just returned numeric properties and their sum performed about the same in SQL and Neo4j. But the queries that sought to exploit the graph nature of the data by following relationships two or three hops tended to be 2 to 4 times faster with the small data set.

There are excellent technical reasons for this, and they aren’t that SQL is slow while Neo4j is fast. I may blog about those in the future, but if you’re interested in knowing now then download the Graph Databases book and check out chapter 6 on Graph Database Internals.

So we see that Neo4j’s strengths are a graph-focused query language and an query engine optimized for graph data.  Both are big wins over the tall tables approach. Next week we’ll look at one of the other major property graph solutions: Titan.

RECENT POSTS FROM
THIS AUTHOR