As the graph database marketplace continues to mature, many believe that 2018 is a tipping point where predominantly corporate R&D projects evolve into concrete business objectives that need to yield results. The early entrants into the space are now being joined by the (lumbering?) giants and some of the smaller players are being acquired by larger companies - sure signs of a maturing marketplace. We’ve been at it for several years now but I still believe we’re in the middle innings of a nine inning game.
Now that there are literally dozens of choices, many corporate customers are stuck in lengthy evaluation cycles trying to select the right graph engine for their needs. A dearth of graph skills in the marketplace lengthens and complicates the selection process. What’s a CIO and CFO to do? Wait for further maturation and consolidation? Pick a leader and just “go?” Perform a labor intensive bake off? Our experience suggests a little of “all of the above.”
Having partnered with many of the commercial vendors and contributed to some of the open source alternatives, Expero has unique experience and a broad perspective. While this is not an exhaustive survey of the marketplace, it is informed by real world projects and the benefit of exposure to many of the options.
One of the earliest entrants in the property graph marketplace, their maturity shows. Web based tooling in conjunction with cloud accessible sandboxes make exploring graphs with Neo4J an afternoon’s project. Neo’s Cypher language is quite expressive and allows mere mortals to perform most graph queries using an immensely readable, understandable language. One benefit of Neo’s longevity is the amount of help that can be found online in forums and on Stack Overflow. Need to do something more complex? Crack open your favorite Java IDE to easily extend Neo’s capabilities using their Awesome Procedures for Neo4J (APOC) extension paradigm. These procedures offer everything from user defined functions that can be referenced from within your Cypher queries to full on parallelized graph algorithm implementations. If your analytics require more horsepower, the Neo4J Spark connector simplifies the task of reading data out of Neo into Spark RDDs where you can use the native Spark tooling.
Previous years saw some large improvements to the core Neo4J database including causal clustering and Cypher language and performance improvements. This past year, more emphasis has been put on Neo as not just a database, but a graph platform. With this we expect to see an increase in tooling and integrations, easing the adoption effort for new customers, and adding new functionality to the existing user base.
DataStax Enterprise Graph
Initially the commercial Apache Cassandra and Apache Solr offering, DataStax has continued to add big data components onto their platform with Apache Spark and DataStax Enterprise Graph (acquired through the Aurelius team behind Titan) along with integrated analytics and search in one comprehensive distribution. By embracing underlying open source technologies and standards (Apache Tinkerpop/Gremlin), DataStax can leverage subsequent innovations originating from the community while providing corporate customers an integrated, vetted platform.
DSE Graph differentiates itself by providing seamless support for both DSE Analytics and DSE Search. Additionally, performance enhancements have been added, such as graph-specific materialized views, a query optimizer, distributed query execution engine, and a locality data partitioner.
As with Neo4J, and many of the other commercial database offerings, DataStax has put an increased emphasis on developer experience. The original desktop based developer tools are being phased out in favor of DataStax Studio, a notebook inspired development environment that provides query access and data visualization over DSE’s Cassandra and graph data. On the data loading side, DSE Graph Loader provides a number of different ingest format options and can perform parallelized bulk loads at a high speed into your DSEG cluster. Recent versions of DSEG have added better interoperability with the Spark ecosystem, extending the Gremlin Spark support with the ability to access your graph data with Spark’s Graph Frames extensions.
Titan was (and still is) a great open source graph database and while the Aurelius folks shifted their focus entirely to DataStax Graph and TinkerPop, the rest of the community is not standing still. JanusGraph (son of Titan - get it?) is a reasonably recently forked distribution of Titan that will now live and evolve with community contributions. Maintaining data storage versatility for Apache Cassandra, Syclla DB, Apache HBase, and Google Bigtable. Powerful fulltext and geospatial search can be provided via integration with popular search solutions including Elasticsearch and Apache Solr. JanusGraph rapidly adopts new Tinkerpop versions and provides up to date Gremlin and TinkerPop driver compatibility. Community contributions have been growing at a rapid clip since the fork, and 3rd party visualization integration options have begun to pop up. 2017 saw the first fully hosted Janus option, provided by IBM’s Compose.
Global graph analytics are available out of the box through Janus’s Spark IO formats which allow the standard TinkerPop OLAP tools to read data out of Janus into a Spark or Giraph cluster for analysis via standard Gremlin or custom vertex program graph algorithms. This can also be used to parallelize data loading.
JanusGraph is a viable option for some enterprises that do not require vendor support or wish to significantly customize the stack. But this is not for the faint of heart: administrative and development tooling is usually light for open source projects - this one’s no different.
OrientDB coined the term "multi-model database" in 2010 and offers a unique take on the property graph space. It provides an object-oriented, class-based representation (complete with multiple inheritance) of tabular data but without the constraints of a relational database or many of the other NoSQL products. Vertices and edges are represented as classes, and their property names are not limited to a global set as they are in DSE Graph and JanusGraph. Relationships are stored as links that point to the actual record ID within the database.
OrientDB's class schema support is extensive. Not only does it offer over 20 different data types, but constraints can be placed on each property. Examples of these are min/max values, regex, mandatory, default values, and read-only. What's nice about the way OrientDB handles schema is that it leaves it to the designer whether each class has a full schema, a partial schema, or no schema at all. Additionally, since OrientDB's classes support inheritance, schema from parent classes is also inherited.
Learning how to communicate with OrientDB data is usually considered very easy as it offers an extended version of standard SQL, called OrientDB SQL (OSQL). Queries and CRUD operations are easy to write and understand. However, OrientDB is a true property graph database and traditional SQL is not very useful for traversing a graph. That's where the OSQL traverse command comes into play. Even so, writing complex multi-level traversals in OrientDB SQL can be tedious. The match command was introduced in version 2.2 and provides a more intuitive, declarative way to traverse any data in OrientDB with deep relationships.
In September, 2017, OrientDB Ltd was acquired by CallidusCloud. Shortly thereafter CallidusCloud was acquired by SAP.
CosmosDB’s recent release has focused much more on their NoSQL and document storage features but they’ve included support for Tinkerpop/Gremlin with much less fanfare. While the database is still in early release, the offering benefits from the mature ecosystem around it with surprisingly sophisticated build and deploy tools for an initial release. Developers with exposure to the Microsoft toolset and Azure hosting will find themselves at home very quickly.
Amazon Neptune is a vertically scalable, managed property graph database with high availability features and expected large-scale read throughput. Unlike the distributed partitioned graph databases, such as DSE Graph, or the distributed Master-Master NoSQL databases, like OrientDB, Amazon Neptune makes use of a single primary instance for all write operations and up to 15 read replicas.
The Neptune DB engine is integrated into an SSD virtualized storage layer. Each database volume grows in 10 GB increments with a maximum of 64 TB. These 10 GB segments are replicated 6 times, spread across 3 Availability Zones.
If the primary DB instance fails, it will automatically fail over to one of the read replicas. Each read replica shares the underlying virtual storage system and writes to the primary instance are asynchronously replicated.
Whew! And there are many more….TigerGraph, ArangoDB, AgensGraph, AllegroGraph. Having trouble deciding? You’re not alone. Stay tuned. This is a fast moving space and it’s still early. Expect many more developments, acquisitions and consolidation.