Property graph databases (PGDB) are gaining popularity as users discover the benefits of this approach for certain uses cases. People appreciate the simplified data modeling and querying that a PGDB provides when dealing with connected data. Despite these benefits, after 10 years plus of existence, even the most senior property graph offerings still lag behind many of the relational database (RDBMS) offerings when it comes to schema support. Some of this is by choice and in other cases, there is just a lot of engineering catch-up to be done.
Today we’ll describe the schema definition options in three popular PGDBs: Neo4j, DataStax Enterprise Graph (DSEG), and JanusGraph to illustrate the spectrum of support that is currently available and point out where the holes are. This is not an exhaustive list of graph databases, let alone of the property graph variety, but it provides a fairly representative view of the current playing field.
Before we continue, what do I even mean by schema? For the purposes of this discussion, I’ll consider schema as a set of constraints provided by the end user to the database management system (DBMS) with the expectation that the DBMS will enforce these constraints to maintain data integrity. Examples of schema elements in the relational world include tables, views, indexes, foreign keys, and check constraints. For property graphs we’ll look at vertex and edge labels, property keys, indexes, and a variety of other constraints.
If this is your first foray into PGDBs, a brief aside is in order to go over what I mean by property graph. There are a number of different modeling paradigms that fit under the umbrella of graph database, property graph being only one of them.
The components are straightforward. There are vertices and edges. Edges connect vertices to other vertices. Vertices and edges may also be referred to as nodes and relationships. These vertices and edges are labelled. You can think of a label as a type or class in object oriented programming parlance. Vertices and edges have properties. Past that, there is some variation between vendors. For example Neo4j allows a vertex to have more than one label whereas DSEG and JanusGraph do not. DSEG and JanusGraph support properties on properties (meta-properties). All three support multi-valued properties to varying degrees. Refer to your database’s documentation for specific details on their unique features. We’ll start our overview of schema support with the most senior PGDB of the group, Neo4j.
Schema definition is optional in Neo4j. When it is used, its main purpose is to define indexes that are tied to specific vertex labels, and for the specification of a number of other constraints. Vertex labels are implicitly created as data is added into the system. The same goes for properties and edge labels (relationship types). Properties are not explicitly typed, so you can insert any of the supported types.
As mentioned previously, indexes can be tied to specific vertex labels. For example, you can index by name and age and restrict that to vertices marked with a specific label. Single and multi-property indexing (composite) is supported. Indexes can also be dropped after they’ve been created if they are no longer necessary.
Neo4j also supports not null constraints on vertex and edge properties, allowing the user to specify which properties are required to have values. This is a helpful feature that is not possible with DSEG or JanusGraph. Additionally, users can define single or composite primary keys on vertices that will guarantee vertex uniqueness by label.
* Only available in the Neo4j Enterprise Edition
One feature to note, also unique to Neo4j out of the three PGDBs we’re looking at, is its ability capture statistics about schema elements. This information is used by the query optimizer to better help it plan query execution. This is a great example that the more information your database has about the structure and statistics of your data, the more information it has to optimally store and retrieve the data. For the full story on Neo4j schema, please refer to https://neo4j.com/docs/developer-manual/current/cypher/schema/.
The Janus management system allows users to create vertex and edge labels, and properties. Properties require that a property data type be specified. Janus supports a comprehensive list of data types and the user can provide custom serializers/deserializers if they would like to use something that isn’t on the list. Vertex, edge label, and property key definitions cannot be removed after they’ve been added but their names can be changed. Property keys have a cardinality that can also be set. This enables users to store a list or set of property values for that given key. Take a phoneNumber property for example - a cardinality of LIST would allow for more than one phone number to be stored under that single property, including duplicates, while SET would maintain a unique set of phone numbers.
Janus has support for a number of edge label multiplicities. This enforces how many vertices of a particular label a vertex can be connected to via the given edge. There is a caveat with edge multiplicities and vertex uniqueness constraints. These constraints are only as good as the guarantees that can be provided by the storage layer that you’re using with JanusGraph. JanusGraph on top of an ACID BerkeleyDB instance will provide stronger guarantees than running on top of an Apache Cassandra cluster that is using a custom locking protocol to maintain these constraints in the face of failure.
Like Neo4j, Janus schema management tasks can include adding and removing indexes. Indexes also bring the ability to add unique constraints to properties or property/vertex label pairs. These indexes can be of the single or composite variety and are stored either in the same storage back end as the rest of your graph, or an external indexing system such as Elasticsearch or Apache Solr.
Like Janus, users specify vertex and edge labels and properties and their data types up front. Indexes can be added to properties on vertices and edges. However, there are a few key additions. Unlike JanusGraph and Neo4j, properties must explicitly be associated with specific vertex labels and edge labels. Like a relational table that has a set of columns, each vertex and edge only contains the properties that have been set up in the schema. Along these lines, edge labels include an additional constraint where the user specifies tuples of vertices that this edge may connect. For example, the “knows” edge can connect “Person” to “Person” and “Person” to “Pet,” but not “Person” to “House.”
DSEG does not currently support a true unique constraint. It does provide a means to efficiently upsert a vertex through custom vertex ids, but this will not notify the user if a vertex with the same primary key already exists in the system like a uniqueness constraint would. However, Custom vertex ids do provide a nice method for explicitly defining primary keys and are the direction the database is going, favoring their usage over system generated, default ids.
I would like to see more systems support the non-null constraints that Neo4j currently supports in their enterprise edition and the strict vertex and edge property composition supported by DSE Graph. I think the JanusGraph edge multiplicity constraints are nice tools, but their usefulness and correctness needs to be proven out with testing at scale and in the face of failure. Looking at a hugely popular and eminently well thought out and versatile RDBMS like PostgreSQL, it would be great if something akin to check constraints came onto the scene. I imagine as PGDBs gain more traction and make their way into more and more deployments, schema support will improve and continue to grow closer to parity with their relational cousins.
The PGDB landscape is rapidly evolving with new players coming on to the scene and existing providers adding new functionality at a rapid pace. The following table summarizes schema support across the three systems we covered as of the writing of this article (October 2017) but I’d expect this to change quickly so be sure to check your PGDB provider for the latest and greatest information. This table notes constraints, not support for a modeling concept. (For example, Neo4j has the notion of vertex and edge labels like JanusGraph and DSEG, but it does not allow the end user to specify exactly which labels are acceptable, and thereby disallowing the addition of vertices with a label that is not setup.)
Ted Wilmes is a member of the Apache TinkerPop Project Management Committee and JanusGraph Technical Steering Committee