Written By: Colin Leister & Brian Hall
As of late Q2 of 2018, there’s a new entry into the graph database marketplace, Amazon Neptune. Amazon Neptune is a vertically scalable, managed property graph and RDF database with high availability features (>99.99%) and expected large-scale read throughput.
Computationally its more inline with the Neo4j master/slave architecture rather than DataStax Enterprise with it’s distributed partitioned graph or orientDB with it’s master-master distributed approach. It’s only vertically scalable for computationally though the storage is horizontal. All writes go to the single primary server while the reads are distributed across the primary instance and all read replicas. Operations on the primary server are ACID with immediate consistency durability is managed by the distributed storage layer. Read replicas are asynchronously updated via a virtualized storage layer so all replicas read from the same commit point.
It’s currently schema-less and consists of a single “primary instance” with multiple read replicas. Supporting both Gremlin 3.3.2 and SPARQL 1.1, you don’t have to decide between triple store or property graph.
Storage: Neptune’s DB engine is integrated into an SSD virtualized storage layer. It’s storage model is very low touch; no manual provisioning is required as the data size increases. Each database volume grows in 10GB increments with a maximum of 64TB.
Fault Tolerance: Each Amazon Neptune database volume is divided into 10 GB segments that are replicated six times, spread across three Availability Zones. Two copies of data can be lost without affecting write availability. Three copies of data can be lost without affecting read availability. The Amazon Neptune storage system is intended to be “self-healing” in that it is continuously scanned and automatically repaired. If the primary DB instance fails, it will automatically fail over to one of the read replicas.
Backups: Amazon Neptune supports fully automated and manual backups. A backup is a storage volume snapshot of the database instance (meaning all databases are backed up within that Neptune instance). A retention period for the backup can be specified between 1 and 35 days - the retention period defaults to 7 days (1 day for DB clusters). Databases can be recovered to any point in time during the backup retention period and a snapshot can be created manually at any time by using the console.
Primary Write + Read Replicas and Failover: An Amazon Neptune database instance is comprised of a single primary write instance and up to 15 read replicas.Each read replica shares the underlying virtual storage volume. Replication is asynchronous and, according to the official Amazon Neptune FAQ, the typical lag time is10s of milliseconds. Each read replica can be a failover target (with promotion priority). Primary supports automated failover to a read replica.
When the primary instance fails, Amazon Neptune changes the CNAME record for the primary instance to one of the healthy replicas. The replica is promoted as the new primary, which should complete within 30 seconds. Without any replicas, Neptune will create a new database instance, a process that typically completes within 15 minutes. Upon connection loss to the database, applications would need to retry the request.
Indexing: Neptune takes an interesting philosophy toward indexing - they do not expose index configuration to the users. They claim that users should not be forced to outguess the software vendor and these should be maintained internally. This seems a logical extension to the ethos of the low touch storage approach.
Data Loading: Neptune supports bulk loading from S3 buckets in multiple formats. CSV for Gremlin. N-Triples, N-Quads, RDF/XML or Turtle for RDF/SPARQL. Data files may be compressed using GZip.
Security: Neptune provides several layers of security. Networking isolation is provided via Amazon Virtual Private Cloud (VPC). Database instances, backups, snapshots and replicas have at-rest encryption using AES-256. You can use AWS Key Management Service for key management. Username based ACLs are currently not supported though you can choose to secure access to the endpoint using IAM based credentials. Amazon has released open source tooling for the SIGv4 signing required to do this on their github site.
Interesting Gremlin-related findings:
- Vertex and Edge IDs and Property Types: IDs must be strings. User-supplied IDs are supported provided they are unique Strings. Automatically generated IDs are UUIDs converted to a String. Neptune supports the base types of boolean, byte, date, double, float, integer, long and string. Arrays, lists, maps and meta-properties are not supported.
- Vertex Property Cardinality: Neptune properties can either have set or single cardinality. List is not supported. Set cardinality is the default, If you set a property value on a vertex it adds a new value to the property if it does not already exist in the set of values. You must specify single cardinality to not add a value to a property, e.g.:
- g.V(‘mycustomid’).property(single, ‘name’, ‘Bob’) (Note that bulk loading only supports set cardinality.)
- Multi-Labels support: Neptune supports multiple labels for a vertex separated by “::”, e.g.:
- hasLabel("Label1"), hasLabel("Label2"), hasLabel("Label3")
- Transactions: Neptune opens a new transaction at the beginning of each Gremlin traversal or SPARQL update and closes it upon successful completion. Upon an error, the transaction is automatically rolled back. You can include multiple statements in a transaction by using a semicolon (;) or newline character (\n). Manually committing or rolling back is not supported.
Overall Amazon Neptune is a very strong new entrant to the marketplace that you would expect from a market leader like Amazon. It’s low touch administrative features and horizontal read scaling should be very attractive to users committed to AWS already.
To purchase or learn more about Amazon Neptune, click below: