It’s common to have folks write to the JanusGraph users and dev group looking for advice on data loading performance. In many cases, they’re running on their laptop or a small test cluster, kicking the tires, and having some issues getting acceptable performance. For this post, we’ll walk through the process of getting things running in a speedy fashion.
The experimental setup consists of JanusGraph and a Gatling-based load generator. The load generator does nothing more than send mutating traversals that create vertices in batches of 10 like this:
That’s it, not even any edges. Though basic, this will be enough to see the effects of a few different tunables on throughput and latency.
The tests were run in AWS using a standalone Gatling instance. Gatling sent write requests to a standalone JanusGraph instance, running on an m4.2xlarge. JanusGraph then stored all data in a single Scylla instance hosted on a i3.2xlarge. All instances were located in the same placement group. The Scylla instance was created using Scylla’s AMI. The default configuration was used throughout all of the experiments so all of the tweaks were performed on the Janus side. Note that a production setup should never use less than 3 Scylla instances (unless there isn’t an SLA to speak of I guess…), but one instance is adequate for demonstration purposes and the following discussion will be equally applicable to clustered setup.
The experimental setup
Janus’s deployment options can be confusing to the newcomer. The support for a variety of storage backends is a bonus if Janus needs to fit into an existing infrastructure, but one immediate question comes to mind, should users run Janus colocated (same instances) with their storage layer? This is tunable in its own right, but I chose to run it on its own instance for these experiments. If Janus is colocated with Scylla, it’s worth looking at using Scylla’s smp and overprovisioned settings so resource contention between Scylla and the Janus JVM is kept to a minimum.
These experiments will use the Apache TinkerPop Java driver to remotely connect to JanusGraph. Some people still run Janus embedded but I usually set it up to be accessed remotely so that the application workload does not commingle with the database workload.
These tests were run using the original string-based querying method, not the newer language embedded withRemote option. A later post will look at performance of the GLV vs. Groovy scripts. One update was made on the client driver side prior to test runs. The following properties were added to the remote-objects.yaml to allow for more inflight requests to ensure that Gatling was not the bottleneck.
There are a lot of good performance related tips in the Apache TinkerPop docs, we’ll follow a few here. First, a few things about the load script.
The “addV”’s will be executed by the Groovy script engine embedded in Gremlin server. There is one thing in particular that is a performance killer when it comes to script execution, script compilation. The Gremlin Server has a script cache, so if the hash of the script that is sent over lines up with a cached and already compiled script, no compilation is needed. If Gremlin Server doesn’t get a hit, expect a nasty compilation penalty on every query. Usually similar queries will be sent over and over and over. Our experiment just repeatedly inserts 10 new people. The key here to getting a cache hit every time the script is sent is to turn the parts of the scripts that change, the property values, into variables. That way, the Groovy that is sent over will be exactly the same every time, thereby matching the cached script. The script parameterization feature can be used to attach the unique property values every time, which will be seeded into the script’s runtime when it’s executed. See here for the syntactical details and note.
Looking at our script, you’ll notice that I’m including more than one addV per call. The exact number you’ll want to send over at once may vary, but the basic idea holds that there are performance benefits to be gained from batching. In addition to batching, note that I chained all of the mutations into a single traversal. This amortizes the cost of traversal compilation, which can be non-trivial when you’re going for as high of throughput as possible. Note that Gremlin is quite powerful and you can mix reads and writes into the same traversal, extending way beyond my simple insert example. So keep that in mind as you write your mutating traversals. The chosen batch size 10 is rather arbitrary so plan to test a few different sizes when you’re doing performance tuning.
Response serialization is a CPU intensive operation and can eat into throughput. Notice that the queries that are being sent over end with a “count” step. We could have also iterated the traversal with an “iterate” so nothing would have come back in the response. If the newly minted vertex ids are needed, an “id” step can be added on to the traversal just to return the ids without incurring the full overhead of serializing the vertex objects. The same advice goes for reads, if the full vertex or edge is not required in the results of a query, update the traversal to selectively return only what is necessary.
Gremlin Server has come up a few times already so you might be saying, “Gremlin Server? I’m running JanusGraph”. You’re not wrong, but JanusGraph implements the TinkerPop APIs and includes Gremlin Server. All remote access to JanusGraph occurs through the Gremlin Server. A number of Gremlin Server settings will be discussed below.
The JanusGraph Gremlin Server startup script located in ./bin/gremlin-server.sh specifies the max heap size as 512 megabytes. Since these experiments only test write throughput, we don’t need to worry about the global- or longer-running queries so the majority of objects that get created as part of our inserts will be short lived. It is a good idea to monitor the Gremlin Server’s memory usage and likely increase the heap size. If the load largely involves short lived objects the new generation size can also be adjusted to lessen the likelihood of unnecessary promotion into the older generations.
Per the comprehensive Gremlin Server docs, users can specify the number of worker and Gremlin pool threads. The docs recommend increasing the worker pool size to at most two times the instance core count. The Gremlin pool is responsible for script execution so, like the worker pool, if the Janus server is not fully utilized, increasing its size may allow the server to process more script requests in parallel. These settings are found in the Gremlin Server config file: conf/gremlin-server/gremlin-server.yaml
Janus vertices are uniquely identified by ids of the Java type long. Every time an addV is called, an id needs to be assigned to that vertex. These ids come from a pool of ids that Janus “retrieves” when the first requests come in and then throughout normal operation, as the pool is exhausted. In a distributed setup, these id blocks must be retrieved carefully, in a manner where it is not possible for more than one Janus instance to grab the same pool of ids. If they did, it is possible for two or more Janus instances to step on each other’s mutations. Since this operation must be safe across not only multiple threads, but also multiple instances, locks are required.
If a cluster (or single instance) is under heavy load, it is common to see large spikes in latency in an untuned setup. The user may say to themselves, “hmm, Janus runs on the JVM, sounds like we have some garbage collection pauses clogging things up”. That isn’t a bad guess, and very well could be the culprit, but they check their metrics and GC looks fine with no stop the worlds that correlate with the drops in throughput. At this point, If they move on to looking at the internal Janus threads, there is a good chance they’ll see a number of idassigner threads periodically blocking and putting a stop to the current batch of insertions. This is id allocation in action. Luckily, there are a few tunables related to id allocation that can help alleviate this pain and greatly reduce the p99 times.
Refer here for the full set of options. For now we’ll focus on block-size and renew-percentage. The default block-size is 10,000 and the default renew-percentage is 0.3. That means that when a Janus id thread starts up, it will reserve 10,000 ids and when that pool drops below 30% full, it will attempt to grab the next block asynchronously to reduce the chance of a full stop. If an application is inserting 1000’s of vertices a second, that block won’t last long and even with the asynchronous renewal, threads will start to block and things will come to a grinding halt. The docs recommend that users set the block-size to approximately the number of vertices expected to be inserted in an hour but don’t be afraid to experiment with this value and the renew-percentage.
When a Janus instance is shutdown, its unused ids go with it and are not returned to the pool. This can cause some concern because there are a finite number of ids available. The good news is, per the docs: “JanusGraph can store up to a quintillion edges (2^60) and half as many vertices” so the likelihood of running up to this is tiny unless perhaps, you’re regularly reserving very, very, very large id blocks and cycling your Janus instances incredibly frequently.
This is going to be a case of “Do as I Say, not as I Do” because when you’re doing performance tuning, it’s not a good idea to go and change a bunch of things all at once. One method I would suggest is the USE method that Brendan Gregg describes here. Having said that, hopefully these suggestions provide some food for thought when you embark upon your next JanusGraph tuning exercise.
This first experiment was run against a freshly unzipped JanusGraph install. Gatling was setup to inject 1,000 batches of vertices per second which, taking into account the batch size of 10, translates to 10,000 vertex creations per second. The defaults, of particular interest due to the above discussions are:
A smooth response rate of about 1,000 responses per second
For the second experiment, the settings were left the same as experiment one, but since things ran so smoothly, the user injection rate was bumped up to 1,200 users / second. This does not mean Janus can handle that many with this setup, it just means that Gatling attempted to make 1,200 calls per second, regardless of Janus’ ability to handle the load.
Throughput becomes erratic
Things start to get a little more interesting. It doesn’t look like any requests are being dropped, but the Janus throughput is much choppier, interspersed with very large drops to what appears to be a complete stop. The below chart also shows a much more erratic set of response times, with the upper percentile reponses hitting over 500 ms, and sometimes even over one second. If you were to hook a profiler up to the Janus JVM, you’d likely see the previously mentioned id allocation threads, blocking periodically, lining up with the huge spikes in latency and dead stop on the throughput front. The Janus instance could be pushed a little further, to the point of dropping requests, but we’ll move on to updating a number of configuration options next to see what effect that has on performance.
At 1,200 users per second, latencies periodically spike to 500 ms or more
A number of the Janus tunables have been updated for this last experiment. The Gatling user injection rate has also been bumped up from 1,200 to 1,750 users per second. Remember that a “user” here corresponds to one of those g.addV().... traversals that inserts 10 new vertices.
The experiment is run, and looking at the responses per second below, Janus doesn’t quite keep pace with the 1,750 per second, but it is pretty close. Note that the extreme drop offs in throughput have disappeared and results are more inline with the 1,000 user/second of first experiment, albeit at a nicely increased throughput. The change to the default id block-size removed the most egregious of the stalls. By setting it to the very high number of 1 billion, we removed the need to go through an id block retrieval round. This is not super realistic outside of perhaps a bulk loading scenario, but it nicely illustrates the effect of id allocation on throughput. When tuning, I would not recommend shooting for never needing to retrieve id blocks. Instead adjust the renew percentage and block size so that there is plenty of time for an asynchronous renewal to occur before the id block is starved and writes are brought to a halt.
Results show between 16,000 and 17,500 vertex additions per second
Switching over to the response time percentiles, the p90 numbers look pretty solid and the p99s are well below the 500-1,000 ms seen in the previous experiment.
A large reduction in tail latencies compared to the untuned 1,200 user / second experiment
Depending on your data ingestion requirements, an out-of-the-box Janus install may meet your needs. If not, this post has covered a few tunables that you can add to your toolbox and possibly put to good use the next time you’re methodically studying the performance of your Janus setups.