Meet the Expero team on the road this summer in Toronto, New York City, and Boston!

Have You Had Your JanusGraph Tuneup?

Tuning JanusGraph is not for the faint of heart, read these tips and tricks to help you get the most out of your JanusGraph project.

Ted Wilmes

Data Architect

Expero Staff

Francisco Fumero

Senior Full Stack Engineer

The intricacies of JanusGraph can often lead users down a rabbit hole of challenges and complexities. Whether you are initiating a project or are already knee-deep into one, you might have felt the pressing need for a JanusGraph tuneup.

While navigating through the blogosphere, you might have chanced upon the invaluable advice of Ted's JanusGraph blogs. However, just like taking a vehicle to a mechanic, it's crucial to understand the very essence of a JanusGraph tuneup.

In this article, we shall embark on the journey of dissecting the importance of this tuneup and offer you a roadmap to optimize your JanusGraph operations. Let’s take a look.

What is a JanusGraph tuneup?

At its core, a JanusGraph tuneup is the meticulous process of adjusting and optimizing various settings and configurations to enhance JanusGraph's performance.

A tuneup is more than a one-size-fits-all solution. It begins with examining the preliminary adjustments that can be made without delving too deeply into database remodeling. Perhaps you might need to tweak the JVM settings or upgrade your JanusGraph version for those coveted performance enhancements.

However, a tuneup doesn't end there. It dives deep into JanusGraph properties, ensuring optimal performance while balancing the trade-offs. From modifying threadPoolWorker in the Gremlin Server properties to managing client driver configurations, every aspect is given due attention.

Consider a JanusGraph tuneup as an extensive health checkup for your graph project, examining every nook and cranny for possible optimizations.

The importance of tuning up JanusGraph

Imagine embarking on a cross-country drive with a car that hasn't been serviced for years. The chances of breaking down or facing issues are incredibly high. Similarly, without a proper JanusGraph tuneup, your project might stumble upon unexpected roadblocks, severely affecting its performance.

The data load speed is a significant concern for many, and without an effective tuneup, projects can stutter right from the onset. But the importance of a tuneup isn't confined to just loading data swiftly. It encompasses optimizing JVM settings, ensuring smooth operations in a containerized environment, and even preventing the adverse effects of excessive garbage collections.

Additionally, every JanusGraph project is uniquely tailored. Hence, a tuneup aids in aligning the JanusGraph configurations with the specific project needs, ensuring the graph can tackle real-world challenges head-on.

In essence, tuning up JanusGraph isn't just an optimization activity; it's a prerequisite for ensuring the longevity, stability, and performance of your graph projects.

Basic Steps For a JanusGraph Tuneup

1. Try the “Low Hanging Fruit” First

Consider upgrading your JanusGraph version if you haven’t done so for a while.

The standard caveat applies here - this is open-source software, you should always read the changelog because while you may get some performance benefits, you are also on the bleeding edge.

2. JVM Settings

There are several items that we like to look at here - none of which will be surprising to server-side Java engineers.

If your heap size is 8GB or larger (which is the case most of the time), you should consider using the G1 garbage collector. It has fewer knobs to twiddle and is specifically designed for large heap sizes. (This is a good primer for more reading on types of garbage collectors.)
If you are using a containerized environment, make sure your heap size is consistent with your container’s settings. Oftentimes we see containers get killed and restarted when the JVM heap size is too big for its host container.
Setting your heap size too low in general will cause excessive garbage collections (which causes latency) or out-of-memory (OOM) errors. In most cases, the default JanusGraph heap size is too small.

3. JanusGraph properties

There are several properties configured in the JanusGraph properties file upon which our gremlin-server.yaml relies. Note that some of these properties may have a GLOBAL_OFFLINE mutability level.

Here are some instructions on how to change them and here is the full list of options. However, these are the key ones we tend to tweak:

a) ids.block-size

This is the number of globally unique IDs to be reserved by each JanusGraph instance id block thread and it defaults to 10,000. It will maintain a pool of IDs to be used later.

For instance, every time we add a vertex, JanusGraph assigns an ID to that vertex, extracting it from the IDs pool. When this pool drops below a given percentage (controlled by the renew-percentage parameter), JanusGraph will try to reserve a new block of this size asynchronously.

However, if we have a high insertion rate, this pool can be exhausted very fast and the asynchronous renewal mechanism will eventually block our worker threads, waiting for the new pool of IDs. Periodic spikes in your query latencies are one potential symptom of this behavior.

If we have a cluster of JanusGraph instances, this could be even worse, since the retrieval of a new block of IDs needs to acquire a lock across multiple threads and multiple instances. So setting this value too low will make transactions block waiting for this reservation.

Setting it too high could potentially cause unused IDs to be lost if the JanusGraph instance is shutdown (unused IDs are not returned to the pool upon restart) but this shouldn’t be an issue if your insertion rate is high and you don’t turn off your instances too often.

The recommendation is to set this value to the number of vertices expected to be inserted per hour per JanusGraph instance.

b) storage.batch-loading

CAUTION! Use this one with care because it turns off constraint checks thereby reducing reads before writes.

This one is most safely used when you have highly preprocessed input data or otherwise do not need uniqueness or other referential integrity constraints enforced at input time. Either because the data coming in is clean or you’ve already done the constraint checking programmatically.

c) query.batch

This boolean enables parallelization of storage access which doesn’t help write performance but can help if you have to otherwise do a lot of reads before writes.

d) cache.db-cache

For multi-node JanusGraph instances, it is usually best to turn this off as the cache is not distributed. Therefore each node will have a different view of the data.

In some cases, it makes sense to turn it on for nodes that have very high reads after load. You could also turn it off for some instances that are in charge of the writes, and turn it on for some other instances than handle the reads.

e) query.force-index

This setting may not apply in most cases as it is really a protective measure against runaway queries. It only runs queries that use an index. Most of the time these scenarios have already been addressed but it’s always good to verify.

4. Gremlin Server properties

Some highly pertinent performance settings also exist in the gremlin-server.yaml file. Again there are several settings available to experiment with (full list here) - these are the key ones we focus on.

threadPoolWorker: The default value for this is 1. This is essentially a traffic cop to disseminate work. It should be set at most 2 * number of cores in the JG instance.
gremlinPool: This is the number of traversals that can be run in parallel. This pool runs until the query completes. You should start at several cores, then increase incrementally and it is usually a function of what your storage layer can handle; eventually, your storage layer will have contention.

5. Client driver configuration

This may seem like a “Captain Obvious” moment but make sure the client knows about all the candidate servers. These are not maintained automagically like some commercial offerings; you must update these settings explicitly.

Additionally, if you have single client instances producing a large amount of traffic, it may also be worth increasing client connection pool sizes.

Please see this blog for more on driver settings.

Intermediate Steps

If you’ve tried all the previous steps and still haven’t gotten the performance you seek, don’t despair. There are still some reasonably painless areas that can be addressed.

1. Storage Metrics

JanusGraph requires you to choose a storage layer, each of which has its own performance profile characteristics. In each, what you’re looking for is the latency of the request to the storage layer from the JanusGraph perspective.

Frequently that requires collecting metrics from some storage-specific toolset.

BigTable: Prometheus or some other GCP metrics console. This can be a blog post on its own but for a starting primer, read this.
Scylla: The Scylla Monitoring Stack is based on their Prometheus API and is deployed as a set of Docker containers described here.
Cassandra: C* can be the most difficult to tune because it has the most knobs to turn. From GC to replication to consistency and compaction, this is surely an area where you can make an entire career. Most clients use something like DataDog to collect the data.
HBase: Depending on who’s HBase you are using, you may have some vendor-specific tools at your disposal. Otherwise, you can consider tools like hbtop.

The JanusGraph project does not come with an out-of-the-box Prometheus setup, but JanusGraph community member Ganesh Guttikonda has put together a well-documented repo that can be used to set up a JanusGraph specific dashboard: https://github.com/gguttikonda/janusgraph-prometheus.

2. Schema Review

You can always cripple a properly tuned JanusGraph architecture with a bad data model. Your data model should always be optimized for your expected query patterns.

In practice, no two client data models have ever been exactly the same, even within the same problem spaces, because the shapes and sizes of data sets vary, the frequency of events vary and key value drivers vary by customer.

Having said that, there are a few things that you should look out for, unfortunately, they are likely to require some changes to your application:

Uniqueness constraints: Determine if unique constraints are necessary as they impose read before writes. Don’t add them unless it is absolutely necessary; if you do need them, then pull in the locking wait time.

3. Locking Wait Time

When writing to the graph, there are locks required on the graph (for a single node) that a single thread will hold long enough to write the node and have the storage layer acknowledge it as committed (which is storage layer dependent).

By default it is 100ms, if your storage layer completes most writes in say 50ms, then the graph is waiting for no reason. The good news is that you don’t get spurious lock failures. If you set it too short, you end up having a bunch of retries. The recommended value is a small multiple of the average write time.

So how low to set this? It depends on your write latency - try ratcheting it down slowly 10ms at a time from the default amount until you find your sweet spot. Alternatively, you can get your average write time if you’ve already gathered storage metrics, and adjust from there.

Structural Changes… If You Must

1. Look for and Remediate Supernodes

Do you know what the maximum degree size is for your graph? You should.

If you don’t know the answer, a quick Spark Gremlin routine can give you the answer. If you have queries that traverse through those nodes, you can be sure they’re not as performant as you’d like.

Do your queries need to go this route? Can they originate elsewhere? Can you remodel your graph? Unfortunately, there’s no succinct to-do here but in short, removing supernodes from your graph and graph model will pay dividends when it comes to performance and operations.

2. Caching and Local Maps to Reduce Reads

Are you looking up attributes from the graph before deciding to write back or what to write back? Any read you have in your write path ultimately steals time that could be spent writing.

In some cases, these reads are inevitable, but in many other cases, the reads can be removed with some careful ingest design. This discussion is larger than the scope of this article.

However, tools like Kafka Streams can ease the burden of designing a data-loading pipeline that can make use of things like caching to remove large percentages of your ingest read traffic. In turn, this will greatly improve your loading performance.

Conclusion

Navigating through the intricacies of JanusGraph can be daunting. But with a JanusGraph tuneup, the challenges become manageable. Tuning isn't just about making a few adjustments here and there. It's about understanding the project's unique demands and tailoring the graph to meet them efficiently.

Just as you wouldn't neglect a vehicle's maintenance, overlooking the significance of a JanusGraph tuneup could be detrimental to your project's health. As you embark on your journey with JanusGraph, ensure that tune-ups become a staple, guiding your project to success and optimal performance.

We are ready to accelerate your business. Get in touch.

Tell us what you need and one of our experts will get back to you.

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.

Consulting Solutions