Running a 400+ Node Elasticsearch Cluster

Cat nodes

Given the volume of many million posts per day that Meltwater has to process, we need a technology for search and storage that can handle this kind of volume.

We have been a pretty happy users of Elasticsearch since the 0.11.X days. While we have been through some up and downs, in the end we think our choice of technology was the right one.

Elasticsearch is used to back our main media-monitoring application, where customers are able to search and analyze media data, such as News articles, (public) Facebook posts, Instagram posts, blogs and Tweets. We gather this content using a mix of APIs and crawling, enrich them and make them searchable using Elasticsearch.

In this post, we share what we’ve learned, how you can tweak Elasticsearch to improve its performance, and which pitfalls to circumvent.

If you want to learn more about our aforementioned ups and downs with Elasticsearch, see e.g. our previous posts about some numad issues and the batch percolator.

Numbers, Numbers

There is a rather large amount of news articles and tweets produced each day. On busy days, we index about 3 million editorial articles and almost 100 million social posts. We keep editorial data searchable forever (going back to 2009) and social data for 15 months. Our current disk usage is about 200 TB for primaries, and about 600 TB with replicas.

We get around 3k requests per minute. All requests go through a service, creatively named the “search-service”, which in turn does all the talking to the Elasticsearch cluster. Most of our searches are complex rules for what to include in dashboards and in news feeds. For example a customer might be interested in Tesla and Elon Musk, but chooses to exclude everything that has to do with SpaceX or PayPal. We allow users to use a flexible syntax close to the Lucene query syntax, for example:

Tesla AND "Elon Musk" NOT (SpaceX OR PayPal)

Our longest such query is 60+ pages long. The point is: Out of the 3k requests per minute, none is the easy “Barack Obama” query you enter in google. They are hideous beasts, and our ES nodes have to work hard to figure out a matching document set.

How many trees do you think the query would use?

Versions

We run a custom Elasticsearch version, based on 1.7.6. The only difference from the stock 1.7.6 version is that we used a backported version of roaring bitsets/bitmaps for caching. It’s backported from Lucene 5 to Lucene 4, which backs ES 1.x. The default bitset used for caching in Elasticsearch 1.X is very costly for sparse results, but this has been improved in Elasticsearch 2.X.

Why are we not on a newer version of Elasticsearch? The main reason is that upgrading is hard. Rolling upgrades between major versions only became a thing going from ES 5 to 6 (the upgrade from 2 to 5 was supposed to support a rolling upgrade, but that didn’t happen). Thus, we can only upgrade the cluster by doing a full cluster restart. Downtime is hard for us, but we could probably deal with the estimated 30-60 minute downtime a restart should cause. The really scary part is that there is no real rollback procedure if something goes down.

So far, we have elected to not upgrade the cluster. We would like to, but so far there have been more urgent tasks. How we actually perform the upgrade is undecided, but it might be that we choose to create another cluster rather than upgrading the current one.

Node Setup

Since June 2017, we are running our main cluster on AWS, using i3.2xlarge instances for our data nodes. We used to run the cluster in a COLO (Co-located Data Center), but moved to the AWS cloud to bring the lead time for new machines down, allowing us to be more flexible in scaling up and down.

We run three dedicated master nodes in different availability zones, with discovery.zen.minimum_master_nodes set to 2. This is a rather common pattern to avoid the split-brain problem.

The disk needed for our dataset, at 80 percent capacity and 3+ replicas, put us at around 430 data nodes. Initially we intended to have different tiers of data, with older data on slower disks, but since we have only relatively low volume older than 15 months (only editorial data, since we drop older social data), this did not make sense. The monthly hardware cost is much more than running in a COLO, but the cloud allows us to double the cluster size in approximately zero time.

You may ask why we decided to manage our own ES cluster. We considered hosted solutions but we decided to opt for our own setup, for these reasons: AWS Elasticsearch Service allows us too little control, and Elastic Cloud would cost us 2-3 times more than running directly on EC2.

To protect ourselves against an availability zone going down, the nodes are spread out over all 3 availability zones in eu-west-1. We are configuring this using the AWS plugin. It provides a node property named aws_availability_zone. We set cluster.routing.allocation.awareness.attributes to aws_availability_zone. This ensures that ES prefers to place replicas in different availability zones, and also makes routing of queries prefer nodes in the same availability zone.

The instances run Amazon Linux, with the ephemeral drive mounted in ext4. The instances have ~64 GB of RAM. We give 26 GB of heap to the ES nodes, and let the rest be used for disk caches. Why 26 GB? Because the JVM is built on black magic.

We use Terraform together with autoscaling groups to provision the instances, and Puppet to set up everything on them.

Index Structure

Since both our data and our searches are date-based, we use time-based indexing, similar to e.g. the ELK (elasticsearch, logstash, kibana) stack. We also keep different kinds of data in different indices, so that e.g. editorial documents and social documents end up in different daily indices. This allows us to drop only social indices after some time, and add some query optimizations. Every daily index runs with two shards each.

This setup results in a large number of shards (closing in on 40k). With this many shards and nodes, cluster operations sometimes take on a peculiar nature. For example, deleting indices seems to be bottlenecked by the master’s ability to push out the cluster state to all the nodes. Our cluster state is about 100 MB, but using TCP compression that can be reduced to 3MB (you can get your own cluster state from curl localhost:9200/_cluster/state/_all). The master still needs to push 1.3 GB per cluster change (430 notes * 3 MB state size). Out of this 1.3 GB, about 860 MB has to be transferred between availability zone (i.e. basically over the internet). This takes time, especially when you want to delete a few hundred indices. We are hoping that newer versions of Elasticsearch will make this better, primarily from the ES 2.0 feature of only sending the cluster diffs.

Performance

As mentioned earlier, our ES cluster needs to handle some very complex queries, in order to meet the search needs of our customers.

To deal with the query load, we have done a huge amount of work in the performance-area over the last few years. We have had to learn our fair share about performance testing an ES cluster, as evident from the quote below.

Sadly, less than a third of the queries managed to successfully complete, as the cluster went down. We are confident that the test itself caused the cluster to go down.
Excerpt from the first performance test with real queries on our new ES cluster platform

In order to control how queries are executed, we have built a plugin which exposes a set of custom query types. We use these query types to provide functionality and performance optimisations not available in stock Elasticsearch. For example, we have implemented wildcards within phrases, with support for executing within SpanNear queries. We optimise “*” to a match-all-query. And a whole lot of other things.

Performance in Elasticsearch and Lucene is highly dependent on your queries and data. There is no silver bullet. With that said, here are some tips, in order from basic to more advanced:

Limit your search to only relevant data. E.g. in daily indices, only search in relevant date ranges. For indices in the middle of your range search, don’t apply the range query/filter.
If you use wildcards, avoid leading wildcards - unless you can reverse-index the terms. Double-ended wildcards are hard to optimize for.
Learn the signs of resource consumption. Are the data nodes constantly on high CPU? High IO-wait? Look at GC statistics. They are available from profilers or through JMX agents If you spend more than 15 percent of the time in ParNewGC, go looking for memory hogs. If you have any SerialGC pauses, you have issues for real. Don’t know what I’m talking about? No worries, this series of blog posts is a good JVM performance intro. Remember that ES and the G1 collector do not play nice together.
If you have Garbage Collection issues, the go-to fix should not be trying to tune the GC settings. This is often futile, since the defaults are good. Instead, focus on reducing memory allocations. For how to do that, see below.
If you have memory issues, but no time to fix them, consider looking into Azul Zing. It’s an expensive product, but we saw a 2x speedup in throughput just by using their JVM. We ended up not using it though, since we could not justify the cost.
Consider using caching, both outside of Elasticsearch and on the Lucene level. In Elasticsearch 1.X one can control caching using filters. In later versions this seems harder, but implementing your own query type for caching is plausible. We will probably do something similar when going to 2.X.
See if you have hot-spots (e.g. one node getting all the load). You can try to spread load either by using shard allocation filtering or by trying to moving shards around yourself using cluster rerouting. We have automated the rerouting using linear optimization, but you can get a long way with simpler automation.
Set up a test environment (I prefer my laptop) where you load a representative amount of data from production (preferably at least one shard). Push it (hard) with queries taken from production. Use the local setup to experiment with resource consumption of your requests.
In combination with the point above, use a profiler on the Elasticsearch process. This is the most important tip on this list. We use both Flight Recorder through Java Mission Control and VisualVM. People who try to speculate (including paid consultants/support) about performance are wasting their (and your) time. Ask the JVM what consumes time and memory, and go spelunking in the Elasticsearch/Lucene source code to figure out what ran the code or allocated the memory.
Once you understand which part of the request causes the slowness, you can try to optimize by changing the request (e.g. changing the execution hint for terms aggregations or switching query types). Changing query types, or query order, can have huge effects. If that does not help you, you can try optimizing the ES/Lucene code. This might seem crazy, but led to a 3-4x decrease in CPU usage and 4-8x decrease in memory usage for us. Some of our changes were trivial (e.g. the indices query), while other required us to rewrite query execution completely. The resulting code is heavily geared toward our query patterns, so may or may not be viable for others to use. We have thus not made it open source so far. It could however be material for another blog post.

Graph showing diff of performance. Such impressive.

Graph Description: Our response times. With/Without a change where we rewrote the query execution of Lucene. It also meant that we no longer had nodes going out of memory multiple times per day.

A note, because I know the question will come: From previous performance tests, we would expect a small performance improvement from upgrading to ES 2.X, but nothing game-changing. That being said, if you have migrated an ES 1.X cluster to ES 2.X, we would love to hear your practical experiences with how that migration went for you.

If you have read this far, you really like Elasticsearch (or at least, you really need it). We are curious to learn about your experiences, and any tips that you can share with us. Please share your feedback and questions in the comments below.