Machine (robot) learning how to do load balancing

At the heart of Meltwater’s and Fairhair.ai’s information retrieval systems lies a collection of Elasticsearch clusters containing billions of social media posts and editorial articles.

The index shards in our clusters vary greatly in their access pattern, workload and size which presents some very interesting challenges.

This blog post describes how we use Linear Optimization modeling for distributing search and indexing workload as evenly as possible across all nodes in our clusters.

We have found that our solution makes our hardware utilization more evenly distributed, we have reduced the likelihood of one single node being a bottleneck. As a result we have improved our search response times as well as saved money spent on infrastructure.

Background

Fairhair.ai’s information retrieval systems contains around 40 billion social media posts and editorial articles, and handles millions of queries daily. This platform provides our customers with search results, graphs, analytics, dataset exports and high level insights.

These massive datasets are hosted on several Elasticsearch clusters totaling 750 nodes and comprising thousands of indices and over 50,000 shards.

For more info on our cluster architecture please have a look at our previous blog posts Running a 400+ Node Elasticsearch Cluster, or the more recent one Using Machine Learning to Load Balance Elasticsearch Queries which both explain other crucial parts of our cluster architecture.

Uneven Workload Distribution

Our data and customer queries are largely date-based, and most queries hit a specific time period, like the past week, past month, last quarter or some other date range. To facilitate this indexing and query pattern we use time-based indexing, similar to the ELK stack.

This index architecture gives a number of benefits, we can for example do efficient bulk indexing as well as drop entire indices when the data gets old. It also means that the workload for a given index varies greatly with time.

Recent indices both receive most of the indexing requests, as well as exponentially more queries compared to older ones.

Figure 1: Access pattern for our time based indices. The Y-axis shows number of queries executed, the X-axis shows the age of the index. We can clearly see weekly, monthly, and one year plateaus, followed by a long tail of lower workload for older indices.

The access pattern in Figure 1 is expected since our customers are more interested in recent information, and regularly compare this month vs last month and/or this year vs last year. Our problem is that Elasticsearch does not know about this pattern, nor does it automatically optimize for the observed workload!

The built-in Elasticsearch shard placement algorithm only takes into consideration:

The number of shards on each node, and tries to balance the number of shards per node evenly across the cluster
The high and low disk watermarks. Elasticsearch considers the available disk space on a node before deciding whether to allocate new shards to that node or to actively relocate shards away from that node. A nodes that has reached the low watermark (i.e 80% disk used) is not allowed receive any more shards. A node that has reached the high watermark (i.e 90%) will start to actively move shards away from it.

The fundamental assumption in this algorithm is that each shard in the cluster receives roughly the same amount of workload and that the all are of similar size, which is very far from the truth in our case.

This quickly leads to what we call hot spots developing randomly in the cluster. These hot spots are coming and going as the workload varies over time.

A hot spot is in essence a node that is operating near its limit of one or more system resources such as CPU, disk I/O or network bandwidth. When this occurs then the node will first queue up requests for a while, which makes request response times go up, but if the overload continues for long, then eventually requests will be rejected and errors propagate up to the user.

Another common overload event is unsustainable JVM garbage pressure due to queries and indexing operations, which leads to the dreaded JVM GC hell phenomenon where the JVM either can’t collect memory fast enough and goes out of memory (OOM) and crashes, or gets stuck in a never-ending GC loop, freezes up and stops responding to requests and cluster pings.

This problem got worse and more apparent when we refactored our architecture to run on AWS. The fact that ‘saved’ us before was that we ran up to 4 Elasticsearch nodes on the same, very powerful (24 cores), bare metal machine inside our data center. This masked the impact of the skewed shard distribution and it largely got soaked up by the relatively high number of cores per machine.

With the refactoring we started to run only one Elasticsearch node on each, less powerful (8 cores) machine; we immediately saw big problems with hot spotting during our early testing.

Elasticsearch effectively assigned shards at random, and with 500+ nodes in a cluster it suddenly became very likely that too many of the ‘hot’ shards were placed on a single node, and such nodes quickly got overwhelmed.

This would have led to a very bad user experience for our customers, with overloaded nodes responding slowly and sometimes even rejecting requests or crashing. If we would have run with that setup for our users in production they would have seen frequent, seemingly random slowdowns of the UI and occasional timeouts.

At the same time we would have had a large number of nodes, the ones that got assigned ’colder’ shards, effectively idling and doing nothing, leading to inefficient usage of our cluster resources.

Both of these problems would be avoided if we could just make Elasticsearch place its shards more intelligently, as the average system resource usage across all nodes were at a healthy 40% level.

Continuous Cluster State Changes

Another thing that we observe when running 500+ nodes, is that the cluster state is constantly changing. Shards are moving in and out from nodes due to

New indices created and old indices dropping.
Disk watermark triggers due to indexing and other shard movements.
Elasticsearch randomly deciding that a node has too few/too many shards compared to the cluster average.
Hardware and OS-level failures causing new AWS instances to spin up and join the cluster. With 500+ nodes this happens several times a week on average.
New nodes added, almost every week because of normal data growth.

With all of this in mind we came to the conclusion that a continuous and dynamic re-optimization algorithm was required to holistically address all of these issues at once.

The Solution - Shardonnay

After much research of available options we concluded that we wanted to

Build the solution ourselves. We did not find any good articles, code or other existing ideas that seemed to work well at our scale and our use case.
Run the re-balancing process outside of Elasticsearch and use the cluster reroute API’s instead of trying to build a plugin, like in this approach. The main reason for this is that we wanted a rapid feedback cycle and deploying a plugin version on a cluster of this scale can take weeks.
Use linear programming for calculating the most optimal shard moves at any given time.
Run the optimization continuously so that the cluster state gradually stabilizes towards an optimal state.
Not move too many shards at once.

One interesting observation that we made was that if we tried to move too many shards at once it was very easy to trigger cascading shard relocation storms. These events could then go on for hours with shards moving back and forth in an uncontrolled manner triggering high watermarks at various locations, which in turn led to even more shard relocations and the chain continued.

To understand why this happens, it is important to know that when a shard that is actively indexed into, is moved off from a node, it is actually using a lot more disk than what it would use if it was stationary on that node. That has to do with how Elasticsearch persists its writes into translogs. We saw cases where an index could use up to twice as much disk on the node that was sending it away, and naturally, more disk would also be used on the node that was receiving it. This means that a node that triggered a shard relocation due to high disk watermark it will actually, for a while, use even more of the disk until it had moved off enough shards to other nodes.

The service the we built to do all of this is called Shardonnay as a reference to the famous Chardonnay grape.

Linear Optimization

Linear optimization (also called linear programming) is a method to achieve the best outcome, such as maximum profit or lowest cost, in a mathematical model whose requirements are represented by linear relationships.

The optimization technique is based on a system of linear variables, some constraints that must be met, and an objective function that defines what success looks like. The goal of linear optimization is to find the values of the variables that minimizes the objective function while still respecting the constraints.

Shard Allocation - Modeled as an Linear Optimization Problem

Shardonnay should run continuously, and at each iteration it performs the following algorithm:

Using Elasticsearch API’s, fetch information about existing shards, indices and nodes in the cluster, as well as their current placements.
Model that cluster state as a set of binary LP variables. Each combination of (node, index, shard, replica) got its own variable. The LP model also has a number of heuristics, constraints and an objective function that we have carefully designed, more on that below.
Send the LP model through a linear solver that produces an optimal solution given the constraints and the supplied objective function. The solution is a new updated assignment of shards into nodes.
Interpret the LP solution and transform it into a set of shard moves.
Instruct Elasticsearch to carry out the shard moves using the cluster reroute API.
Wait for the cluster to move the shards.
Starts over from step 1.

The difficult part with this algorithm is to get the objective function and the constraints right. The rest will be handled by the LP solver and Elasticsearch itself.

Unsurprisingly it turned out that this problem was not so easy to to model and solve in a cluster of this size and complexity!

The Constraints

Some of the constraints in the model we based on the rules dictated by Elasticsearch itself, such as always adhering to the disk watermarks or the fact that a replica can’t placed be on the same node as another replica of the same shard.

Other constraints were added by us based on pragmatism learned from years of running large clusters. Some examples of our own constraints were;

Don’t move todays indexes as those are the hottest of all and receives almost constant read and write load.
Prefer moving smaller shards over large ones, as they are faster to move for Elasticsearch
Create and place future shards optimally a few days in advance of them becoming active and receive indexing and heavy workload.

These constraints together all ensured that shards were correctly placed, and that all solutions respects shard movement restrictions.

The Cost Function

Our cost function weighs together a number of different factors. For example, we want to

minimize the variance in index and search workload, to reduce problems with hotspotting
keeping the disk utilization variance as small as possible to achieve stable system operations
minimize the number of shard movements in order to prevent shard relocation storms as explained above

LP Variable Pruning

At our scale the size of these LP models becomes a problem in itself, and we quickly learned that it’s very difficult to solve problems with upwards of 60M LP variables in any reasonable amount of time. So we applied a large number of optimization and modelling tricks to drastically cut down on the problem size; including biased sampling, heuristics, divide and conquer, relaxation and optimizing iteratively.

Figure 2: Heat map showing imbalanced workload over an Elasticsearch cluster. This shows up as a large variance of resource usage on the left side of the graph, and through continuous shard placement optimization it gradually stabilizes towards a much tighter band of resource usage.

Figure 3: Heat map showing CPU usage across all nodes in a cluster, before and after we did a tweak of the Shardonnay hotness function. A significant change in CPU usage variance from one day to the other - with the same request workload.

Figure 4: Heat map showing disk read throughput for the same time period as in Figure 3. Disk reads are also more evenly distributed across the cluster.

Results

In conclusion, all this allows our LP solver to produce good solutions within a few minutes, even for a cluster state of our huge size, and thus iteratively improve the cluster state towards optimality.

And best of all, the workload variance and disk utilization converged as expected and this near optimal state has been maintained through the many intentional and unforeseen cluster state changes we’ve had since then!

We now maintain a healthy workload distribution in our Elasticsearch clusters, all thanks to linear optimization and our service that we lovingly call Shardonnay.

Given that you have read this far, you must be very interested in distributing search and indexing workload. Thank you! :) If you want to share any related experience with shard balancing or have other questions and comments about this topic, please post them in the comment section below.