| Comments

RabbitIO: A Tool to Backup and Restore Messages from RabbitMQ

Are you working with an ocean of microservices connected together by a message broker like RabbitMQ? Then there is a chance you have experienced queues filling up and messages not reaching their destination, most likely due to stability issues in your architecture of services.

If you want to know how our engineers investigate such issues, and why they built a tool called RabbitIO that helps them with this, then this post is for you.

As an on-call engineer in a world of microservices, you are responsible for keeping stable communication between them. At some point in time one of these services will become slow or break, self healing mechanisms put in place can fail, queues will start to clog up, messages are no longer being consumed or they end up in dead letter queues.

This is where the on-call engineer, hopefully, is alerted of the problem and starts the investigation. How can the engineer quickly restore the normal flow of messages? What to do with messages that are queueing up? We’ll look at how you can use RabbitIO as a tool in a chain of events to help resolve these issues and then look more closely at the common problems where RabbitIO is useful.

Why we wrote RabbitIO and how to use it

The aim for RabbitIO is to be a lightweight, battle tested tool to help with managing messages during our investigation steps or in emergency situations where you quickly want to write data between RabbitMQ and disk.

We will go through the basic usage, but you can also take a look at the detailed usage description.

RabbitIO has two main commands, “out” and “in”.

“out” consumes messages out from a queue:

rabbitio out -e rabbitio-exchange -q rabbitio-queue -d data/
  • -e the exchange we bind to
  • -q the queue we consume from
  • -e the directory to store messages in

“in” publishes messages in to an exchange, that is bound to a queue:

rabbitio in -e rabbitio-exchange -q rabbitio-queue -f data/1_message_100.tgz
  • -e the exchange publish to
  • -q the queue that receives the messages
  • -f the directory or tarball containing our messages

An investigation into a typical problem where a RabbitMQ queue is filling up with messages consist of these steps:

1: Consume the messages to disk (using RabbitIO). In our example we write data from the exchange ‘rabbitio-exchange’ and queue ‘rabbitio-queue’ into the directory ‘data/’. Running the following command will consume our messages and store them as tarballs with a uuid filename e.g: fef21f1e-763f-417a-92fb-efc5bf19dd0c.rio

rabbitio out -e rabbitio-exchange -q rabbitio-queue -d data/

2: Analyze the content of these messages, to find the root cause of the clogged queues

3: Fix the root cause

4: Publish the messages into the queue (using RabbitIO)

rabbitio in -e rabbitio-exchange -q rabbitio-queue -f data/1_message_100.tgz

Common Problems with Queue Management and Consumers

Now that you know how we use RabbitIO in general, let’s look into some common situations that you might run into when handling queues. For each situation we will discuss how RabbitIO can help you to investigate these issues.

Unparsable messages

There are corner cases provided by the messages, where the consumer does not understand what to do with a certain message, possibly due to a bug in the service. It logs an error and rejects the message in question, the message will be seen in the dead letter queue.

The investigation (Step 2) may involve filtering messages based on the content, ignoring the ones that do not need to be processed and sorting them based on priority. You can use any of your favorite bash tools to do so, for example sort, sed, uniq, awk or more advanced JSON parsers like jq. To have full access to the metadata you can use a tool called pax. Otherwise tar is suitable. More details in the metadata sections of the RabbitIO README.

Example script where we only need certain messages:

#!/bin/bash

# Extract tarball using pax to gain access to headers.
pax -r -zf 1_messages_100.tgz

# Make header files readable.
chmod 644 PaxHeaders.0/*.rio

# Create directory for wanted files.
mkdir -p output

# Find the wanted messages determined by PAX header data.
# Loop over all header files.
# Header and message files shares the same filename
cd PaxHeaders.0/  
for file in $(ls *.rio)
do
        if grep -q id123id $file; then
            # Copy the message file to output directory
                cp -v ../$file ../output/$file
        fi
done

# Compress wanted files into a tarball that we can publish
cd ../output
tar cvfz id123id.tar.gz *.rio

# Publish with RabbitIO (Step 4).
rabbitio in -e rabbitio-exchange -q rabbitio-queue -f id123id.tar.gz

Before replaying the messages, we might have to fix a bug on the consumer side, that was the cause of the message not being processed in the first place. Or determine that the consumer is all well and ready to process messages again.

Slow consumer

With enough on-call experience, you’ll learn that the consumer of a queue can become slow. Messages are being processed, but not fast enough. They can be large in size, and within minutes can reach millions of messages piled up in the queue. Eventually if the consumer cannot keep up, RabbitMQ will run out of resources, using disks space for persistent storage or available memory.

RabbitMQ is a message broker, not meant as a place to store messages, only to satisfy the need of passing messages around. If you need better message persistency, you are wise to look elsewhere. Once RabbitMQ runs out of disk space, it will corrupt its own underlying database. You do not want that to happen.

Failing consumer

A well known problem is when a perfectly valid message is not being processed by the consumer. This happen because a dependent service may have temporary outages, network related issues or other problems, causing availability issues in the moment. Retry logic will in the end exceed its limits and the messages gets rejected and put into the dead letter queue by rabbit or requeued, in which case, the queue just keeps on growing.

To avoid exceeding RabbitMQ’s resources, we can temporarily attach a RabbitIO consumer to the queue in trouble, enabling us to write the raw messages to disk. Effectively sharing the message load with the intended consumer that typically would e.g. be storing data. Once the data storage has recovered and the queue in RabbitMQ has been drained by the consumer, we can publish the existing messages on disk to the impacted queue again. No need for further investigation.

Design Goals of RabbitIO

Content separation

In tools similar to RabbitIO, we often see JSON in the body of the file, where the message itself is combined with the metadata. What is interesting to us is the message, but it will be hidden in the JSON that needs parsing and possibly encoded in something like base64. This will complicate the investigation. Parsing of JSON is also an additional overhead we do not want by default. With RabbitIO we keep the headers of the message out of the way from its body, by storing headers in the pax records of the tarball. This ensures data to be clean during the investigations.

Lightweight

Other tools can result in long waiting time when downloading or updating, which is not feasible if you as an on-call engineer want to investigate as fast as possible. RabbitIO will be kept as a small footprint, single binary, command line tool.

Simplicity

Quick command line tool, with simple configuration, does not require advanced configuration file, but still provide powerful options needed for your investigations.

What is next for RabbitIO?

We are pretty happy with how RabbitIO works right now. At least it helps us, so we hope it will help others too.

There are always things that can be improved. Here are some of them:

  • Better control of the message publishing speed.
  • Adding or overwriting headers when publishing messages.
  • Filtering rules when publish/consuming messages based on header data or content in the body.
  • Additional storage support e.g: to AWS S3.
  • Web UI to simplify the process of consuming and publishing to different storages.

We hope that you will find RabbitIO useful. If you want to try it out, head over to the RabbitIO github repo, and follow the installation instructions. Also if you have any questions or comments about how we use it for our investigations, please ask those in the comment section below.

Comments