How we upgraded an old, 3PB large, Elasticsearch cluster without downtime. Part 4 - Tokenization and normalization for high recall in all languages

This is part 4 in our series on how we upgraded our Elasticsearch cluster without any downtime and with minimal user impact.

In part 2 we explained that we decided to do a full reindexing of our entire dataset as part of this Elasticsearch upgrade project. This blog post explains some of the changes we made to our documents during that re-indexing.

Challenging old truths

The decision to re-process all the data, combined with a mandate from the rest of the organization to actually improve things gave us this unique opportunity to make very low-level changes to how our data was indexed.

The best thing about this freedom was that it allowed us to go beyond the constraints of our current system and address issues that would otherwise be too costly to do at another time.

The worst thing about this freedom was also that it allowed us to go beyond the constraints of our current system because we had to decide what to change and what to keep. We had to walk a thin line between adding long-awaited, valuable improvements to our system versus the added risk to the project if we changed too many things at the same time.

Our search engine is not just an internal tool, but something that affects the whole organization. Most importantly we also have tens of thousands of paying customers who in turn have over 800,000 saved searches in our system which they expect to ‘just keep working‘ before, during, and after the upgrade.

This requirement, to “be backward compatible”, stood in direct opposition to our other desires; to improve how we analyze, tokenize and normalize our content.

Tokenization in elasticsearch

Tokenization is a fundamental part of how elasticsearch works behind the scenes, and for us at Meltwater it is important to be able to be in full control of how it works. We need to be able to explain to our users why or why not a certain document matched a certain query.

To briefly explain what tokenization is we can divide the explanation into two parts:

Tokenization is the process of turning texts and queries into tokens
Elasticsearch only does low-level exact matching of tokens

When you add a document to Elasticsearch, an inverted index is created. In short, that is comparable to a table where each token in the document is the identifier and the document id is the value. If you want a simple mental model you can say that each word corresponds to its own token.

For example, these documents:

Id 1: “meltwater use elasticsearch”
Id 2: “elasticsearch is used for search”
Id 3: “search is fast”

would produce an inverted index looking like this:

meltwater	[1]
use	[1]
elasticsearch	[1,2]
is	[2,3]
used	[2]
for	[2]
search	[2,3]
fast	[3]

Searching for elasticsearch corresponds to a lookup in that table for the token elasticsearch, which will then return [1,2] pointing to documents 1 and 2. Searching for elastic would not give any hits, because that exact token does not exist in the table. Neither would a query for use match the document containing the used token since those tokens are different.

Tokenization is a part of a process called analyzing that Elasticsearch performs on a text to create tokens that fit the use case. A tokenizer will split the text into tokens, and a number of filters can modify the text and tokens in various ways.

A simple form of a tokenizer would be to split the text on whitespace.
Examples of filters that can be applied can be to add synonyms, do stemming or lemmatization (reduce each word to its root form), or change or remove certain characters entirely.

At Meltwater we want to provide our customers with the ability to make very exact queries, and this must work in all languages. Our AI language detection supports 241 languages, but even languages outside of that set are frequently added to our cluster. To achieve this we use a custom icu tokenizer that creates many more tokens than the standard one. One example of the default tokenizer behavior is to remove all punctuation and currency symbols, while we want to keep them as separate tokens, so they can be queried.

Worth to note about tokenization, once a document is tokenized and stored on disk it is not easy to change. Any change will require a full re-indexing of all data and with our huge amount of documents this can take months!

During the upgrade project, we wanted to be sure that any modifications we did, did not have any unexpected side-effects. To achieve this, we first set up large automated property based tests to compare the old and new implementation and as soon as we had our two parallel clusters we could start to replay actual live customer queries in the new setup to make sure all our changes were aligned with our intentions.

Character normalization

Since we designed our old Elasticsearch cluster back in 2013 both Meltwater and the world have changed significantly. Not only has our dataset grown much larger and includes many more languages, but the texts also include a lot more different characters. One reason for this is that we started indexing more social data, with shorter texts and more “creative” ways of expression.

Our customers have two conflicting demands on our search solution, they want to get EVERYTHING that matches their query (perfect recall) but they also want ONLY the matches that match their query (perfect precision).

To achieve this some customers have to write extremely large, and precise queries, to be sure to get all the matches regardless of language, character sets, word endings, misspellings, and way of writing.

One area where we believed we could do a better job of helping our users was to improve character normalization. By introducing normalizers into the indexing pipeline we could gain control of this crucial part and hopefully offer a higher recall while still maintaining a high precision for the parts that our users care the most about.

We had to be careful however because normalizing a character will effectively remove the ability to specifically search for (or exclude) that character. Since we also must handle all languages in the world it becomes virtually impossible to know the meaning of a certain character in a certain language. For example, as a Swede, normalizing é to e to cover misspells seems like a good idea, while a French person will most definitely disagree!

Normalizing apostrophes

So we gathered very detailed requirements from our end users, our support personnel and our sales department. Then we came up with a list of specific known problems that we wanted to address. One of the most common gripes was apostrophes. For example;

John’s
and
John´s

are not considered the same terms since the apostrophe characters are different, and thus a search for one of them will not match the other. And this caused a lot of headaches and missed matches for our users. Some tried to counter that with wildcards, but that made the searches slow and unpredictable.

One solution would of course be to remove all apostrophes completely, both in the index and in the queries, but that would cause other problems. Our customers want to be able to search specifically for a word with apostrophes at times; John’s and Johns are different terms and the difference is sometimes important, especially for brand names.

So instead we identified several different apostrophes found in our documents

[ ʹ ʻ ʼ ʽ ʾ ʿ ΄ ᾿ Ꞌ ꞌ ' ‘ ’ ‛ ′ ´ ` ｀ ＇]

and set a mapping character filter to normalize them all into ‘.

Normalizing full-width characters

Another improvement we made was to normalize full-width latin characters. Those characters are a feature of Unicode specifically designed to provide a version of the characters that are approximately square-shaped, to better blend in with characters in other languages, such as Chinese, Japanese or Korean.

That meant that our customers might have had to search for

Meltwater OR Ｍｅｌｔｗａｔｅｒ

to be sure to get a complete recall.

There is also the opposite for certain Japanese characters where they are made smaller to fit into Latin script, these are called half-width characters.

With another character filter set to normalize these two specific ranges of Unicode characters, users can now search without having to add the extra OR clause.

Our users loved both of these fixes! They could now significantly reduce the lengths of their queries and the time it took to construct them. And many of them were also faster to execute.

In fact, these two changes alone were enough to create excitement for the upgrade across our sales and support organizations.

CJK script characters

With the encouragement of the normalization success, we continued to take on another issue, how we handle languages based on logograms. In particular CJK (a collective term for the Chinese, Japanese, and Korean languages)

First, we need to explain what the term logogram means in the context of languages. Logographic simply means a form of writing where each character represents a full word, in contrast to phonographic where each character represents a sound. English, and other languages based on Latin characters, are phonographic and CJK are logographic languages.

When it comes to tokenization, the task of turning text into searchable tokens, phonographic languages have an advantage. Words are most often easy to identify due to word separators, such as spaces. For CJK on the other hand, any character could be a word, or any combination of characters could also be a word.

For example in Chinese 貓頭鷹 means owl, but the individual characters mean cat - head - eagle.

Our old algorithm for CJK would index each character separately, meaning when searching in Chinese a query for cat would give a hit for owl since it contained that token. This means that for a query to only return documents actually related to cats, you need to add several excludes to remove irrelevant matches.

We saw a possibility to simplify this so we used an algorithm that utilizes a comprehensive dictionary to make sure that combinations that actually were valid words were indexed as such.

However, after presenting this solution and testing it on real data, it turned out that our customers preferred the old way of handling CJK text. They have become experts at writing that particular type of excludes and the higher recall of our old implementation turned out to be worth more than the ability to write shorter queries in this case and they preferred that we did not change anything.

Learnings after tokenization and normalization changes

We learned a few important lessons from these experiences:

To meet our customer’s demands in a predictable and explainable way it is important to have a complete understanding of the full analyze and tokenization process
Release early and include customer feedback in your process as often as possible.
Sometimes the most technically elegant solution might not suit the end-users needs.
Lots of automated tests are great to guarantee backward compatibility.

Being a backend-oriented team, one challenge we have is that our day job does not include much interaction with actual end users. Fortunately, we always have a secret weapon, our global customer support team that could help us test and provide invaluable insight. Without their help we could have made more mistakes like the one above during the course of the project.

This concludes the fourth part of our blog post series on how we upgraded our Elasticsearch cluster. Stay tuned for the next post that will be published sometime next week.

To keep up-to-date, please follow us on Twitter or Instagram.

Up next: Running two Elasticsearch clients in the same JVM

Previous blog posts in this series: