My team and I have built a solution that mines a stream of online articles for real-time insights for our customers. This component’s logic could be dramatically simplified if we could assume that it never receives near-duplicates of articles. While deduplication of identical documents is simple, detection of near-duplicates (i.e. “same thing, just slightly different”) is a complex but well-researched problem space.
To solve our problem, we ended up building a locality-sensitive hashing library for Elixir. Read on to find out why and how we built and open-sourced ExLSH.