Recently I had an interesting problem to solve. Given a set of documents and the key phrases found in each document, the task is to display the key phrases on a map so that related phrases appear close to each other while unrelated ones are placed further apart. This should result in a word cloud in which words are somewhat semantically clustered.
Furthermore relatedness must be determined without using any external dictionaries, the distance between phrases must be derived from the documents only.
Semantical Clustering Algorithm
The algorithm I came up with is the following:
- Take the most frequent N key phrases found in the whole document set.
- For every pair of phrases count the number of documents in which they co-occur. The assumption is that related concepts tend to be in the same document more often than unrelated ones.
- Let the distance of a pair of phrases be the reciprocal of their co-occurrence count for such pairs for which this number is not zero, and let the distance be a big number for the other pairs with zero cooccurrence. Then execute an all-pairs-shortest-path algorithm, after which we have a fully filled in distance matrix in which the distance of two phrases represents their relatedness, and the distance of two phrases might be small even if they don’t co-occur in any document but there are some well related concepts to bridge them.
- Try to find a placement in 2D in which the euclidean distances are as close to the values of the distance matrix as possible. For this I used a multidimensional scaling algorithm called Stress majorization, but other methods could work equally well.
- Placing the key phrase labels at the positions obtained in the previous step would cause many labels to overlap, depending on the number of phrases, the available screen space and the size of the label’s font, so kick the labels around a bit so that the labels move away from crowded areas. I simply pick a pair of overlapping labels randomly and move it a bit to the direction that decreases the overlap, and repeat this until there are no overlaps or time is up.
For testing I used the FAO-780 data set which is a standard test corpus for keyword extraction. This is the map I got when I displayed the top 45 phrases, colours represent the frequency of each phrase in the whole document set.
The algorithm seems to do a good job, concepts are nicely clustered, for example there is an island for fishery well separated from forestry. It’s also interesting to note how much the stupid computer seems to know about various aspects of real life: that women are in the middle of everything, that diffusion of information is closely related to international cooperation and the sad fact that evaluation is unrelated to everything else. It’s surprising though that one of the most frequent phrases in these documents dealing with food and agriculture is women. Agriculture has obviously changed a lot since I was at school, that time it was more about cows and manure and rain.
This was a fun task to work on, the kind of those I was hoping for when joining Meltwater.
(Balázs is working in Meltwater’s Enrichment team in Budapest, where he is focusing on tasks related to Natural Language Processing.)