Subscribe to our Newsletter

Fast clustering algorithms for massive datasets

Here we discuss two potential algorithms that can perform clustering extremely fast, on big data sets, as well as the graphical representation of such complex clustering structures. By extremely fast, we mean a computational complexity of order O(n) and even faster such as O(n/log n). This is much faster than good Hierarchical Agglomerative Clustering which are typically O(n^2 log n). By big data, we mean several millions, possibly a billion observations.

Potential applications:

  • Creating a keyword taxonomy to categorize the entire universe of cleaned (standardized), valuable English keywords. We are talking of about 10 million keywords made up of one, two or three tokens, that is, about 300 times the number of keywords found in a good English dictionary. The purpose might be to categorize all bid keywords that could be purchased by eBay and Amazon on Google (for pay-per-click ad campaigns), to better price them. This is the application discussed in this article.
  • Clustering millions of documents (e.g. books on or
  • Clustering web pages, or even the entire Internet, which consist of about 100 million top websites - and billions of web pages.

We also discuss whether it makes sense to perform such massive clustering, and how Map Reduce can help.

A. How to build a keyword taxonomy?

Here's the answer, from my earlier article What MapReduce can't do. Step 2 is the clustering part.

Step #1: pre-processing

You gather tons of keywords over the Internet with a web crawler (crawling Wikipedia or DMOZ directories), and compute the frequencies for each keyword, and for each "keyword pair". A "keyword pair" is two keywords found on a same web page, or close to each other on a same web page. Also by keyword, I mean stuff like "California insurance", so a keyword usually contains more than one token, but rarely more than three. With all the frequencies, you can create a table (typically containing many million keywords, even after keyword cleaning), where each entry is a pair of keywords and 3 numbers, e.g.

A="California insurance", B="home insurance", x=543, y=998, z=11


  • x is the number of occurrences of keyword A in all the web pages that you crawled
  • y is the number of occurrences of keyword B in all the web pages that you crawled
  • z is the number of occurences where A and B form a pair (e.g. they are found on a same page)

This "keyword pair" table can indeed be very easily and efficiently built using MapReduce. Note that the vast majority of keywords A and B do not form a "keyword pair", in other words, z=0. So by ignoring these null entries, your "keyword pair" table is still manageable, and might contain as little as 100 million entries.

Note: This step #1 constitutes the final step of a number of interesting applications.  For instance, it is used in search engine technology to identify or recommend keywords related to some other keywords. See an application here. This keyword API will be available (with source code and data) to students participating in our data science apprenticeship and (upon request) to bloggers posting good quality content on our network. Interestingly, a similar app was licensed to search engines (by for $500,000 a year, a few years ago.

Step #2: clustering

To create a taxonomy, you want to put these keywords into similar clusters. One way to do it is to compute a dissimilarity d(A,B) between two keywords A, B. For instances d(A, B) = z / SQRT(x * y), although other choices are possible. Note that the denominator prevents extremely popular keywords (e.g. "free") from being close to all the keywords, and from dominating the entire keyword relationship structure: indeed, it favors better keyword bonds, such as "lemon" with "law" or "pie", rather than "lemon" with "free".

The higher d(A, B), the closer keywords A and B are to each other. Now the big problem is to perform clustering - any kind of clustering, e.g. hierarchical - on the "keyword pair" table, using any kind of dissimilarity. We now discuss our solution, and a potential alternate solution.

B. Fast clustering algorithm

While this algorithm is described in the context of keyword clustering, it is straightforward to adapt it to other contexts. Here we assume that we have n = 10,000,000 unique keywords and m = 100,000,000 keyword pairs {A, B}, where d(A,B)>0. That is, an average of r=10 related keywords attached to each keyword.

Our algorithm incrementally proceeds in several (5 or 6) rounds, as follows:


Initialization (Round #0): The small data (or seeding) step

Select 10,000 seed keywords, create (say) 100 categories and create a hash table $hash where key is one of the 10,000 seed keywords, and value is a list of categories the keyword is assigned to.

For instance, $hash{"cheap car insurance"} = {"automotive","finance"}

The choice of the initial 10,000 seed keywords is very important. Until more research is done on this subject, I suggest to pick out the top 10,000 keywords, in terms of number of associations: that is, keywords A with many B's where d(A,B)>0. This will speed up the convergence of the algorithm.

Round #1: The big data step

Browse the table of m keyword pairs, from beginning to end.

When you find a pair {A, B} where (say) $hash{A} exists and $hash{B} does not, do:

  • $hash{B} = $hash{A}; 
  • $weight{B} = d(A,B)

When you find a pair {A, B} where both A and B are already in $hash, do

  • if $d(A,B) > $weight(B) then { $hash{B} = $hash{A}; $weight{B} = $d(A, B); } # B gets re-categorized to A's category
  • if $d(A,B) > $weight(A) then { $hash{A} = $hash{B}; $weight{A} = $d(A, B); } # A gets re-categorized to B's category

Round #2: Repeat Round #1 ($hash and $weight are kept in memory and keep growing at each subsequent round)

Round #3: Repeat Round #1, one more time

Round #4: Repeat Round #1, one more time

Round #5: Repeat Round #1, one more time


The computational complexity is q * m = O(n), with q being the number of rounds. This is n=10,000,000 times faster than good clustering algorithms. However, all these hash table accesses will slow it a bit to O(n log n), as $hash and $weight grow bigger at each subsequent round.

Would pre-sorting the big table of m pairs help? (sorting by d(A, B)) It would allow us to drastically reduce the number of hash table accesses (by making all the re-categorizations not needed anymore), but sorting is O(n log n), so we would not gain anything. Note that sorting can be efficiently performed with Map Reduce. The reduce step in this case, consists of merging a bunch of small, sorted tables.

This clustering algorithm seems (at first glance) easy to implement using Map Reduce, however since the big table only has 100,000,000 entries, it might fit in RAM.

You can improve computational complexity by keeping the most important m/log n entries (based on volume and d(A,B)) in the big table, and deleting the remaining entries. In practice, deleting 65% of the big table (the very long tail only, but not the long tail, from a keyword distribution point of view) will have very little impact on the performance: you will have a large bucket of un-categorized keywords, but in terms of volume, these keywords might represent less than 0.1%.


  • Alternate algorithm: One could use Tarjan's strongly connected components algorithm to perform the clustering. To proceed, you first bin the distances: d(A, B) is set to 1 if it is above some pre-specified threshold, 0 otherwise. This is a graph theory algorithm: each keyword represents a node, each pair of keywords where d(A, B) = 1, represents an edge. The computational complexity of the algorithm is O(n+m), where n is the number of keywords and m is the number of pairs (edges). To take advantage of this algorithm, you might want to store the big "keyword pair" table in a graph database (a type of NoSQL database).
  • Visualization. How do you represent these keywords, with their cluster structure determined by d(A, B), in a nice graph? 10 million keywords would fit in a 3,000 x 3,000 pixels image. For those interested in graphical representations, see the Fruchterman and Rheingold algorithm, extensively used to produce graphs similar to the one below. Note that its computational complexity is O(n^3) though, so we need to very significantly improve it for this keyword clustering application - including the graphical representation. The graphical representation could be a raster image with millions of pixels, like a heat map where color represents category and, when you point to a pixel, a keyword value shows up (rather than a vector image with dozens of nodes, see graph below). Neighboring pixels would represent strongly related keywords.
  • Do we really need this clustering algorithm? Most of the time, we are trying to answer a simpler question (e.g. which keywords are related to keyword A) or we already have a taxonomy and we want to extend or improve it. In this case, full clustering is not necessary. But it's nice to know that efficient, simple algorithms exist, if we need them.
  • Question: is this stuff statistical science, computer science or data science?
  • Data sets. I will share my own big table of keyword pairs, including the d(A, B) values, with students attending our data science apprenticeship, but I plan to make it available here as well. DMOZ data (1 million categorized URLs) can be downloaded for free (click here to get it) and it is a good starting point to extract millions of keywords (and categories!) - either just from the dataset itself, or by crawling all these URLs and related links, using distributed architecture. also offers a list of top 1 million domain names, for free (click here to download it). Finally, a good source of keyword data is query logs from search engines. These query logs are particularly useful for step #1.

Full-size image (59 K)

Related articles

Views: 82977

Comment by Oleg Okun on March 2, 2013 at 6:11am

Very interesting, Vincent. I didn't see where you set the number of clusters, though. Also algorithm description doesn't specify how cluster centroids are updated, given that you operate with text data (keywords). Could you elaborate these details?

Comment by Vincent Granville on March 5, 2013 at 11:14am

@Oleg: There is no centroid involved. The number of clusters is pre-determined by number of categories, usually between 20 and 200: for instance 20 categories and 200 sub-categories.

Interestingly, the end result is fuzzy clusters: one keyword can belong to multiple categories, with a weight indicating the strength of the association between keyword and category.

Comment by Vincent Granville on March 16, 2013 at 8:28am

Also, this technology can be used to create a product taxonomy.

Comment by BDV_Works on August 16, 2013 at 6:58pm

Both server-side preprocessing or dynamic clustering solutions can be used to map out millions of records on the fly, pls try out our latest samples on our site:


You need to be a member of BigDataNews to add comments!

Join BigDataNews

Sponsored By

On Data Science Central

© 2021   TechTarget, Inc.   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service