# BigDataNews

A Data Science Central Community

Subscribe to our Newsletter

# Clustering idea for very large datasets

Let's say you have to cluster 10 million points, for instance keywords. You have a dissimilarity function, available as a text file with 100,000,000 entries, each entry consisting of three data points:

Keyword A, Keyword B, distance between A and B denoted as d(A,B)

So, in short, you can perform k-NN (k-nearest neighbors) clustering or some other types of clustering, which typically is O(n^2) or worse, from a computational complexity point of view.

Has anyone ever used a clustering method based on sampling?

The idea is to start by sampling 1% (or less) of the 100,000,000 entries, and perform clustering on these pairs of keywords, to create a "seed" or "baseline" cluster structure.

The next step is to browse sequentially your 10,000,000 keywords, and for each keyword, find the closest cluster from the baseline cluster structure. If there are 1,000 base clusters, you'll have to perform 10 billion (10,000,000 x 1,000) computations (hash table accesses), a rather large number, but it can easily be done with a distributed architecture (Hadoop).

You could indeed repeat the whole procedure 5 times (with 5 different samples), and blend the 5 different final cluster structures that you obtain. The final output might be a table with 30,000,000 entries, where each entry consist of (1) is a pair of keywords A and B, and (2) the number of times (if >0) when A and B belong to a same cluster (based on the 5 iterations corresponding to 5 different samples). The final cluster detection algorithm consist in extracting the connected components from these 30,000,000 entries: these entries, from a graph theory point of view, are called ridges, joining two nodes A and B (here a node is a keyword).

What are the conditions for such an algorithm to work? You can assume that each keyword A has on average between 2 and 50 neighboring keywords B such that d(A,B) > 0. The number of such neighbors is usually much closer to 2, than to 50. So we might end up, after classifying all keywords, with 10 or 20% of all keywords left isolated (forming single-point clusters).

How do you address this issue, without using too much computer power? Of course if you work with 50 rather than 5 samples, or samples that represents 25% of the data rather than 1%, you'll solve the problem, but this is a time-prohibitive approach that very poorly and inefficiently leverage statistical sampling. How to do better?

Related article

Views: 970

Comment