A Data Science Central Community

Let's say you have to cluster 10 million points, for instance keywords. You have a dissimilarity function, available as a text file with 100,000,000 entries, each entry consisting of three data points:

*Keyword A, Keyword B, distance between A and B denoted as d(A,B)*

So, in short, you can perform k-NN (k-nearest neighbors) clustering or some other types of clustering, which typically is O(n^2) or worse, from a computational complexity point of view.

**Has anyone ever used a clustering method based on sampling?**

The idea is to start by sampling 1% (or less) of the 100,000,000 entries, and perform clustering on these pairs of keywords, to create a "seed" or "baseline" cluster structure.

The next step is to browse sequentially your 10,000,000 keywords, and for each keyword, find the closest cluster from the baseline cluster structure. If there are 1,000 base clusters, you'll have to perform 10 billion (10,000,000 x 1,000) computations (hash table accesses), a rather large number, but it can easily be done with a distributed architecture (Hadoop).

You could indeed repeat the whole procedure 5 times (with 5 different samples), and blend the 5 different final cluster structures that you obtain. The final output might be a table with 30,000,000 entries, where each entry consist of (1) is a pair of keywords A and B, and (2) the number of times (if >0) when A and B belong to a same cluster (based on the 5 iterations corresponding to 5 different samples). The final cluster detection algorithm consist in extracting the connected components from these 30,000,000 entries: these entries, from a graph theory point of view, are called *ridges*, joining two *nodes* A and B (here a *node* is a keyword).

What are the conditions for such an algorithm to work? You can assume that each keyword A has on average between 2 and 50 neighboring keywords B such that d(A,B) > 0. The number of such neighbors is usually much closer to 2, than to 50. So we might end up, after classifying all keywords, with 10 or 20% of all keywords left isolated (forming single-point clusters).

How do you address this issue, without using too much computer power? Of course if you work with 50 rather than 5 samples, or samples that represents 25% of the data rather than 1%, you'll solve the problem, but this is a time-prohibitive approach that very poorly and inefficiently leverage statistical sampling. How to do better?

**Related article**

© 2020 BigDataNews.com is a subsidiary of DataScienceCentral LLC and not affiliated with Systap Powered by

Badges | Report an Issue | Privacy Policy | Terms of Service

**Most Popular Content on DSC**

To not miss this type of content in the future, subscribe to our newsletter.

- Book: Statistics -- New Foundations, Toolbox, and Machine Learning Recipes
- Book: Classification and Regression In a Weekend - With Python
- Book: Applied Stochastic Processes
- Long-range Correlations in Time Series: Modeling, Testing, Case Study
- How to Automatically Determine the Number of Clusters in your Data
- New Machine Learning Cheat Sheet | Old one
- Confidence Intervals Without Pain - With Resampling
- Advanced Machine Learning with Basic Excel
- New Perspectives on Statistical Distributions and Deep Learning
- Fascinating New Results in the Theory of Randomness
- Fast Combinatorial Feature Selection

**Other popular resources**

- Comprehensive Repository of Data Science and ML Resources
- Statistical Concepts Explained in Simple English
- Machine Learning Concepts Explained in One Picture
- 100 Data Science Interview Questions and Answers
- Cheat Sheets | Curated Articles | Search | Jobs | Courses
- Post a Blog | Forum Questions | Books | Salaries | News

**Archives:** 2008-2014 |
2015-2016 |
2017-2019 |
Book 1 |
Book 2 |
More

**Most popular articles**

- Free Book and Resources for DSC Members
- New Perspectives on Statistical Distributions and Deep Learning
- Time series, Growth Modeling and Data Science Wizardy
- Statistical Concepts Explained in Simple English
- Machine Learning Concepts Explained in One Picture
- Comprehensive Repository of Data Science and ML Resources
- Advanced Machine Learning with Basic Excel
- Difference between ML, Data Science, AI, Deep Learning, and Statistics
- Selected Business Analytics, Data Science and ML articles
- How to Automatically Determine the Number of Clusters in your Data
- Fascinating New Results in the Theory of Randomness
- Hire a Data Scientist | Search DSC | Find a Job
- Post a Blog | Forum Questions

## You need to be a member of BigDataNews to add comments!

Join BigDataNews