A Data Science Central Community
This famous statement -- the six degrees of separation -- claims that there is at most 6 degrees of separation between you and anyone else on Earth. Here we feature a simple algorithm that simulates how we are connected, and indeed confirms the claim. We also explain how it applies to web crawlers: Any web page is connected to any other web page by a path of 6 links at most.
The algorithm below is rudimentary and can be used for simulation purposes by any programmer: It does not even use tree or graph structures. Applied to a population of 2,000,000 people, each having 20 friends, we show that there is a path involving 6 levels or intermediaries between you and anyone else. Note that the shortest path typically involves fewer levels, as some people have far more than 20 connections.
Starting with you, at level one, you have twenty friends or connections. These connections in turn have 20 friends, so at level two, you are connected to 400 people. At level three, you are connected to 7,985 people, which is a little less than 20 x 400, since some level-3 connections were already level-2 or level-1. And so on.
Note that in practice, people are connected through clusters of people, not randomly as in this model. Yet the simulator gives a pretty good idea of the process at play. In the above example, at level 5, you are already connected to 80% of all people.
Connections Simulator
The algorithm is described by this simple piece of Perl code:
$n=2000000;
$rand=0;
$npeople=1;
$new[0]=1;
for ($level=1; $level<7; $level++) {
$new[$level]=0;
for ($k=0; $k<$new[$level-1]; $k++) {
$nfriends=20;
for ($m=0; $m<$nfriends; $m++) {
$rand=(10232193*$rand + 3701101) % 54198451371;
$friend= $rand % $n;
if ($a[$friend] == 0) {
$a[$friend]++;
$new[$level]++;
}
}
}
$npeople+=$new[$level];
print "$level: $new[$level] - $npeople \n";
}
The array a stores the connections acquired at each level. The number of new connections at each level is equal to $new[$level], while the total reach (at each level) is $npeople. Note that depending on the parameters, you might not always be able to connect everyone to everyone. This is especially true if you allow people to only have two friends, rather than 20. Just like in the real world, if some clusters of people are totally disconnected from the rest of the world, you won't find a path between you and them.
The Need for a Good Random Number Generator
I have suspected for a long time that the function rand() in Perl provides a poor random generator. In this case, it fails miserably, being able to produce only 32,768 (that is, 2^15) different values, while we need 2,000,000. So I replaced it by a basic formula that fits the bill: This is what the computation of the variable $rand is for. Note that the operator % in the code, represents the modulo function.
Application to Web Crawlers
When crawling the web, the best strategy is to proceed by successive layers as we did in our simulation: You start with (say) 1,000,000 seed URLs (easy to find using DMOZ or Quantcast) or even just 500 seed URLs, and at each iteration or level, you follow new links found in webpages discovered at the previous level, making sure that you rarely if ever revisit web pages that were previously crawled. It can be done on a laptop with multiple web crawlers running in parallel, if you are interested in only capturing 99.9% of the visible Internet. We did it, and here is the size of the data gathered at each level (compressed, in bytes):
The distribution is somewhat similar to the one shown in the first table, with a drop in the last level, and a rapid build-up during the first few levels. If you play with the simulator for a while (changing the parameters), you will occasionally stumble upon very similar distributions.
It took several months to gather all these links, using a 2-second timeout for each webpage crawl, downloading only the first 16 KB of each page, using a highly distributed architecture, and accessing only a few webpages per domain. Also, our web crawler was able to get re-started and resume where it stopped, in case our Internet connection crashed. Finally, all links that crashed were given the chance to be re-visited one more time, weeks later.
Some Mathematics
Watts and Strogatz showed that the average path length between two nodes in a random network is equal to log N / log K, where N = total nodes and K = acquaintances per node. Thus if N = 300,000,000 (90% of the US population) and K = 30 then Degrees of Separation = 5.7 and if N = 6,000,000,000 and K = 30 then Degrees of Separation = 6.6.
For related articles from the same author, click here or visit www.VincentGranville.com. Follow me on LinkedIn.
DSC Resources
Popular Articles
More on this problem in my article, https://www.datasciencecentral.com/profiles/blogs/no-hope-for-6-deg...
Comment
© 2019 BigDataNews.com is a subsidiary of DataScienceCentral LLC and not affiliated with Systap Powered by
Badges | Report an Issue | Privacy Policy | Terms of Service
You need to be a member of BigDataNews to add comments!
Join BigDataNews