A Data Science Central Community
To foster the study of the structure and dynamics of Web traffic networks, we make available a large dataset (‘Click Dataset’) of about 53.5 billion HTTP requests made by users at Indiana University. Gathering anonymized requests directly from the network rather than relying on server logs and browser instrumentation allows one to examine large volumes of traffic data while minimizing biases associated with other data sources. It also provides one with valuable referrer information to reconstruct the subset of the Web graph actually traversed by users. The goal is to develop a better understanding of user behavior online and create more realistic models of Web traffic. The potential applications of this data include improved designs for networks, sites, and server software; more accurate forecasting of traffic trends; classification of sites based on the patterns of activity they inspire; and improved ranking algorithms for search results.
The data was generated by applying a Berkeley Packet Filter to a mirror of the traffic passing through the border router of Indiana University. This filter matched all traffic destined for TCP port 80. A long-running collection process used the pcap library to gather these packets, then applied a small set of regular expressions to their payloads to determine whether they contained HTTP GET requests. If a packet did contain a request, the collection system logged a record with the following fields:
Some important notes:
During collection, the system generated data at a rate of about 60 million requests per day, or about 30 GB/day of raw data. The data was collected between Sep 2006 and May 2010. Data is missing for about 275 days. The dataset has two collections:
The dataset is broken into hourly files. The initial line of each file has a set of flags that can be ignored. Each record looks like this:
XXXXADreferrer host path
XXXX is the timestamp (32-bit Unix epoch in seconds, in little endian order),
A is the user-agent flag (“B” for browser or “?” for other, including bots),
D is the direction flag (“O” for external traffic to IU, “I” for internal traffic to outside IU),
referrer is the referrer hostname or URL (terminated by newline),
host is the target hostname (terminated by newline), and
path is the target path (terminated by newline). For further details, please refer to the paper below.