We hope that the graph will be useful for researchers who develop
- search algorithms that rank results based on the hyperlinks between pages.
- SPAM detection methods which identity networks of web pages that are published in order to trick search engines.
- graph analysis algorithms and can use the hyperlink graph for testing the scalability and performance of their tools.
- Web Science researchers who want to analyze the linking patterns within specific topical domains in order to identify the social mechanisms that govern these domains.
1. Levels of Aggregation
We provide the hyperlink graph on four different levels of aggregation:
- Page-Level Graph - This version of the graph contains all details with each node representing a single web page and each arc a hyperlink between to two pages.
- Subdomain-Level Graph - This graph aggregates the page graph by subdomain. Each node in the graph represents a specific subdomain (like research.dws.uni-mannheim.de) and a arc exists, if at least one hyperlink was found between pages that belong to a pair of subdomains. Note that subdomains can be of arbitrary depth.
- First-Level-Subdomain Graph - Each node represents a first level subdomain (like dws.uni-mannheim.de) with all subjacent subdomains aggregated into this domain.
- Pay-Level-Domain Graph - Each node represents a pay-level-domain (lie uni-mannheim.de). An arc exists if at least one hyperlink was found between pages contained in a pair pay-level-domains.
The table below gives an overview of the size of the different graphs:
|Page Graph||3,563 million||128,736 million|
|Subdomain Graph||101 million||2,043 million|
|1st Level Subdomain Graph||95 million||1,937 million|
|PLD Graph||43 million||623 million|
2. Data Formats and Download
We provide the graphs for free download in several formats. All graphs are provided in an index/arc data format. In addition, we provide the page graph in the format used by the WebGraph library and the PLD graph in the format used by Pajek. The page graphs are hosted on Amazon S3. The aggregated graphs are provided for download via a server in Mannheim, Germany.
2.1 Index/Arc Format
The Index/Arc format represents each graph using two files. Within the index file each line represents one node. The first column states the node name, the second column states the node index. Within the arc file each line represents a directed edge between two nodes, where the first column is the origin node and the second the target node. The files are sorted by index and use tabs as a delimiter. The following example files contain a graph with 106 nodes and 141 arcs.
The following table contains the links for downloading the graphs.
|Data Set||Index File||Arc File|
|Page Graph||see below (45 GB)||see below (331 GB)|
|Subdomain Graph||download (832 MB)||download (9.2 GB)|
|1st Subdomain Graph||download (757 MB)||download (8.7 GB)|
|PLD Graph||download (297 MB)||download (2.8 GB)|