Web Spam Challenge 2008: Corpus

The dataset (contents, links, and labels) can be downloaded from:

It is based on a crawl of .UK done on May 2007.

2/3 of the labels have been released for training, and 1/3 of the labels are being held for testing.