Corpus for Track I
For Track I we will use the WEBSPAM-UK2006 collection compiled by University of Rome “La Sapienza” and University of Milan with the support of the DELIS EU - FET research project, and hosted by Yahoo! Research Barcelona.
This corpus consists of 77 million pages from 11,400 hosts. These pages have been annotated at the level of hosts. Over 3,000 hosts have been manually labelled by at least two human judges as ”Spam”, ”Not Spam” or ”Borderline”. Also, 3,000 hosts have been automatically labelled as “Not Spam” as they belong to trusted domains such as .gov.uk (UK government) or .police.uk (UK police force).
In the next figure, a partial view of the hostgraph corresponding to the UK2006 corpus is shown:
Black nodes are spam, white nodes are non-spam.
The collection is available from Yahoo! Research Barcelona at http://www.yr-bcn.es/webspam/datasets/
A set of feature vectors is also available for the hosts and pages of this collection.