Main.PhaseICorpus History

Hide minor edits - Show changes to markup

June 26, 2007, at 02:50 AM by 82.67.192.190 -
Changed line 20 from:

A set of feature vectors is also available for the hosts and pages of this collection.

to:

A set of feature vectors is also available for the hosts and pages of this collection.

May 31, 2007, at 09:40 AM by ChaTo -
Added lines 1-20:

Corpus for Track I

WEBSPAM-UK2006

For Track I we will use the WEBSPAM-UK2006 collection compiled by University of Rome “La Sapienza” and University of Milan with the support of the DELIS EU - FET research project, and hosted by Yahoo! Research Barcelona.

This corpus consists of 77 million pages from 11,400 hosts. These pages have been annotated at the level of hosts. Over 3,000 hosts have been manually labelled by at least two human judges as ”Spam”, ”Not Spam” or ”Borderline”. Also, 3,000 hosts have been automatically labelled as “Not Spam” as they belong to trusted domains such as .gov.uk (UK government) or .police.uk (UK police force).

In the next figure, a partial view of the hostgraph corresponding to the UK2006 corpus is shown:

http://www.yr-bcn.es/webspam/graphs/spamgraph_uk2006.jpg | Black nodes are spam, white nodes are non-spam.

The collection is available from Yahoo! Research Barcelona at http://www.yr-bcn.es/webspam/datasets/

A set of feature vectors is also available for the hosts and pages of this collection.