Corpora for Track II
'The final corpora will be available in June in two formats:
- ASCII format
- MatLab format
The copora will be of different size in order to be used with a large variety of Machine Learning models. Please contact Ludovic DENOYER if you have any suggestion concerning the creation and distribution of such corpora
The first collection is theWebSpam collection developed at the University of Paris 6. It is composed of a connected graph of 5,000 Web pages and is labelled at the page level. Each Web page is labelled as ”Spam”, ”Not Spam” or ”Borderline” - the last category corresponds to Web page where the content is only partially spam, blog spam pages for example. This collections is quite small an corresponds to a classical ML problem of graph labelling. Its main advantage is that it can be used ”as is” with existing Machine Learning models. Figure 1 represents a part of the graph corresponding to that collection.
http://www.yr-bcn.es/webspam/graphs/spamgraph_lip6.jpg | FIGURE 1: Red nodes are spam, blue nodes are normal, and green nodes are normal pages with spam content.
The second collection is the WEBSPAM-UK-2006 collection compiled by University of Rome “La Sapienza” and University of Milan, and hosted by Yahoo! Research Barcelona.This corpus consists of 77 million pages from 12,000 hosts. These pages have been annotated at the level of hosts. Over 3,000 hosts have been manually labelled by at least two human judges as ”Spam”, ”Not Spam” or ”Borderline”. Also, 3,000 hosts have been automatically labelled as “Not Spam” as they belong to trusted domains such as .gov.uk (UK government) or .police.uk (UK police force).
The collection is available from Yahoo! Research Barcelona at the address http://www.yr-bcn.es/webspam/
In Figure 2, a partial view of the hostgraph corresponding to the UK2006 corpus is shown:
http://www.yr-bcn.es/webspam/graphs/spamgraph_uk2006.jpg | FIGURE 2: Black nodes are spam, white nodes are non-spam.