Challenge description

The challenge will explore the web spam detection problem through a series of tests performed on Web Spam Labelling tasks using ML techniques.

General problem

In this challenge, we propose to deal with the following general task. We consider an oriented graph G = (V,E) where V is the set of nodes and E is the set of edges i.e E  V × V . Each vertex of the graph v and each edge e are respectively associated to a content information denoted cv and ce. In the context of Web Spam detection, G corresponds to the graph of Web pages where each node v is a Web page, each edge e = (vi, vj) corresponds to an hyperlink from the Web page vi to the Web page vj ; cv is the full content of page v and ce is the text of the hyperlink v plus other position/contextual information about the link. We consider that each node belongs to particular category in the set of possible labels L. We denote H : V ! L the function that associates a category to each node. For example, H(v) is the function that tell us if a Web page v is ”spam” or ”not spam”. The goal of the challenge is to compute a good approximation ¯H of H using a set of manually labeled nodes. Naturally, besides using cv and ce directly, a vector of features extracted from them F = (f1(cv), f2(cv), . . . fℓ(cv), g1(ce), g2(ce), . . . fs(ce)) can be used for computing the approximation.

Proposed tasks

The task of the challenge consists in labeling the nodes of a graph given the partial labels on this graph. We will propose tests corresponding to different situations by varying:

- the size of the training set in order to test the capacity of the methods to learn will small samples

- the size of the graphs for testing the ability of the methods to scale with large size problems

- the information available at each node.

For the latter we will propose:

1.- A full version of web corpora composed of the graph, labels and Web page contents.

2.- A preprocessed version composed of the graph, labels and a set of feature vectors extracted by the organizers from the contents and links of the pages. - the nature of the nodes (either web pages or whole sites)

Besides developing general machine learning methods for combining the features and creating automatic classifiers for graph and structured data, the challenge will allow to study feature extraction algorithms for extracting vectors of features from the contents and links in the Web pages and feature aggregation and propagation algorithms for using the graph to guide the labelling process.

Bibliography