Training corpora

For the second part of the challenge, the graphs are fully preprocessed and you can only use the provided information (i.e. feature vectors, link matrix, and training labels). If necessary, you can obtain new features from this data only. The use of external data sources is not allowed.

We provide two corpora:

  • The small corpus is composed of about 9,000 nodes (corresponding to hosts in a hostgraph)
  • The large corpus is composed of about 400,000 nodes (corresponding to web pages in a webgraph)

For each corpus, we provide:

  • A set of feature vectors: each vector describes one node of the graph
  • A link matrix: each non-zero entry represents an edge in the graph
  • A set of training labels: each node has a label as spam (1) or normal (0)

Note that the testing labels will have the same distribution than the training labels (about 80% of normal labels and 20% of spam labels)

A. File format

Each node of the graph has a unique ID, ranging from 1 to N.

Feature vectors

This file contains one sparse vector for each node. One line of the file corresponds to one vector. The first column is the ID of the node and the next columns have the format feature number:value.

Link matrix

Each line of the file corresponds to an edge between two nodes:

  • the first column is the ID of the source node,
  • the second column is the ID of the destination node
  • the last column is the weight of the edge between the two nodes

In the case of graphs of web pages, the weight is always 1 for existing hyperlinks. In the case of graphs of hosts, the weight is the number of different pairs of pages that are linked.

Training labels

Each line corresponds to one label:

  • The first column is the ID of the node
  • The second column is the label; 0 means normal, 1 means spam

B. Download

Corpus #1 (small corpus)

  • Feature vectors (150MB) - The vectors correspond to normalized TF-IDF vectors over the content of 100 pages of each host.
  • Link matrix (5.8 MB)
  • Training labels (6.1 KB)
  • Validation labels - these labels can be used as a testing set when submitting an paper to the GraphLab workshop. If the paper is accepted, in the camera-ready version of it the participants will have to use the whole testing set that will be available later (see timeline)

Note: the node IDs do not correspond to the host IDs of the challenge track I.

Corpus #2 (large corpus)

Here, one node corresponds to one page.

  • Feature vectors (150MB) - The vectors correspond to frequential vectors over the content of each page.
  • Link matrix (52 MB)
  • Training labels (341 KB)
  • Validation labels - these labels can be used as a testing set when submitting an paper to the GraphLab workshop. If the paper is accepted, in the camera-ready version of it the participants will have to use the whole testing set that will be available later (see timeline)

In order to generate this collection, we have:

  • randomly choose a normal first Web page connected to a spam Web page
  • then we have sequentially added to the set of web pages a new page which is connected to this set.
    • if at time 't' we have less than 80% of normal page we add a normal page to our current set.
    • if we have more than 80% of normal page we add a spam page to our current set.

The process was stopped when the size of the current set is 400,000.

Contact: ludovic [.] denoyer [] lip6 [.]fr