Evaluation Metrics

Test set and ground truth

After data collection for the web spam challenge, labeled data was randomly split into two sets: train and test (2/3rd training and 1/3rd test). The training set was released along with labels, content, links, and some pre-computed feature vectors. Each test sample was evaluated by one or more judges. These judgments will be used to compute a spamicity score for each host, by taking the average of the assessments using the following weights:

  • NONSPAM counts as 0.0
  • BORDERLINE counts as 0.5
  • SPAM counts as 1.0

Judgements labeled as CAN'T CLASSIFY will be dropped from the spamicity score calculation. Ground truth will be produced by marking samples with a spamicity score > 0.5 as SPAM, and those < 0.5 as NONSPAM. Samples with no judgments, or with spamicity score exactly equal to 0.5 will not be considered in the test set.

The test labels are here: http://chato.cl/webspam/datasets/uk2007/labels/webspam-uk2007-testlabels.tar.gz

Evaluation of the predicted spamicity

Submitted predictions are four-tuples: #hostname, prediction, probability_spam (see format). The prediction is a real number which corresponds to the predictions for spammicity as defined above. We will be using Area Under the ROC Curve (AUC) as the evaluation metric. This evaluation metric aims at measuring the performance of the prediction of spamicity. An easy way of calculating this (and also to obtain a precision-recall curve) is to use the perf program, e.g.:

% cat team_predicted_spamicity.txt \
  | sed 's/NONSPAM/0/g' | sed 's/SPAM/1/g' \
  | grep -v '^#' | awk '{print $2,$3}' | perf -PRF -AUC -plot pr


0.3333 1.0000
0.6667 1.0000
0.6667 0.6667
1.0000 0.7500
1.0000 0.6000
1.0000 0.5000

PRF    0.85714   pred_thresh  0.500000
ROC    0.88889

Ranking and tie breaking

Using the predicted spamicity scores, entries will be sorted in decreasing order of AUC. The team with the highest AUC score will be ranked first. If two consecutively ranked submissions differ by less than 1 percentage point (0.01) in their AUC score a tie will be declared for that rank.

If the first two ranks produce a tie, it will be resolved in the following manner. The test set will be randomly partitioned into five disjoint sets of 20%, and the submission with the lower AUC variance will be declared the winner.