Main.PhaseIEvaluation History

Hide minor edits - Show changes to markup

May 31, 2007, at 09:47 AM by ChaTo -
Added lines 1-114:

Assessment phase

After all the teams submit their predictions, we will begin an assessment phase. Let's suppose we have T teams participating. We will pick uniformly at random 100*T hosts from the pool described below. We will pair judges at random to get two evaluations for each of those hosts. This means each team will tag a minimum of 200 hosts. This figure may increase if there are only few teams participating.

The hosts in the evaluation pool will be hosts that:

  • Are not in the training set. This excludes hosts that have been labeled by judges in the training set, or hosts that match any of the 'trusted' domains (e.g.:,, ...)
  • Have their homepage in the .warc summary content file and the .graph graph file (that is, the host appears here).

The assessors will have four options for each host they have to tag:

  • NONSPAM - The host does not contain spamming aspects.
  • BORDERLINE - The host contains some aspects that are suspicious of being spam.
  • SPAM - The host contains spamming aspects.
  • CAN'T CLASSIFY - The assessor could not classify the host. This count as a null judgment.

The guidelines given to the assessors will be the same that were used for building the training set. The assessors will have access to a cached version of the pages' contents.

Evaluation metrics

After collecting the assessments, a list will be created with all the judgments, and an average spammicity score will be calculated for each host, by taking an average of the assessments using the following scores:

  • NONSPAM counts as 0.0
  • BORDERLINE counts as 0.5
  • SPAM counts as 1.0
  • CAN'T CLASSIFY does not count.

Only hosts in the assessment pool with a final spamicity score distinct from 0.5 count. Those >0.5 are considered SPAM, and those <0.5 are considered NONSPAM as a ground truth.

We will use two evaluation metrics. The first one is for all the systems, and it is related to the exact (binary) classification problem. The second one is for the systems that produce non-binary spam scores (not mandatory).

Evaluation metric #1: Evaluate binary classification

The predictions from each team will be aligned with the test data in the following way, and a confusion matrix will be derived. The prediction will be 0 if NONSPAM, 1 if SPAM. Example (team_predictions.txt):

#Hostname GroundTruth Prediction NONSPAM 1.00 NONSPAM 0.00 SPAM 1.00 NONSPAM 0.00 SPAM 0.00 SPAM 1.00

In this case, the confusion matrix is:

A = NONSPAM classified as NONSPAM = 2
B = NONSPAM classified as SPAM = 1
C = SPAM classified as NONSPAM = 1
D = SPAM classified as SPAM = 2

The following metrics will be used:

F-MEASURE = 2 * ( (P*R) / (P+R) )

In this example, the F-MEASURE is 0.6667. An easy way of calculating this is to use the perf program, e.g.:

% cat team_predictions.txt \
  | sed 's/NONSPAM/0/g' | sed 's/SPAM/1/g' \
  | grep -v '^#' | awk '{print $2,$3}' | perf -PRF

PRF    0.66667   pred_thresh  0.500000

All the entries will be sorted by F-MEASURE in decreasing order.

Evaluation metric #2: Evaluate predicted spamicity

This evaluation metric aims at measuring the performance of the prediction of spamicity. For the teams that provide a probability given by their model than a given host is spam (a spamicity), their prediction will be aligned with the test data in the following way (team_predicted_spamicity.txt):

#Hostname GroundTruth Prediction NONSPAM 0.20 NONSPAM 0.10 SPAM 0.60 NONSPAM 0.10 SPAM 0.80 SPAM 0.90

It is important than the predicted spamicity has 0.5 as a threshold, that is, a value of less than 0.5 indicates that the system believes that the host is NONSPAM, and a value greater than 0.5 indicates that the system believes that the host is SPAM.

The area under ROC curve (AUC) will be used as a metric. An easy way of calculating this (and also to obtain a precision-recall curve and the F-Measure above) is to use the perf program, e.g.:

% cat team_predicted_spamicity.txt \
  | sed 's/NONSPAM/0/g' | sed 's/SPAM/1/g' \
  | grep -v '^#' | awk '{print $2,$3}' | perf -PRF -AUC -plot pr

0.3333 1.0000
0.6667 1.0000
0.6667 0.6667
1.0000 0.7500
1.0000 0.6000
1.0000 0.5000
PRF    0.85714   pred_thresh  0.500000
ROC    0.88889

The entries giving predicted spamicity scores will be sorted by AUC in decreasing order.


  • It may be the case that the ordering differs in both metrics. In that case, there will be two winners.
  • For a difference to be considered significant it has to be higher than the difference obtained by including the fraction of hosts with spamicity=0.5 in the test set as nonspam or spam.