From Web Spam Challenge

Main: PhaseIEvaluation

Assessment phase

After all the teams submit their predictions, we will begin an assessment phase. Let's suppose we have T teams participating. We will pick uniformly at random 100*T hosts from the pool described below. We will pair judges at random to get two evaluations for each of those hosts. This means each team will tag a minimum of 200 hosts. This figure may increase if there are only few teams participating.

The hosts in the evaluation pool will be hosts that:

The assessors will have four options for each host they have to tag:

The guidelines given to the assessors will be the same that were used for building the training set. The assessors will have access to a cached version of the pages' contents.

Evaluation metrics

After collecting the assessments, a list will be created with all the judgments, and an average spammicity score will be calculated for each host, by taking an average of the assessments using the following scores:

Only hosts in the assessment pool with a final spamicity score distinct from 0.5 count. Those >0.5 are considered SPAM, and those <0.5 are considered NONSPAM as a ground truth.

We will use two evaluation metrics. The first one is for all the systems, and it is related to the exact (binary) classification problem. The second one is for the systems that produce non-binary spam scores (not mandatory).

Evaluation metric #1: Evaluate binary classification

The predictions from each team will be aligned with the test data in the following way, and a confusion matrix will be derived. The prediction will be 0 if NONSPAM, 1 if SPAM. Example (team_predictions.txt):

#Hostname GroundTruth Prediction
www.host1.co.uk NONSPAM 1.00
www.host2.co.uk NONSPAM 0.00
www.host3.co.uk SPAM 1.00
www.host4.co.uk NONSPAM 0.00
www.host5.co.uk SPAM 0.00
www.host6.co.uk SPAM 1.00

In this case, the confusion matrix is:

A = NONSPAM classified as NONSPAM = 2
B = NONSPAM classified as SPAM = 1
C = SPAM classified as NONSPAM = 1
D = SPAM classified as SPAM = 2

The following metrics will be used:

PRECISION = (D / (B+D))
RECALL = TRUE POSITIVE RATE = (D / (C+D))
FALSE POSITIVE RATE = (B / (B+A))
F-MEASURE = 2 * ( (P*R) / (P+R) )

In this example, the F-MEASURE is 0.6667. An easy way of calculating this is to use the perf program, e.g.:

% cat team_predictions.txt \
  | sed 's/NONSPAM/0/g' | sed 's/SPAM/1/g' \
  | grep -v '^#' | awk '{print $2,$3}' | perf -PRF

PRF    0.66667   pred_thresh  0.500000

All the entries will be sorted by F-MEASURE in decreasing order.

Evaluation metric #2: Evaluate predicted spamicity

This evaluation metric aims at measuring the performance of the prediction of spamicity. For the teams that provide a probability given by their model than a given host is spam (a spamicity), their prediction will be aligned with the test data in the following way (team_predicted_spamicity.txt):

#Hostname GroundTruth Prediction
www.host1.co.uk NONSPAM 0.20
www.host2.co.uk NONSPAM 0.10
www.host3.co.uk SPAM 0.60
www.host4.co.uk NONSPAM 0.10
www.host5.co.uk SPAM 0.80
www.host6.co.uk SPAM 0.90

It is important than the predicted spamicity has 0.5 as a threshold, that is, a value of less than 0.5 indicates that the system believes that the host is NONSPAM, and a value greater than 0.5 indicates that the system believes that the host is SPAM.

The area under ROC curve (AUC) will be used as a metric. An easy way of calculating this (and also to obtain a precision-recall curve and the F-Measure above) is to use the perf program, e.g.:

% cat team_predicted_spamicity.txt \
  | sed 's/NONSPAM/0/g' | sed 's/SPAM/1/g' \
  | grep -v '^#' | awk '{print $2,$3}' | perf -PRF -AUC -plot pr

0.3333 1.0000
0.6667 1.0000
0.6667 0.6667
1.0000 0.7500
1.0000 0.6000
1.0000 0.5000
PRF    0.85714   pred_thresh  0.500000
ROC    0.88889

The entries giving predicted spamicity scores will be sorted by AUC in decreasing order.

Notes

Retrieved from http://webspam.lip6.fr/wiki/pmwiki.php?n=Main.PhaseIEvaluation
Page last modified on May 31, 2007, at 09:47 AM