Pre-computed feature vectors

To reduce the amount work due to data processing, we provide the following feature vectors for WEBSPAM-UK2006:

Feature set 1: direct features

The file includes two direct, obvious features that are not link-based: the number of pages in the host and the number of characters in the host name. The number of pages per host is extracted from the graph files.

Feature set 2a: link-based features

The following file includes link-based features for the hosts, measured in both the home page and the page with the maximum PageRank in each host.

This includes link-based features such as in-degree, out-degree, PageRank, edge reciprocity, assortativity coefficient, TrustRank, Truncated PageRank, estimation of supporters, etc. See link-based features description. All these features are extracted from the graph files.

The list of the url-id of the home page and the page with the maximum PageRank of each host are also available.

Please read the update done on September 15th, 2007 on the file with the link-based features description.

Feature set 2b: transformed link-based features

The following file includes simple numeric transformations and combinations of the link-based features for the hosts:

These transformation were found to work better for classification in practice than the raw link-based features. This includes mostly ratios between features such as Indegree/PageRank or TrustRank/PageRank, and log(.) of several features. See entire list. All these features are extracted from the graph files, and can be derived from the link-based features above.

Feature set 3a: content-based features

The following file includes content-based features for hosts:

These features include number of words in the home page, average word length, average length of the title, etc. for a sample of pages on each host. See here for details. All these features are extracted from the summary version of the contents of the pages (.warc files).

Feature set 3b: text-based features

Text-based features, extracted from the contents of the pages

Text-based features are available here

Feature set 4: neighborhood-based features

These features are derived from the predictions obtained with the feature sets 1, 2b and 3a, using this procedure:


We will be posting more feture sets here, if you want to be notified subscribe to our mailing list.