Main.PhaseIFeatures History

Hide minor edits - Show changes to markup

October 20, 2008, at 04:11 AM by 213.27.241.137 -
Changed lines 7-10 from:
  • uk-2006-05.obvious_features.csv.gz (139 KB)

The file includes two direct, obvious features that are not link-based: the number of pages in the host and the number of characters in the host name. The number of pages per host is extracted from the graph files.

to:
  • uk-2006-05.obvious_features.csv.gz (139 KB)

The file includes two direct, obvious features that are not link-based: the number of pages in the host and the number of characters in the host name. The number of pages per host is extracted from the graph files.

Changed lines 15-22 from:
  • uk-2006-05.link_based_features.csv.gz (1.9 MB)

This includes link-based features such as in-degree, out-degree, PageRank, edge reciprocity, assortativity coefficient, TrustRank, Truncated PageRank, estimation of supporters, etc. See link-based features description. All these features are extracted from the graph files.

The list of the url-id of the home page and the page with the maximum PageRank of each host are also available.

Please read the update done on September 15th, 2007 on the file with the link-based features description.

to:
  • uk-2006-05.link_based_features.csv.gz (1.9 MB)

This includes link-based features such as in-degree, out-degree, PageRank, edge reciprocity, assortativity coefficient, TrustRank, Truncated PageRank, estimation of supporters, etc. See link-based features description. All these features are extracted from the graph files.

The list of the url-id of the home page and the page with the maximum PageRank of each host are also available.

Please read the update done on September 15th, 2007 on the file with the link-based features description.

Changed lines 27-30 from:
  • uk-2006-05.link_based_features_transformed.csv.gz (8.3 MB)

These transformation were found to work better for classification in practice than the raw link-based features. This includes mostly ratios between features such as Indegree/PageRank or TrustRank/PageRank, and log(.) of several features. See entire list. All these features are extracted from the graph files, and can be derived from the link-based features above.

to:
  • uk-2006-05.link_based_features_transformed.csv.gz (8.3 MB)

These transformation were found to work better for classification in practice than the raw link-based features. This includes mostly ratios between features such as Indegree/PageRank or TrustRank/PageRank, and log(.) of several features. See entire list. All these features are extracted from the graph files, and can be derived from the link-based features above.

Changed lines 35-38 from:
  • uk-2006-05.content_based_features.csv.gz (4.4 MB)

These features include number of words in the home page, average word length, average length of the title, etc. for a sample of pages on each host. See here for details. All these features are extracted from the summary version of the contents of the pages (.warc files).

to:
  • uk-2006-05.content_based_features.csv.gz (4.4 MB)

These features include number of words in the home page, average word length, average length of the title, etc. for a sample of pages on each host. See here for details. All these features are extracted from the summary version of the contents of the pages (.warc files).

Changed lines 41-42 from:

Text-based features, extracted from the contents of the pages

to:

Text-based features, extracted from the contents of the pages

Changed lines 47-50 from:

These features are derived from the predictions obtained with the feature sets 1, 2b and 3a, using this procedure:

  • uk-2006-05.stacked_graphical_learning.csv.gz (101 KB)
to:

These features are derived from the predictions obtained with the feature sets 1, 2b and 3a, using this procedure:

  • uk-2006-05.stacked_graphical_learning.csv.gz (101 KB)
December 13, 2007, at 06:29 AM by 84.88.76.49 -
Changed lines 3-6 from:

To reduce the amount work due to data processing and encourage participation, we will also provide a set of features extracted from the contents and links in the collection, which may be used by the participant teams in addition to any automatic technique they choose to use.

We will be posting content-based and link-based feature vectors in early January 2007.

to:

To reduce the amount work due to data processing, we provide the following feature vectors for WEBSPAM-UK2006:

September 21, 2007, at 11:11 AM by 216.145.54.7 -
Changed lines 19-20 from:

This includes link-based features such as in-degree, out-degree, PageRank, edge reciprocity, assortativity coefficient, TrustRank, Truncated PageRank, estimation of supporters, etc. See entire list. All these features are extracted from the graph files.

to:

This includes link-based features such as in-degree, out-degree, PageRank, edge reciprocity, assortativity coefficient, TrustRank, Truncated PageRank, estimation of supporters, etc. See link-based features description. All these features are extracted from the graph files.

Added lines 23-24:

Please read the update done on September 15th, 2007 on the file with the link-based features description.

May 31, 2007, at 09:41 AM by ChaTo -
Added lines 1-52:

Pre-computed feature vectors

To reduce the amount work due to data processing and encourage participation, we will also provide a set of features extracted from the contents and links in the collection, which may be used by the participant teams in addition to any automatic technique they choose to use.

We will be posting content-based and link-based feature vectors in early January 2007.

Feature set 1: direct features

  • uk-2006-05.obvious_features.csv.gz (139 KB)

The file includes two direct, obvious features that are not link-based: the number of pages in the host and the number of characters in the host name. The number of pages per host is extracted from the graph files.

Feature set 2a: link-based features

The following file includes link-based features for the hosts, measured in both the home page and the page with the maximum PageRank in each host.

  • uk-2006-05.link_based_features.csv.gz (1.9 MB)

This includes link-based features such as in-degree, out-degree, PageRank, edge reciprocity, assortativity coefficient, TrustRank, Truncated PageRank, estimation of supporters, etc. See entire list. All these features are extracted from the graph files.

The list of the url-id of the home page and the page with the maximum PageRank of each host are also available.

Feature set 2b: transformed link-based features

The following file includes simple numeric transformations and combinations of the link-based features for the hosts:

  • uk-2006-05.link_based_features_transformed.csv.gz (8.3 MB)

These transformation were found to work better for classification in practice than the raw link-based features. This includes mostly ratios between features such as Indegree/PageRank or TrustRank/PageRank, and log(.) of several features. See entire list. All these features are extracted from the graph files, and can be derived from the link-based features above.

Feature set 3a: content-based features

The following file includes content-based features for hosts:

  • uk-2006-05.content_based_features.csv.gz (4.4 MB)

These features include number of words in the home page, average word length, average length of the title, etc. for a sample of pages on each host. See here for details. All these features are extracted from the summary version of the contents of the pages (.warc files).

Feature set 3b: text-based features

Text-based features, extracted from the contents of the pages

Text-based features are available here

Feature set 4: neighborhood-based features

These features are derived from the predictions obtained with the feature sets 1, 2b and 3a, using this procedure:

  • uk-2006-05.stacked_graphical_learning.csv.gz (101 KB)

We will be posting more feture sets here, if you want to be notified subscribe to our mailing list.