From Web Spam Challenge

Main: PhaseITextFeatures

Text-based features

The text based features have been computed from the HTML source of the downloaded Web pages.

Host-level text features

The host features concern the content of the different hosts. They have been processed using (a subset of) the Web pages on each host.

We propose here different features set (small size, medium size and large size).

Depending of the features set, all the hosts don't have a features vector.

Host-level text features (sample of up to 10 documents per host)

Updated the 25th of January, 2007

These features have been obtained by sampling 45,000 documents at random, including in the sample at most 10 documents per host.

Each of the following files is a comma-separated-value (.csv) plain text file, where each line corresponds to one host. The first two columns are the host-id and the hostname.

The rest of the columns corresponds to the components of the feature vectors, in the format number_of_the_feature:value_of_the_feature. In this case, a feature corresponds is one word from the vocabulary.

Host-level text features (sample of up to 100 documents per host)

Updated the 2nd of February, 2007

These features have been obtained by sampling 350,000 documents at random, including in the sample at most 100 documents per host.

Page-level text features

Page-level text features (sample of up to 100 documents per host)

Updated the 15th of March, 2007

These features have been obtained by sampling 338,107 documents at random, including in the sample at most 100 documents per host.

Each of the following files is a comma-separated-value (.csv) plain text file, where each line corresponds to one web page. The first two columns are the url-id and the url name.

The rest of the columns corresponds to the components of the feature vectors, in the format number_of_the_feature:value_of_the_feature. In this case, a feature corresponds is one word from the vocabulary.

Contact

In case of any problem concerning the text-based features, please contact:

  ludovic dot denoyer at lip6 dot fr
Retrieved from http://webspam.lip6.fr/wiki/pmwiki.php?n=Main.PhaseITextFeatures
Page last modified on May 31, 2007, at 09:42 AM