Text-based features

The text based features have been computed from the HTML source of the downloaded Web pages.

Host-level text features

The host features concern the content of the different hosts. They have been processed using (a subset of) the Web pages on each host.

We propose here different features set (small size, medium size and large size).

Depending of the features set, all the hosts don't have a features vector.

Host-level text features (sample of up to 10 documents per host)

Updated the 25th of January, 2007

These features have been obtained by sampling 45,000 documents at random, including in the sample at most 10 documents per host.

Each of the following files is a comma-separated-value (.csv) plain text file, where each line corresponds to one host. The first two columns are the host-id and the hostname.

The rest of the columns corresponds to the components of the feature vectors, in the format number_of_the_feature:value_of_the_feature. In this case, a feature corresponds is one word from the vocabulary.

  • Hosts Index: a file containing the host index. Format: hostid,hostname pairs.
  • Word Index: the vocabulary used to compute the features. Format: wordid,word pairs. 1,251,994 words.
  • Frequential vectors
    • Frequential Vectors for Title: These vectors correspond to frequential vectors computed over each host (computed by making the sum of the frequential vectors of the Web pages of each host) restricted to the title of the web pages
    • Frequential Vectors for Links: These vectors correspond to frequential vectors computed over each host (computed by making the sum of the frequential vectors of the Web pages of each host) restricted to the hyperlinks of the web pages
    • Frequential Vectors for Body: These vectors correspond to frequential vectors computed over each host (computed by making the sum of the frequential vectors of the Web pages of each host) restricted to the body of the web pages
  • TF-IDF vectors (Term frequency, inverse host frequency)
    • The number of host where a word appears : The host frequency (like DF but computed for each host, 0 if the word is not referenced in the file)
    • TFIDF Vectors for Title: These vectors correspond to tf-idf vectors computed over each host restricted to the title of the web pages
    • TFIDF Vectors for Links: These vectors correspond to tf-idf vectors computed over each host restricted to the links of the web pages
    • TFIDF Vectors for Body: These vectors correspond to tf-idf vectors computed over each host restricted to the body of the web pages
  • TF-IDF normalized (TF-IDF vector with normalization)

Host-level text features (sample of up to 100 documents per host)

Updated the 2nd of February, 2007

These features have been obtained by sampling 350,000 documents at random, including in the sample at most 100 documents per host.

  • Word Index: the vocabulary used to compute the features. Format: wordid,word pairs. 4,923,973 words.
  • Frequential vectors
    • Frequential Vectors for Title: These vectors correspond to frequential vectors computed over each host (computed by making the sum of the frequential vectors of the Web pages of each host) restricted to the title of the web pages
    • Frequential Vectors for Links: These vectors correspond to frequential vectors computed over each host (computed by making the sum of the frequential vectors of the Web pages of each host) restricted to the hyperlinks of the web pages
    • Frequential Vectors for Body: These vectors correspond to frequential vectors computed over each host (computed by making the sum of the frequential vectors of the Web pages of each host) restricted to the body of the web pages
  • TF-IDF vectors (Term frequency, inverse host frequency)
  • TF-IDF normalized (TF-IDF vector with normalization)

Page-level text features

Page-level text features (sample of up to 100 documents per host)

Updated the 15th of March, 2007

These features have been obtained by sampling 338,107 documents at random, including in the sample at most 100 documents per host.

Each of the following files is a comma-separated-value (.csv) plain text file, where each line corresponds to one web page. The first two columns are the url-id and the url name.

The rest of the columns corresponds to the components of the feature vectors, in the format number_of_the_feature:value_of_the_feature. In this case, a feature corresponds is one word from the vocabulary.

Contact

In case of any problem concerning the text-based features, please contact:

  ludovic dot denoyer at lip6 dot fr