## Main.PhaseITextFeatures History

Hide minor edits - Show changes to markup

# Text-based features

The text based features have been computed from the HTML source of the downloaded Web pages.

## Host-level text features

The host features concern the content of the different hosts. They have been processed using (a subset of) the Web pages on each host.

We propose here different features set (small size, medium size and large size).

Depending of the features set, all the hosts don't have a features vector.

### Host-level text features (sample of up to 10 documents per host)

**Updated the 25th of January, 2007**

These features have been obtained by sampling 45,000 documents at random, including in the sample at most 10 documents per host.

Each of the following files is a comma-separated-value (.csv) plain text file, where each line corresponds to one host. The first two columns are the host-id and the hostname.

The rest of the columns corresponds to the components of the feature vectors, in the format `number_of_the_feature:value_of_the_feature`

. In this case, a feature corresponds is one word from the vocabulary.

- Hosts Index: a file containing the host index. Format: hostid,hostname pairs.
- Word Index: the vocabulary used to compute the features. Format: wordid,word pairs.
**1,251,994**words. - Frequential vectors
- Frequential Vectors for Title: These vectors correspond to frequential vectors computed over each host (computed by making the sum of the frequential vectors of the Web pages of each host) restricted to the title of the web pages
- Frequential Vectors for Links: These vectors correspond to frequential vectors computed over each host (computed by making the sum of the frequential vectors of the Web pages of each host) restricted to the hyperlinks of the web pages
- Frequential Vectors for Body: These vectors correspond to frequential vectors computed over each host (computed by making the sum of the frequential vectors of the Web pages of each host) restricted to the body of the web pages

- TF-IDF vectors (Term frequency, inverse host frequency)
- The number of host where a word appears : The
*host*frequency (like DF but computed for each host, 0 if the word is not referenced in the file) - TFIDF Vectors for Title: These vectors correspond to tf-idf vectors computed over each host restricted to the title of the web pages
- TFIDF Vectors for Links: These vectors correspond to tf-idf vectors computed over each host restricted to the links of the web pages
- TFIDF Vectors for Body: These vectors correspond to tf-idf vectors computed over each host restricted to the body of the web pages

- The number of host where a word appears : The
- TF-IDF normalized (TF-IDF vector with normalization)
- TFIDF Normalized Vectors for Title
- TFIDF Normalized Vectors for Links
- TFIDF Normalized Vectors for Body

### Host-level text features (sample of up to 100 documents per host)

**Updated the 2nd of February, 2007**

These features have been obtained by sampling 350,000 documents at random, including in the sample at most 100 documents per host.

- Word Index: the vocabulary used to compute the features. Format: wordid,word pairs.
**4,923,973**words. - Frequential vectors
- Frequential Vectors for Title: These vectors correspond to frequential vectors computed over each host (computed by making the sum of the frequential vectors of the Web pages of each host) restricted to the title of the web pages
- Frequential Vectors for Links: These vectors correspond to frequential vectors computed over each host (computed by making the sum of the frequential vectors of the Web pages of each host) restricted to the hyperlinks of the web pages
- Frequential Vectors for Body: These vectors correspond to frequential vectors computed over each host (computed by making the sum of the frequential vectors of the Web pages of each host) restricted to the body of the web pages

- TF-IDF vectors (Term frequency, inverse host frequency)
- The number of host where a word appears : The
*host*frequency (like DF but computed for each host) - TFIDF Vectors for Title: These vectors correspond to tf-idf vectors computed over each host restricted to the title of the web pages
- TFIDF Vectors for Links: These vectors correspond to tf-idf vectors computed over each host restricted to the links of the web pages
- TFIDF Vectors for Body: These vectors correspond to tf-idf vectors computed over each host restricted to the body of the web pages

- The number of host where a word appears : The
- TF-IDF normalized (TF-IDF vector with normalization)
- TFIDF Normalized Vectors for Title
- TFIDF Normalized Vectors for Links
- TFIDF Normalized Vectors for Body

## Page-level text features

### Page-level text features (sample of up to 100 documents per host)

**Updated the 15th of March, 2007**

These features have been obtained by sampling 338,107 documents at random, including in the sample at most 100 documents per host.

Each of the following files is a comma-separated-value (.csv) plain text file, where each line corresponds to **one web page**. The first two columns are the url-id and the url name.

The rest of the columns corresponds to the components of the feature vectors, in the format `number_of_the_feature:value_of_the_feature`

. In this case, a feature corresponds is one word from the vocabulary.

- Word Index: the vocabulary used to compute the features. Format: wordid,word pairs.
**4,923,973**words (same as medium size host corpus). - Frequential vectors (body part of the documents)
- TF-IDF vectors

## Contact

In case of any problem concerning the text-based features, please contact:

ludovic dot denoyer at lip6 dot fr