Main.PhaseIResults History

Hide minor edits - Show changes to markup

October 09, 2007, at 11:15 AM by 84.88.76.49 -
Added lines 3-4:

Summary of results for all participants.

June 04, 2007, at 08:02 AM by ChaTo -
Changed lines 8-9 from:

Entry 1; Entry 2; Slides) We describe a Web spam detection algorithm that extends and propagates manual and automatic labels of Web hosts. The manual labels are derived from the training labels provided with the WEBSPAM-UK2006 dataset. The automatic labelling assigned a spam label to hosts with a low variance in the out-degree of in-neighbours and hosts with significant overlap between their in-links and out-links. The score extension and propagation where applied to the directed host graph.

to:

Entry 1; Entry 2; Slides) We describe a Web spam detection algorithm that extends and propagates manual and automatic labels of Web hosts. The manual labels are derived from the training labels provided with the WEBSPAM-UK2006 dataset. The automatic labelling assigned a spam label to hosts with a low variance in the out-degree of in-neighbours and hosts with significant overlap between their in-links and out-links. The score extension and propagation where applied to the directed host graph.

June 04, 2007, at 07:56 AM by ChaTo -
Changed lines 13-14 from:

Entry; Slides) We use the commercial intent and graph similarity features of our Airweb 2007 and 2006 publications, respectively, in addition to the features of Castillo et al., improving their classification accuracy by 3%. We use stacked graphical learning over the Weka C4.5 classifier.

to:

Entry; Slides) We use the commercial intent and graph similarity features of our Airweb 2007 and 2006 publications, respectively, in addition to the features of Castillo et al., improving their classification accuracy by 3%. We use stacked graphical learning over the Weka C4.5 classifier.

June 04, 2007, at 07:50 AM by ChaTo -
Changed line 21 from:

Microsoft Research, Microsoft Search Labs\\

to:

Microsoft Research; Microsoft Search Labs\\

June 04, 2007, at 07:49 AM by ChaTo -
Deleted lines 4-23:

Guanggang Geng, Chunheng Wang, Xiaobo Jin, Qiudan Li and Lei Xu
Institute of Automation, Chinese Academy of Sciences, Beijing
(Summary; Entry 1; Entry 2) Based on the fact that reputable hosts are more easy to obtain than spam ones on the Web, an ensemble under-sampling classification strategy is proposed, which exploits the information involved in the large number of reputable websites to full advantage. Content-based, transformed link-based and HostRank relevant features are taken into account.

Pascal Filoche, Tanguy Urvoy, Chauveau Emmanuel and Lavergne Thomas
France Telecom; ENST
(Summary; Entry 1; Entry 2; Slides) In our article on AIRWeb 2006 workshop on hidden style similarity, we combined the approaches of HTML noise preprocessing (removing content), minhash fingerprinting and similarity clustering to spot dubious sets of web pages. For this challenge the idea is the same but we study more preprocessing and clustering strategies that we use to smooth the predictions of a classifier. We test two learning methodologies and sumbit two predictions.

Gordon Cormack
University of Waterloo
(Summary; Entry; Slides) Our 2007 Web Spam Challenge submission used an ensemble of ten content-based classifiers stacked using logistic regression. Each classifier used one of two state-of-the art email filters -- DMC (Bratko et al 2006) or OSBF-Lua (Assis 2006)-- applied to simple text files, with each text file acting as a proxy for a host to be classified. All text files were derived from the home page (including http and redirection logs), the host name, or the host names associated with incoming or outgoing links. Except for the host names of these immediate neighbours, no information about the topology of the corpus was used.

Andras A. Benczur, István Bíró, Károly Csalogány, Miklós Kurucz and Tamas Sarlos
Hungarian Academy of Sciences
(Summary; Entry; Slides) We use the commercial intent and graph similarity features of our Airweb 2007 and 2006 publications, respectively, in addition to the features of Castillo et al., improving their classification accuracy by 3%. We use stacked graphical learning over the Weka C4.5 classifier.

Added lines 10-19:

Andras A. Benczur, István Bíró, Károly Csalogány, Miklós Kurucz and Tamas Sarlos
Hungarian Academy of Sciences
(Summary; Entry; Slides) We use the commercial intent and graph similarity features of our Airweb 2007 and 2006 publications, respectively, in addition to the features of Castillo et al., improving their classification accuracy by 3%. We use stacked graphical learning over the Weka C4.5 classifier.

Gordon Cormack
University of Waterloo
(Summary; Entry; Slides) Our 2007 Web Spam Challenge submission used an ensemble of ten content-based classifiers stacked using logistic regression. Each classifier used one of two state-of-the art email filters -- DMC (Bratko et al 2006) or OSBF-Lua (Assis 2006)-- applied to simple text files, with each text file acting as a proxy for a host to be classified. All text files were derived from the home page (including http and redirection logs), the host name, or the host names associated with incoming or outgoing links. Except for the host names of these immediate neighbours, no information about the topology of the corpus was used.

Added lines 25-34:

Pascal Filoche, Tanguy Urvoy, Chauveau Emmanuel and Lavergne Thomas
France Telecom; ENST
(Summary; Entry 1; Entry 2; Slides) In our article on AIRWeb 2006 workshop on hidden style similarity, we combined the approaches of HTML noise preprocessing (removing content), minhash fingerprinting and similarity clustering to spot dubious sets of web pages. For this challenge the idea is the same but we study more preprocessing and clustering strategies that we use to smooth the predictions of a classifier. We test two learning methodologies and sumbit two predictions.

Guanggang Geng, Chunheng Wang, Xiaobo Jin, Qiudan Li and Lei Xu
Institute of Automation, Chinese Academy of Sciences, Beijing
(Summary; Entry 1; Entry 2) Based on the fact that reputable hosts are more easy to obtain than spam ones on the Web, an ensemble under-sampling classification strategy is proposed, which exploits the information involved in the large number of reputable websites to full advantage. Content-based, transformed link-based and HostRank relevant features are taken into account.

June 04, 2007, at 07:48 AM by ChaTo -
Changed line 11 from:

France Telecom and ENST\\

to:

France Telecom; ENST\\

June 04, 2007, at 07:45 AM by ChaTo -
Changed lines 23-24 from:

Entry 1; Entry 2; Slides) We use the commercial intent and graph similarity features of our Airweb 2007 and 2006 publications, respectively, in addition to the features of Castillo et al., improving their classification accuracy by 3%. We use stacked graphical learning over the Weka C4.5 classifier.

to:

Entry; Slides) We use the commercial intent and graph similarity features of our Airweb 2007 and 2006 publications, respectively, in addition to the features of Castillo et al., improving their classification accuracy by 3%. We use stacked graphical learning over the Weka C4.5 classifier.

June 04, 2007, at 07:44 AM by ChaTo -
Changed lines 6-7 from:

Institute of Automation, Chinese Academy of Sciences, Beijing

to:

Institute of Automation, Chinese Academy of Sciences, Beijing\\

Changed lines 11-12 from:

France Telecom and ENST

to:

France Telecom and ENST\\

Changed lines 16-17 from:

University of Waterloo

to:

University of Waterloo\\

Changed lines 21-22 from:

Hungarian Academy of Sciences

to:

Hungarian Academy of Sciences\\

Changed lines 26-27 from:

Genie Knows

to:

Genie Knows\\

Changed lines 31-32 from:

Microsoft Research, Microsoft Search Labs

to:

Microsoft Research, Microsoft Search Labs\\

Changed lines 33-34 from:

Entry 1; Entry 2; Slides) This paper describes our contribution to the 2007 Web Spam Challenge. We computed some additional features from the data provided with the UK 2006-05 dataset, and other features from external data sources.

to:

Entry; Slides) This paper describes our contribution to the 2007 Web Spam Challenge. We computed some additional features from the data provided with the UK 2006-05 dataset, and other features from external data sources.

June 04, 2007, at 07:43 AM by ChaTo -
Changed lines 8-9 from:

Based on the fact that reputable hosts are more easy to obtain than spam ones on the Web, an ensemble under-sampling classification strategy is proposed, which exploits the information involved in the large number of reputable websites to full advantage. Content-based, transformed link-based and HostRank relevant features are taken into account.

to:

(Summary; Entry 1; Entry 2) Based on the fact that reputable hosts are more easy to obtain than spam ones on the Web, an ensemble under-sampling classification strategy is proposed, which exploits the information involved in the large number of reputable websites to full advantage. Content-based, transformed link-based and HostRank relevant features are taken into account.

Changed lines 14-15 from:

(Slides) In our article on AIRWeb 2006 workshop on hidden style similarity, we combined the approaches of HTML noise preprocessing (removing content), minhash fingerprinting and similarity clustering to spot dubious sets of web pages. For this challenge the idea is the same but we study more preprocessing and clustering strategies that we use to smooth the predictions of a classifier. We test two learning methodologies and sumbit two predictions.

to:

(Summary; Entry 1; Entry 2; Slides) In our article on AIRWeb 2006 workshop on hidden style similarity, we combined the approaches of HTML noise preprocessing (removing content), minhash fingerprinting and similarity clustering to spot dubious sets of web pages. For this challenge the idea is the same but we study more preprocessing and clustering strategies that we use to smooth the predictions of a classifier. We test two learning methodologies and sumbit two predictions.

Changed lines 20-21 from:

(Slides) Our 2007 Web Spam Challenge submission used an ensemble of ten content-based classifiers stacked using logistic regression. Each classifier used one of two state-of-the art email filters -- DMC (Bratko et al 2006) or OSBF-Lua (Assis 2006)-- applied to simple text files, with each text file acting as a proxy for a host to be classified. All text files were derived from the home page (including http and redirection logs), the host name, or the host names associated with incoming or outgoing links. Except for the host names of these immediate neighbours, no information about the topology of the corpus was used.

to:

(Summary; Entry; Slides) Our 2007 Web Spam Challenge submission used an ensemble of ten content-based classifiers stacked using logistic regression. Each classifier used one of two state-of-the art email filters -- DMC (Bratko et al 2006) or OSBF-Lua (Assis 2006)-- applied to simple text files, with each text file acting as a proxy for a host to be classified. All text files were derived from the home page (including http and redirection logs), the host name, or the host names associated with incoming or outgoing links. Except for the host names of these immediate neighbours, no information about the topology of the corpus was used.

Changed lines 26-27 from:

(Slides) We use the commercial intent and graph similarity features of our Airweb 2007 and 2006 publications, respectively, in addition to the features of Castillo et al., improving their classification accuracy by 3%. We use stacked graphical learning over the Weka C4.5 classifier.

to:

(Summary; Entry 1; Entry 2; Slides) We use the commercial intent and graph similarity features of our Airweb 2007 and 2006 publications, respectively, in addition to the features of Castillo et al., improving their classification accuracy by 3%. We use stacked graphical learning over the Weka C4.5 classifier.

Changed lines 32-33 from:

(Slides) We describe a Web spam detection algorithm that extends and propagates manual and automatic labels of Web hosts. The manual labels are derived from the training labels provided with the WEBSPAM-UK2006 dataset. The automatic labelling assigned a spam label to hosts with a low variance in the out-degree of in-neighbours and hosts with significant overlap between their in-links and out-links. The score extension and propagation where applied to the directed host graph.

to:

(Summary; Entry 1; Entry 2; Slides) We describe a Web spam detection algorithm that extends and propagates manual and automatic labels of Web hosts. The manual labels are derived from the training labels provided with the WEBSPAM-UK2006 dataset. The automatic labelling assigned a spam label to hosts with a low variance in the out-degree of in-neighbours and hosts with significant overlap between their in-links and out-links. The score extension and propagation where applied to the directed host graph.

Changed lines 38-39 from:

(Slides) This paper describes our contribution to the 2007 Web Spam Challenge. We computed some additional features from the data provided with the UK 2006-05 dataset, and other features from external data sources.

to:

(Summary; Entry 1; Entry 2; Slides) This paper describes our contribution to the 2007 Web Spam Challenge. We computed some additional features from the data provided with the UK 2006-05 dataset, and other features from external data sources.

May 31, 2007, at 09:45 AM by ChaTo -
Changed lines 3-4 from:

The Web Spam Challenge had entries from the following teams:

to:

The Web Spam Challenge Track I received nine entries from six teams:

May 31, 2007, at 09:39 AM by ChaTo -
Added lines 1-37:

Participants of Track I

The Web Spam Challenge had entries from the following teams:

Guanggang Geng, Chunheng Wang, Xiaobo Jin, Qiudan Li and Lei Xu
Institute of Automation, Chinese Academy of Sciences, Beijing

Based on the fact that reputable hosts are more easy to obtain than spam ones on the Web, an ensemble under-sampling classification strategy is proposed, which exploits the information involved in the large number of reputable websites to full advantage. Content-based, transformed link-based and HostRank relevant features are taken into account.

Pascal Filoche, Tanguy Urvoy, Chauveau Emmanuel and Lavergne Thomas
France Telecom and ENST

(Slides) In our article on AIRWeb 2006 workshop on hidden style similarity, we combined the approaches of HTML noise preprocessing (removing content), minhash fingerprinting and similarity clustering to spot dubious sets of web pages. For this challenge the idea is the same but we study more preprocessing and clustering strategies that we use to smooth the predictions of a classifier. We test two learning methodologies and sumbit two predictions.

Gordon Cormack
University of Waterloo

(Slides) Our 2007 Web Spam Challenge submission used an ensemble of ten content-based classifiers stacked using logistic regression. Each classifier used one of two state-of-the art email filters -- DMC (Bratko et al 2006) or OSBF-Lua (Assis 2006)-- applied to simple text files, with each text file acting as a proxy for a host to be classified. All text files were derived from the home page (including http and redirection logs), the host name, or the host names associated with incoming or outgoing links. Except for the host names of these immediate neighbours, no information about the topology of the corpus was used.

Andras A. Benczur, István Bíró, Károly Csalogány, Miklós Kurucz and Tamas Sarlos
Hungarian Academy of Sciences

(Slides) We use the commercial intent and graph similarity features of our Airweb 2007 and 2006 publications, respectively, in addition to the features of Castillo et al., improving their classification accuracy by 3%. We use stacked graphical learning over the Weka C4.5 classifier.

Tony Abou-Assaleh and Tapajyoti Das
Genie Knows

(Slides) We describe a Web spam detection algorithm that extends and propagates manual and automatic labels of Web hosts. The manual labels are derived from the training labels provided with the WEBSPAM-UK2006 dataset. The automatic labelling assigned a spam label to hosts with a low variance in the out-degree of in-neighbours and hosts with significant overlap between their in-links and out-links. The score extension and propagation where applied to the directed host graph.

Dennis Fetterly, Steve Chien, Marc Najork, Mark Manasse and Alexandros Ntoulas
Microsoft Research, Microsoft Search Labs

(Slides) This paper describes our contribution to the 2007 Web Spam Challenge. We computed some additional features from the data provided with the UK 2006-05 dataset, and other features from external data sources.

Results

The results of the evaluation phase of Track I were announced during the AIRWeb'07 workshop. See the presentation of the results (400 Kb PDF).