Participants of Track I

Summary of results for all participants.

The Web Spam Challenge Track I received nine entries from six teams:

Tony Abou-Assaleh and Tapajyoti Das
Genie Knows
(Summary; Entry 1; Entry 2; Slides) We describe a Web spam detection algorithm that extends and propagates manual and automatic labels of Web hosts. The manual labels are derived from the training labels provided with the WEBSPAM-UK2006 dataset. The automatic labelling assigned a spam label to hosts with a low variance in the out-degree of in-neighbours and hosts with significant overlap between their in-links and out-links. The score extension and propagation where applied to the directed host graph.

Andras A. Benczur, István Bíró, Károly Csalogány, Miklós Kurucz and Tamas Sarlos
Hungarian Academy of Sciences
(Summary; Entry; Slides) We use the commercial intent and graph similarity features of our Airweb 2007 and 2006 publications, respectively, in addition to the features of Castillo et al., improving their classification accuracy by 3%. We use stacked graphical learning over the Weka C4.5 classifier.

Gordon Cormack
University of Waterloo
(Summary; Entry; Slides) Our 2007 Web Spam Challenge submission used an ensemble of ten content-based classifiers stacked using logistic regression. Each classifier used one of two state-of-the art email filters -- DMC (Bratko et al 2006) or OSBF-Lua (Assis 2006)-- applied to simple text files, with each text file acting as a proxy for a host to be classified. All text files were derived from the home page (including http and redirection logs), the host name, or the host names associated with incoming or outgoing links. Except for the host names of these immediate neighbours, no information about the topology of the corpus was used.

Dennis Fetterly, Steve Chien, Marc Najork, Mark Manasse and Alexandros Ntoulas
Microsoft Research; Microsoft Search Labs
(Summary; Entry; Slides) This paper describes our contribution to the 2007 Web Spam Challenge. We computed some additional features from the data provided with the UK 2006-05 dataset, and other features from external data sources.

Pascal Filoche, Tanguy Urvoy, Chauveau Emmanuel and Lavergne Thomas
France Telecom; ENST
(Summary; Entry 1; Entry 2; Slides) In our article on AIRWeb 2006 workshop on hidden style similarity, we combined the approaches of HTML noise preprocessing (removing content), minhash fingerprinting and similarity clustering to spot dubious sets of web pages. For this challenge the idea is the same but we study more preprocessing and clustering strategies that we use to smooth the predictions of a classifier. We test two learning methodologies and sumbit two predictions.

Guanggang Geng, Chunheng Wang, Xiaobo Jin, Qiudan Li and Lei Xu
Institute of Automation, Chinese Academy of Sciences, Beijing
(Summary; Entry 1; Entry 2) Based on the fact that reputable hosts are more easy to obtain than spam ones on the Web, an ensemble under-sampling classification strategy is proposed, which exploits the information involved in the large number of reputable websites to full advantage. Content-based, transformed link-based and HostRank relevant features are taken into account.


The results of the evaluation phase of Track I were announced during the AIRWeb'07 workshop. See the presentation of the results (400 Kb PDF).