Learning to Label a Graph: Application to Web Spam Detection
Many domains and applications are concerned with complex data composed of elementary components linked according to some structural or logical organization. These data are often described as a graph where nodes and links hold information. In Computer Vision for example, a picture can be described by a graph which corresponds to the organization of the different regions of the pictures – each node of the graph corresponding to a region. In the text domain, the diffusion of new data formats like XML and HTML have considerably changed the domains of Information Retrieval. On the Web, documents are organized according to a graph structure where nodes correspond to the Web page and edges correspond to the hyper links between the pages. Moreover, Web pages are also structured documents containing both a content information and a logical information encoded by the HTML tags, and can be viewed as labelled trees.
Other application domains concerned with graph data include image processing, multimedia (video), natural language processing, social networks, biology, etc. Handling structured data has become a main challenge for these domains and different communities have been developing for some years their own methods for dealing with structured data. The ML community should be a major actor in this area. Some new models have recently been proposed for learning with relational data and particularly with trees and graphs. Graph labelling which consists in labelling all the vertices of a graph from a partial labelling of the graph vertices has been identified as a generic ML problem with many fields of application.
A few models have already been proposed for this task. These models have two major problems and can not be trivially extended to the Web spam detection task:
- They usually work with graphs that do not contain any content information in the nodes.
- They are limited to small collections and usually do not scale well to Web environments having millions of nodes.
For example, [HPW05] proposes a perceptron-based model able to learn with unoriented graphs where the nodes are sequentially proposed to the algorithm which penalizes if the prediction does not correspond to the true label. This model is tested on the digit recognition task with a graph composed of 1,000 nodes. [ZHS05] extends a preliminary work presented in [ZBL+03] and proposes a model based on the use of the pagerank scoring function. The authors introduce a semi-supervised technique and use this algorithm for the genre identification of Web pages over a Web graph of about 4,000 nodes.
Web Spam Detection
Web spam detection is becoming a major target application for internet search providers. The Web contains numerous profit-seeking ventures that are attracted by the prospect of reaching millions of users at a very low cost. There is an economic incentive for manipulating search engine’s listings by creating pages that score high independently of their real merit. In practice such manipulation is widespread, and in many cases, successful. One suitable way to define Web spam is any attempt to get “an unjustifiably favorable relevance or importance score for some Web page, considering the page’s true value” [GGM05]. There is a large gray area between “ethical” Search Engine Optimization (SEO) and “unethical” spam. SEO services range from ensuring that Web pages are indexable by Web crawlers, to the creation of thousands or millions of fake pages aimed at deceiving search engine ranking algorithms. Traditional IR methods assumed a controlled collection in which the authors of the documents being indexed and retrieved had no knowledge of the IR system and no intention of manipulating its behavior. On Web-IR, these assumptions are no longer valid, specially when searching at global scale. Almost every IR algorithm is prone to manipulation in its pure form.
A ranking based purely on the vector space model, for instance, can be easily manipulated by inserting many keywords in the document; a ranking based purely on counting citations can be manipulated by creating many meaningless pages pointing to a target page, and so on. Of course, ideally the search engine administrators want to stay ahead of the spammers in terms of ranking algorithms and detection methods. Fortunately, from the point of view of the search engine, the goal is just to alter the economic balance for the would-be spammer [NNMF06], not necessarily detecting 100% of the Web spam. If the search engine can maintain the costs for the spammers consistently above their expected gain from manipulating the ranking, it can really keep Web spam low.
The Adversarial Information Retrieval on the Web (AIRWeb) series of workshops was started in 2005 by the academic community. Many existing heuristics for detection are often specific to a specific type of spam and can not be used if a new Web spam technique appears. In other words, after a heuristic for Web spam detection is developed, the bubble of Web visibility tends to resurface somewhere else [GW05]. We need to propose new models able to learn to detect any type of Web Spam and that can be adapted quickly to new unknown spam techniques. Machine learning methods are the key to achieve this goal.
Key research issues
This challenge aims at bringing together the ML and IR community on the specific problem of Web spam labelling. The goal of the challenge is therefore to explore algorithmic, theoretical and practical issues regarding methods for automatically label a partially labelled Web collection. The challenge is also aimed as a discussion forum for defining new ML problems specific to Web Spam detection and Graph labelling and Structured Output Classification. Since this is the first evaluation of ML techniques for graph labelling and Web Spam this challenge will raise several new issues. Among the main issues to be investigated we will focus on the following ones.
Key issues: ML perspective
- Learning with inter-dependent variables (the nodes of the Web graph).
- Short-term or long-term dependencies: Web Spam detection methods will have to deal with both the dependencies to take into account and the complexity of the model.
- Learning with few examples: in a global-scale search engine that needs to detect Web Spam, typically the number of manually labeled examples will be several orders of magnitude smaller than the number of pages on the Web.
- Scalability: most applications concerned with Web graph will have to handle large collections (this will be the case with this challenge).
Key issues: IR/Web Mining perspective
- Feature extraction: find out which are the representative elements of the Web pages and/or domains that are more useful to separate spam from non-spam pages. This includes contentbased features extracted from the text of the pages as link-based features extracted from the Web graph.
- Feature aggregation: study how to aggregate page-level data to aggregate entire directories, hosts or domains to improve the effectiveness of the labelling process. Also study the converse problem: how to use data that is only available at the domain level to label individual pages belonging to the domain.
- Feature propagation: study how to use the graph, not only for generating features about the pages, but also for guiding the labelling process and for propagating information about the labels.
- Recall/precision trade-offs: the Web spam labelling tasks is not error-free, and the system must be carefully balanced to avoid being too strict (at the risk of marking many nonspam items as spam) or being too lenient (at the risk of letting too many spam items to go undetected). The ideal scenario is to provide tunable parameters for adapting the strictness of the labelling to different scenarios.
- Scalability: when the number of pages is very large, the cost of extracting a feature has to be balanced with its performance in the Web Spam detection task.
- Zoltán Gyöngyi and Hector Garcia-Molina. Web spam taxonomy. In First International Workshop on Adversarial Information Retrieval on the Web, 2005.
- Zoltán Gyöngyi, Hector G. Molina, and Jan Pedersen. Combating web spam with trustrank. In Proceedings of the Thirtieth International Conference on Very Large Data Bases (VLDB), pages 576–587, Toronto, Canada, August 2004.
- Marco Gori and Ian Witten. The bubble of web visibility. Commun. ACM, 48(3):115–117, March 2005.
- Mark Herbster, Massimiliano Pontil, and Lisa Wainer. Online learning over graphs. In ICML ’05: Proceedings of the 22nd international conference on Machine learning, pages 305–312, New York, NY, USA, 2005. ACM Press.
- A. Ntoulas, M. Najork, M. Manasse, and D. Fetterly. Detecting spam web pages through content analysis. In Proceedings of the World Wide Web conference, pages 83–92, Edinburgh, Scotland, May 2006.
- D. Zhou, O. Bousquet, T. Lal, J. Weston, and B. Sch¨olkopf. Learning with local and global consistency. In 18th Annual Conf. on Neural Information Processing Systems, 2003.
- Dengyong Zhou, Jiayuan Huang, and Bernhard Schölkopf. Learning from labeled and unlabeled data on a directed graph. In ICML ’05: Proceedings of the 22nd international conference on Machine learning, pages 1036–1043, New York, NY, USA, 2005. ACM Press.