Purely URL-based topic classification
Baykan E, Henzinger MH, Marian L, Weber I. 2009. Purely URL-based topic classification. 18th International World Wide Web Conference. WWW: Conference on World Wide Web, 1109–1110.
Download
No fulltext has been uploaded. References only!
Conference Paper
| Published
| English
Scopus indexed
Author
Baykan, Eda;
Henzinger, MonikaISTA ;
Marian, Ludmila;
Weber, Ingmar
Abstract
Given only the URL of a web page, can we identify its topic? This is the question that we examine in this paper. Usually, web pages are classified using their content, but a URL-only classifier is preferable, (i) when speed is crucial, (ii) to enable content filtering before an (objection-able) web page is downloaded, (iii) when a page's content is hidden in images, (iv) to annotate hyperlinks in a personalized web browser, without fetching the target page, and (v) when a focused crawler wants to infer the topic of a target page before devoting bandwidth to download it. We apply a machine learning approach to the topic identification task and evaluate its performance in extensive experiments on categorized web pages from the Open Directory Project (ODP). When training separate binary classifiers for each topic, we achieve typical F-measure values between 80 and 85, and a typical precision of around 85. We also ran experiments on a small data set of university web pages. For the task of classifying these pages into faculty, student, course and project pages, our methods improve over previous approaches by 13.8 points of F-measure.
Publishing Year
Date Published
2009-04-01
Proceedings Title
18th International World Wide Web Conference
Publisher
Association for Computing Machinery
Page
1109-1110
Conference
WWW: Conference on World Wide Web
Conference Location
New York, NY, United States
Conference Date
2009-04-20 – 2009-04-24
ISBN
IST-REx-ID
Cite this
Baykan E, Henzinger MH, Marian L, Weber I. Purely URL-based topic classification. In: 18th International World Wide Web Conference. Association for Computing Machinery; 2009:1109-1110. doi:10.1145/1526709.1526880
Baykan, E., Henzinger, M. H., Marian, L., & Weber, I. (2009). Purely URL-based topic classification. In 18th International World Wide Web Conference (pp. 1109–1110). New York, NY, United States: Association for Computing Machinery. https://doi.org/10.1145/1526709.1526880
Baykan, Eda, Monika H Henzinger, Ludmila Marian, and Ingmar Weber. “Purely URL-Based Topic Classification.” In 18th International World Wide Web Conference, 1109–10. Association for Computing Machinery, 2009. https://doi.org/10.1145/1526709.1526880.
E. Baykan, M. H. Henzinger, L. Marian, and I. Weber, “Purely URL-based topic classification,” in 18th International World Wide Web Conference, New York, NY, United States, 2009, pp. 1109–1110.
Baykan E, Henzinger MH, Marian L, Weber I. 2009. Purely URL-based topic classification. 18th International World Wide Web Conference. WWW: Conference on World Wide Web, 1109–1110.
Baykan, Eda, et al. “Purely URL-Based Topic Classification.” 18th International World Wide Web Conference, Association for Computing Machinery, 2009, pp. 1109–10, doi:10.1145/1526709.1526880.