A comprehensive study of features and algorithms for URL-based topic classification
Baykan E, Henzinger MH, Marian L, Weber I. 2011. A comprehensive study of features and algorithms for URL-based topic classification. ACM Transactions on the Web. 5(3), 15.
Download
No fulltext has been uploaded. References only!
Journal Article
| Published
| English
Scopus indexed
Author
Baykan, Eda;
Henzinger, MonikaISTA ;
Marian, Ludmila;
Weber, Ingmar
Abstract
Given only the URL of a Web page, can we identify its topic? We study this problem in detail by exploring a large number of different feature sets and algorithms on several datasets. We also show that the inherent overlap between topics and the sparsity of the information in URLs makes this a very challenging problem. Web page classification without a page’s content is desirable when the content is not available at all, when a classification is needed before obtaining the content, or when classification speed is of utmost importance. For our experiments we used five different corpora comprising a total of about 3 million (URL, classification) pairs. We evaluated several techniques for feature generation and classification algorithms. The individual binary classifiers were then combined via boosting into metabinary classifiers. We achieve typical F-measure values between 80 and 85, and a typical precision of around 86. The precision can be pushed further over 90 while maintaining a typical level of recall between 30 and 40.
Keywords
Publishing Year
Date Published
2011-07-01
Journal Title
ACM Transactions on the Web
Publisher
Association for Computing Machinery
Volume
5
Issue
3
Article Number
15
ISSN
eISSN
IST-REx-ID
Cite this
Baykan E, Henzinger MH, Marian L, Weber I. A comprehensive study of features and algorithms for URL-based topic classification. ACM Transactions on the Web. 2011;5(3). doi:10.1145/1993053.1993057
Baykan, E., Henzinger, M. H., Marian, L., & Weber, I. (2011). A comprehensive study of features and algorithms for URL-based topic classification. ACM Transactions on the Web. Association for Computing Machinery. https://doi.org/10.1145/1993053.1993057
Baykan, Eda, Monika H Henzinger, Ludmila Marian, and Ingmar Weber. “A Comprehensive Study of Features and Algorithms for URL-Based Topic Classification.” ACM Transactions on the Web. Association for Computing Machinery, 2011. https://doi.org/10.1145/1993053.1993057.
E. Baykan, M. H. Henzinger, L. Marian, and I. Weber, “A comprehensive study of features and algorithms for URL-based topic classification,” ACM Transactions on the Web, vol. 5, no. 3. Association for Computing Machinery, 2011.
Baykan E, Henzinger MH, Marian L, Weber I. 2011. A comprehensive study of features and algorithms for URL-based topic classification. ACM Transactions on the Web. 5(3), 15.
Baykan, Eda, et al. “A Comprehensive Study of Features and Algorithms for URL-Based Topic Classification.” ACM Transactions on the Web, vol. 5, no. 3, 15, Association for Computing Machinery, 2011, doi:10.1145/1993053.1993057.