[{"type":"journal_article","day":"01","status":"public","intvolume":"         7","publication":"ACM Transactions on the Web","issue":"1","article_type":"original","date_published":"2013-03-01T00:00:00Z","month":"03","language":[{"iso":"eng"}],"scopus_import":"1","publisher":"Association for Computing Machinery","date_created":"2022-07-27T12:50:18Z","citation":{"ieee":"E. Baykan, I. Weber, and M. H. Henzinger, “A comprehensive study of techniques for URL-based web page language classification,” <i>ACM Transactions on the Web</i>, vol. 7, no. 1. Association for Computing Machinery, 2013.","apa":"Baykan, E., Weber, I., &#38; Henzinger, M. H. (2013). A comprehensive study of techniques for URL-based web page language classification. <i>ACM Transactions on the Web</i>. Association for Computing Machinery. <a href=\"https://doi.org/10.1145/2435215.2435218\">https://doi.org/10.1145/2435215.2435218</a>","chicago":"Baykan, Eda, Ingmar Weber, and Monika H Henzinger. “A Comprehensive Study of Techniques for URL-Based Web Page Language Classification.” <i>ACM Transactions on the Web</i>. Association for Computing Machinery, 2013. <a href=\"https://doi.org/10.1145/2435215.2435218\">https://doi.org/10.1145/2435215.2435218</a>.","ama":"Baykan E, Weber I, Henzinger MH. A comprehensive study of techniques for URL-based web page language classification. <i>ACM Transactions on the Web</i>. 2013;7(1). doi:<a href=\"https://doi.org/10.1145/2435215.2435218\">10.1145/2435215.2435218</a>","mla":"Baykan, Eda, et al. “A Comprehensive Study of Techniques for URL-Based Web Page Language Classification.” <i>ACM Transactions on the Web</i>, vol. 7, no. 1, 3, Association for Computing Machinery, 2013, doi:<a href=\"https://doi.org/10.1145/2435215.2435218\">10.1145/2435215.2435218</a>.","short":"E. Baykan, I. Weber, M.H. Henzinger, ACM Transactions on the Web 7 (2013).","ista":"Baykan E, Weber I, Henzinger MH. 2013. A comprehensive study of techniques for URL-based web page language classification. ACM Transactions on the Web. 7(1), 3."},"publication_status":"published","author":[{"first_name":"Eda","last_name":"Baykan","full_name":"Baykan, Eda"},{"full_name":"Weber, Ingmar","last_name":"Weber","first_name":"Ingmar"},{"first_name":"Monika H","orcid":"0000-0002-5008-6530","full_name":"Henzinger, Monika H","last_name":"Henzinger","id":"540c9bbd-f2de-11ec-812d-d04a5be85630"}],"keyword":["Computer Networks and Communications"],"abstract":[{"text":"Given only the URL of a Web page, can we identify its language? In this article we examine this question. URL-based language classification is useful when the content of the Web page is not available or downloading the content is a waste of bandwidth and time.\r\nWe built URL-based language classifiers for English, German, French, Spanish, and Italian by applying a variety of algorithms and features. As algorithms we used machine learning algorithms which are widely applied for text classification and state-of-art algorithms for language identification of text. As features we used words, various sized n-grams, and custom-made features (our novel feature set). We compared our approaches with two baseline methods, namely classification by country code top-level domains and classification by IP addresses of the hosting Web servers.\r\n\r\nWe trained and tested our classifiers in a 10-fold cross-validation setup on a dataset obtained from the Open Directory Project and from querying a commercial search engine. We obtained the lowest F1-measure for English (94) and the highest F1-measure for German (98) with the best performing classifiers.\r\n\r\nWe also evaluated the performance of our methods: (i) on a set of Web pages written in Adobe Flash and (ii) as part of a language-focused crawler. In the first case, the content of the Web page is hard to extract and in the second page downloading pages of the “wrong” language constitutes a waste of bandwidth. In both settings the best classifiers have a high accuracy with an F1-measure between 95 (for English) and 98 (for Italian) for the Adobe Flash pages and a precision between 90 (for Italian) and 97 (for French) for the language-focused crawler.","lang":"eng"}],"article_processing_charge":"No","date_updated":"2022-09-12T08:51:57Z","volume":7,"quality_controlled":"1","oa_version":"None","user_id":"2DF688A6-F248-11E8-B48F-1D18A9856A87","extern":"1","publication_identifier":{"eissn":["1559-114X"],"issn":["1559-1131"]},"_id":"11671","doi":"10.1145/2435215.2435218","year":"2013","title":"A comprehensive study of techniques for URL-based web page language classification","article_number":"3"},{"article_processing_charge":"No","date_updated":"2022-09-12T08:46:56Z","volume":5,"publication_identifier":{"eissn":["1559-114X"],"issn":["1559-1131"]},"extern":"1","_id":"11673","oa_version":"None","quality_controlled":"1","user_id":"2DF688A6-F248-11E8-B48F-1D18A9856A87","citation":{"mla":"Baykan, Eda, et al. “A Comprehensive Study of Features and Algorithms for URL-Based Topic Classification.” <i>ACM Transactions on the Web</i>, vol. 5, no. 3, 15, Association for Computing Machinery, 2011, doi:<a href=\"https://doi.org/10.1145/1993053.1993057\">10.1145/1993053.1993057</a>.","ama":"Baykan E, Henzinger MH, Marian L, Weber I. A comprehensive study of features and algorithms for URL-based topic classification. <i>ACM Transactions on the Web</i>. 2011;5(3). doi:<a href=\"https://doi.org/10.1145/1993053.1993057\">10.1145/1993053.1993057</a>","short":"E. Baykan, M.H. Henzinger, L. Marian, I. Weber, ACM Transactions on the Web 5 (2011).","ista":"Baykan E, Henzinger MH, Marian L, Weber I. 2011. A comprehensive study of features and algorithms for URL-based topic classification. ACM Transactions on the Web. 5(3), 15.","apa":"Baykan, E., Henzinger, M. H., Marian, L., &#38; Weber, I. (2011). A comprehensive study of features and algorithms for URL-based topic classification. <i>ACM Transactions on the Web</i>. Association for Computing Machinery. <a href=\"https://doi.org/10.1145/1993053.1993057\">https://doi.org/10.1145/1993053.1993057</a>","ieee":"E. Baykan, M. H. Henzinger, L. Marian, and I. Weber, “A comprehensive study of features and algorithms for URL-based topic classification,” <i>ACM Transactions on the Web</i>, vol. 5, no. 3. Association for Computing Machinery, 2011.","chicago":"Baykan, Eda, Monika H Henzinger, Ludmila Marian, and Ingmar Weber. “A Comprehensive Study of Features and Algorithms for URL-Based Topic Classification.” <i>ACM Transactions on the Web</i>. Association for Computing Machinery, 2011. <a href=\"https://doi.org/10.1145/1993053.1993057\">https://doi.org/10.1145/1993053.1993057</a>."},"publication_status":"published","abstract":[{"text":"Given only the URL of a Web page, can we identify its topic? We study this problem in detail by exploring a large number of different feature sets and algorithms on several datasets. We also show that the inherent overlap between topics and the sparsity of the information in URLs makes this a very challenging problem. Web page classification without a page’s content is desirable when the content is not available at all, when a classification is needed before obtaining the content, or when classification speed is of utmost importance. For our experiments we used five different corpora comprising a total of about 3 million (URL, classification) pairs. We evaluated several techniques for feature generation and classification algorithms. The individual binary classifiers were then combined via boosting into metabinary classifiers. We achieve typical F-measure values between 80 and 85, and a typical precision of around 86. The precision can be pushed further over 90 while maintaining a typical level of recall between 30 and 40.","lang":"eng"}],"keyword":["Topic classification","URL","ODP"],"author":[{"full_name":"Baykan, Eda","last_name":"Baykan","first_name":"Eda"},{"id":"540c9bbd-f2de-11ec-812d-d04a5be85630","first_name":"Monika H","orcid":"0000-0002-5008-6530","full_name":"Henzinger, Monika H","last_name":"Henzinger"},{"first_name":"Ludmila","full_name":"Marian, Ludmila","last_name":"Marian"},{"first_name":"Ingmar","last_name":"Weber","full_name":"Weber, Ingmar"}],"article_number":"15","doi":"10.1145/1993053.1993057","year":"2011","title":"A comprehensive study of features and algorithms for URL-based topic classification","publication":"ACM Transactions on the Web","issue":"3","day":"01","type":"journal_article","intvolume":"         5","status":"public","date_created":"2022-07-27T13:48:11Z","month":"07","article_type":"original","date_published":"2011-07-01T00:00:00Z","scopus_import":"1","publisher":"Association for Computing Machinery","language":[{"iso":"eng"}]}]
