Web page language identification based on URLs

Baykan, Eda; Henzinger, Monika H; Weber, Ingmar

Web page language identification based on URLs

Baykan E, Henzinger MH, Weber I. 2008. Web page language identification based on URLs. Proceedings of the VLDB Endowment. 1(1), 176–187.

Download

No fulltext has been uploaded. References only!

DOI

10.14778/1453856.1453880

Journal Article | Published | English

Scopus indexed

Author

Baykan, Eda; Henzinger, Monika^ISTA ; Weber, Ingmar

Abstract

Given only the URL of a web page, can we identify its language? This is the question that we examine in this paper. Such a language classifier is, for example, useful for crawlers of web search engines, which frequently try to satisfy certain language quotas. To determine the language of uncrawled web pages, they have to download the page, which might be wasteful, if the page is not in the desired language. With URL-based language classifiers these redundant downloads can be avoided. We apply a variety of machine learning algorithms to the language identification task and evaluate their performance in extensive experiments for five languages: English, French, German, Spanish and Italian. Our best methods achieve an F-measure, averaged over all languages, of around .90 for both a random sample of 1,260 web page from a large web crawl and for 25k pages from the ODP directory. For 5k pages of web search engine results we even achieve an F-measure of .96. The achieved recall for these collections is .93, .88 and .95 respectively. Two independent human evaluators performed considerably worse on the task, with an F-measure of .75 and a typical recall of a mere .67. Using only country-code top-level domains, such as .de or .fr yields a good precision, but a typical recall of below .60 and an F-measure of around .68.

Publishing Year

2008

Date Published

2008-08-01

Journal Title

Proceedings of the VLDB Endowment

Publisher

Association for Computing Machinery

Volume

Issue

Page

176-187

ISSN

2150-8097

IST-REx-ID

11878

Cite this

Baykan E, Henzinger MH, Weber I. Web page language identification based on URLs. Proceedings of the VLDB Endowment. 2008;1(1):176-187. doi:10.14778/1453856.1453880

Baykan, E., Henzinger, M. H., & Weber, I. (2008). Web page language identification based on URLs. Proceedings of the VLDB Endowment. Association for Computing Machinery. https://doi.org/10.14778/1453856.1453880

Baykan, Eda, Monika H Henzinger, and Ingmar Weber. “Web Page Language Identification Based on URLs.” Proceedings of the VLDB Endowment. Association for Computing Machinery, 2008. https://doi.org/10.14778/1453856.1453880.

E. Baykan, M. H. Henzinger, and I. Weber, “Web page language identification based on URLs,” Proceedings of the VLDB Endowment, vol. 1, no. 1. Association for Computing Machinery, pp. 176–187, 2008.

Baykan E, Henzinger MH, Weber I. 2008. Web page language identification based on URLs. Proceedings of the VLDB Endowment. 1(1), 176–187.

Baykan, Eda, et al. “Web Page Language Identification Based on URLs.” Proceedings of the VLDB Endowment, vol. 1, no. 1, Association for Computing Machinery, 2008, pp. 176–87, doi:10.14778/1453856.1453880.

Export

Marked Publications

Open Data ISTA Research Explorer

Search this title in

Google Scholar