---
_id: '11878'
abstract:
- lang: eng
  text: "Given only the URL of a web page, can we identify its language? This is the
    question that we examine in this paper.\r\nSuch a language classifier is, for
    example, useful for crawlers of web search engines, which frequently try to satisfy
    certain language quotas. To determine the language of uncrawled web pages, they
    have to download the page, which might be wasteful, if the page is not in the
    desired language. With URL-based language classifiers these redundant downloads
    can be avoided.\r\n\r\nWe apply a variety of machine learning algorithms to the
    language identification task and evaluate their performance in extensive experiments
    for five languages: English, French, German, Spanish and Italian. Our best methods
    achieve an F-measure, averaged over all languages, of around .90 for both a random
    sample of 1,260 web page from a large web crawl and for 25k pages from the ODP
    directory. For 5k pages of web search engine results we even achieve an F-measure
    of .96. The achieved recall for these collections is .93, .88 and .95 respectively.
    Two independent human evaluators performed considerably worse on the task, with
    an F-measure of .75 and a typical recall of a mere .67. Using only country-code
    top-level domains, such as .de or .fr yields a good precision, but a typical recall
    of below .60 and an F-measure of around .68."
article_processing_charge: No
article_type: original
author:
- first_name: Eda
  full_name: Baykan, Eda
  last_name: Baykan
- first_name: Monika H
  full_name: Henzinger, Monika H
  id: 540c9bbd-f2de-11ec-812d-d04a5be85630
  last_name: Henzinger
  orcid: 0000-0002-5008-6530
- first_name: Ingmar
  full_name: Weber, Ingmar
  last_name: Weber
citation:
  ama: Baykan E, Henzinger MH, Weber I. Web page language identification based on
    URLs. <i>Proceedings of the VLDB Endowment</i>. 2008;1(1):176-187. doi:<a href="https://doi.org/10.14778/1453856.1453880">10.14778/1453856.1453880</a>
  apa: Baykan, E., Henzinger, M. H., &#38; Weber, I. (2008). Web page language identification
    based on URLs. <i>Proceedings of the VLDB Endowment</i>. Association for Computing
    Machinery. <a href="https://doi.org/10.14778/1453856.1453880">https://doi.org/10.14778/1453856.1453880</a>
  chicago: Baykan, Eda, Monika H Henzinger, and Ingmar Weber. “Web Page Language Identification
    Based on URLs.” <i>Proceedings of the VLDB Endowment</i>. Association for Computing
    Machinery, 2008. <a href="https://doi.org/10.14778/1453856.1453880">https://doi.org/10.14778/1453856.1453880</a>.
  ieee: E. Baykan, M. H. Henzinger, and I. Weber, “Web page language identification
    based on URLs,” <i>Proceedings of the VLDB Endowment</i>, vol. 1, no. 1. Association
    for Computing Machinery, pp. 176–187, 2008.
  ista: Baykan E, Henzinger MH, Weber I. 2008. Web page language identification based
    on URLs. Proceedings of the VLDB Endowment. 1(1), 176–187.
  mla: Baykan, Eda, et al. “Web Page Language Identification Based on URLs.” <i>Proceedings
    of the VLDB Endowment</i>, vol. 1, no. 1, Association for Computing Machinery,
    2008, pp. 176–87, doi:<a href="https://doi.org/10.14778/1453856.1453880">10.14778/1453856.1453880</a>.
  short: E. Baykan, M.H. Henzinger, I. Weber, Proceedings of the VLDB Endowment 1
    (2008) 176–187.
date_created: 2022-08-16T13:10:11Z
date_published: 2008-08-01T00:00:00Z
date_updated: 2023-02-17T13:55:24Z
day: '01'
doi: 10.14778/1453856.1453880
extern: '1'
intvolume: '         1'
issue: '1'
language:
- iso: eng
month: '08'
oa_version: None
page: 176-187
publication: Proceedings of the VLDB Endowment
publication_identifier:
  issn:
  - 2150-8097
publication_status: published
publisher: Association for Computing Machinery
quality_controlled: '1'
scopus_import: '1'
status: public
title: Web page language identification based on URLs
type: journal_article
user_id: 2DF688A6-F248-11E8-B48F-1D18A9856A87
volume: 1
year: '2008'
...
