---
_id: '11671'
abstract:
- lang: eng
  text: "Given only the URL of a Web page, can we identify its language? In this article
    we examine this question. URL-based language classification is useful when the
    content of the Web page is not available or downloading the content is a waste
    of bandwidth and time.\r\nWe built URL-based language classifiers for English,
    German, French, Spanish, and Italian by applying a variety of algorithms and features.
    As algorithms we used machine learning algorithms which are widely applied for
    text classification and state-of-art algorithms for language identification of
    text. As features we used words, various sized n-grams, and custom-made features
    (our novel feature set). We compared our approaches with two baseline methods,
    namely classification by country code top-level domains and classification by
    IP addresses of the hosting Web servers.\r\n\r\nWe trained and tested our classifiers
    in a 10-fold cross-validation setup on a dataset obtained from the Open Directory
    Project and from querying a commercial search engine. We obtained the lowest F1-measure
    for English (94) and the highest F1-measure for German (98) with the best performing
    classifiers.\r\n\r\nWe also evaluated the performance of our methods: (i) on a
    set of Web pages written in Adobe Flash and (ii) as part of a language-focused
    crawler. In the first case, the content of the Web page is hard to extract and
    in the second page downloading pages of the “wrong” language constitutes a waste
    of bandwidth. In both settings the best classifiers have a high accuracy with
    an F1-measure between 95 (for English) and 98 (for Italian) for the Adobe Flash
    pages and a precision between 90 (for Italian) and 97 (for French) for the language-focused
    crawler."
article_number: '3'
article_processing_charge: No
article_type: original
author:
- first_name: Eda
  full_name: Baykan, Eda
  last_name: Baykan
- first_name: Ingmar
  full_name: Weber, Ingmar
  last_name: Weber
- first_name: Monika H
  full_name: Henzinger, Monika H
  id: 540c9bbd-f2de-11ec-812d-d04a5be85630
  last_name: Henzinger
  orcid: 0000-0002-5008-6530
citation:
  ama: Baykan E, Weber I, Henzinger MH. A comprehensive study of techniques for URL-based
    web page language classification. <i>ACM Transactions on the Web</i>. 2013;7(1).
    doi:<a href="https://doi.org/10.1145/2435215.2435218">10.1145/2435215.2435218</a>
  apa: Baykan, E., Weber, I., &#38; Henzinger, M. H. (2013). A comprehensive study
    of techniques for URL-based web page language classification. <i>ACM Transactions
    on the Web</i>. Association for Computing Machinery. <a href="https://doi.org/10.1145/2435215.2435218">https://doi.org/10.1145/2435215.2435218</a>
  chicago: Baykan, Eda, Ingmar Weber, and Monika H Henzinger. “A Comprehensive Study
    of Techniques for URL-Based Web Page Language Classification.” <i>ACM Transactions
    on the Web</i>. Association for Computing Machinery, 2013. <a href="https://doi.org/10.1145/2435215.2435218">https://doi.org/10.1145/2435215.2435218</a>.
  ieee: E. Baykan, I. Weber, and M. H. Henzinger, “A comprehensive study of techniques
    for URL-based web page language classification,” <i>ACM Transactions on the Web</i>,
    vol. 7, no. 1. Association for Computing Machinery, 2013.
  ista: Baykan E, Weber I, Henzinger MH. 2013. A comprehensive study of techniques
    for URL-based web page language classification. ACM Transactions on the Web. 7(1),
    3.
  mla: Baykan, Eda, et al. “A Comprehensive Study of Techniques for URL-Based Web
    Page Language Classification.” <i>ACM Transactions on the Web</i>, vol. 7, no.
    1, 3, Association for Computing Machinery, 2013, doi:<a href="https://doi.org/10.1145/2435215.2435218">10.1145/2435215.2435218</a>.
  short: E. Baykan, I. Weber, M.H. Henzinger, ACM Transactions on the Web 7 (2013).
date_created: 2022-07-27T12:50:18Z
date_published: 2013-03-01T00:00:00Z
date_updated: 2022-09-12T08:51:57Z
day: '01'
doi: 10.1145/2435215.2435218
extern: '1'
intvolume: '         7'
issue: '1'
keyword:
- Computer Networks and Communications
language:
- iso: eng
month: '03'
oa_version: None
publication: ACM Transactions on the Web
publication_identifier:
  eissn:
  - 1559-114X
  issn:
  - 1559-1131
publication_status: published
publisher: Association for Computing Machinery
quality_controlled: '1'
scopus_import: '1'
status: public
title: A comprehensive study of techniques for URL-based web page language classification
type: journal_article
user_id: 2DF688A6-F248-11E8-B48F-1D18A9856A87
volume: 7
year: '2013'
...
---
_id: '11673'
abstract:
- lang: eng
  text: Given only the URL of a Web page, can we identify its topic? We study this
    problem in detail by exploring a large number of different feature sets and algorithms
    on several datasets. We also show that the inherent overlap between topics and
    the sparsity of the information in URLs makes this a very challenging problem.
    Web page classification without a page’s content is desirable when the content
    is not available at all, when a classification is needed before obtaining the
    content, or when classification speed is of utmost importance. For our experiments
    we used five different corpora comprising a total of about 3 million (URL, classification)
    pairs. We evaluated several techniques for feature generation and classification
    algorithms. The individual binary classifiers were then combined via boosting
    into metabinary classifiers. We achieve typical F-measure values between 80 and
    85, and a typical precision of around 86. The precision can be pushed further
    over 90 while maintaining a typical level of recall between 30 and 40.
article_number: '15'
article_processing_charge: No
article_type: original
author:
- first_name: Eda
  full_name: Baykan, Eda
  last_name: Baykan
- first_name: Monika H
  full_name: Henzinger, Monika H
  id: 540c9bbd-f2de-11ec-812d-d04a5be85630
  last_name: Henzinger
  orcid: 0000-0002-5008-6530
- first_name: Ludmila
  full_name: Marian, Ludmila
  last_name: Marian
- first_name: Ingmar
  full_name: Weber, Ingmar
  last_name: Weber
citation:
  ama: Baykan E, Henzinger MH, Marian L, Weber I. A comprehensive study of features
    and algorithms for URL-based topic classification. <i>ACM Transactions on the
    Web</i>. 2011;5(3). doi:<a href="https://doi.org/10.1145/1993053.1993057">10.1145/1993053.1993057</a>
  apa: Baykan, E., Henzinger, M. H., Marian, L., &#38; Weber, I. (2011). A comprehensive
    study of features and algorithms for URL-based topic classification. <i>ACM Transactions
    on the Web</i>. Association for Computing Machinery. <a href="https://doi.org/10.1145/1993053.1993057">https://doi.org/10.1145/1993053.1993057</a>
  chicago: Baykan, Eda, Monika H Henzinger, Ludmila Marian, and Ingmar Weber. “A Comprehensive
    Study of Features and Algorithms for URL-Based Topic Classification.” <i>ACM Transactions
    on the Web</i>. Association for Computing Machinery, 2011. <a href="https://doi.org/10.1145/1993053.1993057">https://doi.org/10.1145/1993053.1993057</a>.
  ieee: E. Baykan, M. H. Henzinger, L. Marian, and I. Weber, “A comprehensive study
    of features and algorithms for URL-based topic classification,” <i>ACM Transactions
    on the Web</i>, vol. 5, no. 3. Association for Computing Machinery, 2011.
  ista: Baykan E, Henzinger MH, Marian L, Weber I. 2011. A comprehensive study of
    features and algorithms for URL-based topic classification. ACM Transactions on
    the Web. 5(3), 15.
  mla: Baykan, Eda, et al. “A Comprehensive Study of Features and Algorithms for URL-Based
    Topic Classification.” <i>ACM Transactions on the Web</i>, vol. 5, no. 3, 15,
    Association for Computing Machinery, 2011, doi:<a href="https://doi.org/10.1145/1993053.1993057">10.1145/1993053.1993057</a>.
  short: E. Baykan, M.H. Henzinger, L. Marian, I. Weber, ACM Transactions on the Web
    5 (2011).
date_created: 2022-07-27T13:48:11Z
date_published: 2011-07-01T00:00:00Z
date_updated: 2022-09-12T08:46:56Z
day: '01'
doi: 10.1145/1993053.1993057
extern: '1'
intvolume: '         5'
issue: '3'
keyword:
- Topic classification
- URL
- ODP
language:
- iso: eng
month: '07'
oa_version: None
publication: ACM Transactions on the Web
publication_identifier:
  eissn:
  - 1559-114X
  issn:
  - 1559-1131
publication_status: published
publisher: Association for Computing Machinery
quality_controlled: '1'
scopus_import: '1'
status: public
title: A comprehensive study of features and algorithms for URL-based topic classification
type: journal_article
user_id: 2DF688A6-F248-11E8-B48F-1D18A9856A87
volume: 5
year: '2011'
...
