---
_id: '11905'
abstract:
- lang: eng
  text: Given only the URL of a web page, can we identify its topic? This is the question
    that we examine in this paper. Usually, web pages are classified using their content,
    but a URL-only classifier is preferable, (i) when speed is crucial, (ii) to enable
    content filtering before an (objection-able) web page is downloaded, (iii) when
    a page's content is hidden in images, (iv) to annotate hyperlinks in a personalized
    web browser, without fetching the target page, and (v) when a focused crawler
    wants to infer the topic of a target page before devoting bandwidth to download
    it. We apply a machine learning approach to the topic identification task and
    evaluate its performance in extensive experiments on categorized web pages from
    the Open Directory Project (ODP). When training separate binary classifiers for
    each topic, we achieve typical F-measure values between 80 and 85, and a typical
    precision of around 85. We also ran experiments on a small data set of university
    web pages. For the task of classifying these pages into faculty, student, course
    and project pages, our methods improve over previous approaches by 13.8 points
    of F-measure.
article_processing_charge: No
author:
- first_name: Eda
  full_name: Baykan, Eda
  last_name: Baykan
- first_name: Monika H
  full_name: Henzinger, Monika H
  id: 540c9bbd-f2de-11ec-812d-d04a5be85630
  last_name: Henzinger
  orcid: 0000-0002-5008-6530
- first_name: Ludmila
  full_name: Marian, Ludmila
  last_name: Marian
- first_name: Ingmar
  full_name: Weber, Ingmar
  last_name: Weber
citation:
  ama: 'Baykan E, Henzinger MH, Marian L, Weber I. Purely URL-based topic classification.
    In: <i>18th International World Wide Web Conference</i>. Association for Computing
    Machinery; 2009:1109-1110. doi:<a href="https://doi.org/10.1145/1526709.1526880">10.1145/1526709.1526880</a>'
  apa: 'Baykan, E., Henzinger, M. H., Marian, L., &#38; Weber, I. (2009). Purely URL-based
    topic classification. In <i>18th International World Wide Web Conference</i> (pp.
    1109–1110). New York, NY, United States: Association for Computing Machinery.
    <a href="https://doi.org/10.1145/1526709.1526880">https://doi.org/10.1145/1526709.1526880</a>'
  chicago: Baykan, Eda, Monika H Henzinger, Ludmila Marian, and Ingmar Weber. “Purely
    URL-Based Topic Classification.” In <i>18th International World Wide Web Conference</i>,
    1109–10. Association for Computing Machinery, 2009. <a href="https://doi.org/10.1145/1526709.1526880">https://doi.org/10.1145/1526709.1526880</a>.
  ieee: E. Baykan, M. H. Henzinger, L. Marian, and I. Weber, “Purely URL-based topic
    classification,” in <i>18th International World Wide Web Conference</i>, New York,
    NY, United States, 2009, pp. 1109–1110.
  ista: 'Baykan E, Henzinger MH, Marian L, Weber I. 2009. Purely URL-based topic classification.
    18th International World Wide Web Conference. WWW: Conference on World Wide Web,
    1109–1110.'
  mla: Baykan, Eda, et al. “Purely URL-Based Topic Classification.” <i>18th International
    World Wide Web Conference</i>, Association for Computing Machinery, 2009, pp.
    1109–10, doi:<a href="https://doi.org/10.1145/1526709.1526880">10.1145/1526709.1526880</a>.
  short: E. Baykan, M.H. Henzinger, L. Marian, I. Weber, in:, 18th International World
    Wide Web Conference, Association for Computing Machinery, 2009, pp. 1109–1110.
conference:
  end_date: 2009-04-24
  location: New York, NY, United States
  name: 'WWW: Conference on World Wide Web'
  start_date: 2009-04-20
date_created: 2022-08-17T11:49:53Z
date_published: 2009-04-01T00:00:00Z
date_updated: 2023-02-17T14:54:56Z
day: '01'
doi: 10.1145/1526709.1526880
extern: '1'
language:
- iso: eng
month: '04'
oa_version: None
page: 1109-1110
publication: 18th International World Wide Web Conference
publication_identifier:
  isbn:
  - 978-1-60558-487-4
publication_status: published
publisher: Association for Computing Machinery
quality_controlled: '1'
scopus_import: '1'
status: public
title: Purely URL-based topic classification
type: conference
user_id: 2DF688A6-F248-11E8-B48F-1D18A9856A87
year: '2009'
...
