A comparison of techniques to find mirrored hosts on the WWW
Bharat K, Broder A, Dean J, Henzinger MH. 2000. A comparison of techniques to find mirrored hosts on the WWW. Journal of the American Society for Information Science. 51(12), 1114–1122.
Download (ext.)
https://doi.org/10.1002/1097-4571(2000)9999:9999<::aid-asi1025>3.0.co;2-0
[Published Version]
Journal Article
| Published
| English
Scopus indexed
Author
Bharat, Krishna;
Broder, Andrei;
Dean, Jeffrey;
Henzinger, MonikaISTA
Abstract
We compare several algorithms for identifying mirrored hosts on the World Wide Web. The algorithms operate on the basis of URL strings and linkage data: the type of information about Web pages easily available from Web proxies and crawlers. Identification of mirrored hosts can improve Web-based information retrieval in several ways: first, by identifying mirrored hosts, search engines can avoid storing and returning duplicate documents. Second, several new information retrieval techniques for the Web make inferences based on the explicit links among hypertext documents—mirroring perturbs their graph model and degrades performance. Third, mirroring information can be used to redirect users to alternate mirror sites to compensate for various failures, and can thus improve the performance of Web browsers and proxies. We evaluated four classes of “top-down” algorithms for detecting mirrored host pairs (that is, algorithms that are based on page attributes such as URL, IP address, and hyperlinks between pages, and not on the page content) on a collection of 140 million URLs (on 230,000 hosts) and their associated connectivity information. Our best approach is one which combines five algorithms and achieved a precision of 0.57 for a recall of 0.86 considering 100,000 ranked host pairs.
Publishing Year
Date Published
2000-10-01
Journal Title
Journal of the American Society for Information Science
Publisher
Wiley
Volume
51
Issue
12
Page
1114-1122
IST-REx-ID
Cite this
Bharat K, Broder A, Dean J, Henzinger MH. A comparison of techniques to find mirrored hosts on the WWW. Journal of the American Society for Information Science. 2000;51(12):1114-1122. doi:10.1002/1097-4571(2000)9999:9999<::aid-asi1025>3.0.co;2-0
Bharat, K., Broder, A., Dean, J., & Henzinger, M. H. (2000). A comparison of techniques to find mirrored hosts on the WWW. Journal of the American Society for Information Science. Wiley. https://doi.org/10.1002/1097-4571(2000)9999:9999<::aid-asi1025>3.0.co;2-0
Bharat, Krishna, Andrei Broder, Jeffrey Dean, and Monika H Henzinger. “A Comparison of Techniques to Find Mirrored Hosts on the WWW.” Journal of the American Society for Information Science. Wiley, 2000. https://doi.org/10.1002/1097-4571(2000)9999:9999<::aid-asi1025>3.0.co;2-0.
K. Bharat, A. Broder, J. Dean, and M. H. Henzinger, “A comparison of techniques to find mirrored hosts on the WWW,” Journal of the American Society for Information Science, vol. 51, no. 12. Wiley, pp. 1114–1122, 2000.
Bharat K, Broder A, Dean J, Henzinger MH. 2000. A comparison of techniques to find mirrored hosts on the WWW. Journal of the American Society for Information Science. 51(12), 1114–1122.
Bharat, Krishna, et al. “A Comparison of Techniques to Find Mirrored Hosts on the WWW.” Journal of the American Society for Information Science, vol. 51, no. 12, Wiley, 2000, pp. 1114–22, doi:10.1002/1097-4571(2000)9999:9999<::aid-asi1025>3.0.co;2-0.
All files available under the following license(s):
Copyright Statement:
This Item is protected by copyright and/or related rights. [...]
Link(s) to Main File(s)
Access Level
Open Access