User Tools

Site Tools


en:problems:search:duplication

Table of Contents

 
 

(Near) duplication

By some estimates, 40% of the pages are duplicates of other pages and duplicate hosts are the single largest source of duplicate pages. Many of the duplicate pages are legitimate copies (mirrored information repositories for redundancy and access reliability). Duplication increases the time to crawl and does not contribute new information to search results.

Documents

Documents that are exact duplicates of each other (mirroring and plagiarism) are easy to detect with checksum techniques. Near-duplicate documents are identical in terms of content but are different in for example, ads, counters, and/or timestamps.

  • (Near) document duplication is a well-studied problem and many algorithms exist.
  • When using N-grams for language detection, the Jaccard coefficient can possibly be used. Two documents (hosts) are near duplicates if the sets of shingles generated from them are nearly the same (A preset Jaccard coefficient threshold of say 0.9)

Hosts

Mirror detection and individual-page detection could provide a complete solution to the problem, and is costly. A 'good enough' variant that detects duplicate hosts can reap most of the benefits while requiring less computational resources. Problems:

  • A host is merely a name in a domain name system, and duphosts can come from two DNS names resolving to the same IP address, the same webserver, or to computers that serve the same content for the two hostnames in question. ⇒ Duplicate IP addresses are not necessary nor 'good enough' to identify duplicate hosts.
  • Virtual hosting can result in different sites sharing an IP address, and round-robin DNS can result in a single site having multiple IP addresses.
  • Using text-based approaches and looking at the content of a small part of a site is also not 'good enough'. Pages can be different on two subsequent viewings if it contains some dynamic content. And there are many unrelated sites on the web that have an identical page, like the “under construction” page, generating a lot of false positives.

en/problems/search/duplication.txt · Last modified: 2020/03/09 16:07 by Digital Dot