By some estimates, 40% of the pages are duplicates of other pages and duplicate hosts are the single largest source of duplicate pages. Many of the duplicate pages are legitimate copies (mirrored information repositories for redundancy and access reliability). Duplication increases the time to crawl and does not contribute new information to search results.
Documents that are exact duplicates of each other (mirroring and plagiarism) are easy to detect with checksum techniques. Near-duplicate documents are identical in terms of content but are different in for example, ads, counters, and/or timestamps.
Mirror detection and individual-page detection could provide a complete solution to the problem, and is costly. A 'good enough' variant that detects duplicate hosts can reap most of the benefits while requiring less computational resources. Problems: