User Tools

Site Tools


en:problems:dsearch:duplication

(Near) duplication

Just like on the web, (near) duplication is a problem. The dweb corpii will likely also contain multiple (near) copies of the same content. Duplication does not contribute new information to search results.

  • Documents that are exact duplicates of each other (mirroring and plagiarism) are easy to detect. Exact duplicates have the same hash (if hashed by the same hashing algo)
  • We can perhaps use semantic knowledge.

en/problems/dsearch/duplication.txt · Last modified: 2020/03/09 10:43 by Digital Dot