(Near) duplication

Just like on the web, (near) duplication is a problem. The dweb corpii will likely also contain multiple (near) copies of the same content. Duplication does not contribute new information to search results.

  • Documents that are exact duplicates of each other (mirroring and plagiarism) are easy to detect. Exact duplicates have the same hash (if hashed by the same hashing algo)
  • We can perhaps use semantic knowledge.

Last modified: 2020/03/09 10:43