Just like on the web, (near) duplication is a problem. The dweb corpii will likely also contain multiple (near) copies of the same content. Duplication does not contribute new information to search results.
Documents that are exact duplicates of each other (mirroring and plagiarism) are easy to detect. Exact duplicates have the same hash (if hashed by the same hashing algo)