Distributed systems are hard to design, develop and test because of uncertainties due to:
The large number of processes running in parallel
Processes that update their variables independently
Problems specific to the development of distributed applications that are not exactly implemented by known programming languages and tools.
Decentralised crawlers need to coordinate in order to not visit the same hashes multiple times (although we need some duplication for failure tolerance), and the adopted crawling policy needs to be strictly enforced. The coordination of decentralised and/or distributed crawlers can incur significant communication overhead, limiting the number of (simultaneous) crawlers.
Heuristics or classifiers for resource allocations in distributed systems (crawl queue assignment and load balancing between peers)?
With performant decentralised full text search algorithms, we can move to fully decentralised search, such that nodes can directly interact with IPFS when searching. For making the index horizontally scalable, the search enine will need to be able to infer knowledge in a trivial time interval: an efficient, scalable and distributed execution pipeline for clustering. The clustering can perhaps be achieved via a fuzzy similarity relation obtained by the transitive closure of a proximity relation.
And what if a dweb solution solving most of these already exists, and we can use it for a distributed index?