User Tools

Site Tools


Extracting and indexing

Parsing (extracting the sequence of characters) is not trivial. The initial classification problems:

  • What format is it in?
  • What language is it in?
  • What character set is used?


  • Documents being indexed can include docs from different languages. This can be a problem with a single index, because tokenisation and linguistic preprocessing will have to be done separately.
  • A document can contain multiple languages or multiple formats. How to identify the main language?

The following choices affect precision and recall.

  • A unit is most likely a single file.
  • A mailbox file may need splitting into multiple documents?
  • Attachments to a mail may need to be split off?
  • A archive or zip file may need extraction and splitting up?
  • A group of files, for example PPT or LaTex, may need to be merged into one file?
  • Is a book a file, or are the chapters/paragraphs files?

Not really a pipeline

Tokenisation, normalisation and language detection are often presented as a pipeline, but are really intertwined.

  • Tokenisation converts a string of characters into a sequence of tokens, but it is not as simple as splitting text while removing spaces, as can be done with the split method in Java or Python.
  • In the linguistic preprocessing phase, the tokens are normalised into some canonical form to be able to find the variants of a particular word. Language needs to have already been identified before any morphological processing of data.
    • If a lexicon is used for spelling correction, then the program needs to know which language specific rules are to be applied.
    • Stemming and lemmatisation are out-of-the-box tools for reducing inflections. These are effective techniques to expand recall, with lemmatisation giving up some of that recall to increase precision. Their effectiveness depend on language.
    • Language needs to be a given for spelling correction and processing of double quote syntax or wildcard queries.


  • A combination of heuristics and machine learning and simply asking can perhaps be used.
  • Using language detection, instead of collapsing words that have different meanings in different languages, entries in the dictionary can be appended with language.


en/problems/psearch/indexing.txt · Last modified: 2020/03/10 14:42 by