Normalisation to terms

In the linguistic preprocessing phase, the tokens are normalised into some canonical form to be able to find the variants of a particular word.

Create equivalence classes, take all the variants (tokens) belonging to the same equivalence class, and map them to the same term.

Implicitly defined equivalence classes

  • Deleting periods to form a term
  • Deleting hyphens to form a term

Unicode normalisation

Normalisation is language and context/user dependent. Tokenisation, normalisation and language detection are really intertwined. Most systems support Unicode, an international standard for consistently encoding text across the world’s written languages. Because there can be many ways to represent the same character, Unicode introduces normalization as a way to transform characters into standard canonical forms, making it possible to recognize and match equivalent representations.

Unicode supports two kinds of normalisation:

  • Normalisation Form Canonical Decomposition (NFD) in which characters are decomposed by canonical equivalence, and multiple combining characters are arranged in a specific order.
  • Normalisation Form Canonical Composition (NFC) in which characters are decomposed and then recomposed by canonical equivalence.
  • Variants are NFKD and NFKC, in which the decomposition step uses compatibility forms that more aggressively standardize characters.

Support for Unicode normalisation can be found in Java and Python, as well as in open-source search engines Apache Lucene and Elastic.

Removing accents and ligatures

Accents (diacritics) are for example, Türingen and Tueringen in German, and Résumé and Resume in French. Accents can change the meanings of words, but not all keyboards support accents, and some users are lazy. It is generally considered a good idea to remove accents as part of indexing content and processing search queries. If the search engine is for a local context, examining the corpus and surveying the user base first to find out how they write terms can be an idea.

For removing accents (diacritics), either NFD or NFKD  can be used. Unicode normalisation transforms strings into a standard character encoding, but it leaves accents in place. To remove accents (these tools can also simplify ligatures, like transform æ into ae):

  • In Java, use StringUtils.stripAccents
  • In Python, use the Unidecode module
  • In Apache Lucene, use ASCIIFoldingFilter
  • In Elastic, use asciifolding

The original strings must also be kept for displaying the search results and the query. That is respectful to the original text and mitigates the risk of confusing searchers when removing accents changes the meaning of a word.

Case folding

It would be unreasonable to expect proper capitalisation from searchers, hence it is often best to make everything lower case. To convert a string to lowercase:

  • In Java, use String.toLowerCase
  • In Python, you can use lower
  • Apache Lucene has a LowerCaseFilter
  • Elastic has a lowercase token filter.

Trade-off

Each of the filters used in equivalence classification represents a trade-off of precision for recall and it is entirely possible that it will cause two words with different meanings to be mapped to the same representation. If this poses a significant risk, evaluate the precision of the filters through human judgment.

One of the alternatives allows for asymmetric expansion: Take every token from the tokeniser, add it as it is, as a separate term in the dictionary of the index. Then when a search is put in, use query expansion, in which a list of words is extracted from a hashtable, which conceptually would have gotten mapped had there been equivalence classification. The postings list will then use the list with an OR of each of the term.

This more powerful (better functionality), but less efficient (postings dictionary is larger).

Handling synonyms and homonyms is another choice, and can be done by:

  • Manually constructed equivalence classes
  • Rewriting to form equivalence-class terms (index under the synonym or homonym and vice versa)
  • Query expansion

Estimates of the fraction of misspelled search queries by various studies mention between 10% and 15%. Robust spelling correction is important. Off-the-shelf spelling correction systems, such as Aspell, Hunspell or Soundex (not very effective in general IR systems), are highly customisable.

An inflection is a change in the form of a word (often the ending) to express a grammatical function or attribute such as tense, mood, person, number, case, and gender. Stemming and lemmatisation are out-of-the-box tools for managing inflections:

Lemmatisation

Lemmatisation relies on a lexical knowledge base like WordNet to get the correct base forms of words. A disadvantage is that it cannot handle unknown words and that it requires specifying the word’s part of speech — otherwise, it assumes the word is a noun.

  • am, are, is → be
  • car, cars, car's cars' → car

Stemming

Stemming reduces terms to their “roots” (quite aggressively at that). The common stemming approach for English is the Porter stemming algorithm (the Porter stemmer): A collection of rules (heuristics) designed to reflect how English handles inflections. This can increase recall (sacrificing precision) by matching words against their other inflected forms. It is included in all major natural language processing libraries.

Summarised, the Porter stemmer uses 5 phases of reduction that are applied sequentially. Each phase consists of a set of commands. Some Typical rules are:

  • sses → ss
  • ies → i
  • ational → ate
  • tional → tion

There are also rules sensitive to the measure of words

  • M>1 EMENT →
    • replacement → replac
    • cement → cement

The Lovins stemmer is a single pass, longest-suffix removal algorithm, with about 250 rules.

Effectiveness

Both stemming and lemmatisation are effective techniques to expand recall, with lemmatisation giving up some of that recall to increase precision. According to morphological analysis there are only modest benefits for retrieval (in English), as it helps recall but harms precision.

  • operative (dentistry), operating (system), operational (research) → oper

In other languages with a rich morphological structure (elegant in how it defines the variant forms of words), such as Spanish, German and Finnish, the benefits are better. In Finnish there is even a 30% performance gain.

Rolling our own

In English we can roll our own, using the equivalence classes from Porter stemming and WordNet lemmatisation as candidates, and then use further processing — either automatic or editorial — to improve precision by estimating word similarity.

  • Automatic detection of word similarity can be based on corpus analysis and training a machine-learning model.
  • Editorial detection by humans can be done at the level of token-token pairs, query-query pairs (to assess the tokens in context), or query-document pairs (to assess the end-to-end effect of query expansion).

Using language detection (see my simpleton language guesser), instead of collapsing words that have different meanings in different languages, entries in the dictionary can be appended with language. MIT.english and mit.german for example.