The subtleties of ElasticSearch synonyms

 Reading time ~13 minutes

Heads up: this article is over a year old. Some information might be out of date, as I don't always update older articles.

Introduction

The synonyms feature in ElasticSearch is very powerful, but just like most of the ElasticSearch features, it hides complexities and subtleties that might be hard to understand at a first glance.

Synonyms bridge the gap between relating concepts and ideas (i.e. “England”, “UK”) or between slightly different vocabulary usage (like British English “colour” vs. American English “color”) in the documents and queries, which cannot be implemented using token filters like stemmers or fuzzy queries.

Synonym filters are part of the analysis process that converts input text into searchable tokens. Since this process is highly domain-specific, users need to provide their own appropriate rules.

In ElasticSearch, synonyms can be defined at the index level or at the query level. Defining synonyms at the index level allows them to be applied to all queries against that index, while defining synonyms at the query level allows for more targeted synonym expansion for specific queries. These two approaches are not equivalent and they might cause some confusion to developers.

In this post we will explain how such, apparently simple, feature can hide quite a few complexities that are hard to sort out if you’re not paying too much attention to the results or to the documentation (any reference to the author of this post is purely casual 😅).

All examples are tested on the current Elasticsearch version1.

Set the stage

First, let’s create an index to store our favorite Star Wars quotes, using the synonym token filter with a list of synonyms.

PUT /starwars
{
  "settings": {
    "number_of_shards": 1,
    "index": {
      "analysis": {
        "index_analyzer": {
          "synonym": {
            "tokenizer": "standard",
            "filter": [
              "lowercase",
              "stop",
              "snowball",
              "my_synonym_filter"
            ]
          }
        },
        "filter": {
          "my_synonym_filter": {
            "type": "synonym",
            "synonyms": [
              "father,dad",
              "machine=>droid"
            ]
          }
        }
      }
    },
    "mappings": {
      "properties": {
        "quote": {
          "type": "text",
          "analyzer": "index_analyzer"
        }
      }
    }
  }
}

As you can see, we’ve defined a custom analyzer (index_analyzer) that uses a custom token filter (my_synonym_filter) for the starwars index. This custom filter has the synonym type and we’ve explicitly provided a list of synonyms, with the synonyms option. The Solr synonyms used in this post are

  • father,dad means that those two words are equivalent because, by default, the expand parameter is set to true. Searching for father will also return matches for dad and vice-versa. It is equivalent to the explicit mapping father,dad=>father,dad
  • machine=>droid the tokens on the lefthand side of => are replaced with the tokens on the right side.

Finally, in the mappings for the document, the custom analyzer is specified for the quote field.

To test the newly created analyzer, we can call the analyze endpoint, which performs analysis on a text string and returns the resulting tokens:

GET /starwars/_analyze
{
  "analyzer": "index_analyzer",
  "text": "dad"
}

We can see that the token for “dad” is expanded with the “father” synonym.

{
  "tokens": [
    {
      "token": "dad",
      "start_offset": 0,
      "end_offset": 3,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "father",
      "start_offset": 0,
      "end_offset": 3,
      "type": "SYNONYM",
      "position": 0
    }
  ]
}

Since the output of the analysis process consists of both tokens, ElasticSearch knows that it should return matches for both those words.

Now let’s call the _analyze endpoint with “machine”:

GET /starwars/_analyze
{
  "analyzer": "index_analyzer",
  "text": "machine"
}

This is the result

{
  "tokens": [
    {
      "token": "droid",
      "start_offset": 0,
      "end_offset": 7,
      "type": "SYNONYM",
      "position": 0
    }
  ]
}

As expected, the “machine” token is replaced by “droid” and is not added, like the previous example.

Let’s try to index some documents:

PUT /starwars/_doc/1
{
  "quote": "These are not the droids you are looking for."
}

PUT /starwars/_doc/2
{
  "quote": "Luke I'm your father."
}

PUT /starwars/_doc/3
{
  "quote": "I am a Jedi, like my dad before me."
}

Now we can perform a simple match search to see if the results match (no pun intended) our expectations. Let’s start with “dad”:

GET /starwars/_search
{
  "query": {
    "match": {
      "quote": "dad"
    }
  }
}

these are the results

[
  {
    "_index": "starwars",
    "_id": "2",
    "_score": 0.70453453,
    "_source": {
      "quote": "Luke I'm your father."
    }
  },
  {
    "_index": "starwars",
    "_id": "3",
    "_score": 0.5791807,
    "_source": {
      "quote": "I am a Jedi, like my dad before me."
    }
  }
]

As you can see, both documents are returned, even though the first one does not match “dad”. Synonyms are working! Running the same query with “father” returns the exact same results, with the same exact scores 2.

Now let’s perform searches for “droids” and “machine”. As expected, in both cases the document with id 1 is returned and the score is equivalent.

{
  "_index": "starwars",
  "_id": "1",
  "_score": 1.2146692,
  "_source": {
    "quote": "These are not the droids you are looking for."
  }
}

⚠️ First subtlety: misunderstand that by default the synonym analyzer is used used for both indexing and searching

Intuitively, with the machine=>droid mapping defined above and the result of the searches, you would expect that, at query time, whenever a user searches for “machine”, the token is replaced with “droid”. On the other hand, if the user searches for “droid”, then no replacement is performed. This is true, but you shouldn’t forget that, by default, the same anlyzer is also used during indexing meaning that, in the inverted index, “machine” is never stored because it’s replaced by “droid”.

We can try to add another document to our index to better understand the problem.

PUT /starwars/_doc/4
{
  "quote": "He's more machine now than man. Twisted and evil."
}

Again we can perform searches for “droids” and “machine”. In both cases we have the same results:

[
  {
    "_index": "starwars",
    "_id": "1",
    "_score": 0.88044095,
    "_source": {
      "quote": "These are not the droids you are looking for."
    }
  },
  {
    "_index": "starwars",
    "_id": "4",
    "_score": 0.62191015,
    "_source": {
      "quote": "He's more machine now than man. Twisted and evil."
    }
  }
]

The explicit mapping machine=>droid behaves just like if they were equivalent synonyms. This is true when the same synonym analyzer is used both for indexing and for searching.

If it didn’t behave this way, than it would mean that documents containing “machine” were actually invisible from any “machine” search. Definitely not the behavior that you expect, isn’t it?

For this exact reason, adding a mapping like the following machine=>machine,droid is completely wrong. We can see why by reindexing and searching the same documents.

Querying for “droids” returns these results:

[
  {
    "_index": "starwars",
    "_id": "1",
    "_score": 0.888969,
    "_source": {
        "quote": "These are not the droids you are looking for."
    }
  },
  {
    "_index": "starwars",
    "_id": "4",
    "_score": 0.6333549,
    "_source": {
        "quote": "He's more machine now than man. Twisted and evil."
    }
  }
]

Querying for “machine” returns these results:

{
  "_index": "starwars",
  "_id": "4",
  "_score": 0.89498913,
  "_source": {
      "quote": "He's more machine now than man. Twisted and evil."
  }
},
{
  "_index": "starwars",
  "_id": "1",
  "_score": 0.888969,
  "_source": {
      "quote": "These are not the droids you are looking for."
  }
}

What’s going on? In the second case, the score of the document with id 1 is unaffected, but the score of the document with id 4 is higher and is then returned as first result. This means that the TF/IDF is affected by the synonym mapping.

We can confirm this issue by using the explain parameter of the search api, which returns detailed information about score computation as part of a hit.

Notice: In the following snippets I omitted the parts that are equivalent in both responses.

For “droids” this is the result

{
  "_explanation": {
    "value": 0.6333549,
    "description": "weight(quote:droid in 3) [PerFieldSimilarity], result of:",
    "details": [
      {
        "value": 0.6333549,
        "description": "score(freq=1.0), computed as boost * idf * tf from:",
        "details": [
          ...,
          {
            "value": 0.41533542,
            "description": "tf, computed as freq \/ (freq + k1 * (1 - b + b * dl \/ avgdl)) from:",
            "details": [
              {
                "value": 1,
                "description": "freq, occurrences of term within document",
                "details": []
              },
              {
                "value": 1.2,
                "description": "k1, term saturation parameter",
                "details": []
              },
              {
                "value": 0.75,
                "description": "b, length normalization parameter",
                "details": []
              },
              {
                "value": 8,
                "description": "dl, length of field",
                "details": []
              },
              {
                "value": 6.5,
                "description": "avgdl, average length of field",
                "details": []
              }
            ]
          }
        ]
      }
    ]
  }
}

and for “machine”:

{
  "_explanation": {
    "value": 0.89498913,
    "description": "weight(Synonym(quote:droid quote:machin) in 3) [PerFieldSimilarity], result of:",
    "details": [
      {
        "value": 0.89498913,
        "description": "score(freq=2.0), computed as boost * idf * tf from:",
        "details": [
          ...,
          {
            "value": 0.5869074,
            "description": "tf, computed as freq \/ (freq + k1 * (1 - b + b * dl \/ avgdl)) from:",
            "details": [
              {
                "value": 2,
                "description": "termFreq=2.0",
                "details": []
              },
              {
                "value": 1.2,
                "description": "k1, term saturation parameter",
                "details": []
              },
              {
                "value": 0.75,
                "description": "b, length normalization parameter",
                "details": []
              },
              {
                "value": 8,
                "description": "dl, length of field",
                "details": []
              },
              {
                "value": 6.5,
                "description": "avgdl, average length of field",
                "details": []
              }
            ]
          }
        ]
      }
    ]
  }
}

You can clearly see that the Term Frequency is affected. In the first case “droid” has a value of 1, but in the second case “machine” has a value of 2.

For this reason the above can be listed as one of the disadvantages of index-time synonyms, along with the following:

  • The index might get bigger, because all synonyms must be indexed.
  • Search scoring, which relies on term statistics, might suffer because synonyms are also counted, and the statistics for less common words become skewed.
  • Synonym rules can’t be changed for existing documents without reindexing.

Using synonyms in search-time analyzers on the other hand doesn’t have many of the above mentioned problems:

  • The index size is unaffected.
  • The term statistics in the corpus stay the same.
  • Changes in the synonym rules don’t require reindexing of documents.

For these resons, in general, the advantages of using synonyms at search time usually outweigh any slight performance gain you might get when using them at index time.

Apply synonyms at search time

Let’s rebuild the index, but this time we’re going to define a separate analyzer just for the search process.

PUT /starwars
{
  "settings": {
    "number_of_shards": 1,
    "index": {
      "analysis": {
        "analyzer": {
          "index_analyzer": {
            "synonym": {
              "tokenizer": "standard",
              "filter": [
                "lowercase",
                "stop",
                "snowball"
              ]
            }
          },
          "search_analyzer": {
            "synonym": {
              "tokenizer": "standard",
              "filter": [
                "lowercase",
                "stop",
                "snowball",
                "my_synonym_filter"
              ]
            }
          },
        },
        "filter": {
          "my_synonym_filter": {
            "type": "synonym",
            "synonyms": [
              "father,dad",
              "machine=>droid"
            ]
          }
        }
      }
    },
    "mappings": {
      "properties": {
        "quote": {
          "type": "text",
          "analyzer": "index_analyzer",
          "search_analyzer": "search_analyzer"
        }
      }
    }
  }
}

As you can see, we moved the synonym filter from the index-time analyzer to the search-time analyzer. The search_analyzer is also specified for the quote field explicitly.

Using the mappings above we can see that querying for “father” or “dad” returns both documents

[
  {
    "_index": "starwars",
    "_id": "2",
    "_score": 1.3751924,
    "_source": {
      "quote": "Luke I'm your father."
    }
  },
  {
    "_index": "starwars",
    "_id": "3",
    "_score": 1.0378369,
    "_source": {
      "quote": "I am a Jedi, like my dad before me."
    }
  }
]

but querying for “machine” or “droid” just returns the document with id 1

{
  "_index": "starwars",
  "_id": "1",
  "_score": 1.496831,
  "_source": {
    "quote": "These are not the droids you are looking for."
  }
}

The document with id 4 is in fact invisible for “machine” searches. For this reason, in this case it makes perfectly sense to use the mapping machine=>machine,droid so the document with id 4 is returned as well.

[
  {
    "_index": "starwars",
    "_id": "1",
    "_score": 1.496831,
    "_source": {
      "quote": "These are not the droids you are looking for."
    }
  },
  {
    "_index": "starwars",
    "_id": "4",
    "_score": 1.0378369,
    "_source": {
      "quote": "He's more machine now than man. Twisted and evil."
    }
  }
]

To further confirm the correct behavior, we can again run the search with explain. It shows correctly that the Term Frequency is 1 in both cases. The scoring difference is simply due to the different length of the fields.

⚠️ Second subtlety: the synonym token filter does not handle multi-word synonyms correctly

Under the hood, when a tokenizer converts a text into a stream of tokens, it also records the following:

  • The position of each token in the stream
  • The positionLength, the number of positions that a token spans

In fact, Lucene’s TokenStreams are actually graphs. The text is broken down into nodes and arcs, where each node is a position and each arc is a token. Assuming that stop words are preserved, the phrase “Luke I am your father”, when viewed as a graph looks like this:

flowchart LR id1((1)) -- luke --> id2((2)) -- i --> id3((3)) -- am --> id4((4)) -- your --> id5((5)) -- father --> id6((6))

Adding the dad synonym changes the DAG as follows

flowchart LR id1((1)) -- luke --> id2((2)) -- i --> id3((3)) -- am --> id4((4)) -- your --> id5((5)) -- father --> id6((6)) id5((5)) -- dad --> id6((6))

With single-word synonyms this is not an issue.

Let’s consider another example with multi-words synonyms: “The ATM does not work”

flowchart LR id1((1)) -- the --> id2((2)) -- atm --> id3((3)) -- does --> id4((4)) -- not --> id5((5)) -- work --> id6((6))

Now let’s add a synonym atm=>automatic teller machine. We would expect the DAG to become like the following

flowchart LR id1((1)) -- the --> id2((2)) -- atm --> id3((3)) id2((2)) -- automatic teller machine --> id3((3)) id3((3)) -- does --> id4((4)) -- not --> id5((5)) -- work --> id6((6))

However the synonym token filter does not accurately record the positionLength for multi-position tokens. In reality the graph above looks more like the following

flowchart LR id1((1)) -- the --> id2((2)) -- atm --> id3((3)) -- does --> id4((4)) -- not --> id5((5)) -- work --> id6((6)) id2((2)) -- automatic teller machine --> id5((5))

So basically the phrase becomes “the automatic teller machine work”.

Here’s how we can prove this issue. First let’s create a new index, called technology, that applies the same synonym filter both at index time and at query time:

PUT /technology
{
  "settings": {
    "number_of_shards": 1,
    "index": {
      "analysis": {
        "analyzer": {
          "index_analyzer": {
            "tokenizer": "standard",
            "filter": [
              "lowercase",
              "my_synonym_filter"
            ]
          }
        },
        "filter": {
          "my_synonym_filter": {
            "type": "synonym",
            "synonyms": [
              "atm=>automatic teller machine"
            ]
          }
        }
      }
    },
    "mappings": {
      "properties": {
        "quote": {
          "type": "text",
          "analyzer": "index_analyzer"
        }
      }
    }
  }
}

Then we can index one document

PUT /technology/_doc/1
{
  "quote": "The atm does not work"
}

Now we can clearly see what happens when we try to search for that document using the match phrase query

GET /technology/_search
{
  "query": {
    "match_phrase": {
      "quote": "the"
    }
  }
}
QueryDocument matches
the✔️
the automatic teller machine✔️
the automatic teller machine does
the automatic teller machine does not work
the automatic teller machine work✔️

You can see that both “the automatic teller machine does” and “the automatic teller machine does not” do not match, but should. Instead “the automatic teller machine work” matches, but shouldn’t.

One way to solve this would be changing the synonym mapping from atm=>automatic teller machine to atm, automatic teller machine. Runnning a search for the phrase “the automatic teller machine does not work” using the match_phrase query, would cause the search analyzer to add “atm” as synonym, effectively adding it in the query’s string token stream.

flowchart LR id1((1)) -- the --> id2((2)) -- atm --> id3((3)) id2((2)) -- automatic teller machine --> id3((3)) id3((3)) -- does --> id4((4)) -- not --> id5((5)) -- work --> id6((6))

The match_phrase query (and the match query) uses this graph to generate sub-queries for the following phrases:

  • “the atm does not work”
  • “the automatic teller machine does not work”

For all these resons, ElasticSearch recommends to perform the synonym expansion at query time. On top of that, search-time synonym expansion allows for using the more sophisticated synonym_graph token filter, which can handle multi-word synonyms correctly and is designed to be used as part of a search analyzer only.

That’s all! I hope you enjoyed this post.


  1. 8.6 at the time of writing ↩︎

  2. the score of the second docuent is slightly lower because the quote field contains more text. You can read more about it here ↩︎

comments powered by Disqus