Elasticsearch is mostly used for full-text searches, however, elasticsearch is much more powerful than that and can be used in multiple cases like auto-completer, spell checker, alerting engine, log aggregator, and as a general-purpose document store. By definition, elasticsearch is a distributed, analytics engine for all types of data, including textual, numerical, geospatial, structured, and unstructured. Elasticsearch is built on top of Apache Lucene. This article assumes a basic understanding of indexes, documents, mapping on the reader's behalf. We will split the whole article into 2 parts, the first part will be looking into data structures and low-level design and, the second part will be dedicated to High-Level Design.

Inverted Index:

An Inverted Index is one of the main data structures used in elastic-search, it is a data structure of sorted terms, with its frequency, and which all documents contain that term.

Inverted Index

In the above image, we have depicted an Inverted Index, all the documents are tokenized into terms, then the frequency and documents in which the term has appeared are stored. When we query with the term "party", elasticsearch checks all documents which have this term and then responds with those documents.

While the above example works fine for exact matches, what happens in the case of a fuzzy search? The explanation mentioned above is a very dumb down version of what actually happens, let's dig a little deeper, for instance, what happens if we have a typo in our search query, ie. what happens if we search for Jefferies instead of Jeffery?

For cases like above, TextAnalyzer plays an important part. In the case of full-text search, analyzers are responsible for tokenizing, normalizing, stemming, stopword removal of the input documents, etc so that search results give correct results. Let's look into these components one by one.

  • Tokenizing: A tokenizer receives a stream of characters, breaks it up into individual tokens (usually individual words), and outputs a stream of tokens. For example, The letter tokenizer divides the text into terms whenever it encounters a character that is not a letter.
POST _analyze
{
  "tokenizer": "letter",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}
[ The, QUICK, Brown, Foxes, jumped, over, the, lazy, dog, s, bone ]

There are multiple tokenizers, please read through the following documentation. https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-tokenizers.html

  • Normalization: Tokenization enables matching on individual terms, but each token is still matched literally. Normalization adds richness to the tokens for better searching outcomes. Normalization includes lowercasing, changing to singular form, adding synonyms for the original token, adding abbreviation for tokens, etc.
  • Stemming: Stemming is the process of reducing a word to its root form. This ensures variants of a word match during a search. For example, walking and walked can be stemmed to the same root word: walk. Once stemmed, an occurrence of either word would match the other in a search.
  • Stop words removal: Removes stop words from a token stream. When not customized, the filter removes the following English stop words by default:
GET /_analyze
{
  "tokenizer": "standard",
  "filter": [ "stop" ],
  "text": "a quick fox jumps over the lazy dog"
}
[ quick, fox, jumps, over, lazy, dog ]

We have seen that Inverted Index allows queries to look up the search term in a unique sorted list of terms, and from that immediately have access to the list of documents that contain the term, this is useful in case of full-text searches, but what about aggregation and sorting? The same data structure won't be much of a use if we want to find the sum of all the ages of document's age field in an index.

Field Data/ DocValue

Consider following documents with the following fields.

id, name,   gender, age
1,  harry,    m,    12
2,  Luna,     f,    11
3,  Ron,      m,    12
4,  Parvati,  f,    14

Doc values are a Lucene 4.x feature that allows for storing field values on disk in a column stride fashion, which is filesystem cache-friendly and suitable for custom scoring, sorting, or faceting. The way doc values are computed provides more efficient compression because of the homogeneous data types.

DocValues

In this part of the article, we looked into data structures that are being used for searching, aggregating, and sorting. We also looked into the role of analyzer in the case of text search. In the next part, we will look into different abstractions on top of these data structures and will learn about the High-Level Design of elasticsearch. We will see, what actually happens when a request is fired to an elasticsearch cluster.

I highly recommend reading the following articles to get a deep understanding of field Data and Doc Values.

Support in the Wild: My Biggest Elasticsearch Problem at Scale
Learn how to avoid the biggest, most common problem scaling Elasticsearch from a Support Engineer at Elastic.
Disk-Based Field Data a.k.a. Doc Values
Part II: How Elasticsearch Works Internally
Lucene index contains multiple immutable segments. Index segments have all the data structures like stored fields, inverted index, doc-values etc