GenerativeModels.ai
Blog

Evaluating Search Performance: Elasticsearch vs. Algolia

R&D Team

Introduction

Search is everywhere—whether you’re browsing an online store, digging through academic papers, or trying to find the right document at work, you rely on search systems to deliver the right results quickly. As the volume of digital information keeps growing, strong information retrieval tools have become essential for making that data useful.

Among the many search engine solutions available, Algolia and Elasticsearch stand out as two of the most widely used platforms. Both offer powerful tools for optimizing search functionality, but they differ significantly in how they approach indexing, querying, and ranking.

In this blog, we will compare these two solutions. Through the use of a standardized dataset, we will evaluate their performance across several key metrics. This comparison is valuable because both search engines could be implemented in the retrieval step of a Retrieval-Augmented Generation (RAG) system, as well as in traditional enterprise search environments and other similar applications.

Understanding Algolia and Elasticsearch

In a previous article, we covered the pricing and features of both search engines. In this section, we will explore how the search engines work in more detail.

Algolia

Data Storage

In Algolia, the basic units for ingested data are called “records”. A record is an object-like collection of attributes, each consisting of a name and a value. Depending on the Algolia plan, a single record may be limited to 10KB (in the Free Build plan) or 100KB (in the Grow, Premium, and Elevate plans).

As part of the ingestion process, Algolia automatically creates an objectID field for each record.

Below is an example of a corpus record saved to Algolia:

{
  "_id": "w5kjmw88",
  "title": "Weathering the pandemic: How the Caribbean Basin can use viral and environmental patterns to predict, prepare, and respond to COVID-19",
  "text": "The 2020 coronavirus pandemic is developing at different paces throughout the world. Some areas...",
  "objectID": "1f636f408f90ea_dashboard_generated_id"
}

Searching

Algolia querying does not support boolean operators like AND, OR, NOT. Instead, when working with multi-word queries, records are matched only when all words are present. In other words, every word in the query must appear in a record for it to be returned as a search result.

That being said, Algolia does offer some flexibility with its search functionality, such as the ability to use stopwords and specify searchable fields, among other features.

Additionally, Algolia also offers semantic search capabilities, but this feature is only available with their premium Elevate plan.

Ranking

Once the queries are returned, Algolia ranks the results and applies tie-breaking based on an ordered list of criteria.

Elasticsearch

Data Storage

In Elasticsearch, data is stored as “documents,” which can be either structured or unstructured, with a maximum document size limit of 100 MB. For consistency with Algolia during the evaluation, we will upload object items with the fields: _id, title, and txt as individual documents. This ensures a fair comparison between the two search engines.

Searching

In contrast to Algolia’s AND-like approach, Elasticsearch uses the OR operator for all words in a query. This means that as long as any word in the query is found in the document, it will be returned as a result.

In Elasticsearch, searches can be performed either lexically or semantically. To enable semantic search, the fields being queried must first be semantically indexed.

Ranking

Elasticsearch employs a ranking system, so even if only a single word from a query is found in a document, it is less likely to appear as a top search result. It achieves this by sorting results based on relevancy using the BM25 ranking system after the search is performed.

Evaluating Retrieval Performance

Choosing an Evaluation Dataset

For the purpose of evaluation, the TREC-COVID MTEB dataset—which is both an MTEB dataset and a BEIR dataset—was selected to base the evaluations of search engine results on.

BEIR Benchmark

BEIR (Benchmarking IR) is a standard benchmark for evaluating retrieval task performance, primarily using NDCG and Recall as the key metrics.

BEIR provides 18 datasets covering a wide range of domains, used to test various retrieval tasks, including fact-checking, question answering, document retrieval, and recommendation systems.

Each BEIR dataset consists of a corpus, queries, and relevance judgments (qrels) for those queries (a sample of the data format is shown later).

MTEB Benchmark

MTEB (Massive Text Embedding Benchmark) is a standard benchmark for measuring the performance of 8 text embedding tasks and includes 56 datasets. MTEB is also a superset of the BEIR benchmark, meaning the MTEB retrieval task datasets reuse those from BEIR.

TREC-COVID Dataset

Among all the datasets in MTEB and BEIR, the TREC-COVID dataset was chosen specifically for its relatively higher labeling rate.

The original dataset consists of a corpus containing 171,332 documents, 50 queries, and 24,763 qrels.

For the evaluation experiments, we removed documents that exceeded the 10 KB record size limit in Algolia’s Free Plan. As a result, the majority of the corpus data from the original dataset was used for both search engines, except for 12 documents. The corresponding qrels for these excluded documents were also omitted from the evaluation.

Below are sample records from the dataset, which include the corpus, queries, and relevance judgments (qrels):

Corpus

Each corpus item includes the following fields: “_id”, “title” and “text”. Some “title” and “text” fields are empty, so both fields were set to be searchable.

{
  "_id": "0a0i6vjn",
  "title": "Zambia’s National Cancer Centre response to the COVID-19 pandemic—an opportunity for improved care",
  "text": "The COVID-19 pandemic has overwhelmed health systems around the globe even in countries with strong economies..."
}
{
  "_id": "d1stzy8w",
  "title": "Susceptibility of tree shrew to SARS-CoV-2 infection",
  "text": "Since SARS-CoV-2 became a pandemic event in the world, it has not only caused huge economic losses, but also..."
}
{
  "_id": "6jej7l24",
  "title": "Diagnosing rhinitis: viral and allergic characteristics.",
  "text": ""
}
Queries

The fields for a single query item are: “_id” and “text”.

{
    "_id": "1",
    "text": "what is the origin of COVID-19"
},
{
    "_id": "2",
    "text": "how does the coronavirus respond to changes in the weather"
},
{
    "_id": "3",
    "text": "will SARS-CoV2 infected people develop immunity? Is cross protection possible?"
}
Relevance Judgements (qrels)

Each qrel entry identifies a query, a corpus ID, and the relevance score for that corpus. Since a single query often yields multiple search results, there are 50 queries and 24,763 qrels.

query-id	corpus-id	score
1	        005b2j4b	2.0
1	        00fmeepz	1.0
1	        g7dhmyyo	2.0

The score values are 0, 1, or 2, where:

For the chosen metric evaluations, the difference between scores of 1 and 2 will primarily affect the results of the NDCG metric.

Choosing Evaluation Metrics and Demonstrating Metric Calculations

We have chosen to evaluate the following common metrics for retrieval tasks: Precision, Recall, NDCG, and MAP. The evaluations were performed using pytrec_eval.

Precision and Recall are not rank-aware, while NDCG and MAP are. This means that Precision and Recall scores indicate whether the correct sources were surfaced, while NDCG and MAP also account for the ordering of relevant results.

NOTE: In the metric equations below, Metric@K refers to the value of the metric calculated for the top K retrieved results.

Sample Data for Demonstrating Metric Calculations

To illustrate the calculation of these metrics, we provide sample results from Algolia searches on queries with IDs “1”, “2”, and “3.” In these examples, we assume the search engine returns up to 10 results per query.

The format of the queries and qrels data below follows the structure accepted by pytrec_eval.

Queries
{'_id': '1', 'text': 'what is the origin of COVID-19'}
{'_id': '2', 'text': 'how does the coronavirus respond to changes in the weather'}
{'_id': '3', 'text': 'will SARS-CoV2 infected people develop immunity? Is cross protection possible?'}
Qrels

Below is a sample of the qrels for queries “1”, “2”, and “3.” This representation is reordered and truncated to focus on the results for query “1,” highlighting the corresponding Algolia search scores for easier following of the metric calculations.

(The full qrels for the queries “1”, “2” and “3” can be found here.)

{
    "1": {
        "dckuhrlf": 0,
        "96zsd27n": 0,
        "0paafp5j": 0,
        "fqs40ivc": 1,
        "hmvo5b0q": 1,
        "l2wzr3w1": 1,
        "41378qru": 0,
        "dv9m19yk": 1,
        "ipl6189w": 0,
        "084o1dmp": 0,
        "08ds967z": 1,
        ...
        },
    "2": {...},
    "3": {...}
}

Search Results (Limited to the Top 10 Results as Determined by the Engine)

pytrec_eval interprets search results similarly to how it handles qrels, where each result has an associated relevance score. While search engines may use different scoring systems or scales, pytrec_eval evaluates these scores based on their relative rank. In other words, it focuses on the rank order of results, with higher scores indicating greater relevance. The absolute value of the score itself is less important than its position relative to other results in the ranked list.

In the case of the Algolia search results below, the corpus ID with the highest score — “dv9ml9yk” — was ranked as the top result, while “26276rpr” was ranked at the bottom of the top 10 relevant results.

{
    "1": {
        "26276rpr": 2,
        "dckuhrlf": 3,
        "96zsd27n": 4,
        "0paafp5j": 5,
        "hmvo5b0q": 6,
        "fqs40ivc": 7,
        "l2wzr3w1": 8,
        "41378qru": 9,
        "ipl6189w": 10,
        "dv9m19yk": 11
    },
    "2": {},
    "3": {}
}

From the results above, we can see that Algolia did not return any results for queries “2” and “3.”

We chose to use incrementing numbers and avoid duplicate scores because Algolia doesn’t return an explicit search score. Instead, we rely on the order in which results are returned, which is determined by a series of tie-breakers. You can learn more about these tie-breakers in the documentation.

Additionally, we’ve excluded scores of 0 or 1, as Algolia guarantees that all documents returned contain the query terms. Therefore, we treat all results as relevant.

With the sample data above, we can now move forward with calculating values for both non-rank-aware metrics (Precision and Recall) and rank-aware metrics (MAP and NDCG).

Non-Rank-Aware Metrics

Precision

Precision measures how many of the retrieved items are relevant, indicating the percentage of retrieved results that are considered correct according to the ground truth (qrels).

Precision@K=TPTP+FP=TPK=Number of relevant items in KTotal number of items in KPrecision@K=\frac{TP}{TP+FP}=\frac{TP}{K}=\frac{\text{Number of relevant items in K}}{\text{Total number of items in K}}

Where:

Example calculation :

For Query 1, there were 4 true positive results. Since we’re calculating Precision@10, we have:

Precision@10=410=0.4Precision@10=\frac{4}{10}=0.4

For Queries 2 and 3, since no relevant results were retrieved, Precision@10 = 0 for both queries.

Average Precision@10:

Average Precision@10=0.4+0+03=0.13333\text{Average }Precision@10=\frac{0.4+0+0}{3}=0.13333

Recall

Recall measures how many of the relevant items were retrieved, indicating the percentage of all relevant items that were retrieved.

Recall@K=TPTP+FN=Number of relevant items in KTotal number of relevant itemsRecall@K = \frac{TP}{TP+FN} = \frac{\text{Number of relevant items in K}}{\text{Total number of relevant items}}

Where:

Example calculation:

For Query 1, there were 4 true positive results. The total number of relevant items is 637, which comes from the sum of items marked as somewhat relevant (score of 1) and relevant (score of 2) in the qrels. So, the calculation for Recall@10 is:

Recall@10=4637=0.00628Recall@10=\frac{4}{637}=0.00628

For Queries 2 and 3, since no relevant results were retrieved, Recall@10 = 0 for both queries.

Average Recall@10:

Average Recall@10=0.00627943+0+03=0.00209\text{Average }Recall@10=\frac{0.00627943 +0+0}{3}=0.00209

Rank-Aware Metrics

MAP

Mean Average Precision (MAP) evaluates the system’s ability to return relevant items and rank them appropriately, with the most relevant items appearing at the top of the list.

To calculate MAP, we first need to compute the Average Precision (AP) for a single query. AP is the average of the precision values at K = 1,…, where each K corresponds to the top-K retrieved results in MAP@K.

AP@K=1Nk=1KPrecision(k)×rel(k)AP@K=\frac{1}{N}\sum_{k=1}^{K} \text{Precision(k)} \times \text{rel(k)}

Here:

The final MAP score, is the average of the AP values for all queries.

MAP@K=1Uu=1UAP@KuMAP@K=\frac{1}{U}\sum_{u=1}^{U}AP@K_{u}

Where:

Example calculation:

For Query 1, the table below shows the Precision@K, rel(k), and AP@K for the ordered search results.

Ordered query result #Corpus IDrel(k)Precision@KPrecision(k) * rel(k)
1dv9m19yk11 / 11
2ipl6189w01 / 20
341378qru01 / 30
4l2wzr3w112 / 40.5
5fqs40ivc13 / 50.6
6hmvo5b0q14 / 60.66667
70paafp5j04 / 70
896zsd27n04 / 80
9dckuhrlf04 / 90
1026276rpr04 / 100

This results in:

AP@10=1+0+0+0.5+0.6+0.66667+0+0+0+0637=0.00434AP@10=\frac{1+0+0+0.5+ 0.6 + 0.66667 + 0 + 0 + 0 + 0} {637} = 0.00434

For Queries 2 and 3, since no relevant items were retrieved, AP@10 = 0 for both queries.

Average AP@10 (or simply MAP@10):

MAP@10=0.00434+0+03=0.00145MAP@10=\frac{0.00434 + 0 + 0} {3} = 0.00145

NDCG

Normalized Discounted Cumulative Gain (NDCG) measures a system’s ability to rank items based on their relevance. Unlike other metrics, NDCG accounts for how relevant the retrieved items are, using relevance values from the qrels.

NDCG is calculated by first determining the Discounted Cumulative Gain (DCG) and then normalizing it by the Ideal Discounted Cumulative Gain (IDCG).

Formula:

NDCG@K=DCG@KIDCG@KNDCG@K=\frac{DCG@K}{IDCG@K}

DCG Calculation:

DCG@K=k=1Krelilog2(i+1)DCG@K=\sum_{k=1}^{K}\frac{rel_{i}}{\log_{2}(i+1)}

Where:

Example calculation:

For Query 1, we first calculate the DCG@10.

Rank (i)corpus_idrelrel(i) / (\log_2(i + 1))
1dv9m19yk11
2ipl6189w00
341378qru00
4l2wzr3w110.4307
5fqs40ivc10.3869
6hmvo5b0q10.3562
70paafp5j00
896zsd27n00
9dckuhrlf00
1026276rpr00

By summing the last column, we get:

DCG@10=1+0+0+0.4307+0.3869+0.3562+0+0+0+0=2.1738DCG@10=1+0+0+0.4307+0.3869+0.3562+0+0+0+0=2.1738

IDCG@10 for Query 1 represents the ideal scenario where the most relevant documents are ranked highest for Query 1. There are 305 documents with relevance 2, so the top 10 most relevant documents should each have a relevance of 2. We calculate IDCG@10 as follows:

IDCG@10=2log2(2)+2log2(3)+2log2(4)+2log2(5)+2log2(6)+2log2(7)+2log2(8)+2log2(9)+2log2(10)+2log2(11)=9.0886IDCG@10=\frac{2}{\log_2(2)} + \frac{2}{\log_2(3)} + \frac{2}{\log_2(4)} + \frac{2}{\log_2(5)} + \frac{2}{\log_2(6)} + \frac{2}{\log_2(7)} + \frac{2}{\log_2(8)} + \frac{2}{\log_2(9)} + \frac{2}{\log_2(10)} + \frac{2}{\log_2(11)} = 9.0886

Finally, we compute NDCG@10 for Query 1:

NDCG@10=DCG@10IDCG@10=2.17389.0886=0.23915NDCG@10 = \frac{DCG@10}{IDCG@10} = \frac{2.1738}{9.0886} = 0.23915

For Queries 2 and 3, since no relevant items were retrieved, NDCG@10 = 0 for both queries.

Average NDCG@10:

Average NDCG@10=0.23915+0+03=0.07972\text{Average }NDCG@10=\frac{0.23915+0+0}{3}=0.07972

Running Full Evaluations

Now that we’ve covered the selected metrics and how they’re calculated, we’ll apply them to the full dataset to compare the performance of Elasticsearch lexical search, Elasticsearch semantic search, default Algolia Search, and Algolia Search with stopwords enabled. This evaluation will help us understand how each method performs across different configurations using the same set of queries and ground-truth relevance labels.

Code Reference

You can access the full code in our GitHub repository.

Results

We evaluated the following metrics: Precision, Recall, NDCG, and MAP at different values of K: 1, 5, 10, 15, 25, 35, 45, and 55.

Results for Elasticsearch lexical search:

KPrecisionRecallNDCGMAP
10.760000.002030.720000.00203
50.712000.008800.668930.00777
100.654000.015890.615870.013061
150.630670.023080.590910.018215
250.585600.034410.552590.02652
350.560000.045600.528550.03396
450.539560.055950.507530.04088
550.512730.064160.484710.04630

Results for Elasticsearch semantic search:

KPrecisionRecallNDCGMAP
10.940000.002460.870000.00246
50.836000.010990.781450.01033
100.788000.020360.750030.01865
150.525330.020360.581400.01865
250.315200.020360.419070.01865
350.225140.020360.335920.01865
450.175110.020360.283820.01865
550.143270.020360.248020.01865

Below are the results when running the metrics on Algolia.

KPrecisionRecallNDCGMAP
10.260000.000420.200000.00042
50.208000.002000.177050.00157
100.192000.003490.165980.00266
150.173330.004850.154140.00361
250.126400.005880.122980.00434
350.090290.005880.098580.00434
450.070220.005880.083290.00434
550.057450.005880.072690.00434

Results for Algolia with stopwords:

KPrecisionRecallNDCGMAP
10.380000.000710.310000.00071
50.316000.003500.278980.00272
100.274000.005960.251370.00440
150.257330.008290.237130.00577
250.199200.010610.197930.00721
350.142290.010610.158660.00721
450.110670.010610.134050.00721
550.09050.010610.116990.00721

Results Interpretation

Elasticsearch

Overall, Elasticsearch performs better than Algolia.

When comparing regular lexical search and semantic search within Elasticsearch, semantic search shows stronger performance at lower K-values. However, starting from K=15 and beyond, the regular lexical search begins to outperform semantic search across all metrics.

Looking more closely, we notice that for Elasticsearch semantic search, both Recall and MAP scores remain constant at and beyond K = 15, mirroring the previous Recall@10 and MAP@10 scores. This is likely because the semantic search returns no more than 15 results for many queries. As a result, the percentage of relevant results found—and therefore the scores—stop increasing beyond that point.

Since these two approaches were used exclusively in this evaluation, a potential improvement could be to combine semantic and lexical signals. This hybrid approach might offer the precision of semantic search at the top ranks while maintaining broader coverage through lexical matching.

Algolia

From the results, we can observe that Algolia’s scores are significantly lower compared to Elasticsearch.

Similar to Elasticsearch, we notice that Algolia Recall and MAP scores remains constant at and beyond K = 25, likely for the same reason.

Interestingly, we observe a slight improvement in Algolia’s metric scores when stopwords are applied. By excluding certain words from the strict AND condition, Algolia broadens its search criteria, allowing more relevant results to surface. However, this enhancement is limited. Algolia’s default behavior still requires that all (non-stopword) query terms appear in the result, which can restrict its ability to retrieve relevant records in some cases. This is evident in the following example:

Query “2”: “How does the coronavirus respond to changes in the weather”

Returns: Nothing

Here, Algolia returns no results because no records contain all of the terms in the query.

Modified Query “2”: “coronavirus respond to the weather”

Returns: records “w5kjmw88”, “gan10za0”

In this case, Algolia returns results because these records contain the specified terms.

This suggests that Algolia is better suited for keyword-based searches. However, it’s worth noting that Algolia does offer a semantic search feature called NeuralSearch, available with their most expensive plan: Elevate. Implementing this feature could potentially improve Algolia’s results for more complex queries.

Conclusion

In this article, we compared the querying performance of Algolia and Elasticsearch. While Elasticsearch performed better, its results were not perfect. Algolia, on the other hand, demonstrated limitations, particularly in handling more complex queries due to its strict search behavior.

When ranking the methods based on performance, we found the following:

  1. (Tie): Elasticsearch lexical and semantic search – Lexical search performs better when K ≥ 15, while semantic search excels when K ≤ 15.
  2. Algolia search with stopwords enabled
  3. Algolia default search

To improve performance in enterprise search, both Algolia and Elasticsearch could potentially benefit from preprocessing techniques, especially for complex queries. This combination could ensure more relevant and accurate results across diverse use cases.

References

BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models

Evaluating Search Relevance Part 1

MTEB: Massive Text Embedding Benchmark

Prepare your records for indexing

Mean Average Precision (mAP) Explained

Evaluation Metrics for Search and Recommendation Systems

← Back to Blog