How Search.io works

At a high level, there are two main components. The first component is the index that stores the documents and metadata. The second component is pipelines, which contain a series of steps that improves and augments search queries.

Let's break down a search into a series of steps to illustrate how data is indexed and how it moves through pipeline.

  1. First things first, indexing content

  2. A user types a query

  3. Spell-checking the query

  4. Applying synonyms

  5. Applying smarts

  6. Applying boost rules

  7. Searching the index

  8. Returning the results

1. First things first, indexing content

Before you can start searching, we need to index the data you want to search through. This is similar to an index in the back of a book. A list of terms that references pages where the term occurs.

You can index most websites automatically using our Crawler. The Crawler navigates through your website, starting at the homepage, and indexes each page it can reach. No need to worry about uploading data or writing code to index your data.

Aside from adding the content of the page, the Crawler will also index webpage's metadata to improve the search. For example:

  • title

  • published_time

  • headings

  • url

  • keywords

The marketing manager for American Cities wants to add search to their domain americancities.com. Within minutes, all pages on the domain were indexed.

2. A user types a query

The next step is pretty straight forward - the user types their query into the search bar.

With autocomplete suggestions enabled, the most relevant suggestions will be displayed as you type.

Jane heads to the website www.americancities.com to conduct some research for her upcoming vacation to San Francisco. She types in her query “san fram hotels” but doesn’t realize she has misspelled the keyword “fran”.

3. Spell-checking the query

Once a user has typed their query and pressed enter, it is sent to a query pipeline. You can think of the pipeline as a series of steps that compile smarter queries for you before searching the index.

One example of how the pipeline can improve the query is spell-checking. But unlike most search services, we not only look at individual words, but also take the context of the text into account.

This process uses a probability matrix instead of using a regular dictionary to correct misspelled words. We can predict different variations of a misspelled query to determine which is more likely to lead to more useful results. This ensures that brand names or domain specific words can be corrected.

A query pipeline receiving the query “san fram hotels” will predict the word “fram” is likely incorrect. Alternatives that are spelled correctly and highly probable to match the user intent will be added into the query automatically. This alternate query is then weighted based on it's probability to be the correct alternative when ranking search results.

4. Applying synonyms

After each word of the query has been corrected, the pipeline then checks to see if synonyms have been set for any terms in the query. Synonyms help query keywords to reflect the content of an index, so the words on web pages can be matched to the search patterns of your users. Synonyms are particularly useful to localize content or when people search for brands or nick names. For example, “pants” = “trousers”, or “San Fran” = “San Francisco”.

When the pipeline applies each synonym, this creates an alternate query that runs in tandem to the original query. Another probability matrix then determines which query is more likely to lead to the most useful results.

In the pipeline, a synonym has been set where every query for “san fran” relates exactly to the keywords “san francisco”.

This creates the query “san francisco hotels” and keeps the original query “san fram hotels”. “san francisco hotels” is determined to more likely lead to a better search result.

5. Applying smarts

The next step of the pipeline is to determine how much of the relevance algorithm is to be allocated to machine learning optimization. We use reinforcement learning to continuously optimize search result ranking.

When a user selects a result, feedback to the underlying index data will allow the algorithm to learn what results people preferred. The engine then rewards this result with a higher ranking for the same query in the future.

One particular result, titled “The San Francisco City Cheap Hotel Deals”, is the most popular when users search for “san francisco hotels”. When this query leaves the pipeline and the best results are returned, this result will be boosted.

5. Apply boost rules

You can "boost" the importance of documents based on many different attributes. Title, exact matches in the description, pages that are recently published - anything that you can think of. Boost rules are a great way to increase relevancy of an entire section of your results. No need to adjust individual queries and results on a micro-level.

In the americancities.com search index, there is a boost rule where results that contain /cheap-hotels/new-york/ in the URL are to be boosted by 20%. Results that match this boost rule will be lifted higher in the results set.

6. Searching the index

The pipeline built many variants of the original query, based on the above steps. Those variants will be used to query the index and return the best results. That allows us to search indexes with millions of documents to find the best results - all in the blink of an eye.

7. Returning the results

As the final step, each of these search results is sent back to the end user in the results page. Each of these results is displayed and given a tracking token, which helps to identify each result. Usually this tracking token helps in keeping a record when a result is clicked. However, you can also record which results lead to another positive outcome, like a sale, sign-up, or article share.

The end user is given a list of results, determined from most relevant to least relevant. Jane clicks on the top result, “The Best San Francisco Cheap Hotel Deals”, and finds a place to stay in "San Fran".

In summary

That was a basic overview of how Search.io's Search works. Aside from spell-checking, synonyms, and boosting, there are a dozen other ways to customize search relevance. A library of pre-built steps makes this very easy.

For example, you can run a live A/B test on your e-commerce store. One pipeline may boost products with a higher profit margin and more stock on the floor, and another pipeline may boost products with a high customer review rating. These can then be live tested to see which better meets your goals.

And for very specific use-cases that aren't already covered, you can write your own Pipeline steps.

And the best part, running through all of the above steps to perform a search takes less than 0.01 seconds on average.

Last updated