Crawling a website
This guide describes how to automatically index the pages of a website using the Search.io crawler.

Indexing your website

After successfully creating your Search.io account, you are ready to create your first collection.
A collection is a store of all your data and associated configuration. It can contain webpages, documents, or records that you want to make searchable.
Select "Crawl your website" from the available options.

Enter a collection name

Pick a descriptive name. That way you can distinguish them later if you have multiple domains. E.g. ‘my-domain-com’ or ‘my-store’.

Enter your domain URL

Enter the URL to your domain and hit "Index website" to start the indexing process.
Search.io's crawler then visits your website pages, processes the html document of each page and stores records in your collection. The initial indexing process takes about 30 seconds to complete. If not all pages have been indexed as part of the initial setup, the process will continue in the background. The time to index all your webpages depends on the size of your site.
If the crawler encounters an error hop over to our Help Center or contact us under. Common issues are password protected sites and issues with canonicals.

Indexing multiple domains

If the pages you want to make accessible via search are spread across multiple domains or sub-domains, you can add additional domains. Once added, the content on the domains will automatically be indexed.

Using Sitemaps

A sitemap is a web standard that provides a list of URLs available for crawling. It must be present on the root of the domain with the name "sitemap.xml" (e.g. www.example.com/sitemap.xml). The crawler looks for sitemaps on domains that are being indexed and will visit the URLs in any sitemap it finds. If the Crawler does not find your sitemap for some reason, you can point it manually to the sitemap file.
  1. 1.
    Navigate to Domains > Diagnose
  2. 2.
    Enter the URL of the sitemap (i.e. www.example.com/sitemap.xml), and press "Diagnose"
  3. 3.
    Press "Add to Index"

Instant Indexing

The best way to manage crawling on your site is to setup Instant Indexing. Instant Indexing ensures that new and updated pages are immediately available once visited, without having to wait for a full crawl cycle to complete.
Pages that are being updated must be visited within 30 mins before and after the page modification has been published.
An update is only acknowledged if the content of one of the following fields has changed
"title", "description", "canonical", "robots", "og:title", "og:image", "og:description"
Instant Indexing is enabled by adding a small snippet of JavaScript, also known as ping-back code, to pages on your site. When the page is visited by an end-user it will trigger a light-weight background request to the Search.io web crawler, which will check if the page is new or updated and needs to be reindexed.
You can find the snippet tailored to your Collection in the Instant Indexing section in the Console.

Diagnose or add individual pages

The diagnose feature in the Domains section provides information on the status of URLs in your domains, including:
  • if the URL has been crawled already
  • redirecting to another URL
  • when the URL was last visited by the crawler
  • crawling errors (if any)
URLs that are not in your collection can also be added using the diagnose tool, and existing URLs can be manually reindexed.
  1. 1.
    Navigate to Domains section
  2. 2.
    Click on "Diagnose" button
  3. 3.
    Enter the URL you want to diagnose.
  4. 4.
    Press "Add to Index" to crawl the URL.
  5. 5.
    Check the status of the page by re-diagnosing the URL.
The status might be "Pending" if there are a high number of indexing operations being run. It is usually indexed instantly, but in some cases, it might take a few minutes.

Common questions

Crawling frequency

All indexed pages are recrawled every 3-6 days. See Instant Indexing for detecting meta-data changes and updating them immediately.

Canonicals and redirects

A canonical tag (aka "rel canonical") is a way of telling search engines that a specific URL represents the master copy of a page. This is done by setting the canonical tag in the head section of the page, as below.
<link rel="canonical" href="https://www.search.io" />
Canonicals are used for a variety of reasons, such as choosing the preferred domain, http vs https preference, and consolidation of ranking "juice" for a given piece of content. Good canonicals can also help improve SEO. For more information, read how Google handles canonical tags and why the SEO community considers them important.
Canonicals are very important to the way Search.io works and one of the biggest reasons for crawling failing to index content correctly. They are a very strong signal and we generally won't index a URL if it has a canonical pointing elsewhere; we will instead try to index the canonical URL. The biggest mistakes we see with canonicals are:
  • Redirect loops: The canonical will point to a different URL, which will redirect back to the original, and so on.
  • Unresolvable: The URL in the canonical tag is either not a URL, does not exist, or cannot be resolved.
  • Self referential: Sometimes developers and CMS' set the canonical for each page as itself, defeating the point of canonicals.
  • All the same: Every page on a site has the exact same canonical URL (often the root domain or homepage).
You can tell if you have some of these issues using our content debug tool. You should either
  1. 1.
    Fix these issues, or
  2. 2.
    Remove canonical tags from your pages altogether.
Removing all canonicals is much better than setting them incorrectly.

Indexing non-linked pages

It is common to find pages that are not linked in header, footer, navigation or from anywhere else on the website. There are two ways to make sure such pages are also added to the search index:

Submit your sitemap for index

If pages are not linked in the header, footer, navigation or anywhere else on your website, they can often be found in your sitemap.
You can submit your sitemap to the Search.io index so that even non linked pages will get a crawl status and will be visited by the crawler.
Navigate to domains, click on 'Diagnose', enter your sitemap and click 'Diagnose' and then 'Index'. You can also follow a more in-depth guide here.

Manually index non-linked pages

Similarly, if you find individual pages are not being crawled, you can manually index them via the same diagnose tool.
Navigate to domains, click on 'Diagnose', enter the URL for the page you would like to index and click 'Diagnose' and then 'Index'.

Prevent pages or content sections from being indexed

Preventing entire pages from being indexed

To stop a page from being indexed, add the attribute data-sj-noindex to an HTML element on the page.
<meta name="noindex" content="noindex" data-sj-noindex />
Note: although this will prevent our crawler from indexing the page, it will not stop other crawlers. Use the attribute on the standard "robots noindex meta tag" to prevent all crawlers from indexing the page:
<meta name="robots" content="noindex" data-sj-noindex />

Preventing specific content sections from being indexed

Typically the crawler is very good at ignoring navigation, ads and other superfluous content. It will also automatically remove header and footer HTML elements if they are used.
In the case where this still does not handle your situation, you can add the data-sj-ignore attribute to specific HTML elements and the crawler will then ignore that element along with all it's children. Example:
<div data-sj-ignore>Unwanted content in here</div>

Debugging a page

The 'Page debug' tool allows you to see how data is extracted from your pages by our crawler.
After Diagnosing a page click on 'See extended debug information' to use the Page debug tool. The Page debug tool crawls your webpage or document and gives you details of all the extracted metadata, content, open graph data, and schema.org data from your web page.
The Page debug tool allows you to identify existing issues with your pages that deteriorate the quality of search data such as missing metadata, missing canonicals, incorrect mark-up, lack of content, and incorrect redirects.
Page debug screenshot

Site Search Health Report

Another tool that you can use to check for errors across your whole domain rather than a specific web page is the Search Health Report.
The Search Health Report contains helpful information about your content, meta-data, URL structure, query parameters, and server configuration. You also get this report emailed to you when you add a new domain or create a new collection using Search.io console.
Copy link
On this page
Indexing your website
Indexing multiple domains
Using Sitemaps
Instant Indexing
Diagnose or add individual pages
Common questions
Crawling frequency
Canonicals and redirects
Indexing non-linked pages
Prevent pages or content sections from being indexed
Debugging a page
Site Search Health Report