This guide describes how to automatically index the pages of a website using the Search.io crawler.
Indexing your website
After successfully creating your Search.io account, you are ready to create your first collection.
A collection is a store of all your data and associated configuration. It can contain webpages, documents, or records that you want to make searchable.
Select "Crawl your website" from the available options.
Enter a collection name
Pick a descriptive name. That way you can distinguish them later if you have multiple domains. E.g. ‘my-domain-com’ or ‘my-store’.
Enter your domain URL
Enter the URL to your domain and hit "Index website" to start the indexing process.
Search.io's crawler then visits your website pages, processes the html document of each page and stores records in your collection. The initial indexing process takes about 30 seconds to complete. If not all pages have been indexed as part of the initial setup, the process will continue in the background. The time to index all your webpages depends on the size of your site.
If the pages you want to make accessible via search are spread across multiple domains or sub-domains, you can add additional domains. Once added, the content on the domains will automatically be indexed.
A sitemap is a web standard that provides a list of URLs available for crawling. It must be present on the root of the domain with the name "sitemap.xml" (e.g. www.example.com/sitemap.xml). The crawler looks for sitemaps on domains that are being indexed and will visit the URLs in any sitemap it finds. If the Crawler does not find your sitemap for some reason, you can point it manually to the sitemap file.
The best way to manage crawling on your site is to setup Instant Indexing. Instant Indexing ensures that new and updated pages are immediately available once visited, without having to wait for a full crawl cycle to complete.
Pages that are being updated must be visited within 30 mins before and after the page modification has been published.
An update is only acknowledged if the content of one of the following fields has changed
You can find the snippet tailored to your Collection in the Instant Indexing section in the Console.
Diagnose or add individual pages
The diagnose feature in the Domains section provides information on the status of URLs in your domains, including:
if the URL has been crawled already
redirecting to another URL
when the URL was last visited by the crawler
crawling errors (if any)
URLs that are not in your collection can also be added using the diagnose tool, and existing URLs can be manually reindexed.
Check the status of the page by re-diagnosing the URL.
The status might be "Pending" if there are a high number of indexing operations being run. It is usually indexed instantly, but in some cases, it might take a few minutes.
All indexed pages are recrawled every 3-6 days. See Instant Indexing for detecting meta-data changes and updating them immediately.
Canonicals and redirects
A canonical tag (aka "rel canonical") is a way of telling search engines that a specific URL represents the master copy of a page. This is done by setting the canonical tag in the head section of the page, as below.
Canonicals are very important to the way Search.io works and one of the biggest reasons for crawling failing to index content correctly. They are a very strong signal and we generally won't index a URL if it has a canonical pointing elsewhere; we will instead try to index the canonical URL. The biggest mistakes we see with canonicals are:
Redirect loops: The canonical will point to a different URL, which will redirect back to the original, and so on.
Unresolvable: The URL in the canonical tag is either not a URL, does not exist, or cannot be resolved.
Self referential: Sometimes developers and CMS' set the canonical for each page as itself, defeating the point of canonicals.
All the same: Every page on a site has the exact same canonical URL (often the root domain or homepage).
You can tell if you have some of these issues using our content debug tool. You should either
Fix these issues, or
Remove canonical tags from your pages altogether.
Removing all canonicals is much better than setting them incorrectly.
Indexing non-linked pages
It is common to find pages that are not linked in header, footer, navigation or from anywhere else on the website. There are two ways to make sure such pages are also added to the search index:
Submit your sitemap for index
If pages are not linked in the header, footer, navigation or anywhere else on your website, they can often be found in your sitemap.
You can submit your sitemap to the Search.io index so that even non linked pages will get a crawl status and will be visited by the crawler.
Navigate to domains, click on 'Diagnose', enter your sitemap and click 'Diagnose' and then 'Index'. You can also follow a more in-depth guide here.
Manually index non-linked pages
Similarly, if you find individual pages are not being crawled, you can manually index them via the same diagnose tool.
Navigate to domains, click on 'Diagnose', enter the URL for the page you would like to index and click 'Diagnose' and then 'Index'.
Prevent pages or content sections from being indexed
Preventing entire pages from being indexed
To stop a page from being indexed, add the attribute data-sj-noindex to an HTML element on the page.
Note: although this will prevent our crawler from indexing the page, it will not stop other crawlers. Use the attribute on the standard "robots noindex meta tag" to prevent all crawlers from indexing the page:
Preventing specific content sections from being indexed
Typically the crawler is very good at ignoring navigation, ads and other superfluous content. It will also automatically remove header and footer HTML elements if they are used.
In the case where this still does not handle your situation, you can add the data-sj-ignore attribute to specific HTML elements and the crawler will then ignore that element along with all it's children. Example:
<divdata-sj-ignore>Unwanted content in here</div>
Debugging a page
The 'Page debug' tool allows you to see how data is extracted from your pages by our crawler.
After Diagnosing a page click on 'See extended debug information' to use the Page debug tool. The Page debug tool crawls your webpage or document and gives you details of all the extracted metadata, content, open graph data, and schema.org data from your web page.
The Page debug tool allows you to identify existing issues with your pages that deteriorate the quality of search data such as missing metadata, missing canonicals, incorrect mark-up, lack of content, and incorrect redirects.
Page debug screenshot
Site Search Health Report
Another tool that you can use to check for errors across your whole domain rather than a specific web page is the Search Health Report.
The Search Health Report contains helpful information about your content, meta-data, URL structure, query parameters, and server configuration. You also get this report emailed to you when you add a new domain or create a new collection using Search.io console.