Crawling a website
This guide describes how to automatically index the pages of a website using the Search.io crawler.
After successfully creating your Search.io account, you are ready to create your first collection.
A collection is a store of all your data and associated configuration. It can contain webpages, documents, or records that you want to make searchable.
Select "Crawl your website" from the available options.
Pick a descriptive name. That way you can distinguish them later if you have multiple domains. E.g. ‘my-domain-com’ or ‘my-store’.
Enter the URL to your domain then click "Continue" to move on to the language selection screen. Choose your language, then hit "Choose language and finish" to start the crawling process.
Search.io's crawler then visits your website pages, processes the html document of each page and stores records in your collection. The initial crawling process takes about 30 seconds to complete. If not all pages have been crawled as part of the initial setup, the process will continue in the background. The time to crawl all your webpages depends on the size of your site.
If the crawler encounters an error hop over to our Help Center or contact us. Common issues are password protected sites and issues with canonicals.
If the pages you want to make accessible via search are spread across multiple domains or sub-domains, you can add additional domains. Once added, the content on the domains will automatically be indexed.
A sitemap is a web standard that provides a list of URLs available for crawling. It must be present on the root of the domain with the name "sitemap.xml" (e.g. www.example.com/sitemap.xml). The crawler looks for sitemaps on domains that are being indexed and will visit the URLs in any sitemap it finds. If the crawler does not find your sitemap for some reason, you can point it manually to the sitemap file.
See Submit your sitemap for index for instructions on submitting your sitemap. You can also follow a more in-depth guide here.
The best way to manage crawling on your site is to setup Instant Indexing. Instant Indexing ensures that new and updated pages are immediately available once visited, without having to wait for a full crawl cycle to complete.
Pages that are being updated must be visited within 30 mins before and after the page modification has been published.
An update is only acknowledged if the content of one of the following fields has changed.
"title", "description", "canonical", "robots", "og:title", "og:image", "og:description"
You can find the snippet tailored to your Collection in the Instant Indexing section in the Console.
The diagnose feature in the Crawler crawl statuses section provides information on the status of URLs in your domains, including:
- if the URL has been crawled already
- if the URL redirected to another URL
- when the URL was last visited by the crawler
- crawling errors (if any)
- if the page at the URL is in your collection's search index
URLs that are not in your collection can also be added using the diagnose tool, and existing URLs can be manually re-crawled.
- 2.Enter the URL you want to diagnose in the textbox beneath the page heading.
- 3.Press the "Diagnose" button.
- 4.Press "Crawl page".
- 5.Check the status of the page by re-diagnosing the URL.
The status might be "Pending" if there are a high number of indexing operations being run. It is usually indexed instantly, but in some cases, it might take a few minutes.
You can remove a page by clicking the "Delete page" button. This removes the page from your search index and also deletes its crawl status.
Crawler diagnose modal
All indexed pages are recrawled every 3-6 days. See Instant Indexing for detecting meta-data changes and updating them immediately.
A canonical tag (aka "rel canonical") is a way of telling search engines that a specific URL represents the master copy of a page. This is done by setting the canonical tag in the head section of the page, as below.
<link rel="canonical" href="https://www.search.io" />
Canonicals are used for a variety of reasons, such as choosing the preferred domain, http vs https preference, and consolidation of ranking "juice" for a given piece of content. Good canonicals can also help improve SEO. For more information, read how Google handles canonical tags and why the SEO community considers them important.
Canonicals are very important to the way Search.io works and one of the biggest reasons for crawling failing to index content correctly. They are a very strong signal and we generally won't index a URL if it has a canonical pointing elsewhere; we will instead try to index the canonical URL. The biggest mistakes we see with canonicals are:
- Redirect loops: The canonical will point to a different URL, which will redirect back to the original, and so on.
- Unresolvable: The URL in the canonical tag is either not a URL, does not exist, or cannot be resolved.
- Self referential: Sometimes developers and CMS' set the canonical for each page as itself, defeating the point of canonicals.
- All the same: Every page on a site has the exact same canonical URL (often the root domain or homepage).
- 1.Fix these issues, or
- 2.Remove canonical tags from your pages altogether.
Removing all canonicals is much better than setting them incorrectly.
It is common to find pages that are not linked in header, footer, navigation or from anywhere else on the website. There are two ways to make sure such pages are also added to the search index:
If pages are not linked in the header, footer, navigation or anywhere else on your website, they can often be found in your sitemap.
You can submit your sitemap to the Search.io index so that even non linked pages will get a crawl status and will be visited by the crawler.
- 2.Enter the URL of the sitemap into the textbox below the page heading (i.e. www.example.com/sitemap.xml) and press "Diagnose" to launch the diagnose modal.
- 3.Press "Crawl page".
Similarly, if you find individual pages are not being crawled, you can manually crawl them via the same diagnose tool.
Navigate to the crawl statuses section of Crawler. In the textbox beneath the page heading, enter the URL of the page you would like to crawl and click "Diagnose". When the diagnose modal loads with the page's crawl status, click the "Crawl page" button.
To stop a page from being crawled and indexed, add the attribute
data-sj-noindexto an HTML element on the page.
<meta name="noindex" content="noindex" data-sj-noindex />
Note: although this will prevent our crawler from indexing the page, it will not stop other crawlers. Use the attribute on the standard "robots noindex meta tag" to prevent all crawlers from indexing the page:
<meta name="robots" content="noindex" data-sj-noindex />
Typically the crawler is very good at ignoring navigation, ads and other superfluous content. It will also automatically remove
footerHTML elements if they are used.
In the case where this still does not handle your situation, you can add the
data-sj-ignoreattribute to specific HTML elements and the crawler will then ignore that element along with all it's children. Example:
<div data-sj-ignore>Unwanted content in here</div>
After diagnosing a page, click on 'Open in page debugger' to use the Page debug tool. The Page debug tool crawls your webpage or document and gives you details of all the extracted metadata, content, open graph data, and schema.org data from your web page.
The Page debug tool allows you to identify existing issues with your pages that deteriorate the quality of search data such as missing metadata, missing canonicals, incorrect mark-up, lack of content, and incorrect redirects.
Page debug screenshot
Another tool that you can use to check for errors across your whole domain rather than a specific web page is the Search Health Report.
The Search Health Report contains helpful information about your content, meta-data, URL structure, query parameters, and server configuration. You also get this report emailed to you when you add a new domain or create a new collection using Search.io console.