Crawling a website
This guide describes how to automatically index the pages of a website using the Search.io crawler.
Last updated
Was this helpful?
This guide describes how to automatically index the pages of a website using the Search.io crawler.
Last updated
Was this helpful?
After successfully creating your Search.io account, you are ready to create your first collection.
Select "Crawl your website" from the available options.
Pick a descriptive name. That way you can distinguish them later if you have multiple domains. E.g. ‘my-domain-com’ or ‘my-store’.
Enter the URL to your domain then click "Continue" to move on to the language selection screen. Choose your language, then hit "Choose language and finish" to start the crawling process.
Search.io's crawler then visits your website pages, processes the html document of each page and stores records in your collection. The initial crawling process takes about 30 seconds to complete. If not all pages have been crawled as part of the initial setup, the process will continue in the background. The time to crawl all your webpages depends on the size of your site.
If the pages you want to make accessible via search are spread across multiple domains or sub-domains, you can add additional domains. Once added, the content on the domains will automatically be indexed.
The best way to manage crawling on your site is to setup Instant Indexing. Instant Indexing ensures that new and updated pages are immediately available once visited, without having to wait for a full crawl cycle to complete.
Instant Indexing is enabled by adding a small snippet of JavaScript, also known as ping-back code, to pages on your site. When the page is visited by an end-user it will trigger a light-weight background request to the Search.io web crawler, which will check if the page is new or updated and needs to be reindexed.
if the URL has been crawled already
if the URL redirected to another URL
when the URL was last visited by the crawler
crawling errors (if any)
if the page at the URL is in your collection's search index
URLs that are not in your collection can also be added using the diagnose tool, and existing URLs can be manually re-crawled.
Enter the URL you want to diagnose in the textbox beneath the page heading.
Press the "Diagnose" button.
Press "Crawl page".
Check the status of the page by re-diagnosing the URL.
You can remove a page by clicking the "Delete page" button. This removes the page from your search index and also deletes its crawl status.
All indexed pages are recrawled every 3-6 days. See Instant Indexing for detecting meta-data changes and updating them immediately.
A canonical tag (aka "rel canonical") is a way of telling search engines that a specific URL represents the master copy of a page. This is done by setting the canonical tag in the head section of the page, as below.
Canonicals are very important to the way Search.io works and one of the biggest reasons for crawling failing to index content correctly. They are a very strong signal and we generally won't index a URL if it has a canonical pointing elsewhere; we will instead try to index the canonical URL. The biggest mistakes we see with canonicals are:
Redirect loops: The canonical will point to a different URL, which will redirect back to the original, and so on.
Unresolvable: The URL in the canonical tag is either not a URL, does not exist, or cannot be resolved.
Self referential: Sometimes developers and CMS' set the canonical for each page as itself, defeating the point of canonicals.
All the same: Every page on a site has the exact same canonical URL (often the root domain or homepage).
Fix these issues, or
Remove canonical tags from your pages altogether.
Removing all canonicals is much better than setting them incorrectly.
It is common to find pages that are not linked in header, footer, navigation or from anywhere else on the website. There are two ways to make sure such pages are also added to the search index:
If pages are not linked in the header, footer, navigation or anywhere else on your website, they can often be found in your sitemap.
You can submit your sitemap to the Search.io index so that even non linked pages will get a crawl status and will be visited by the crawler.
Press "Crawl page".
Similarly, if you find individual pages are not being crawled, you can manually crawl them via the same diagnose tool.
To stop a page from being crawled and indexed, add the attribute data-sj-noindex
to an HTML element on the page.
Note: although this will prevent our crawler from indexing the page, it will not stop other crawlers. Use the attribute on the standard "robots noindex meta tag" to prevent all crawlers from indexing the page:
Typically the crawler is very good at ignoring navigation, ads and other superfluous content. It will also automatically remove header
and footer
HTML elements if they are used.
In the case where this still does not handle your situation, you can add the data-sj-ignore
attribute to specific HTML elements and the crawler will then ignore that element along with all it's children. Example:
The Page debug tool allows you to identify existing issues with your pages that deteriorate the quality of search data such as missing metadata, missing canonicals, incorrect mark-up, lack of content, and incorrect redirects.
The Search Health Report contains helpful information about your content, meta-data, URL structure, query parameters, and server configuration. You also get this report emailed to you when you add a new domain or create a new collection using Search.io console.
If the crawler encounters an error hop over to our or . Common issues are and issues with .
A is a web standard that provides a list of URLs available for crawling. It must be present on the root of the domain with the name "sitemap.xml" (e.g. ). The crawler looks for sitemaps on domains that are being indexed and will visit the URLs in any sitemap it finds. If the crawler does not find your sitemap for some reason, you can point it manually to the sitemap file.
See Submit your sitemap for index for instructions on submitting your sitemap. You can also follow a more in-depth guide .
You can find the snippet tailored to your Collection in the section in the Console.
The diagnose feature in the section provides information on the status of URLs in your domains, including:
Navigate to the section.
Canonicals are used for a variety of reasons, such as choosing the preferred domain, http vs https preference, and consolidation of ranking "juice" for a given piece of content. Good canonicals can also help improve SEO. For more information, read how and why the .
You can tell if you have some of these issues using our . You should either
Navigate to Crawler > .
Enter the URL of the sitemap into the textbox below the page heading (i.e. ) and press "Diagnose" to launch the diagnose modal.
Navigate to the crawl statuses section of . In the textbox beneath the page heading, enter the URL of the page you would like to crawl and click "Diagnose". When the diagnose modal loads with the page's crawl status, click the "Crawl page" button.
The '' tool allows you to see how data is extracted from your pages by our crawler.
After diagnosing a page, click on 'Open in page debugger' to use the Page debug tool. The Page debug tool crawls your webpage or document and gives you details of all the extracted metadata, content, , and data from your web page.
Another tool that you can use to check for errors across your whole domain rather than a specific web page is the .