# Adding custom fields

## What data is indexed by default? <a href="#what-data-is-indexed-by-default" id="what-data-is-indexed-by-default"></a>

The crawler extracts metadata from each page and condenses it into a standard set of fields to be added to the search index.

{% hint style="info" %}
Javascript rendered elements are not indexed. Any scripts that change content after DOM load (e.g. Optimize running via Google Tag Manager) are also not taken into account.
{% endhint %}

### Page metadata <a href="#page-metadata" id="page-metadata"></a>

The crawler uses page metadata and content to construct a standardized set of fields:

* URL (`url`). The full URL of the page
* Title (`title`). The meta-title of the page
* Image (`image`). URL for the page image
* Language (`lang`). Language of the page content (`en`, `fr`, `de`, ...)
* Description (`description`). The meta description of the page
* Keywords (`keywords`). List of keywords for the page
* Modified Time (`modified_time`). The time when the page was last modified
* Published Time (`published_time`). The time when the page was first published
* Headings (`headings`). List of headings from the body of the page

To see a full list of the fields we crawl and their associated HTML markup, visit [this ](https://kb.search.io/KB/what-html-elements-does-search-io-crawl)page.

{% hint style="info" %}
When multiple metadata types are used for a given field, the crawler will use OpenGraph values over others.

* Page title: `og:title over` `<title>`
* Page description: `og:description over <meta type="description">`
  {% endhint %}

### Body content <a href="#body-content" id="body-content"></a>

The page `<body>` is summarised to provide a more concise base for searching. This process discards text inside `<head>`, `<script>`, `<header>` and `<footer>` elements.

### URL fields

Fields derived from the URL are also included for common queries (e.g. limiting to a domain or particular sub-URL structure of a site):

* Domain (`domain`). The domain of the URL
* First directory (`dir1`). The first directory of the URL, or empty if none
* Secondary directory (`dir2`). The second directory of the URL, or empty if none

### Custom metadata

In addition to the above, the following metadata is also extracted if available:

* All meta tags within head
* OpenGraph tags
* Custom SJ tags
* Body content (`<body>`)

{% hint style="info" %}
To test what content of a webpage is indexed, use our [Page debug tool](https://app.search.io/page-debug).
{% endhint %}

## Indexing custom fields <a href="#indexing-additional-fields" id="indexing-additional-fields"></a>

1. Add a schema field (e.g. `authors`) and select the desired schema field type.
2. Add custom meta tags to your site (see below).
3. Crawl a page containing the custom field via the [diagnose tool](https://app.search.io/collection/crawler/crawl-statuses). Use the [preview section](/documentation/guides/general/previewing-results.md) to check that the additional field was indexed correctly.
4. Re-crawl all domains so all records are updated.

{% hint style="info" %}
Schema fields must begin with a letter and contain only letters, numbers or underscores
{% endhint %}

#### Adding custom meta tags to your webpage: <a href="#adding-custom-meta-tags-to-your-webpage" id="adding-custom-meta-tags-to-your-webpage"></a>

Filters and facets often use additional fields to provide better searching and filtering capabilities. For example, a news site might want to filter by topic or a documentation site by version.

Custom meta tags allow you to add those additional fields to your records. Meta tags are defined in HTML by adding `data` attributes to elements. To avoid name clashes with other systems, data attributes must contain the prefix `data-sj-`.

#### Defining custom fields in `<head>` elements <a href="#defining-custom-fields-in-head-elements" id="defining-custom-fields-in-head-elements"></a>

By default the crawler reads `<meta>` tags within `<head>`, but only keeps standard fields (title, description, keywords, etc). Add a `data-sj-field="fieldname"` attribute to override this behaviour and create a custom field from the meta tag's `content` attribute. This example shows an otherwise ignored `<meta>` tag being converted into a custom field `fieldname="fieldvalue"`:

```html
<meta
  property="custom meta field"
  data-sj-field="fieldname"
  content="fieldvalue"
/>
```

#### Defining custom fields in `<body>` elements <a href="#defining-custom-fields-in-body-elements" id="defining-custom-fields-in-body-elements"></a>

To capture data already rendered within an element, add `data-sj-field="fieldname"` to it:

```html
<span data-sj-field="random">This text is the value</span>
```

This will set custom field `random="This text is the value"`.

If you don't want the data rendered on the page, then you can also set the field value using the data attribute.

```html
<span data-sj-field="fieldname" data-sj-value="fieldvalue">
  This text is not used because the data attribute has a value
</span>
```

#### Adding data to a list field type <a href="#adding-data-to-a-list-field-type" id="adding-data-to-a-list-field-type"></a>

It is possible to add a list of values by repeating the same tag multiple times. You just need to ensure that the schema field type is a 'List of String/Integers/etc'.

```html
<meta data-sj-field="topics" content="Art"/>
<meta data-sj-field="topics" content="Biology"/>
<meta data-sj-field="topics" content="Chemistry"/>
```

In the example above, the strings "Art, Biology, Chemistry" will be stored as a list against the field `topics`.

**Note:** If you have multiple meta tags on your page for a specific field but the schema field type is not a list, we will not index that webpage.

**Localization**

**Problem:** I have very locally targeted content and wish to recommend local content based on my site visitor location. **Solution:** On each "locally" targeted content page, add two pieces of meta information as follows. e.g.

```html
<span data-sj-field="lat" data-sj-value="-33.867487"></span>
<span data-sj-field="lng" data-sj-value="181.3615434"></span>
```

In the above case, the prefix data-sj-field indicates this is information specific to the page. So `data-sj-field="lat"` indicates this page has a property called "lat" with corresponding value -33.867487.

#### Processed meta data vs Raw meta data <a href="#processed-meta-data-vs-raw-meta-data" id="processed-meta-data-vs-raw-meta-data"></a>

Processed metadata is the metadata that is stored in the index. Raw metadata is read by the crawler, but may not be indexed in the search index. An example of raw metadata is links on a webpage that may be useful for the crawler to find linked pages, but do not need to be recorded in the search index.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.search.io/documentation/guides/content-websites/adding-custom-fields.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
