Everything You Need to Know About (XML) Sitemaps and SEO

Stephan Czysch 5/26/2023

The creation of HTML and XML sitemaps is a basic SEO principle, as it improves the crawling and indexing of your page(s). Learn everything about pitfalls and best practices for sitemaps here.

Table of contents

What is a sitemap and why is it important for websites?
What requirements do XML-Sitemaps have?
What other requirements apply to XML-Sitemaps?
What are Image-Sitemaps, News-Sitemaps, Video-Sitemaps and hreflang-Sitemaps?
Bundle Sitemaps with Sitemap Index File(s)
Where should the Sitemap be stored? And what should its name be?
How can a sitemap be created and which tools can help?
Make Sitemaps known to search engines via the robots.txt, Sitemap ping or the Webmaster Tools
How can one ensure that a Sitemap is current and error-free?
Does a Website need a XML or HTML-Sitemap?
Let’s sum it up: Here’s what you need to know about XML-Sitemaps

Sitemaps – that sounds like SEO from 2001. But even today and in the future, sitemaps have their right to exist and help to get content indexed quickly or again. Always remember: without indexing, no ranking is possible. Therefore, the use of sitemaps is one of the SEO basics. But first things first: What are sitemaps in the first place and why should you use them?

What is a sitemap and why is it important for websites?

The term sitemap is used for a website overview.So sitemaps are essentially a list of individual pages of a website. Sitemaps can optionally include all or only selected pages.

The basic distinction is made between HTML sitemaps and XML sitemaps. While HTML sitemaps can also be used by visitors for navigation, XML sitemaps are mainly intended for search engines.

An example of an HTML sitemap can be found on the Website of the Federal Government.

On the website of the Federal Government, the sitemap is simply called “Overview” and offers jumps to deeper levels of the website. Source: bundesregierung.de

For HTML sitemaps, it is important to ensure that they are understandable for users. Therefore, it may make sense to link only particularly important (distributor) pages, instead of listing “everything”.

XML sitemaps are usually what SEO professionals understand by a sitemap. XML stands for “extensible markup language”. Because in XML sitemaps, not only an address is listed, but also additional data points for this page.

An example of additional data points are the last updating date of the page. Thus, XML sitemaps can positively influence crawling. Are you wondering what crawling means? Crawling is the automatic capture of websites by the so-called crawlers or robots of search engines. At Google, the crawler is called Googlebot.

This already describes the task of XML sitemaps and their relevance for search engine optimization: (XML-)Sitemaps ensure the accessibility of pages through links. Because only pages known to a search engine can be crawled, indexed and then appear in the search results.

At this point the note: Search engines like Google can also process sitemaps in other file formats such as .atom or RSS. But also .txt files can be used as a sitemap. See these alternative forms of sitemaps and their advantages and disadvantages in the Google Help.

What requirements do XML-Sitemaps have?

Standards ensure that the web works well. This applies to “normal” websites and HTML, the Hyper Text Markup Language, as well as to XML sitemaps. In November 2006, Google, Yahoo and Microsoft, the major search engines at the time, agreed on a common standard for XML-Sitemaps. See sitemaps.org for the official website.

XML-Sitemaps can contain mandatory as well as optional information. Currently, the following information in the XML-Sitemap is mandatory:

<urlset>: This opening statement defines that individual addresses are listed below. So it is a “set” of URLs. Hence the name urlset
<url>: With URL a web address is referred to. The specification heralds the mention of a web address that is found by means of …
<loc>: At the address mentioned here (English: Location).

Optional information are:

<lastmod>: Contains the date of the last update of the address. This timestamp can ensure timely (re-)crawling of an updated page.
<changefreq>: With this information, the (usually) occurring update of the page can be indicated. Possible values are always, hourly, daily, weekly, monthly, yearly and never. The indication never can e.g. be used for pages that are now archived, while always should only be used for pages that actually change with every page view.
<priority>: Defines the relative priority of a page with values between 1.0 and 0.0. The default value (= medium priority) is 0.5. The values can be specified in 0.1 steps within the range of 1.0 and 0.0.

Are the optional data worth it? Only if they are “correctly” filled.

In the Google Help, the search engine giant from Mountain View writes:

Google ignores the values <priority> and <changefreq>.
Google uses the value <lastmod> if it is consistent and verifiably correct, for example by comparing it with the last change of the page.

Here you can find the corresponding source from Google.

Regardless of Google’s view of things: Assigning a priority of 1.0 to every page, for example, makes no sense at all, because it is about the relative importance of all addresses in the sitemap to each other.

Also, indicating a “wrong” <lastmod> date does not bring any advantage, as Google writes. Here, an update should only be carried out if something has actually changed on the page. Search engines are able to determine whether a page has changed since the last access. False data is therefore not followed blindly.

What other requirements apply to XML-Sitemaps?

For your sitemaps to help search engines and thus search engine optimization, you need to know the following:

A maximum of 50,000 <loc> entries are possible per sitemap
The maximum file size for a sitemap is 50 MB
A website can have several sitemaps
For the encoding of addresses and other information UTF-8 should be used so that all letters and characters can be “correctly” interpreted
The date must be specified in YYYY-MM-DD syntax
The hour indication related to UTC time is optional and is appended to the date indication. If the time refers to UTC time, a Z is placed behind the indication. To set local time deviations, such as for the German time zone, the Z is omitted and the deviation is indicated by a +. It then looks like this: <lastmod>2023-04-21T07:09:10+01:00</lastmod>.
Only absolute addresses should be mentioned in sitemaps. So, specify addresses in the format https://www.getindexed.io/knowledge-hub/crawled-currently-not-indexed-fix/ and not “relative” using /knowledge-hub/crawled-currently-not-indexed-fix/
Submitted addresses should actually exist and be allowed to be crawled and indexed by search engines. However, minor problems such as pages that no longer exist are no problem.
Optionally, a sitemap can be compressed in the GNU ZIP format. This is similar to a .zip file, but uses .gz as the file extension.

What are Image-Sitemaps, News-Sitemaps, Video-Sitemaps and hreflang-Sitemaps?

Sitemaps can not only contain classic websites but also special file formats like pictures or videos. This ensures that the corresponding file types are indexed (here you can find the OMR article on the topic Indexing). With regard to the necessary data, there were a few changes from Google in May 2022. For example, for image sitemaps the previously possible optional data

caption (caption)
geo_location (place of capture)
title (picture title)
license (licence information)

are no longer supported. If you continue to submit this information in your image sitemaps, you do not need to delete this information. They are just not evaluated by Google anymore. Accordingly, only the previous mandatory data <image:image> and <image:loc> from Google are interpreted. Further information on the image expansion for sitemaps can be found at Google.

In the same blog post Google also announced changes for Video-Sitemaps. Here, too, optional data are no longer supported:

category
player_loc[@allow_embed]
player_loc[@autoplay]
gallery_loc
price[@all]
tvshow[@all]

For publishers (with listing in Google News) News Sitemaps are an important topic. All important information about these can be found on the Google Help.

Concerning the topic, International SEO sitemaps have a meaning through the hreflang indicators. Hreflang can inform search engines about the URL under which the same content can be found in a different language or for a different country. Think of a German-language content for Germany and Switzerland as the target market. Everything you need to know about hreflang Sitemaps can be found at Google.

hreflang information can be transmitted to search engines via XML Sitemaps. Source: Google Search Central

You do not have to create individual sitemaps per “type”, the information can be combined. In other words, it is possible to combine image information and hreflang information in a single sitemap.

Bundle Sitemaps with Sitemap Index File(s)

Due to the limitation to a maximum of 50,000 entries per sitemap, large sites cannot help but create multiple sitemaps. Classically, a segmentation of the sitemaps according to page type is carried out in this context. audible.de, for example, has decided to set up separate sitemaps per page type, as visible in the robots.txt.

As the robots.txt file from audible.de shows, the audio book provider segments its XML sitemaps according to page type. Source: audible.de

Looking at the screenshot, something else becomes apparent: These sitemaps are called sitemap_index.xml. Under a sitemap index file different sitemaps are summarized. This has the advantage that not all individual sitemaps have to be sent to search engines. However, we would still recommend this, especially within the webmaster tools of the important search engines. More on that in a moment.

What a sitemap index file looks like is demonstrated by the Google Help on the subject.

You can find more information about sitemap index files among other things in the Google Documentation for technical SEO topics. Source: Google Search Central

Where should the Sitemap be stored? And what should its name be?

The storage location of a XML-Sitemap has a (theoretical) pitfall that even longtime SEO professionals do not know: the scope of a Sitemap. For this we look at the specifications on sitemaps.org:

"The storage location of the Sitemap file determines which URLs can be included in this Sitemap. A Sitemap file at the location http://example.com/catalog/sitemap.xml can contain URLs starting with http://example.com/catalog/, but not URLs starting with http://example.com/images/."

What does that mean? If the XML Sitemap is stored in a subdirectory and not in the so-called root, then this Sitemap only applies to addresses that are in the corresponding subdirectory.

However, there is one exception, at least in the case of Google: the submission of the Sitemap via the Google Search Console. The Google Helpsays the following about this:

"You can host your sitemaps on your website, but if you do not submit them via the Search Console, they only affect successor elements of the parent directory. Therefore, a sitemap published in the root directory of the website affects all files of the website. We recommend you to publish your sitemaps there."

In short: Google recommends that the Sitemap be stored directly in the root directory. This makes it accessible under the address https://www.example.com/sitemap.xml. If you submit the Sitemap through the Google Search Console, the Sitemap can also be stored in any subdirectory.

And not only that: The Sitemap can also be stored under a completely different domain. That means, the XML Sitemap of omr.com could be under stephan-czysch.de. But in order for Google to trust this sitemap, a Sitemap: reference to my private website must be set in the robots.txt of omr.com, or this Sitemap must be entered via the Google Search Console. The idea: Since I can not alter the robots.txt of omr.com, the owner of the website of omr.com must have set the reference to the Sitemap and is therefore trustworthy. What needs to be considered exactly you will find out as usual at Google.

By the way, regarding the name of the Sitemap you are free. So nobody forces you to use e.g. “sitemap.xml” as filename.

How can a sitemap be created and which tools can help?

Okay, now let's cut to the chase! XML-Sitemaps are still relevant for search engine optimization, but how can XML-Sitemaps be created? Most websites are created with content management systems or website builders - and these usually already have a function for automatically creating XML-Sitemaps.If in doubt, just search for the name of the used CMS and Sitemap. If no Sitemap function is integrated in the CMS, a XML-Sitemap (or several) can often be created through free extensions.

Alternatively, online tools, so-called Sitemap Generators, will help you to generate a XML-Sitemap by entering a URL list. But this has a disadvantage, since the file generated in this way is static. This means that newly created addresses are not automatically part of the Sitemap. This cancels the actual purpose of a Sitemap. Reminder: This is to ensure that search engines learn about the existence of a site. In case of doubt, however, a static Sitemap is better than not having a Sitemap at all.

By the way: In many website crawling tools like the Screaming Frog SEO Spider the function to create Sitemaps is integrated.

Make Sitemaps known to search engines via the robots.txt, Sitemap ping or the Webmaster Tools

The address of a Sitemap can have any name and location. Accordingly, it is necessary that search engines know the address of the Sitemap. Because you know: Search engines only access known addresses.

To make your Sitemap(s) known to search engines, there are the possibilities via the robots.txt, a ping interface or alternatively via the Webmaster Tools. In case of Google, the Webmaster Tools are meanwhile Google Search Console. But one after the other.

The easiest way to make the address of the Sitemap known to all search engines is through the Reference of the Sitemap in the so-called robots.txt file. With the robots.txt the crawling for individual page areas or addresses can be restricted. A “Opt-Out” is the standard there: All addresses that were not explicitly excluded from crawling by a Disallow: can be crawled by search engines.

To let search engines know the address of your XML-Sitemap, you have to put it in the robots.txt through “Sitemap: Address-of-sitemap”. However, this has a small disadvantage: Since the robots.txt can be opened by everyone, your competitors also know all addresses of your website. If you do not want that, then the registration of the Sitemaps via the Webmaster Tools is the better option.

By submitting the Sitemap addresses in the Webmaster Tools of the search engines, you have several advantages. On the one hand, the respective search engine knows for sure the address of the Sitemap(s) and can access it, on the other hand, in the case of Google Search Console you get detailed data about the indexing status of the respective Sitemaps. So you can see how many of the submitted (or submitted) pages were indexed by Google.

If you submit your Sitemap(s) to the Google Search Console, then you can analyze which of the addresses submitted via the Sitemap were actually indexed. To do this, click on the filter in the upper left corner and select “All submitted pages” or the individual Sitemaps. Source: Google Search Console

To dive deeper into the Pages Indexation report you can check out this YouTube video:

For this reason you should definitely submit your Sitemaps in the Webmaster Tools. Whether you also link your Sitemaps in the robots.txt is up to you. One reference to the Sitemap via the Webmaster Tools is enough for the respective search engine. Within the Google Search Console you will find the entry mask under Indexation => Sitemaps.

You can submit your Sitemap to Google via this mask of the Google Search Console. Source: Google Search Console

Speaking of which: On the Google Search Console and the competitor products like Bing Webmaster Tools you only get access if you identify yourself as owner of the website. Here you find the article on Setting up the Google Search Console.

Another alternative is pinging the official Sitemap interface of Google. Also via the programming interface of the Google Search Console, the Google Search Console API, a submission of the Sitemap is possible. More information about it can be found in the Google Help for the GSC API.

How can one ensure that a Sitemap is current and error-free?

By definition, XML-Sitemaps should only contain pages that can be accessed by search engines (and users). Because why show a page that does not exist at all, that crawlers are not allowed to visit (obstructed by robots.txt with Disallow:), or that search engines are not allowed to display in the search?

The surest way to an always up-to-date XML-Sitemap is the regular generation of a new one, e.g. with every update of pages or the publication of new articles. For error checking either the Google Search Console or a Crawling Tool can be used.

Crawlers like the Screaming Frog can not only create Sitemaps, but also use these as “source data” for the analysis of a website. Also an automatic comparison between the addresses found in the Crawl and those in the Sitemap is possible with Crawlers. This way you can detect problems within your Sitemap, for example no longer existing pages or addresses that are not transmitted to Sitemaps.

XML-Sitemaps can be used as an “output list” for the analysis of a website in Crawling Tools. Source: Screaming Frog

Basically, even with errors search engines can process your Sitemaps. But that does not mean that your Sitemaps should be “neglected”.

Does a Website need a XML or HTML-Sitemap?

No, neither a XML-Sitemap nor a HTML-Sitemap is needed. But especially the use of XML-Sitemaps is recommended, since it can improve the Crawling.

But what about HMTL-Sitemap? Such an overview can have a positive effect on the website structure and the accessibility of the individual pages. In the broadest sense, a listing of all entries of a glossary or all brands of a shop is a HTML-(part-)Sitemap. Therefore, many more websites have a HTML-Sitemap in use than thought. An example? Here you can find the OMR Glossary. I regularly work with HTML-Sitemaps at customer projects within the scope of my consulting projects. But they are not mandatory.

Let’s sum it up: Here’s what you need to know about XML-Sitemaps

Time for a conclusion, don’t you think? Here’s a brief summary of what you should know about XML-Sitemaps:

XML-Sitemaps help search engines to crawl your site efficiently and to find new or updated URLs
Most content management systems come with a function for generating Sitemaps. Alternatively, such function can be retrofitted with extensions. If all else fails, then Sitemap Generators are your rescue
Submit your Sitemaps in the Webmaster Tools that are important to you like the Google Search Console
Mentioning the Sitemap in the robots.txt is optional
Sitemaps should only contain pages that actually exist and can be crawled and indexed by search engines
Bear in mind the limitations like currently a maximum of 50,000 addresses per Sitemap
Sitemaps should not be larger than 50 MB in order to be processed correctly
You can build a sitemap for sitemaps using a so-called Sitemap index file
XML-Sitemaps for special content types such as images or news are possible. Publishers should also deal with News Sitemaps. For international websites, the hreflang indication can not only be transmitted on the individual pages themselves, but in the Sitemap.
HTML-Sitemaps can be helpful both for users and search engines
Sitemaps can have any name and be outside the actual website. But a confirmation of the “correctness” of this “outside-Sitemap” is necessary through a reference in the robots.txt, or via the Google Search Console.

Now you know everything that is important about the topic XML-Sitemaps!

Recommended SEO Tools

You can find more recommended tools SEO-Tools on OMR Reviews and compare them. In total, we have listed over 150 SEO tools (as of December 2023) that can help you increase your organic traffic in the long term. So take a look and compare the software with the help of the verified user reviews:

Indexing

Author

Stephan Czysch

Stephan Czysch (Aussprache: Zisch) unterstützt Unternehmen dabei, bessere Websites bereitzustellen und damit mehr SEO-Erfolg zu erzielen. Als externer Berater baut er Inhouse SEO-Teams auf oder schult Teams zu unterschiedlichen Themen. Er ist mehrfacher Buchautor, regelmäßiger Referent auf Konferenzen und gibt öffentliche Schulungen bei 121Watt und OMT. Dazu ist er Gründer von searchanalyzer.io und getindexed.io. Mehr über ihn findest du unter stephan-czysch.de.

All Articles of Stephan Czysch