Home » Technical SEO » Best Practices for XML Sitemaps

Best Practices for XML Sitemaps

Published on

Updated on

A featured pixel art image for the XML sitemaps best practices post.

A sitemap is a data feed for a crawler. If you treat it as a passive file, you miss the point. It is a tool for managing crawl budget and signaling content freshness at scale.

Most SEOs let a plugin generate a file and never look at it again. That is a mistake. A poorly managed sitemap sends mixed signals to Googlebot, wasting its limited time on your site

A Strict 200-OK Policy

Every URL in your XML must be a clean, canonical, indexable page. Including redirects or URLs with bad status codes (like 404s) is the fastest way to get your sitemap ignored.

When Googlebot hits a sitemap and sees a 301 redirect, it has to perform a second request to find the actual page.

Do that enough times, and the bot decides your sitemap is an unreliable data source.

Filter:

  • No 301/302 redirects – Only the final destination belongs here.
  • No 404/410 errors – If a page is gone, remove the entry immediately.
  • No Non-Canonical URLs – If /page/ is your master version, do not list /page/?source=email.

Why Use an Index File?

Large sites should never use a single, massive XML file. Even if you stay under the 50,000 URL limit, a single file makes troubleshooting nearly impossible.

Use a Sitemap Index. This master file points to child sitemaps categorized by content type. This structure allows you to isolate indexing issues in Google Search Console.

Sitemap TypeUse Case
Product SitemapHigh-churn pages that update often
Blog SitemapEvergreen content that updates occasionally
Image/Video SitemapMedia assets for rich results
Hreflang SitemapManaging international versions

If your “Product” sitemap shows 5,000 submitted URLs but only 500 indexed, you know the issue is specific to your product pages, not your entire site.

The Truth About Metadata Tags

Google and Bing have both explicitly stated they ignore <priority> and <changefreq>. These tags are relics. They clutter your file and provide zero ranking benefit.

The only metadata tag with real value is <lastmod>. However, it only works if it is honest. If you update the timestamp for every page every time the sitemap regenerates (even when content hasn’t changed), search engines will stop trusting your timestamps.

Rule of thumb: Only update <lastmod> when the main body content, structured data, or internal links on that page have significantly changed.

Technical Constraints

A study published in Women in Tech SEO suggests that reducing URL limits from 50,000 to 30,000 per file can significantly accelerate Google’s processing times for large-scale websites.

This strategic fragmentation effectively “motivates” the bot to discover millions of pages within days rather than allowing files to sit unprocessed for weeks.

RequirementSpecification
Maximum URLs50,000 per sitemap
Maximum File Size50MB (uncompressed)
EncodingUTF-8
LocationRoot directory (e.g., /sitemap.xml)
CompressionUse Gzip to save crawl bandwidth

Discovery and Submission

Don’t wait for Google to find your sitemap.

  • Robots.txt – Add the sitemap location to the very bottom of your robots.txt file. This ensures every bot that visits knows where the map is.
  • Manual Ping – Submit the Sitemap Index URL directly in Google Search Console and Bing Webmaster Tools.
  • API Submission – For massive sites (news, large e-commerce), use the Search Console API to programmatically notify Google of sitemap updates.

A Clean Entry

To see how these rules look in production, here is an example snippet of a high-performance entry.

We are stripping out the useless <priority> and <changefreq> tags to keep the file size down and the signal high.

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
   <url>
      <loc>https://seo-automata.com/python-status-checker/</loc>
      <lastmod>2025-12-22T13:00:00+00:00</lastmod>
   </url>
</urlset>

The Validation Audit

Before you consider a sitemap “done,” you need to verify that the map matches the reality of your server.

  1. Status Check – Run your sitemap URLs through a bulk status checker. If you see anything other than a 200 OK, your sitemap is broken.
  2. Check for Orphan Pages – Compare your sitemap list against a full site crawl. If you have thousands of pages that aren’t in the sitemap, those are “Orphan Pages.” Google might find them, but you aren’t making it easy.
  3. GSC Coverage Report Check the “Sitemaps” section in Search Console. Look for the “Discovered, currently not indexed” status. This usually means your sitemap is technically fine, but the content quality on those pages is too low for Google to care.

Conclusion

A sitemap is a communication channel, not a backup file. If you populate it with junk, the search engine will stop listening.

By keeping the file restricted to 200-OK canonical URLs and using an Index File for better tracking, you turn a basic XML file into a precise tool for crawl management.

Prune the tags, segment the files, and keep your timestamps honest.


Discover more from SEO Automata by Preslav Atanasov

Subscribe now to keep reading and get access to the full archive.

Continue reading