Home » Automations » Script: XML Sitemap Generator

Script: XML Sitemap Generator

Published on

Updated on

A featured image for the XML sitemap generator script guide.

We are going to write a Python script that doesn’t care if you have 100 pages or 10 000. It uses Generators (the yield keyword) to fetch data lazily, ensuring your server’s RAM flatlines while the disk does the heavy lifting.

This is a skeleton for a production-grade cartographer.

Architecture and Dependencies

We need three distinct components:

  • Fetcher – A generator that pulls rows from your database one at a time.
  • Writer – A function that handles the XML formatting and file I/O.
  • Controller – The logic that manages file rotation (sharding) when we hit the 50,000 URL limit.

We are keeping this vanilla. No pandas, no heavy frameworks. Just standard Python libraries.

import os
import gzip
from datetime import datetime, timezone

# Config
LIMIT_PER_FILE = 50000
OUTPUT_DIR = 'sitemaps/'
BASE_URL = 'https://example.com'
USE_GZIP = False  # Toggle for compressed output

Mock Database (Fetcher)

In a real scenario, this connects to Postgres or MySQL. Here, we simulate a massive database using a generator.

Note the yield. This is the magic. It hands over one record and pauses execution, preventing memory bloat.

def fetch_urls():
    """
    Simulates a database cursor. 
    In reality, use: cursor.execute("SELECT url, updated_at FROM pages")
    """
    # Let's pretend we have 120,000 pages
    total_records = 120000
    
    for i in range(total_records):
        # Logic to skip 'noindex' pages happens in the SQL query
        priority = "1.0" if i == 0 else "0.8"  # Homepage
        
        yield {
            'loc': f"{BASE_URL}/page-{i}",
            'lastmod': datetime.now(timezone.utc).strftime('%Y-%m-%d')
        }

Writer & Sharding Logic

This is where we enforce the “Stream, Don’t String” rule.

We open a file, write the header, loop through the generator, and if we hit the limit, we close the file and open a new one.

def generate_sitemaps():
    if not os.path.exists(OUTPUT_DIR):
        os.makedirs(OUTPUT_DIR)

    file_count = 1
    url_count = 0
    current_file = None
    generated_files = []

    # The Header and Footer for standard sitemaps
    xml_header = '<?xml version="1.0" encoding="UTF-8"?>\n<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">\n'
    xml_footer = '</urlset>\n'

    def open_new_file(file_num):
        """Helper to open a new sitemap file (with optional gzip)."""
        ext = '.xml.gz' if USE_GZIP else '.xml'
        filename = os.path.join(OUTPUT_DIR, f'sitemap-{file_num}{ext}')
        
        if USE_GZIP:
            f = gzip.open(filename, 'wt', encoding='utf-8')
        else:
            f = open(filename, 'w', encoding='utf-8')
        
        f.write(xml_header)
        print(f"[*] Writing to {filename}...")
        return f, filename

    # Initialize the first file
    current_file, current_filename = open_new_file(file_count)
    generated_files.append(os.path.basename(current_filename))
    
    try:
        # Iterate through the stream
        for row in fetch_urls():
            # Check if we need to rotate files
            if url_count >= LIMIT_PER_FILE:
                # Close current file
                current_file.write(xml_footer)
                current_file.close()
                print(f"[+] Finished {os.path.basename(current_filename)}. URLs: {url_count}")
                
                # Reset and rotate
                file_count += 1
                url_count = 0
                current_file, current_filename = open_new_file(file_count)
                generated_files.append(os.path.basename(current_filename))

            # Write the URL entry
            entry = (
                f"  <url>\n"
                f"    <loc>{row['loc']}</loc>\n"
                f"    <lastmod>{row['lastmod']}</lastmod>\n"
                f"  </url>\n"
            )
            current_file.write(entry)
            url_count += 1

    finally:
        # Clean up the final file
        if current_file and not current_file.closed:
            current_file.write(xml_footer)
            current_file.close()
            print(f"[+] Finished {os.path.basename(current_filename)}. URLs: {url_count}")

    return generated_files

Index Generator (Binder)

Since we generated multiple files (because we respected the 50k limit), we need an xml Sitemap Index to rule them all.

def generate_index(sitemap_files):
    index_filename = os.path.join(OUTPUT_DIR, 'sitemap-index.xml')
    
    with open(index_filename, 'w', encoding='utf-8') as f:
        f.write('<?xml version="1.0" encoding="UTF-8"?>\n')
        f.write('<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">\n')
        
        for filename in sitemap_files:
            # Construct proper URL (handle trailing slashes)
            base = BASE_URL.rstrip('/')
            sitemap_url = f"{base}/{OUTPUT_DIR}{filename}"
            
            f.write('  <sitemap>\n')
            f.write(f'    <loc>{sitemap_url}</loc>\n')
            # In a real script, check the file modification time for this date
            f.write(f'    <lastmod>{datetime.now(timezone.utc).strftime("%Y-%m-%d")}</lastmod>\n')
            f.write('  </sitemap>\n')
            
        f.write('</sitemapindex>\n')
    
    print(f"[!] Sitemap Index generated at {index_filename}")
    print(f"[!] Total sitemaps created: {len(sitemap_files)}")

# Execution
if __name__ == "__main__":
    try:
        files = generate_sitemaps()
        generate_index(files)
        print(f"\n✓ Submit this to Google: {BASE_URL}/{OUTPUT_DIR}sitemap-index.xml")
    except Exception as e:
        print(f"[ERROR] Generation failed: {e}")
        raise

Why This Script?

Memory Safety

Look at the fetch_urls function. It yields one dict at a time.

The generate_sitemaps loop consumes that dict, writes it to disk, and discards it. You could run this script against a database with 10 million rows on a cheap $5/month VPS with 512MB of RAM, and it won’t crash. It is O(1) in terms of memory complexity.

“Sitemap Index” Automation

Most custom scripts forget this. They generate one giant file that hits 51,000 lines, and Google Search Console throws a “Parsing Error.”

This script automatically handles the overflow. It shards the data into sitemap-1.xml, sitemap-2.xml, etc., and creates the parent index automatically.

Production-Grade Error Handling

Notice the try/finally block. Even if your database connection drops mid-stream, the script ensures the current file is properly closed with a valid XML footer.

No corrupted sitemaps. No debugging why Google rejected your submission.

A diagram showcasing the database XML sitemap generator script logic.

Expansion Packs

Because this is Python, you can get fancy:

  • Gzip Compression – Already implemented. Just flip USE_GZIP = True to output .xml.gz files directly, saving bandwidth. Search engines handle compressed sitemaps natively.
  • Image Sitemaps – Add <image:image> tags inside the loop if your DB has image references. The streaming architecture doesn’t care about payload complexity.
  • Hreflang – If you have a translation table, inject the xhtml:link attributes right there in the loop. Multi-language support without plugins.
  • Database Integration – Replace the mock generator with real database queries:
def fetch_urls():
    cursor.execute("""
        SELECT url, updated_at, priority 
        FROM pages 
        WHERE noindex = FALSE 
        ORDER BY priority DESC, updated_at DESC
    """)
    for row in cursor:
        yield {
            'loc': f"{BASE_URL}{row['url']}",
            'lastmod': row['updated_at'].strftime('%Y-%m-%d'),
        }

Full Mock Example

This is the mock / simulation version. It doesn’t connect to any real database.

Look at the fetch_urls() function – it just generates fake URLs in a loop:

for i in range(total_records):
    yield {
        'loc': f"{BASE_URL}/page-{i}",
        'lastmod': datetime.now(timezone.utc).strftime('%Y-%m-%d')
    }

The script:

#!/usr/bin/env python3
"""
The Python Cartographer: A Stream-Based Sitemap Generator
Generates XML sitemaps with automatic sharding for large sites.
Memory-efficient generator-based approach for millions of URLs.
"""

import os
import gzip
from datetime import datetime, timezone

# ==================== CONFIG ====================
LIMIT_PER_FILE = 50000
OUTPUT_DIR = 'sitemaps/'
BASE_URL = 'https://example.com'
USE_GZIP = False  # Toggle for compressed output


# ==================== MOCK DATABASE (FETCHER) ====================
def fetch_urls():
    """
    Simulates a database cursor using a generator.
    In production, replace with:
        cursor.execute("SELECT url, updated_at FROM pages WHERE noindex = FALSE")
        for row in cursor:
            yield { ... }
    """
    # Let's pretend we have 120,000 pages
    total_records = 120000
    
    for i in range(total_records):
        # Logic to skip 'noindex' pages happens in the SQL query
        priority = "1.0" if i == 0 else "0.8"  # Homepage
        
        yield {
            'loc': f"{BASE_URL}/page-{i}",
            'lastmod': datetime.now(timezone.utc).strftime('%Y-%m-%d')
        }


# ==================== WRITER & SHARDING ====================
def generate_sitemaps():
    """
    Streams URLs from generator and writes to sharded XML files.
    Automatically rotates files at 50k URLs per file.
    Returns list of generated filenames.
    """
    if not os.path.exists(OUTPUT_DIR):
        os.makedirs(OUTPUT_DIR)

    file_count = 1
    url_count = 0
    current_file = None
    generated_files = []

    # The Header and Footer for standard sitemaps
    xml_header = '<?xml version="1.0" encoding="UTF-8"?>\n<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">\n'
    xml_footer = '</urlset>\n'

    def open_new_file(file_num):
        """Helper to open a new sitemap file (with optional gzip)."""
        ext = '.xml.gz' if USE_GZIP else '.xml'
        filename = os.path.join(OUTPUT_DIR, f'sitemap-{file_num}{ext}')
        
        if USE_GZIP:
            f = gzip.open(filename, 'wt', encoding='utf-8')
        else:
            f = open(filename, 'w', encoding='utf-8')
        
        f.write(xml_header)
        print(f"[*] Writing to {filename}...")
        return f, filename

    # Initialize the first file
    current_file, current_filename = open_new_file(file_count)
    generated_files.append(os.path.basename(current_filename))
    
    try:
        # Iterate through the stream
        for row in fetch_urls():
            # Check if we need to rotate files
            if url_count >= LIMIT_PER_FILE:
                # Close current file
                current_file.write(xml_footer)
                current_file.close()
                print(f"[+] Finished {os.path.basename(current_filename)}. URLs: {url_count}")
                
                # Reset and rotate
                file_count += 1
                url_count = 0
                current_file, current_filename = open_new_file(file_count)
                generated_files.append(os.path.basename(current_filename))

            # Write the URL entry
            entry = (
                f"  <url>\n"
                f"    <loc>{row['loc']}</loc>\n"
                f"    <lastmod>{row['lastmod']}</lastmod>\n"
                f"  </url>\n"
            )
            current_file.write(entry)
            url_count += 1

    finally:
        # Clean up the final file
        if current_file and not current_file.closed:
            current_file.write(xml_footer)
            current_file.close()
            print(f"[+] Finished {os.path.basename(current_filename)}. URLs: {url_count}")

    return generated_files


# ==================== INDEX GENERATOR (BINDER) ====================
def generate_index(sitemap_files):
    """
    Creates a sitemap index file that references all sharded sitemaps.
    This is the file you submit to Google Search Console.
    """
    index_filename = os.path.join(OUTPUT_DIR, 'sitemap-index.xml')
    
    with open(index_filename, 'w', encoding='utf-8') as f:
        f.write('<?xml version="1.0" encoding="UTF-8"?>\n')
        f.write('<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">\n')
        
        for filename in sitemap_files:
            # Construct proper URL (handle trailing slashes)
            base = BASE_URL.rstrip('/')
            sitemap_url = f"{base}/{OUTPUT_DIR}{filename}"
            
            f.write('  <sitemap>\n')
            f.write(f'    <loc>{sitemap_url}</loc>\n')
            # In a real script, check the file modification time for this date
            f.write(f'    <lastmod>{datetime.now(timezone.utc).strftime("%Y-%m-%d")}</lastmod>\n')
            f.write('  </sitemap>\n')
            
        f.write('</sitemapindex>\n')
    
    print(f"[!] Sitemap Index generated at {index_filename}")
    print(f"[!] Total sitemaps created: {len(sitemap_files)}")


# ==================== EXECUTION ====================
if __name__ == "__main__":
    print("=" * 60)
    print("Python Cartographer - Sitemap Generator")
    print("=" * 60)
    
    try:
        files = generate_sitemaps()
        generate_index(files)
        
        print("\n" + "=" * 60)
        print("✓ Generation complete!")
        print(f"✓ Submit this to Google: {BASE_URL}/{OUTPUT_DIR}sitemap-index.xml")
        print("=" * 60)
        
    except Exception as e:
        print(f"\n[ERROR] Sitemap generation failed: {e}")
        raise

Converting Mock to Database Version

To connect to a real database, you need to make two changes:

1. Add Database Dependencies

At the top of the script, add the database connector:

import mysql.connector  # For MySQL/MariaDB
# OR
import psycopg2  # For PostgreSQL

Install it: pip install mysql-connector-python

2. Replace the fetch_urls() Function with:

def fetch_urls():
    # Connect to database
    conn = mysql.connector.connect(
        host='localhost',
        user='your_db_user',
        password='your_db_pass',
        database='your_db_name'
    )
    cursor = conn.cursor(dictionary=True, buffered=False)
    
    # Query your pages table
    cursor.execute("""
        SELECT url, updated_at 
        FROM pages 
        WHERE noindex = FALSE
        ORDER BY updated_at DESC
    """)
    
    # Stream results one row at a time
    for row in cursor:
        yield {
            'loc': f"{BASE_URL}{row['url']}",
            'lastmod': row['updated_at'].strftime('%Y-%m-%d')
        }
    
    cursor.close()
    conn.close()

That’s it. The rest of the script (writer, sharding, index generator) stays identical. The generator pattern works the same whether you’re yielding fake data or real database rows.

The magic is that cursor.execute() with buffered=False returns a cursor that also yields one row at a time, maintaining the same memory-efficient streaming behavior.

REST API Example

This version uses WordPress’s REST API instead of direct database access.

Why? Because most people don’t have MySQL credentials.

#!/usr/bin/env python3
"""
The Python Cartographer: WordPress REST API Edition
Generates XML sitemaps from WordPress via REST API.
Works with WordPress.com and any managed WordPress hosting.
No database access required.
"""

import os
import gzip
import requests
from datetime import datetime, timezone
from urllib.parse import urljoin, urlparse
import time

# ==================== CONFIG ====================
LIMIT_PER_FILE = 50000
OUTPUT_DIR = 'sitemaps/'
USE_GZIP = False

# WordPress Site Config
SITE_URL = 'https://example.com'  # Your WordPress site URL (no trailing slash)

# Post types to include
INCLUDE_POST_TYPES = ['posts', 'pages']  # API endpoints

# API Settings
PER_PAGE = 100  # Max items per API request (WordPress limit is 100)
REQUEST_DELAY = 0.5  # Seconds between requests (be nice to the server)


# ==================== API FETCHER ====================
def fetch_from_api(endpoint, params=None):
    """
    Fetches data from WordPress REST API with pagination.
    Handles rate limiting and errors gracefully.
    """
    url = f"{SITE_URL.rstrip('/')}/wp-json/wp/v2/{endpoint}"
    all_items = []
    page = 1
    
    if params is None:
        params = {}
    
    params['per_page'] = PER_PAGE
    params['_embed'] = 'true'  # Get additional data
    
    print(f"[*] Fetching {endpoint}...")
    
    while True:
        params['page'] = page
        
        try:
            response = requests.get(url, params=params, timeout=30)
            
            if response.status_code == 400 and page > 1:
                # No more pages
                break
            
            response.raise_for_status()
            
            items = response.json()
            
            if not items:
                break
            
            all_items.extend(items)
            print(f"    Page {page}: {len(items)} items (Total: {len(all_items)})")
            
            # Check if there are more pages
            total_pages = int(response.headers.get('X-WP-TotalPages', page))
            if page >= total_pages:
                break
            
            page += 1
            time.sleep(REQUEST_DELAY)  # Be nice to the server
            
        except requests.exceptions.RequestException as e:
            print(f"[!] API request failed on page {page}: {e}")
            if page == 1:
                raise  # Critical error on first page
            break  # Partial success, continue with what we have
    
    print(f"[✓] Fetched {len(all_items)} {endpoint}")
    return all_items


def fetch_urls():
    """
    Generator that yields URL data from WordPress REST API.
    Fetches posts and pages, formats them for sitemap.
    """
    for post_type in INCLUDE_POST_TYPES:
        items = fetch_from_api(post_type, params={'status': 'publish'})
        
        for item in items:
            # Get the URL
            url = item.get('link', '')
            
            if not url:
                continue
            
            # Get modified date
            modified = item.get('modified_gmt', item.get('date_gmt', ''))
            if modified:
                try:
                    # Parse ISO 8601 format
                    dt = datetime.fromisoformat(modified.replace('Z', '+00:00'))
                    lastmod = dt.strftime('%Y-%m-%d')
                except:
                    lastmod = datetime.now(timezone.utc).strftime('%Y-%m-%d')
            else:
                lastmod = datetime.now(timezone.utc).strftime('%Y-%m-%d')
            
            # Determine priority
            parsed_url = urlparse(url)
            if parsed_url.path in ['/', '']:
                priority = '1.0'
            elif post_type == 'pages':
                priority = '0.8'
            else:
                priority = '0.6'
            
            yield {
                'loc': url,
                'lastmod': lastmod
            }


# ==================== WRITER & SHARDING ====================
def generate_sitemaps():
    """
    Streams URLs from API and writes to sharded XML files.
    Automatically rotates files at 50k URLs per file.
    """
    if not os.path.exists(OUTPUT_DIR):
        os.makedirs(OUTPUT_DIR)

    file_count = 1
    url_count = 0
    current_file = None
    generated_files = []

    xml_header = '<?xml version="1.0" encoding="UTF-8"?>\n<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">\n'
    xml_footer = '</urlset>\n'

    def open_new_file(file_num):
        """Opens a new sitemap file with optional gzip compression."""
        ext = '.xml.gz' if USE_GZIP else '.xml'
        filename = os.path.join(OUTPUT_DIR, f'sitemap-{file_num}{ext}')
        
        if USE_GZIP:
            f = gzip.open(filename, 'wt', encoding='utf-8')
        else:
            f = open(filename, 'w', encoding='utf-8')
        
        f.write(xml_header)
        print(f"[*] Writing to {filename}...")
        return f, filename

    current_file, current_filename = open_new_file(file_count)
    generated_files.append(os.path.basename(current_filename))

    try:
        for row in fetch_urls():
            if url_count >= LIMIT_PER_FILE:
                current_file.write(xml_footer)
                current_file.close()
                print(f"[+] Finished {os.path.basename(current_filename)}. URLs: {url_count}")
                
                file_count += 1
                url_count = 0
                current_file, current_filename = open_new_file(file_count)
                generated_files.append(os.path.basename(current_filename))

            entry = (
                f"  <url>\n"
                f"    <loc>{row['loc']}</loc>\n"
                f"    <lastmod>{row['lastmod']}</lastmod>\n"
                f"  </url>\n"
            )
            current_file.write(entry)
            url_count += 1

    finally:
        if current_file and not current_file.closed:
            current_file.write(xml_footer)
            current_file.close()
            print(f"[+] Finished {os.path.basename(current_filename)}. URLs: {url_count}")

    return generated_files


# ==================== INDEX GENERATOR ====================
def generate_index(sitemap_files):
    """Creates sitemap index file referencing all sharded sitemaps."""
    index_filename = os.path.join(OUTPUT_DIR, 'sitemap-index.xml')
    
    with open(index_filename, 'w', encoding='utf-8') as f:
        f.write('<?xml version="1.0" encoding="UTF-8"?>\n')
        f.write('<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">\n')
        
        for filename in sitemap_files:
            base = SITE_URL.rstrip('/')
            sitemap_url = f"{base}/{OUTPUT_DIR}{filename}"
            
            f.write('  <sitemap>\n')
            f.write(f'    <loc>{sitemap_url}</loc>\n')
            f.write(f'    <lastmod>{datetime.now(timezone.utc).strftime("%Y-%m-%d")}</lastmod>\n')
            f.write('  </sitemap>\n')
        
        f.write('</sitemapindex>\n')
    
    print(f"[!] Sitemap Index: {index_filename}")
    print(f"[!] Total sitemaps: {len(sitemap_files)}")


# ==================== API VALIDATION ====================
def validate_api_access():
    """Tests if the WordPress REST API is accessible."""
    url = f"{SITE_URL.rstrip('/')}/wp-json/wp/v2"
    
    print(f"[*] Testing API access: {url}")
    
    try:
        response = requests.get(url, timeout=10)
        response.raise_for_status()
        print("[✓] WordPress REST API is accessible")
        return True
    except requests.exceptions.RequestException as e:
        print(f"[✗] Cannot access WordPress REST API: {e}")
        print("\n[!] Make sure:")
        print("    1. SITE_URL is correct")
        print("    2. WordPress REST API is enabled")
        print("    3. Site is publicly accessible")
        return False


# ==================== EXECUTION ====================
if __name__ == "__main__":
    print("=" * 60)
    print("Python Cartographer - WordPress REST API Edition")
    print("=" * 60)
    
    # Validate config
    if SITE_URL == 'https://yoursite.com' or 'yoursite.com' in SITE_URL:
        print("\n[!] Please update SITE_URL with your WordPress site URL")
        print("    Example: SITE_URL = 'https://example.com'")
        exit(1)
    
    # Test API access
    if not validate_api_access():
        exit(1)
    
    try:
        print("\n" + "=" * 60)
        files = generate_sitemaps()
        generate_index(files)
        
        print("\n" + "=" * 60)
        print("✓ Generation complete!")
        print(f"✓ Sitemaps generated in: {OUTPUT_DIR}")
        print(f"✓ Upload these files to your WordPress site")
        print(f"✓ Then submit to Google: {SITE_URL}/{OUTPUT_DIR}sitemap-index.xml")
        print("=" * 60)
        
    except Exception as e:
        print(f"\n[ERROR] Sitemap generation failed: {e}")
        import traceback
        traceback.print_exc()
        raise

The Key Difference

Database Method (Original):

  • Direct SQL queries to wp_posts table
  • Requires database credentials from wp-config.php
  • Needs SSH or command-line access to the server
  • Lightning fast – streams millions of rows with zero HTTP overhead
  • Only works on self-hosted with full server access

REST API Method :

  • Fetches data via public HTTP endpoints (/wp-json/wp/v2/posts)
  • Works with WordPress.com, managed hosting, or any WP site
  • Runs from your local machine – no server access required
  • Slightly slower due to HTTP requests and pagination (100 items per call)
  • Includes polite rate limiting to avoid hammering your server

The REST API adds HTTP latency. For a site with 10,000 posts, you’re making ~100 API calls instead of one database query. But for most sites (under 50k posts), this is negligible, and it’s the only option if you’re on managed WordPress hosting.

The generator pattern still applies. Memory usage stays flat. The sharding logic is identical. The only difference is the data source.

If you control the server, use the database version. If you’re on WordPress.com or managed hosting, this is your tool.

Takeaways

Stop letting plugins dictate your architecture. Plugins are generic – they have to work for everyone. Your site is specific.

By writing a 100-line Python script, you gain total control over automation, crawl priority, update frequency, and server load. You turn a black box into a transparent pipeline.

This cartographer knows exactly what to map, when to rotate files, and how to keep Google happy, because you wrote the rules.


Discover more from SEO Automata by Preslav Atanasov

Subscribe now to keep reading and get access to the full archive.

Continue reading