We are going to write a Python script that doesn’t care if you have 100 pages or 10 000. It uses Generators (the yield keyword) to fetch data lazily, ensuring your server’s RAM flatlines while the disk does the heavy lifting.
This is a skeleton for a production-grade cartographer.
Architecture and Dependencies
We need three distinct components:
- Fetcher – A generator that pulls rows from your database one at a time.
- Writer – A function that handles the XML formatting and file I/O.
- Controller – The logic that manages file rotation (sharding) when we hit the 50,000 URL limit.
We are keeping this vanilla. No pandas, no heavy frameworks. Just standard Python libraries.
import os
import gzip
from datetime import datetime, timezone
# Config
LIMIT_PER_FILE = 50000
OUTPUT_DIR = 'sitemaps/'
BASE_URL = 'https://example.com'
USE_GZIP = False # Toggle for compressed outputMock Database (Fetcher)
In a real scenario, this connects to Postgres or MySQL. Here, we simulate a massive database using a generator.
Note the yield. This is the magic. It hands over one record and pauses execution, preventing memory bloat.
def fetch_urls():
"""
Simulates a database cursor.
In reality, use: cursor.execute("SELECT url, updated_at FROM pages")
"""
# Let's pretend we have 120,000 pages
total_records = 120000
for i in range(total_records):
# Logic to skip 'noindex' pages happens in the SQL query
priority = "1.0" if i == 0 else "0.8" # Homepage
yield {
'loc': f"{BASE_URL}/page-{i}",
'lastmod': datetime.now(timezone.utc).strftime('%Y-%m-%d')
}Writer & Sharding Logic
This is where we enforce the “Stream, Don’t String” rule.
We open a file, write the header, loop through the generator, and if we hit the limit, we close the file and open a new one.
def generate_sitemaps():
if not os.path.exists(OUTPUT_DIR):
os.makedirs(OUTPUT_DIR)
file_count = 1
url_count = 0
current_file = None
generated_files = []
# The Header and Footer for standard sitemaps
xml_header = '<?xml version="1.0" encoding="UTF-8"?>\n<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">\n'
xml_footer = '</urlset>\n'
def open_new_file(file_num):
"""Helper to open a new sitemap file (with optional gzip)."""
ext = '.xml.gz' if USE_GZIP else '.xml'
filename = os.path.join(OUTPUT_DIR, f'sitemap-{file_num}{ext}')
if USE_GZIP:
f = gzip.open(filename, 'wt', encoding='utf-8')
else:
f = open(filename, 'w', encoding='utf-8')
f.write(xml_header)
print(f"[*] Writing to {filename}...")
return f, filename
# Initialize the first file
current_file, current_filename = open_new_file(file_count)
generated_files.append(os.path.basename(current_filename))
try:
# Iterate through the stream
for row in fetch_urls():
# Check if we need to rotate files
if url_count >= LIMIT_PER_FILE:
# Close current file
current_file.write(xml_footer)
current_file.close()
print(f"[+] Finished {os.path.basename(current_filename)}. URLs: {url_count}")
# Reset and rotate
file_count += 1
url_count = 0
current_file, current_filename = open_new_file(file_count)
generated_files.append(os.path.basename(current_filename))
# Write the URL entry
entry = (
f" <url>\n"
f" <loc>{row['loc']}</loc>\n"
f" <lastmod>{row['lastmod']}</lastmod>\n"
f" </url>\n"
)
current_file.write(entry)
url_count += 1
finally:
# Clean up the final file
if current_file and not current_file.closed:
current_file.write(xml_footer)
current_file.close()
print(f"[+] Finished {os.path.basename(current_filename)}. URLs: {url_count}")
return generated_filesIndex Generator (Binder)
Since we generated multiple files (because we respected the 50k limit), we need an xml Sitemap Index to rule them all.
def generate_index(sitemap_files):
index_filename = os.path.join(OUTPUT_DIR, 'sitemap-index.xml')
with open(index_filename, 'w', encoding='utf-8') as f:
f.write('<?xml version="1.0" encoding="UTF-8"?>\n')
f.write('<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">\n')
for filename in sitemap_files:
# Construct proper URL (handle trailing slashes)
base = BASE_URL.rstrip('/')
sitemap_url = f"{base}/{OUTPUT_DIR}{filename}"
f.write(' <sitemap>\n')
f.write(f' <loc>{sitemap_url}</loc>\n')
# In a real script, check the file modification time for this date
f.write(f' <lastmod>{datetime.now(timezone.utc).strftime("%Y-%m-%d")}</lastmod>\n')
f.write(' </sitemap>\n')
f.write('</sitemapindex>\n')
print(f"[!] Sitemap Index generated at {index_filename}")
print(f"[!] Total sitemaps created: {len(sitemap_files)}")
# Execution
if __name__ == "__main__":
try:
files = generate_sitemaps()
generate_index(files)
print(f"\n✓ Submit this to Google: {BASE_URL}/{OUTPUT_DIR}sitemap-index.xml")
except Exception as e:
print(f"[ERROR] Generation failed: {e}")
raiseWhy This Script?
Memory Safety
Look at the fetch_urls function. It yields one dict at a time.
The generate_sitemaps loop consumes that dict, writes it to disk, and discards it. You could run this script against a database with 10 million rows on a cheap $5/month VPS with 512MB of RAM, and it won’t crash. It is O(1) in terms of memory complexity.
“Sitemap Index” Automation
Most custom scripts forget this. They generate one giant file that hits 51,000 lines, and Google Search Console throws a “Parsing Error.”
This script automatically handles the overflow. It shards the data into sitemap-1.xml, sitemap-2.xml, etc., and creates the parent index automatically.
Production-Grade Error Handling
Notice the try/finally block. Even if your database connection drops mid-stream, the script ensures the current file is properly closed with a valid XML footer.
No corrupted sitemaps. No debugging why Google rejected your submission.

Expansion Packs
Because this is Python, you can get fancy:
- Gzip Compression – Already implemented. Just flip
USE_GZIP = Trueto output.xml.gzfiles directly, saving bandwidth. Search engines handle compressed sitemaps natively. - Image Sitemaps – Add
<image:image>tags inside the loop if your DB has image references. The streaming architecture doesn’t care about payload complexity. - Hreflang – If you have a translation table, inject the
xhtml:linkattributes right there in the loop. Multi-language support without plugins. - Database Integration – Replace the mock generator with real database queries:
def fetch_urls():
cursor.execute("""
SELECT url, updated_at, priority
FROM pages
WHERE noindex = FALSE
ORDER BY priority DESC, updated_at DESC
""")
for row in cursor:
yield {
'loc': f"{BASE_URL}{row['url']}",
'lastmod': row['updated_at'].strftime('%Y-%m-%d'),
}Full Mock Example
This is the mock / simulation version. It doesn’t connect to any real database.
Look at the fetch_urls() function – it just generates fake URLs in a loop:
for i in range(total_records):
yield {
'loc': f"{BASE_URL}/page-{i}",
'lastmod': datetime.now(timezone.utc).strftime('%Y-%m-%d')
}The script:
#!/usr/bin/env python3
"""
The Python Cartographer: A Stream-Based Sitemap Generator
Generates XML sitemaps with automatic sharding for large sites.
Memory-efficient generator-based approach for millions of URLs.
"""
import os
import gzip
from datetime import datetime, timezone
# ==================== CONFIG ====================
LIMIT_PER_FILE = 50000
OUTPUT_DIR = 'sitemaps/'
BASE_URL = 'https://example.com'
USE_GZIP = False # Toggle for compressed output
# ==================== MOCK DATABASE (FETCHER) ====================
def fetch_urls():
"""
Simulates a database cursor using a generator.
In production, replace with:
cursor.execute("SELECT url, updated_at FROM pages WHERE noindex = FALSE")
for row in cursor:
yield { ... }
"""
# Let's pretend we have 120,000 pages
total_records = 120000
for i in range(total_records):
# Logic to skip 'noindex' pages happens in the SQL query
priority = "1.0" if i == 0 else "0.8" # Homepage
yield {
'loc': f"{BASE_URL}/page-{i}",
'lastmod': datetime.now(timezone.utc).strftime('%Y-%m-%d')
}
# ==================== WRITER & SHARDING ====================
def generate_sitemaps():
"""
Streams URLs from generator and writes to sharded XML files.
Automatically rotates files at 50k URLs per file.
Returns list of generated filenames.
"""
if not os.path.exists(OUTPUT_DIR):
os.makedirs(OUTPUT_DIR)
file_count = 1
url_count = 0
current_file = None
generated_files = []
# The Header and Footer for standard sitemaps
xml_header = '<?xml version="1.0" encoding="UTF-8"?>\n<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">\n'
xml_footer = '</urlset>\n'
def open_new_file(file_num):
"""Helper to open a new sitemap file (with optional gzip)."""
ext = '.xml.gz' if USE_GZIP else '.xml'
filename = os.path.join(OUTPUT_DIR, f'sitemap-{file_num}{ext}')
if USE_GZIP:
f = gzip.open(filename, 'wt', encoding='utf-8')
else:
f = open(filename, 'w', encoding='utf-8')
f.write(xml_header)
print(f"[*] Writing to {filename}...")
return f, filename
# Initialize the first file
current_file, current_filename = open_new_file(file_count)
generated_files.append(os.path.basename(current_filename))
try:
# Iterate through the stream
for row in fetch_urls():
# Check if we need to rotate files
if url_count >= LIMIT_PER_FILE:
# Close current file
current_file.write(xml_footer)
current_file.close()
print(f"[+] Finished {os.path.basename(current_filename)}. URLs: {url_count}")
# Reset and rotate
file_count += 1
url_count = 0
current_file, current_filename = open_new_file(file_count)
generated_files.append(os.path.basename(current_filename))
# Write the URL entry
entry = (
f" <url>\n"
f" <loc>{row['loc']}</loc>\n"
f" <lastmod>{row['lastmod']}</lastmod>\n"
f" </url>\n"
)
current_file.write(entry)
url_count += 1
finally:
# Clean up the final file
if current_file and not current_file.closed:
current_file.write(xml_footer)
current_file.close()
print(f"[+] Finished {os.path.basename(current_filename)}. URLs: {url_count}")
return generated_files
# ==================== INDEX GENERATOR (BINDER) ====================
def generate_index(sitemap_files):
"""
Creates a sitemap index file that references all sharded sitemaps.
This is the file you submit to Google Search Console.
"""
index_filename = os.path.join(OUTPUT_DIR, 'sitemap-index.xml')
with open(index_filename, 'w', encoding='utf-8') as f:
f.write('<?xml version="1.0" encoding="UTF-8"?>\n')
f.write('<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">\n')
for filename in sitemap_files:
# Construct proper URL (handle trailing slashes)
base = BASE_URL.rstrip('/')
sitemap_url = f"{base}/{OUTPUT_DIR}{filename}"
f.write(' <sitemap>\n')
f.write(f' <loc>{sitemap_url}</loc>\n')
# In a real script, check the file modification time for this date
f.write(f' <lastmod>{datetime.now(timezone.utc).strftime("%Y-%m-%d")}</lastmod>\n')
f.write(' </sitemap>\n')
f.write('</sitemapindex>\n')
print(f"[!] Sitemap Index generated at {index_filename}")
print(f"[!] Total sitemaps created: {len(sitemap_files)}")
# ==================== EXECUTION ====================
if __name__ == "__main__":
print("=" * 60)
print("Python Cartographer - Sitemap Generator")
print("=" * 60)
try:
files = generate_sitemaps()
generate_index(files)
print("\n" + "=" * 60)
print("✓ Generation complete!")
print(f"✓ Submit this to Google: {BASE_URL}/{OUTPUT_DIR}sitemap-index.xml")
print("=" * 60)
except Exception as e:
print(f"\n[ERROR] Sitemap generation failed: {e}")
raiseConverting Mock to Database Version
To connect to a real database, you need to make two changes:
1. Add Database Dependencies
At the top of the script, add the database connector:
import mysql.connector # For MySQL/MariaDB
# OR
import psycopg2 # For PostgreSQLInstall it: pip install mysql-connector-python
2. Replace the fetch_urls() Function with:
def fetch_urls():
# Connect to database
conn = mysql.connector.connect(
host='localhost',
user='your_db_user',
password='your_db_pass',
database='your_db_name'
)
cursor = conn.cursor(dictionary=True, buffered=False)
# Query your pages table
cursor.execute("""
SELECT url, updated_at
FROM pages
WHERE noindex = FALSE
ORDER BY updated_at DESC
""")
# Stream results one row at a time
for row in cursor:
yield {
'loc': f"{BASE_URL}{row['url']}",
'lastmod': row['updated_at'].strftime('%Y-%m-%d')
}
cursor.close()
conn.close()That’s it. The rest of the script (writer, sharding, index generator) stays identical. The generator pattern works the same whether you’re yielding fake data or real database rows.
The magic is that cursor.execute() with buffered=False returns a cursor that also yields one row at a time, maintaining the same memory-efficient streaming behavior.
REST API Example
This version uses WordPress’s REST API instead of direct database access.
Why? Because most people don’t have MySQL credentials.
#!/usr/bin/env python3
"""
The Python Cartographer: WordPress REST API Edition
Generates XML sitemaps from WordPress via REST API.
Works with WordPress.com and any managed WordPress hosting.
No database access required.
"""
import os
import gzip
import requests
from datetime import datetime, timezone
from urllib.parse import urljoin, urlparse
import time
# ==================== CONFIG ====================
LIMIT_PER_FILE = 50000
OUTPUT_DIR = 'sitemaps/'
USE_GZIP = False
# WordPress Site Config
SITE_URL = 'https://example.com' # Your WordPress site URL (no trailing slash)
# Post types to include
INCLUDE_POST_TYPES = ['posts', 'pages'] # API endpoints
# API Settings
PER_PAGE = 100 # Max items per API request (WordPress limit is 100)
REQUEST_DELAY = 0.5 # Seconds between requests (be nice to the server)
# ==================== API FETCHER ====================
def fetch_from_api(endpoint, params=None):
"""
Fetches data from WordPress REST API with pagination.
Handles rate limiting and errors gracefully.
"""
url = f"{SITE_URL.rstrip('/')}/wp-json/wp/v2/{endpoint}"
all_items = []
page = 1
if params is None:
params = {}
params['per_page'] = PER_PAGE
params['_embed'] = 'true' # Get additional data
print(f"[*] Fetching {endpoint}...")
while True:
params['page'] = page
try:
response = requests.get(url, params=params, timeout=30)
if response.status_code == 400 and page > 1:
# No more pages
break
response.raise_for_status()
items = response.json()
if not items:
break
all_items.extend(items)
print(f" Page {page}: {len(items)} items (Total: {len(all_items)})")
# Check if there are more pages
total_pages = int(response.headers.get('X-WP-TotalPages', page))
if page >= total_pages:
break
page += 1
time.sleep(REQUEST_DELAY) # Be nice to the server
except requests.exceptions.RequestException as e:
print(f"[!] API request failed on page {page}: {e}")
if page == 1:
raise # Critical error on first page
break # Partial success, continue with what we have
print(f"[✓] Fetched {len(all_items)} {endpoint}")
return all_items
def fetch_urls():
"""
Generator that yields URL data from WordPress REST API.
Fetches posts and pages, formats them for sitemap.
"""
for post_type in INCLUDE_POST_TYPES:
items = fetch_from_api(post_type, params={'status': 'publish'})
for item in items:
# Get the URL
url = item.get('link', '')
if not url:
continue
# Get modified date
modified = item.get('modified_gmt', item.get('date_gmt', ''))
if modified:
try:
# Parse ISO 8601 format
dt = datetime.fromisoformat(modified.replace('Z', '+00:00'))
lastmod = dt.strftime('%Y-%m-%d')
except:
lastmod = datetime.now(timezone.utc).strftime('%Y-%m-%d')
else:
lastmod = datetime.now(timezone.utc).strftime('%Y-%m-%d')
# Determine priority
parsed_url = urlparse(url)
if parsed_url.path in ['/', '']:
priority = '1.0'
elif post_type == 'pages':
priority = '0.8'
else:
priority = '0.6'
yield {
'loc': url,
'lastmod': lastmod
}
# ==================== WRITER & SHARDING ====================
def generate_sitemaps():
"""
Streams URLs from API and writes to sharded XML files.
Automatically rotates files at 50k URLs per file.
"""
if not os.path.exists(OUTPUT_DIR):
os.makedirs(OUTPUT_DIR)
file_count = 1
url_count = 0
current_file = None
generated_files = []
xml_header = '<?xml version="1.0" encoding="UTF-8"?>\n<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">\n'
xml_footer = '</urlset>\n'
def open_new_file(file_num):
"""Opens a new sitemap file with optional gzip compression."""
ext = '.xml.gz' if USE_GZIP else '.xml'
filename = os.path.join(OUTPUT_DIR, f'sitemap-{file_num}{ext}')
if USE_GZIP:
f = gzip.open(filename, 'wt', encoding='utf-8')
else:
f = open(filename, 'w', encoding='utf-8')
f.write(xml_header)
print(f"[*] Writing to {filename}...")
return f, filename
current_file, current_filename = open_new_file(file_count)
generated_files.append(os.path.basename(current_filename))
try:
for row in fetch_urls():
if url_count >= LIMIT_PER_FILE:
current_file.write(xml_footer)
current_file.close()
print(f"[+] Finished {os.path.basename(current_filename)}. URLs: {url_count}")
file_count += 1
url_count = 0
current_file, current_filename = open_new_file(file_count)
generated_files.append(os.path.basename(current_filename))
entry = (
f" <url>\n"
f" <loc>{row['loc']}</loc>\n"
f" <lastmod>{row['lastmod']}</lastmod>\n"
f" </url>\n"
)
current_file.write(entry)
url_count += 1
finally:
if current_file and not current_file.closed:
current_file.write(xml_footer)
current_file.close()
print(f"[+] Finished {os.path.basename(current_filename)}. URLs: {url_count}")
return generated_files
# ==================== INDEX GENERATOR ====================
def generate_index(sitemap_files):
"""Creates sitemap index file referencing all sharded sitemaps."""
index_filename = os.path.join(OUTPUT_DIR, 'sitemap-index.xml')
with open(index_filename, 'w', encoding='utf-8') as f:
f.write('<?xml version="1.0" encoding="UTF-8"?>\n')
f.write('<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">\n')
for filename in sitemap_files:
base = SITE_URL.rstrip('/')
sitemap_url = f"{base}/{OUTPUT_DIR}{filename}"
f.write(' <sitemap>\n')
f.write(f' <loc>{sitemap_url}</loc>\n')
f.write(f' <lastmod>{datetime.now(timezone.utc).strftime("%Y-%m-%d")}</lastmod>\n')
f.write(' </sitemap>\n')
f.write('</sitemapindex>\n')
print(f"[!] Sitemap Index: {index_filename}")
print(f"[!] Total sitemaps: {len(sitemap_files)}")
# ==================== API VALIDATION ====================
def validate_api_access():
"""Tests if the WordPress REST API is accessible."""
url = f"{SITE_URL.rstrip('/')}/wp-json/wp/v2"
print(f"[*] Testing API access: {url}")
try:
response = requests.get(url, timeout=10)
response.raise_for_status()
print("[✓] WordPress REST API is accessible")
return True
except requests.exceptions.RequestException as e:
print(f"[✗] Cannot access WordPress REST API: {e}")
print("\n[!] Make sure:")
print(" 1. SITE_URL is correct")
print(" 2. WordPress REST API is enabled")
print(" 3. Site is publicly accessible")
return False
# ==================== EXECUTION ====================
if __name__ == "__main__":
print("=" * 60)
print("Python Cartographer - WordPress REST API Edition")
print("=" * 60)
# Validate config
if SITE_URL == 'https://yoursite.com' or 'yoursite.com' in SITE_URL:
print("\n[!] Please update SITE_URL with your WordPress site URL")
print(" Example: SITE_URL = 'https://example.com'")
exit(1)
# Test API access
if not validate_api_access():
exit(1)
try:
print("\n" + "=" * 60)
files = generate_sitemaps()
generate_index(files)
print("\n" + "=" * 60)
print("✓ Generation complete!")
print(f"✓ Sitemaps generated in: {OUTPUT_DIR}")
print(f"✓ Upload these files to your WordPress site")
print(f"✓ Then submit to Google: {SITE_URL}/{OUTPUT_DIR}sitemap-index.xml")
print("=" * 60)
except Exception as e:
print(f"\n[ERROR] Sitemap generation failed: {e}")
import traceback
traceback.print_exc()
raiseThe Key Difference
Database Method (Original):
- Direct SQL queries to
wp_poststable - Requires database credentials from
wp-config.php - Needs SSH or command-line access to the server
- Lightning fast – streams millions of rows with zero HTTP overhead
- Only works on self-hosted with full server access
REST API Method :
- Fetches data via public HTTP endpoints (
/wp-json/wp/v2/posts) - Works with WordPress.com, managed hosting, or any WP site
- Runs from your local machine – no server access required
- Slightly slower due to HTTP requests and pagination (100 items per call)
- Includes polite rate limiting to avoid hammering your server
The REST API adds HTTP latency. For a site with 10,000 posts, you’re making ~100 API calls instead of one database query. But for most sites (under 50k posts), this is negligible, and it’s the only option if you’re on managed WordPress hosting.
The generator pattern still applies. Memory usage stays flat. The sharding logic is identical. The only difference is the data source.
If you control the server, use the database version. If you’re on WordPress.com or managed hosting, this is your tool.
Takeaways
Stop letting plugins dictate your architecture. Plugins are generic – they have to work for everyone. Your site is specific.
By writing a 100-line Python script, you gain total control over automation, crawl priority, update frequency, and server load. You turn a black box into a transparent pipeline.
This cartographer knows exactly what to map, when to rotate files, and how to keep Google happy, because you wrote the rules.

