Script: Broken Link Checker

You’ve got a site, maybe it’s a blog, maybe a sprawling SaaS documentation portal, or something in-between. You update content, write new posts, adjust internal linking, pull in external references, and yet, somewhere along the way, links start to die. It’s like termites quietly chewing through your SEO foundation.

Broken links are signals to search engines that your site isn’t well cared for. Bots waste crawl budget hitting dead ends. Users bounce off frustrated. Ranking potential slips. Fixing links is part of solid SEO housekeeping.

Why a Script and Not Just a Tool?

Sure, there is Screaming Frog free and some other online checkers that will scan pages and spit results back at you in a browser, but:

They usually limit how many links you can check per scan.
You don’t own the logic – it’s someone else’s UI and workflow.
They often don’t integrate easily into automation pipelines or devops tasks.

A script gives you control, flexibility and the ability to embed link checking into your CI/CD, your CMS publishing hooks, or even your local site audits.

The Core Idea

At its simplest, a broken link checker does two things – crawl a set of pages and verify every link found (internal and external) to see if they return a healthy HTTP status (e.g., 200 OK) or something problematic (like 404 Not Found).

That means: follow <a href=""> links, optionally inspect images, CSS, JS, forms, and keep records of what worked and what didn’t.

Let’s build that.

A Simple Script

Below is a Python example using requests and beautifulsoup4.

import requests
from urllib.parse import urljoin, urlparse
from bs4 import BeautifulSoup
# Basic config
BASE_URL = "https://yourdomain.com"
TIMEOUT = 7  # seconds
checked_links = set()
broken_links = []
def fetch_html(url):
    try:
        resp = requests.get(url, timeout=TIMEOUT)
        resp.raise_for_status()
        return resp.text
    except Exception as e:
        print(f"Failed to fetch {url} — {e}")
        return None
def check_link(url):
    try:
        resp = requests.head(url, timeout=TIMEOUT, allow_redirects=True)
        return resp.status_code
    except Exception as e:
        print(f"Error checking {url} — {e}")
        return None
def crawl_page(url):
    html = fetch_html(url)
    if not html:
        return
    soup = BeautifulSoup(html, "html.parser")
    anchors = soup.find_all("a", href=True)
    for tag in anchors:
        href = tag["href"].strip()
        # Resolve relative URLs
        full_url = urljoin(url, href)
        parsed = urlparse(full_url)
        # Normalize and avoid duplicates
        if full_url in checked_links:
            continue
        checked_links.add(full_url)
        # Check only HTTP(S)
        if parsed.scheme not in ("http", "https"):
            continue
        status = check_link(full_url)
        if not status or status >= 400:
            broken_links.append((full_url, status))
            print(f"❌ BROKEN: {full_url} → {status}")
        else:
            print(f"✅ OK: {full_url} → {status}")
        # Optionally crawl internal links deeper
        if BASE_URL in full_url:
            crawl_page(full_url)
# kick off
crawl_page(BASE_URL)
print("=== Summary ===")
print(f"Total links checked: {len(checked_links)}")
print(f"Broken links found: {len(broken_links)}")
for link, code in broken_links:
    print(f" - {link} returned {code}")

This simple script: Fetches a page – Parses links – Normalizes URL paths – Checks their HTTP status with HEAD requests – Reports broken ones.

You can expand it with features like:

Queuing instead of recursion (to avoid stack limits),
Async requests for speed,
Logging to CSV/JSON,
Exporting results to dashboards,
Running via cron, or integrating with GitHub Actions.

What Counts as “Broken” in This Case?

Not just 404. Other HTTP statuses can be problematic too:

Client errors (4xx) like 403 or 410.
Server errors (5xx) where a page temporarily fails.
Redirect loops or endless chains.
Timeouts, where the server refuses connections.

All of these degrade user trust and SEO. Tools often lump them into “dead links” by status code.

Export the Results to CSV

Printing red lines to the console feels good for five minutes. After that, you want something you can sort, filter, share, or drop into Sheets.

A simple CSV export keeps things practical.

Since our script already collects failures like this: broken_links.append((full_url, status))

We just need to pour it into a file by adding this function near the top:

import csv
from datetime import datetime
def export_to_csv(broken_links):
    timestamp = datetime.now().strftime("%Y-%m-%d_%H-%M")
    filename = f"broken_links_{timestamp}.csv"
    with open(filename, "w", newline="", encoding="utf-8") as f:
        writer = csv.writer(f)
        writer.writerow(["url", "status_code"])
        for url, status in broken_links:
            writer.writerow([url, status])
    print(f"Saved report → {filename}")

Then call it at the very end of the script, right after your summary:

print("=== Summary ===")
print(f"Total links checked: {len(checked_links)}")
print(f"Broken links found: {len(broken_links)}")
export_to_csv(broken_links)

Don’t Follow Every Redirect Blindly

Some destinations use pyramid redirects, e.g., 301 – 302 – 200. You might think it’s fine, but every hop expends crawl budget and could slow bots. Some systems even penalize excessive redirects.

Handling redirect chains intelligently is a next-level enhancement.

By the way, external broken links maybe the other site moved something, but from an SEO POV, it still leads to UX problems.

Why Schedule a Broken Link Checker?

One scan feels productive, but sites don’t sit still. You remove a landing page, a partner reshuffles their URL structure, an image host disappears, and suddenly everything is a mess. Broken links don’t explode all at once, they accumulate.

So don’t treat link checking like a one-off task. Automate it. Run it weekly, trigger it after deploys, bake it into your pipeline. Let it hum in the background like tests or linting.

Visitors don’t hit dead ends, crawlers don’t waste time, and your site feels cared for. Those small signals add up, and over time, they’re the difference between a site that ranks and one that slowly decays.

Wrapping Up

Technical SEO is about signals – sitemaps, schema, canonical URLs, and more. Broken links are the most tangible signal of neglect. They’re visible, measurable, and fixable. Tools help, but a script like this gives you ownership and control.

So plug in your domain, tweak the timeouts, maybe add async or exporting, and let it run. Over time you’ll see fewer red lines in your reports and more green ticks in your logs, and that’s a pretty satisfying feeling for any SEO.