Script: Bulk Schema Checker

Structured data audits can be a nasty business.

Someone opens a validator, pastes in one URL, gets a green check, and declares victory. Meanwhile the other 4,982 templates on the site are sitting there with broken JSON-LD, missing @type, and duplicate graphs.

What you might need is a repeatable way to pull a list of URLs, inspect their structured data in bulk, and flag the obvious failures before they turn into search losses.

This script is not as a replacement for Google’s Rich Results Test or Schema.org’s validator, but as a first-pass machine that tells you where the bodies are buried. At present, neither Google nor Schema.org provide an API for validating structured data schemas.

What This Script Does?

This is a bulk Python checker for pages that output JSON-LD.

It will:

Fetch a list of URLs
Extract every application/ld+json block
Parse valid JSON
Walk through @graph when present
Report missing markup
Report invalid JSON
Report missing top-level fields like @context and @type
Export the findings to CSV

That makes it useful for template QA, migration checks, release monitoring, and large-scale sanity audits.

What it will not do is certify that a page will earn a rich result. That is a different problem and Google’s documentation is very clear on that boundary.

Operating Principle

Most structured data failures are mechanical ones.

A script tag is missing. A comma is misplaced. A deployment strips quotation marks. A CMS field pushes empty values into a graph. A template outputs Product on pages that are not products. A developer ships schema to staging and forgets production.

This script is designed for that repetitive, high-volume layer of the problem.

Architecture

The workflow is simple:

Input – A text file or CSV of URLs.
Fetch – Request the HTML with a polite timeout and user agent.
Extract – Collect all JSON-LD script blocks.
Parse – Load each block as JSON and flatten @graph items into individual entities.
Validate – Check for:
- no JSON-LD found
- invalid JSON
- missing @context
- missing @type
Export – Write one row per URL with summary fields you can sort, filter, and hand to engineering.

That is enough to find the majority of structural failures in a real-world audit.

The Script

import csv
import json
import os
import requests
from bs4 import BeautifulSoup
from typing import Any, Dict, List, Tuple
HEADERS = {
    "User-Agent": "Mozilla/5.0 (compatible; BulkSchemaValidator/1.0; +https://example.com/bot)"
}
TIMEOUT = 15
def load_urls(file_path: str) -> List[str]:
    """
    Load URLs from either:
    - .txt files: one URL per line
    - .csv files: first column assumed to contain URLs, header allowed
    """
    ext = os.path.splitext(file_path)[1].lower()
    urls: List[str] = []
    if ext == ".txt":
        with open(file_path, "r", encoding="utf-8") as f:
            for line in f:
                url = line.strip()
                if url:
                    urls.append(url)
    elif ext == ".csv":
        with open(file_path, "r", encoding="utf-8-sig", newline="") as f:
            reader = csv.reader(f)
            first_row = next(reader, None)
            if not first_row:
                return urls
            first_cell = first_row[0].strip().lower() if first_row else ""
            has_header = first_cell in {"url", "urls", "link", "links"}
            if not has_header and first_row[0].strip():
                urls.append(first_row[0].strip())
            for row in reader:
                if row and row[0].strip():
                    urls.append(row[0].strip())
    else:
        raise ValueError("Unsupported file type. Use .txt or .csv")
    return urls
def fetch_html(url: str) -> Tuple[int, str]:
    response = requests.get(url, headers=HEADERS, timeout=TIMEOUT)
    return response.status_code, response.text
def extract_jsonld(html: str) -> List[str]:
    soup = BeautifulSoup(html, "html.parser")
    blocks = soup.find_all("script", attrs={"type": "application/ld+json"})
    return [block.get_text(strip=True) for block in blocks if block.get_text(strip=True)]
def normalize_entities(data: Any) -> List[Dict[str, Any]]:
    """
    Convert a JSON-LD block into a list of entity dictionaries.
    Handles:
    - single object
    - list of objects
    - object with @graph
    """
    entities: List[Dict[str, Any]] = []
    if isinstance(data, dict):
        if "@graph" in data and isinstance(data["@graph"], list):
            for item in data["@graph"]:
                if isinstance(item, dict):
                    entities.append(item)
        else:
            entities.append(data)
    elif isinstance(data, list):
        for item in data:
            if isinstance(item, dict):
                entities.append(item)
    return entities
def validate_block_and_entities(parsed: Any, entities: List[Dict[str, Any]]) -> List[str]:
    """
    Validate JSON-LD in a way that avoids false positives for @graph structures.
    Important:
    - @context is checked at the block level, not on every entity
    - @type is checked on extracted entities
    """
    issues: List[str] = []
    if isinstance(parsed, dict):
        if "@context" not in parsed:
            issues.append("missing_@context")
    elif isinstance(parsed, list):
        # For top-level arrays, at least one object should usually carry @context
        has_context = any(isinstance(item, dict) and "@context" in item for item in parsed)
        if not has_context:
            issues.append("missing_@context")
    if not entities:
        issues.append("no_entities_extracted")
        return issues
    missing_type_count = 0
    for entity in entities:
        if "@type" not in entity:
            missing_type_count += 1
    if missing_type_count == len(entities):
        issues.append("missing_@type")
    return issues
def inspect_url(url: str) -> Dict[str, Any]:
    result: Dict[str, Any] = {
        "url": url,
        "http_status": "",
        "jsonld_blocks": 0,
        "entities_found": 0,
        "schema_types": "",
        "valid_json_blocks": 0,
        "invalid_json_blocks": 0,
        "issues": ""
    }
    try:
        status_code, html = fetch_html(url)
        result["http_status"] = status_code
        blocks = extract_jsonld(html)
        result["jsonld_blocks"] = len(blocks)
        if not blocks:
            result["issues"] = "no_jsonld_found"
            return result
        all_types: List[str] = []
        all_issues: List[str] = []
        for block in blocks:
            try:
                parsed = json.loads(block)
                result["valid_json_blocks"] += 1
                entities = normalize_entities(parsed)
                result["entities_found"] += len(entities)
                block_issues = validate_block_and_entities(parsed, entities)
                all_issues.extend(block_issues)
                for entity in entities:
                    entity_type = entity.get("@type")
                    if isinstance(entity_type, list):
                        all_types.extend([str(t) for t in entity_type])
                    elif entity_type:
                        all_types.append(str(entity_type))
            except json.JSONDecodeError:
                result["invalid_json_blocks"] += 1
                all_issues.append("invalid_json")
        unique_types = sorted(set(all_types))
        unique_issues = sorted(set(all_issues))
        result["schema_types"] = ", ".join(unique_types)
        result["issues"] = ", ".join(unique_issues) if unique_issues else "none"
        return result
    except requests.RequestException as e:
        result["issues"] = f"request_error: {type(e).__name__}"
        return result
    except Exception as e:
        result["issues"] = f"unexpected_error: {type(e).__name__}"
        return result
def run_audit(input_file: str, output_file: str) -> None:
    urls = load_urls(input_file)
    results: List[Dict[str, Any]] = []
    for url in urls:
        print(f"Checking: {url}")
        results.append(inspect_url(url))
    fieldnames = [
        "url",
        "http_status",
        "jsonld_blocks",
        "entities_found",
        "schema_types",
        "valid_json_blocks",
        "invalid_json_blocks",
        "issues"
    ]
    with open(output_file, "w", newline="", encoding="utf-8") as f:
        writer = csv.DictWriter(f, fieldnames=fieldnames)
        writer.writeheader()
        writer.writerows(results)
    print(f"\nDone. Report saved to: {output_file}")
if __name__ == "__main__":
    run_audit("urls.csv", "schema_audit_report.csv")
    
        """
    Important: Change input urls.csv with your own file name (csv or txt).
    """

Why This Version Stays Useful?

The trap with automation is trying to make it too clever too early.

You do not need a giant framework to detect that 800 product pages suddenly have no JSON-LD. You do not need a machine learning pipeline to discover that your article template is outputting invalid JSON because a quote in the headline was not escaped.

This script gives you a workable signal fast.

It stays deliberately small and readable. That matters, because most SEO automation scripts die not from lack of ambition, but from becoming too annoying to maintain.

What the Output Tells You

A clean report might look boring and that is the point.

You want rows like:

valid JSON blocks present
entities extracted
expected schema types appearing
no structural issues

The interesting rows are the ugly ones:

no_jsonld_found
invalid_json
missing_@type
missing_@context
request_error

That is where your QA process starts.

For example, if all category pages return CollectionPage but 20 of them return no markup at all, you likely have a rendering or template exception. If all product pages still return Product but half the blocks are invalid JSON, your deployment broke serialization.

That distinction matters when you hand the issue to developers.

Why Not Use Screaming Frog Instead?

You can, and in many cases you should.

Screaming Frog is faster for broad crawling, quick extraction, and general schema audits across a whole site. If the goal is to inspect markup at scale without writing code, it is the obvious tool.

The reason to use a script instead is control.

A custom validator lets you check a specific URL list, apply your own rules, shape the output however you want, and plug the process into QA or deployment workflows.

And if you have a small domain, Screaming Frog’s free tier is great for exploration. A script is better when you need something repeatable and built around your own implementation.

Limits of Bulk Validation

A script like this can validate structure. It can catch parse failures. It can tell you whether markup exists and whether core fields are present.

Schema.org’s validator is built to extract Schema.org markup and identify syntax mistakes, while Google recommends using Rich Results Test to validate markup for supported search features.

But the script cannot fully determine whether the page qualifies for a specific Google rich result, whether the markup matches visible on-page content, or whether the implementation violates Google quality guidelines.

Again, Google is explicit here – structured data helps with eligibility, not certainty.

Smart Extensions

Once the base script is doing its job, then you can start making it smarter.

A few practical upgrades:

Compare expected schema by URL pattern – If /product/ URLs do not contain Product, flag them.
Add rendered HTML support – For JavaScript-heavy sites, use Playwright or Selenium instead of raw requests.
Pull URLs from a crawl export – Feed in URLs from Screaming Frog, a sitemap, or a database dump.
Split reports by template type – That makes debugging faster for engineering teams.
Add required-property checks – For example, you can create custom rules for fields your implementation expects on certain templates. Just do not confuse your internal requirements with Google-wide guarantees unless the documentation actually supports that claim. Google’s schema feature docs consistently separate required properties, recommended properties, and general eligibility rules.

Final Thought

Most schema problems are operational. The markup strategy was usually fine, then someone changed a component, launched a redesign, swapped out a field mapper, or “cleaned up” a template.

That is why bulk validation can be helpful. It is the difference between assuming your structured data exists and knowing it does.

And in technical SEO, that difference is usually where the damage lives.