Simple GPT Explained for SEOs

GPT models aren’t magical beasts. They’re matrix math, attention heads, and token predictions stacked on top of each other. Read the code once carefully and the whole thing makes sense.

Most SEOs interact with these models daily without knowing what’s running underneath, and that gap shows up when you’re debugging why a piece of content ranks, how semantic search actually works, or why AI overviews keep pulling the wrong answers.

This article walks through a minimal GPT implementation, built from scratch, and explains what each part actually does.

How the Technology Works

A GPT model has one job, which is to predict the next token. A token is a piece of text turned into a number. The model never reads words directly – it reads IDs, converts them into vectors, runs those through transformer blocks, and spits out scores for what comes next.

Training works by feeding the model token sequences where the target is the same sequence shifted one position forward. It predicts the next token at every position, measures how wrong it was, and adjusts its weights to be less wrong next time.

Evaluation runs the same model without touching the weights. Dropout off, no gradient updates, tested on text it never trained on. The point is checking whether it actually learned language patterns or just memorized the training data.

In production, the model takes a prompt, tokenizes it, predicts one token, appends it, and repeats. That’s how generated text works, one token at a time.

Real production systems bolt on safety layers, caching, batching, hardware scaling, better sampling. The code below skips all of that and focuses on the core mechanics.

The knowledge and implementation of code is inspired by and based on Sebastian Raschka’s book Build a Large Language Model (From Scratch).

A diagram visualizing how a GPT LLM predicts next token.

Why You Should Care as an SEO

Google has been a language model company for years. BERT, MUM, Gemini – these all run on the same core ideas you will see in the code, with some differences.

Knowing how attention works, how tokens get weighted, how context gets processed is not trivia, it changes how you think about content.

Entity relationships, semantic relevance, why some pages get pulled into AI overviews and others don’t, a lot of that starts making more sense once you understand what the model is actually doing when it reads a page. It’s not keyword matching. It never was.

The SEOs who’ll do well in the next few years don’t need to build these models. They simply need to stop treating them as black boxes.

Fully Commented Code

The code is a complete GPT-style model in PyTorch with a real tokenizer (tiktoken), not fake random token IDs. Nearly every line has a comment explaining what it does and what the technical terms mean.

At the end there’s a glossary covering the complex terms used throughout. Read the code top to bottom and you’ll have both a working model and a clear understanding of what’s running inside it.

# ============================================================
# Educational GPT-style language model from scratch.
#
# GPT-style means:
#   A model architecture similar in structure to GPT models:
#   it reads tokens from left to right and learns to predict
#   the next token.
#
# Language model means:
#   A model that learns patterns in text so it can estimate
#   what text piece probably comes next.
#
# Uses:
#   - PyTorch for neural network building and training.
#   - tiktoken for real text tokenization.
#
# Install first:
#   pip install torch tiktoken
#
# Important:
#   This is a learning implementation.
#   It is not a production LLM.
#   It will not generate good text unless trained on real text
#   for far longer with a larger model.
# ============================================================
 
 
# Python's built-in math module gives us basic calculator-like functions.
# We use it for square roots and constants inside formulas.
# A square root is the number that, multiplied by itself, gives the original number.
# Example: sqrt(9) = 3 because 3 * 3 = 9.
import math
 
 
# PyTorch is the main library that handles tensors and neural networks.
# A tensor is a container of numbers.
# It can be:
#   - one-dimensional, like a list: [1, 2, 3]
#   - two-dimensional, like a table
#   - three-dimensional or more, like stacked tables
# Neural networks store and process information as tensors.
import torch
 
 
# torch.nn contains reusable neural network building blocks.
# A neural network is a system of adjustable mathematical layers.
# During training, those layers change their internal numbers to reduce errors.
# We give torch.nn the shorter name "nn" to make later code easier to read.
import torch.nn as nn
 
 
# Dataset lets us define our own training data format.
# DataLoader helps feed that data to the model in small groups called batches.
# A batch is a small group of examples processed together.
# Batches make training faster and more stable than processing one example at a time.
from torch.utils.data import Dataset, DataLoader
 
 
# tiktoken is a real tokenizer library.
# A tokenizer turns text into token IDs.
# A token ID is a whole number representing a piece of text.
# A token can be a word, part of a word, punctuation, or spacing pattern.
import tiktoken
 
 
# ============================================================
# 1. TOKENIZER SETUP
# ============================================================
 
 
# Choose a GPT-2-style tokenizer.
# "gpt2" means we are using the token-splitting rules from GPT-2.
# This tokenizer uses a vocabulary with 50,257 possible token IDs.
# Vocabulary means the full set of tokens the tokenizer knows.
tokenizer = tiktoken.get_encoding("gpt2")
 
 
# Ask the tokenizer how many tokens exist in its vocabulary.
# This avoids hardcoding the vocabulary size incorrectly.
# Hardcoding means manually typing a value instead of reading it from the source.
VOCAB_SIZE = tokenizer.n_vocab
 
 
# ============================================================
# 2. MODEL CONFIGURATION
# ============================================================
 
 
# This dictionary is the control panel for the model.
# A dictionary stores named settings as key-value pairs.
# Example:
#   "emb_dim": 128
# means the setting named "emb_dim" has value 128.
GPT_CONFIG = {
 
    # Number of possible token IDs the model can read or predict.
    # This must match the tokenizer vocabulary size.
    # If this number is wrong, the model may crash when it sees or predicts token IDs.
    "vocab_size": VOCAB_SIZE,
 
    # Maximum number of tokens the model can look at at once.
    # This is called the context window or context length.
    # Context means the text the model can currently see.
    "context_length": 64,
 
    # Size of each token's internal vector.
    # A vector is a list of numbers.
    # The model does not understand raw text directly.
    # It turns each token ID into a vector of numbers first.
    "emb_dim": 128,
 
    # Number of attention heads.
    # Attention is the mechanism that lets tokens decide which earlier tokens matter.
    # A head is one separate attention pathway.
    # Multiple heads let the model examine relationships in several ways at once.
    "n_heads": 4,
 
    # Number of transformer blocks.
    # A transformer block is a repeated processing unit containing attention,
    # normalization, and a feed-forward network.
    "n_layers": 2,
 
    # Dropout rate.
    # Dropout is a training technique that randomly hides some internal values.
    # This helps reduce overfitting.
    # Overfitting means the model memorizes training examples too closely
    # instead of learning patterns that generalize.
    "drop_rate": 0.1,
 
    # Whether the query/key/value layers include bias terms.
    # A bias is an extra trainable number added after a linear calculation.
    # Bias can give a layer more flexibility.
    # False keeps this educational implementation simpler.
    "qkv_bias": False
}
 
 
# ============================================================
# 3. TEXT ENCODING AND DECODING HELPERS
# ============================================================
 
 
# This function turns normal text into token IDs.
# Function means a reusable named block of code.
# Example:
#   "hello world" -> [31373, 995]
def text_to_token_ids(text, tokenizer):
 
    # tokenizer.encode converts a text string into a list of integer token IDs.
    # Integer means a whole number, such as 0, 1, 2, or 50256.
    encoded = tokenizer.encode(text)
 
    # Convert the Python list into a PyTorch tensor.
    # dtype means data type.
    # torch.long means the tensor stores whole numbers.
    # Token IDs must be whole numbers because they point to rows in an embedding table.
    tensor = torch.tensor(encoded, dtype=torch.long)
 
    # Add a batch dimension at the front.
    # Dimension means one axis of a tensor.
    # Shape changes from:
    #   [num_tokens]
    # to:
    #   [1, num_tokens]
    # The 1 means "one text example in this batch".
    tensor = tensor.unsqueeze(0)
 
    # Return the tensor so the model can use it.
    return tensor
 
 
# This function turns token IDs back into readable text.
# This is called decoding.
# Encoding means text -> token IDs.
# Decoding means token IDs -> text.
def token_ids_to_text(token_ids, tokenizer):
 
    # If the token IDs have a batch dimension, remove it.
    # squeeze removes dimensions of size 1.
    # Example:
    #   [[31373, 995]] -> [31373, 995]
    flat = token_ids.squeeze(0)
 
    # Convert the tensor into a regular Python list.
    # This is needed because tokenizer.decode expects a normal list of token IDs.
    ids = flat.tolist()
 
    # tokenizer.decode converts token IDs back into text.
    text = tokenizer.decode(ids)
 
    # Return the decoded text.
    return text
 
 
# ============================================================
# 4. DATASET FOR NEXT-TOKEN PREDICTION
# ============================================================
 
 
# This class prepares training examples for the language model.
# A class is a blueprint for creating objects.
# Here, the object is a dataset that PyTorch can read from.
class GPTDatasetV1(Dataset):
 
    # This setup function runs when we create the dataset.
    # __init__ is called a constructor.
    # It prepares the object when it is first created.
    #
    # token_ids: full list of token IDs from the tokenizer.
    # max_length: how many tokens each input example contains.
    # stride: how far the window moves before making the next example.
    def __init__(self, token_ids, max_length, stride):
 
        # Store input chunks here.
        # A chunk is a smaller piece cut from the full token list.
        self.input_ids = []
 
        # Store target chunks here.
        # A target is the correct answer the model should learn to predict.
        self.target_ids = []
 
        # This loop slides across the token list.
        # A loop repeats code multiple times.
        #
        # It creates examples like:
        # input:  [10, 20, 30, 40]
        # target: [20, 30, 40, 50]
        #
        # The target is shifted one token forward.
        # This teaches the model next-token prediction.
        for i in range(0, len(token_ids) - max_length, stride):
 
            # Take max_length tokens starting at position i.
            # This is what the model sees.
            input_chunk = token_ids[i:i + max_length]
 
            # Take the next max_length tokens, shifted one position forward.
            # This is what the model should predict.
            target_chunk = token_ids[i + 1:i + max_length + 1]
 
            # Convert the input chunk into a PyTorch tensor of whole numbers.
            # append adds one item to the end of a Python list.
            self.input_ids.append(torch.tensor(input_chunk, dtype=torch.long))
 
            # Convert the target chunk into a PyTorch tensor of whole numbers.
            self.target_ids.append(torch.tensor(target_chunk, dtype=torch.long))
 
    # This tells PyTorch how many examples are in the dataset.
    # len(dataset) will call this method.
    def __len__(self):
 
        # The number of input chunks equals the number of target chunks.
        return len(self.input_ids)
 
    # This tells PyTorch how to fetch one example by index.
    # Index means position number in a list.
    # Example: index 0 means the first item.
    def __getitem__(self, idx):
 
        # Return one input-target pair.
        return self.input_ids[idx], self.target_ids[idx]
 
 
# ============================================================
# 5. DATA LOADER CREATION
# ============================================================
 
 
# This function creates a DataLoader from raw text.
# Raw text means ordinary text before tokenization.
# The DataLoader feeds batches into the model during training.
def create_dataloader_v1(text, tokenizer, batch_size, max_length, stride, shuffle, drop_last):
 
    # Convert the full text into token IDs.
    # Token IDs are the numeric form of the text.
    token_ids = tokenizer.encode(text)
 
    # Make sure there are enough tokens to create at least one training example.
    # We need more tokens than max_length because the target is shifted by one token.
    if len(token_ids) <= max_length:
 
        # Stop with a clear error message instead of failing later in a confusing way.
        # raise means intentionally produce an error.
        # ValueError means the input value is not acceptable.
        raise ValueError(
            f"Text is too short: got {len(token_ids)} tokens, "
            f"but need more than {max_length} tokens."
        )
 
    # Create the dataset from the token IDs.
    dataset = GPTDatasetV1(
        token_ids=token_ids,
        max_length=max_length,
        stride=stride
    )
 
    # Wrap the dataset in a DataLoader.
    # batch_size controls how many examples are grouped together.
    # shuffle=True randomizes example order during training.
    # drop_last=True discards the final incomplete batch, if one exists.
    dataloader = DataLoader(
        dataset,
        batch_size=batch_size,
        shuffle=shuffle,
        drop_last=drop_last
    )
 
    # Return the DataLoader.
    return dataloader
 
 
# ============================================================
# 6. LAYER NORMALIZATION
# ============================================================
 
 
# LayerNorm keeps internal values stable.
# Normalization means adjusting numbers so they have a more consistent scale.
# This helps training because very large or very tiny values can make learning unstable.
class LayerNorm(nn.Module):
 
    # emb_dim is the size of each token vector.
    def __init__(self, emb_dim):
 
        # Initialize PyTorch's base module.
        # super() lets this class use setup behavior from nn.Module.
        super().__init__()
 
        # Small safety value to avoid division by zero.
        # Division by zero is mathematically undefined and would crash or produce invalid values.
        self.eps = 1e-5
 
        # Trainable scale values.
        # Trainable means the model can change these numbers during learning.
        # torch.ones creates a tensor filled with 1s.
        # Starting with 1 means scale initially leaves values unchanged.
        self.scale = nn.Parameter(torch.ones(emb_dim))
 
        # Trainable shift values.
        # torch.zeros creates a tensor filled with 0s.
        # Starting with 0 means shift initially leaves values unchanged.
        self.shift = nn.Parameter(torch.zeros(emb_dim))
 
    # This function defines what happens when data passes through LayerNorm.
    # forward is the standard PyTorch name for "run this layer".
    def forward(self, x):
 
        # Calculate the average value across the last dimension.
        # Mean means average.
        # dim=-1 means "use the last axis of the tensor".
        # keepdim=True keeps the tensor shape compatible for later arithmetic.
        mean = x.mean(dim=-1, keepdim=True)
 
        # Calculate how spread out the values are around the average.
        # This spread measurement is called variance.
        # Low variance means values are close together.
        # High variance means values are spread far apart.
        # unbiased=False uses the version commonly used in neural network layer normalization.
        var = x.var(dim=-1, keepdim=True, unbiased=False)
 
        # Center and rescale the values so they are more stable.
        # x - mean shifts the values so their average is near zero.
        # torch.sqrt takes the square root.
        # Dividing by sqrt(variance) makes the spread more consistent.
        norm_x = (x - mean) / torch.sqrt(var + self.eps)
 
        # Apply the trainable scale and shift.
        # This lets the model undo or adjust the normalization if useful.
        return self.scale * norm_x + self.shift
 
 
# ============================================================
# 7. GELU ACTIVATION
# ============================================================
 
 
# GELU is a smooth activation function commonly used in GPT-style models.
# GELU stands for Gaussian Error Linear Unit.
#
# Activation function means:
#   A mathematical rule that changes numbers inside a neural network.
#   It helps the model learn complex patterns instead of only simple straight-line relationships.
#
# Smooth means:
#   The output changes gradually rather than with sharp cutoffs.
class GELU(nn.Module):
 
    # This function transforms incoming values.
    def forward(self, x):
 
        # This is the standard GELU approximation formula.
        # Approximation means a formula that is very close to the original
        # but easier or faster to compute.
        #
        # tanh is a mathematical function that squashes values into a smooth range.
        # x.pow(3) means x raised to the third power, also called x cubed.
        # math.pi is the number pi, approximately 3.14159.
        #
        # You do not need to memorize this formula.
        # Its purpose is to smoothly control how much signal passes forward.
        return 0.5 * x * (
            1.0 + torch.tanh(
                math.sqrt(2.0 / math.pi) * (x + 0.044715 * x.pow(3))
            )
        )
 
 
# ============================================================
# 8. FEED-FORWARD NETWORK
# ============================================================
 
 
# The feed-forward network processes each token after attention.
#
# Feed-forward means:
#   Information moves forward through layers, from input to output,
#   without looping backward inside this block.
#
# Attention gathers context from other tokens.
# Feed-forward processing transforms each token's context-enriched vector.
class FeedForward(nn.Module):
 
    # cfg is the model settings dictionary.
    def __init__(self, cfg):
 
        # Initialize PyTorch's base module.
        super().__init__()
 
        # Build a small neural network as a sequence of steps.
        # nn.Sequential runs each layer in order.
        self.layers = nn.Sequential(
 
            # First linear layer expands the token vector.
            #
            # Linear layer means:
            #   A trainable operation that multiplies inputs by learned weights
            #   and optionally adds learned bias values.
            #
            # Weight means:
            #   A trainable number that controls how strongly one value affects another.
            #
            # Example: 128 numbers become 512 numbers.
            nn.Linear(cfg["emb_dim"], 4 * cfg["emb_dim"]),
 
            # GELU adds flexible nonlinear behavior.
            # Nonlinear means it can learn curved or complex relationships,
            # not only straight-line relationships.
            GELU(),
 
            # Second linear layer shrinks the vector back to the original size.
            # This returns the token representation to emb_dim so it can continue
            # through the rest of the model.
            nn.Linear(4 * cfg["emb_dim"], cfg["emb_dim"])
        )
 
    # This function defines how data flows through the feed-forward network.
    def forward(self, x):
 
        # Pass x through the layers and return the result.
        return self.layers(x)
 
 
# ============================================================
# 9. MULTI-HEAD CAUSAL SELF-ATTENTION
# ============================================================
 
 
# Attention lets each token decide which earlier tokens matter.
#
# Self-attention means:
#   Tokens in the same sequence attend to other tokens in that same sequence.
#
# Causal means:
#   The model can only look backward, not forward.
#   This prevents cheating during next-token prediction.
#
# Multi-head means:
#   Several attention mechanisms run in parallel.
#   Each head can learn a different kind of relationship.
class MultiHeadAttention(nn.Module):
 
    # d_in: input vector size.
    # d_out: output vector size.
    # context_length: maximum number of tokens the model can see.
    # dropout: dropout rate.
    # num_heads: number of attention heads.
    # qkv_bias: whether query/key/value layers use bias.
    def __init__(self, d_in, d_out, context_length, dropout, num_heads, qkv_bias=False):
 
        # Initialize PyTorch's base module.
        super().__init__()
 
        # Make sure the vector size can be split evenly across heads.
        # % means remainder after division.
        # Example:
        #   128 % 4 = 0, so 128 divides evenly into 4 heads.
        assert d_out % num_heads == 0, "d_out must be divisible by num_heads"
 
        # Store total output size.
        self.d_out = d_out
 
        # Store number of attention heads.
        self.num_heads = num_heads
 
        # Store how many vector dimensions each head gets.
        # // means integer division.
        # Example:
        #   128 // 4 = 32
        self.head_dim = d_out // num_heads
 
        # Query layer.
        # A query means:
        #   "What information is this token looking for?"
        # Projection means:
        #   A learned transformation from one vector form into another.
        self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias)
 
        # Key layer.
        # A key means:
        #   "What kind of information does this token represent?"
        # Queries compare against keys to decide relevance.
        self.W_key = nn.Linear(d_in, d_out, bias=qkv_bias)
 
        # Value layer.
        # A value means:
        #   "What actual information can this token pass along?"
        # Attention weights decide how much of each value to use.
        self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias)
 
        # Final projection layer after all heads are combined.
        # This mixes information from the different attention heads.
        self.out_proj = nn.Linear(d_out, d_out)
 
        # Dropout used on attention weights.
        # Attention weights are the percentages saying how much each token
        # should care about other tokens.
        self.dropout = nn.Dropout(dropout)
 
        # Create a causal mask.
        #
        # Mask means:
        #   A tensor used to block certain positions.
        #
        # torch.ones creates a square grid filled with 1s.
        # torch.triu keeps the upper triangular part of that grid.
        # Upper triangular means the part above the main diagonal.
        #
        # The upper triangle represents future tokens.
        # Those must be blocked in causal language modeling.
        mask = torch.triu(torch.ones(context_length, context_length), diagonal=1)
 
        # Store the mask inside the model.
        # register_buffer stores a tensor that belongs to the model
        # but is not trainable.
        #
        # Parameter = trainable model value.
        # Buffer = stored helper value that moves with the model to CPU/GPU.
        self.register_buffer("mask", mask)
 
    # This function runs attention.
    def forward(self, x):
 
        # Read the shape of the input.
        # Shape means the size of each tensor dimension.
        #
        # b = number of examples in the batch.
        # num_tokens = number of tokens in each example.
        # d_in = number of features per token.
        b, num_tokens, d_in = x.shape
 
        # Create queries from the input.
        queries = self.W_query(x)
 
        # Create keys from the input.
        keys = self.W_key(x)
 
        # Create values from the input.
        values = self.W_value(x)
 
        # Split queries into multiple attention heads.
        # view changes the tensor shape without changing its values.
        queries = queries.view(b, num_tokens, self.num_heads, self.head_dim)
 
        # Split keys into multiple attention heads.
        keys = keys.view(b, num_tokens, self.num_heads, self.head_dim)
 
        # Split values into multiple attention heads.
        values = values.view(b, num_tokens, self.num_heads, self.head_dim)
 
        # Move the heads dimension before the token dimension.
        # transpose swaps two tensor dimensions.
        # Shape becomes:
        #   batch, heads, tokens, head_dim
        queries = queries.transpose(1, 2)
 
        # Rearrange keys in the same way.
        keys = keys.transpose(1, 2)
 
        # Rearrange values in the same way.
        values = values.transpose(1, 2)
 
        # Compare queries with keys to get attention scores.
        #
        # @ means matrix multiplication.
        # Matrix multiplication is a structured way of combining rows and columns
        # of numbers to produce relationship scores.
        #
        # keys.transpose(2, 3) swaps the last two dimensions so queries can be
        # compared against keys.
        #
        # Higher score means stronger relationship.
        attn_scores = queries @ keys.transpose(2, 3)
 
        # Select the part of the mask needed for the current sequence length.
        # bool converts numbers into True/False values.
        # True positions are the positions we will block.
        mask_bool = self.mask[:num_tokens, :num_tokens].bool()
 
        # Replace future-token scores with negative infinity.
        #
        # Negative infinity means an extremely low value.
        # After softmax, these positions become zero probability.
        #
        # masked_fill_ modifies the tensor in place.
        # In place means it changes the existing tensor rather than making a new one.
        attn_scores.masked_fill_(mask_bool, -torch.inf)
 
        # Scale scores for numerical stability.
        #
        # Numerical stability means preventing numbers from becoming too large,
        # too small, or otherwise difficult for the computer to handle accurately.
        #
        # softmax turns raw scores into probabilities.
        # Probability means a number between 0 and 1 representing relative likelihood.
        # The probabilities across the compared tokens add up to 1.
        attn_weights = torch.softmax(attn_scores / math.sqrt(self.head_dim), dim=-1)
 
        # Apply dropout to attention probabilities during training.
        attn_weights = self.dropout(attn_weights)
 
        # Use the attention probabilities to combine value vectors.
        #
        # This is a weighted sum.
        # Weighted sum means values with higher attention weights contribute more.
        #
        # The result is a context-aware token representation.
        context_vec = attn_weights @ values
 
        # Move token dimension back before head dimension.
        context_vec = context_vec.transpose(1, 2)
 
        # Merge all attention heads back into one vector per token.
        #
        # contiguous makes sure the tensor is stored in memory in a layout
        # that view can safely reshape.
        context_vec = context_vec.contiguous().view(b, num_tokens, self.d_out)
 
        # Apply the final output projection.
        context_vec = self.out_proj(context_vec)
 
        # Return the attention output.
        return context_vec
 
 
# ============================================================
# 10. TRANSFORMER BLOCK
# ============================================================
 
 
# One transformer block contains:
#   - layer normalization
#   - causal self-attention
#   - residual connection
#   - layer normalization
#   - feed-forward network
#   - residual connection
#
# Residual connection means:
#   Add the original input back after a transformation.
#   This helps information and learning signals flow through deep models.
class TransformerBlock(nn.Module):
 
    # cfg contains the model settings.
    def __init__(self, cfg):
 
        # Initialize PyTorch's base module.
        super().__init__()
 
        # Create the attention part of the block.
        self.att = MultiHeadAttention(
            d_in=cfg["emb_dim"],
            d_out=cfg["emb_dim"],
            context_length=cfg["context_length"],
            dropout=cfg["drop_rate"],
            num_heads=cfg["n_heads"],
            qkv_bias=cfg["qkv_bias"]
        )
 
        # Create the feed-forward part of the block.
        self.ff = FeedForward(cfg)
 
        # Create normalization before attention.
        self.norm1 = LayerNorm(cfg["emb_dim"])
 
        # Create normalization before feed-forward.
        self.norm2 = LayerNorm(cfg["emb_dim"])
 
        # Create dropout used before residual addition.
        self.drop_resid = nn.Dropout(cfg["drop_rate"])
 
    # This function defines how data moves through one transformer block.
    def forward(self, x):
 
        # Save the original input.
        # This saved copy is called a shortcut.
        # We will add it back after attention.
        shortcut = x
 
        # Normalize values before attention.
        x = self.norm1(x)
 
        # Run causal self-attention.
        x = self.att(x)
 
        # Apply dropout.
        x = self.drop_resid(x)
 
        # Add the original input back.
        # This is the residual connection.
        x = x + shortcut
 
        # Save the updated value before feed-forward processing.
        shortcut = x
 
        # Normalize before the feed-forward network.
        x = self.norm2(x)
 
        # Run the feed-forward network.
        x = self.ff(x)
 
        # Apply dropout.
        x = self.drop_resid(x)
 
        # Add the saved value back.
        # This is another residual connection.
        x = x + shortcut
 
        # Return the processed token representations.
        return x
 
 
# ============================================================
# 11. FULL GPT MODEL
# ============================================================
 
 
# This is the complete GPT-style model.
# It combines embeddings, transformer blocks, normalization, and output prediction.
class GPTModel(nn.Module):
 
    # cfg contains all important model settings.
    def __init__(self, cfg):
 
        # Initialize PyTorch's base module.
        super().__init__()
 
        # Token embedding table.
        #
        # Embedding means:
        #   A learned vector representation of a token.
        #
        # Embedding table means:
        #   A lookup table where each token ID points to one vector.
        #
        # Example:
        #   token ID 123 gets converted into 128 learned numbers.
        self.tok_emb = nn.Embedding(cfg["vocab_size"], cfg["emb_dim"])
 
        # Positional embedding table.
        #
        # Positional embedding means:
        #   A learned vector that tells the model where a token appears
        #   in the sequence.
        #
        # This matters because:
        #   "SEO needs content" and "Content needs SEO"
        #   use similar words but have different order and meaning.
        self.pos_emb = nn.Embedding(cfg["context_length"], cfg["emb_dim"])
 
        # Dropout after token and position embeddings are added.
        self.drop_emb = nn.Dropout(cfg["drop_rate"])
 
        # Stack multiple transformer blocks.
        #
        # List comprehension means:
        #   A compact Python way to create a list.
        #
        # The * operator here unpacks the list into nn.Sequential.
        self.trf_blocks = nn.Sequential(
            *[TransformerBlock(cfg) for _ in range(cfg["n_layers"])]
        )
 
        # Final normalization before prediction.
        self.final_norm = LayerNorm(cfg["emb_dim"])
 
        # Output layer.
        #
        # This converts each token vector into scores for every possible token.
        # Those scores are called logits.
        #
        # Logits means:
        #   Raw prediction scores before converting them into probabilities.
        self.out_head = nn.Linear(cfg["emb_dim"], cfg["vocab_size"], bias=False)
 
    # This function defines the model's forward pass.
    # Forward pass means:
    #   The process of sending input through the model to get predictions.
    def forward(self, in_idx):
 
        # Read input shape.
        # batch_size = number of text examples.
        # seq_len = number of tokens per example.
        batch_size, seq_len = in_idx.shape
 
        # Safety check.
        # The model cannot process more tokens than its context length.
        if seq_len > self.pos_emb.num_embeddings:
 
            # Raise a clear error if the input is too long.
            raise ValueError(
                f"Input sequence length {seq_len} exceeds model context length "
                f"{self.pos_emb.num_embeddings}."
            )
 
        # Convert token IDs into token embeddings.
        # Shape changes from:
        #   batch, tokens
        # to:
        #   batch, tokens, embedding_dimension
        tok_embeds = self.tok_emb(in_idx)
 
        # Create position IDs:
        #   0, 1, 2, ..., seq_len - 1
        #
        # arange creates a sequence of integers.
        # device=in_idx.device ensures the position IDs are created on the same
        # device as the input, such as CPU or GPU.
        pos_ids = torch.arange(seq_len, device=in_idx.device)
 
        # Convert position IDs into position embeddings.
        pos_embeds = self.pos_emb(pos_ids)
 
        # Add token meaning and token position together.
        # Broadcasting automatically applies the position embeddings across the batch.
        # Broadcasting means PyTorch expands compatible shapes during arithmetic.
        x = tok_embeds + pos_embeds
 
        # Apply dropout after embeddings.
        x = self.drop_emb(x)
 
        # Pass through all transformer blocks.
        x = self.trf_blocks(x)
 
        # Normalize final hidden states.
        #
        # Hidden state means:
        #   The model's internal representation after processing.
        x = self.final_norm(x)
 
        # Convert hidden states into raw vocabulary scores.
        # Each token position gets one score for every token in the vocabulary.
        logits = self.out_head(x)
 
        # Return raw prediction scores.
        return logits
 
 
# ============================================================
# 12. LOSS CALCULATION
# ============================================================
 
 
# This function calculates loss for one batch.
#
# Loss means:
#   A number that measures how wrong the model was.
#   Lower loss means better predictions.
def calc_loss_batch(input_batch, target_batch, model, device):
 
    # Move input data to CPU or GPU.
    # Device means the hardware where computation happens.
    input_batch = input_batch.to(device)
 
    # Move target data to CPU or GPU.
    target_batch = target_batch.to(device)
 
    # Run the model.
    # This returns logits, which are raw prediction scores.
    logits = model(input_batch)
 
    # Cross entropy compares predicted scores to correct target token IDs.
    #
    # Cross entropy means:
    #   A standard error measurement for classification problems.
    #   Here, each next-token prediction is a classification problem:
    #   choose one correct token from the vocabulary.
    #
    # Important:
    #   We do not apply softmax before cross_entropy.
    #   PyTorch's cross_entropy expects raw logits and handles the probability
    #   conversion internally.
    #
    # logits shape before flattening:
    #   batch, tokens, vocab_size
    #
    # target shape before flattening:
    #   batch, tokens
    #
    # flatten combines dimensions.
    # flatten(0, 1) combines batch and token dimensions into one long list.
    loss = nn.functional.cross_entropy(
        logits.flatten(0, 1),
        target_batch.flatten()
    )
 
    # Return the loss tensor.
    return loss
 
 
# This function calculates average loss over a DataLoader.
# Average means the total divided by the number of items.
# We use it to evaluate training and validation loss.
@torch.no_grad()
def calc_loss_loader(data_loader, model, device, num_batches=None):
 
    # torch.no_grad means:
    #   Do not track gradients inside this function.
    #
    # Gradient means:
    #   A learning signal showing how a model parameter should change
    #   to reduce loss.
    #
    # We turn gradients off during evaluation because we are measuring,
    # not training.
 
    # If the loader has no batches, return NaN.
    # NaN means "not a number".
    if len(data_loader) == 0:
        return float("nan")
 
    # Start total loss at zero.
    total_loss = 0.0
 
    # Use all batches unless a smaller number is requested.
    if num_batches is None:
        num_batches = len(data_loader)
 
    # Do not evaluate more batches than exist.
    else:
        num_batches = min(num_batches, len(data_loader))
 
    # Loop over batches.
    # enumerate gives both the batch number and the batch data.
    for i, (input_batch, target_batch) in enumerate(data_loader):
 
        # Stop after the requested number of batches.
        if i >= num_batches:
            break
 
        # Compute batch loss.
        loss = calc_loss_batch(input_batch, target_batch, model, device)
 
        # Add normal Python number version of loss to total.
        # item() extracts a regular number from a one-value tensor.
        total_loss += loss.item()
 
    # Return average loss.
    return total_loss / num_batches
 
 
# ============================================================
# 13. TEXT GENERATION
# ============================================================
 
 
# This function generates token IDs one at a time.
#
# Greedy decoding means:
#   Always pick the highest-scoring next token.
#
# This is simple but not always creative.
# More advanced systems often use sampling methods such as temperature,
# top-k, or top-p sampling.
@torch.no_grad()
def generate_text_simple(model, idx, max_new_tokens, context_size):
 
    # Generate one token per loop.
    for _ in range(max_new_tokens):
 
        # Keep only the most recent tokens that fit in the context window.
        # The colon syntax selects parts of a tensor.
        # idx[:, -context_size:] means:
        #   all batch rows,
        #   only the last context_size tokens.
        idx_cond = idx[:, -context_size:]
 
        # Run the model on the current context.
        logits = model(idx_cond)
 
        # Keep only the final position's prediction.
        # That final position predicts the next token.
        logits = logits[:, -1, :]
 
        # Pick the token ID with the highest score.
        #
        # argmax means:
        #   Return the index of the largest value.
        #
        # dim=-1 means:
        #   Search across the vocabulary-score dimension.
        #
        # keepdim=True keeps the output shape compatible for concatenation.
        idx_next = torch.argmax(logits, dim=-1, keepdim=True)
 
        # Add the predicted token to the sequence.
        #
        # cat means concatenate, or join together.
        # dim=1 means join along the token-sequence dimension.
        idx = torch.cat((idx, idx_next), dim=1)
 
    # Return the full token sequence.
    return idx
 
 
# ============================================================
# 14. TRAINING LOOP
# ============================================================
 
 
# This function trains the model.
#
# Training means:
#   Repeatedly making predictions, measuring errors, and updating
#   the model's trainable numbers to reduce future errors.
def train_model_simple(model, train_loader, val_loader, optimizer, device,
                       num_epochs, eval_freq, eval_iter):
 
    # Store training loss values over time.
    train_losses = []
 
    # Store validation loss values over time.
    #
    # Validation data means:
    #   Data not used for model updates.
    #   It helps estimate whether the model generalizes beyond training examples.
    val_losses = []
 
    # Store how many tokens the model has seen over time.
    track_tokens_seen = []
 
    # Count total tokens processed.
    tokens_seen = 0
 
    # Count optimizer update steps.
    # Optimizer means:
    #   The algorithm that changes model parameters to reduce loss.
    global_step = -1
 
    # Loop over full passes through the training data.
    #
    # Epoch means:
    #   One full pass over the training dataset.
    for epoch in range(num_epochs):
 
        # Put model in training mode.
        # This enables training-specific behavior such as dropout.
        model.train()
 
        # Loop over training batches.
        for input_batch, target_batch in train_loader:
 
            # Clear old gradients.
            #
            # PyTorch accumulates gradients by default.
            # Accumulates means new values are added to old values.
            # We clear them so each update uses only the current batch.
            optimizer.zero_grad()
 
            # Calculate current batch loss.
            loss = calc_loss_batch(input_batch, target_batch, model, device)
 
            # Compute gradients through backpropagation.
            #
            # Backpropagation means:
            #   The method used to calculate how each trainable parameter
            #   contributed to the error.
            loss.backward()
 
            # Update model weights.
            #
            # Weight means:
            #   A trainable number inside the model.
            #
            # optimizer.step() changes those numbers using the gradients.
            optimizer.step()
 
            # Count how many token IDs were processed in this batch.
            #
            # numel means "number of elements".
            # For a tensor shaped [8, 64], numel is 512.
            tokens_seen += input_batch.numel()
 
            # Increase training step counter.
            global_step += 1
 
            # Evaluate every eval_freq steps.
            #
            # % means remainder after division.
            # If global_step % eval_freq == 0, this step is an evaluation step.
            if global_step % eval_freq == 0:
 
                # Switch to evaluation mode.
                # This disables dropout.
                model.eval()
 
                # Calculate training loss on a small number of batches.
                train_loss = calc_loss_loader(
                    train_loader,
                    model,
                    device,
                    num_batches=eval_iter
                )
 
                # Calculate validation loss on a small number of batches.
                val_loss = calc_loss_loader(
                    val_loader,
                    model,
                    device,
                    num_batches=eval_iter
                )
 
                # Store training loss.
                train_losses.append(train_loss)
 
                # Store validation loss.
                val_losses.append(val_loss)
 
                # Store token count.
                track_tokens_seen.append(tokens_seen)
 
                # Print progress.
                #
                # f-string means formatted string.
                # It lets us insert variable values into text.
                #
                # :.3f means show a decimal number with 3 digits after the decimal point.
                print(
                    f"Epoch {epoch + 1}, "
                    f"Step {global_step}, "
                    f"Train loss {train_loss:.3f}, "
                    f"Val loss {val_loss:.3f}"
                )
 
                # Switch back to training mode.
                model.train()
 
    # Return training history.
    return train_losses, val_losses, track_tokens_seen
 
 
# ============================================================
# 15. MAIN SCRIPT
# ============================================================
 
 
# This block runs only when this file is executed directly.
#
# __name__ is a special Python variable.
# "__main__" means this file is being run as the main program,
# not imported by another file.
if __name__ == "__main__":
 
    # Set the random seed.
    #
    # Random seed means:
    #   A starting value for random number generation.
    #
    # This makes the random parts more repeatable.
    # Repeatable does not always mean identical on every machine,
    # but it reduces unnecessary variation.
    torch.manual_seed(123)
 
    # Use GPU if available; otherwise use CPU.
    #
    # GPU means Graphics Processing Unit.
    # GPUs are often faster than CPUs for neural network training.
    #
    # CUDA is NVIDIA's system for running computations on NVIDIA GPUs.
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
 
    # Print which device is being used.
    print("Using device:", device)
 
    # This is tiny demo text.
    # It exists only to prove the tokenizer + model + training loop work together.
    # A real model needs much more text.
    raw_text = (
        "Search engines crawl pages, index content, and rank documents. "
        "Good SEO depends on technical health, useful content, internal links, "
        "clear structure, and satisfying search intent. "
        "Language models learn by predicting the next token from previous tokens. "
    ) * 300
 
    # Encode the full demo text into token IDs.
    # Encoding means converting text into numeric token IDs.
    all_token_ids = tokenizer.encode(raw_text)
 
    # Print the number of tokens created by the tokenizer.
    print("Total tokens:", len(all_token_ids))
 
    # Split text at the character level for this simple demo.
    #
    # Character level means:
    #   We split the raw string by character position, not by document boundary.
    #
    # For serious training, split documents carefully to avoid leakage.
    # Leakage means validation data accidentally contains training-like content,
    # making evaluation look better than it really is.
    split_idx = int(0.9 * len(raw_text))
 
    # First 90 percent of text becomes training text.
    train_text = raw_text[:split_idx]
 
    # Last 10 percent becomes validation text.
    val_text = raw_text[split_idx:]
 
    # Create the training DataLoader.
    train_loader = create_dataloader_v1(
        text=train_text,
        tokenizer=tokenizer,
        batch_size=8,
        max_length=GPT_CONFIG["context_length"],
        stride=GPT_CONFIG["context_length"] // 2,
        shuffle=True,
        drop_last=True
    )
 
    # Create the validation DataLoader.
    val_loader = create_dataloader_v1(
        text=val_text,
        tokenizer=tokenizer,
        batch_size=8,
        max_length=GPT_CONFIG["context_length"],
        stride=GPT_CONFIG["context_length"] // 2,
        shuffle=False,
        drop_last=False
    )
 
    # Create the GPT-style model.
    # At this moment, the model starts with random trainable values.
    # It does not yet know language.
    model = GPTModel(GPT_CONFIG)
 
    # Move the model to CPU or GPU.
    model.to(device)
 
    # Create the optimizer.
    #
    # AdamW is an optimizer commonly used for transformer training.
    # It decides how to update the model's trainable numbers after each batch.
    #
    # lr means learning rate.
    # Learning rate controls how large each update step is.
    #
    # weight_decay is a regularization technique.
    # Regularization means reducing overfitting.
    # Weight decay gently discourages trainable weights from becoming too large.
    optimizer = torch.optim.AdamW(
        model.parameters(),
        lr=4e-4,
        weight_decay=0.1
    )
 
    # Train the model for a few demo epochs.
    # This is intentionally small and will not produce a strong language model.
    train_losses, val_losses, tokens_seen = train_model_simple(
        model=model,
        train_loader=train_loader,
        val_loader=val_loader,
        optimizer=optimizer,
        device=device,
        num_epochs=3,
        eval_freq=10,
        eval_iter=2
    )
 
    # Put the model in evaluation mode before generating.
    # Evaluation mode disables dropout.
    model.eval()
 
    # Create a real text prompt.
    # Prompt means the starting text given to the model before generation.
    start_text = "Search engines"
 
    # Convert the prompt into token IDs.
    start_context = text_to_token_ids(start_text, tokenizer).to(device)
 
    # Generate new token IDs.
    generated_token_ids = generate_text_simple(
        model=model,
        idx=start_context,
        max_new_tokens=30,
        context_size=GPT_CONFIG["context_length"]
    )
 
    # Decode generated token IDs back into text.
    # .cpu() moves the tensor back to CPU memory before decoding.
    generated_text = token_ids_to_text(generated_token_ids.cpu(), tokenizer)
 
    # Print the generated text.
    print("\nGenerated text:")
    print(generated_text)

Glossary of Complex Terms

Term	Plain-English meaning
Activation function	A mathematical rule that changes numbers inside a neural network so it can learn complex, nonlinear patterns.
AdamW	An optimizer commonly used for transformer models. It updates trainable model values and includes weight decay.
Argmax	An operation that returns the position of the largest value. In generation, it selects the highest-scoring next token.
Attention	A mechanism that lets each token decide which other tokens are most relevant to it.
Attention head	One separate attention pathway. Multiple heads let the model look at relationships in different ways at the same time.
Attention score	A raw number measuring how strongly one token should pay attention to another token.
Backpropagation	The method used to calculate how each trainable model value contributed to the error.
Batch	A group of training examples processed together.
Bias	An extra trainable number added after a linear calculation.
Broadcasting	Automatic expansion of compatible tensor shapes during arithmetic.
Buffer	A tensor stored inside the model that is not trainable, such as the attention mask.
Causal attention	Attention that only allows tokens to look at current and previous positions, not future positions.
Context length	The maximum number of tokens the model can look at at once.
Cross entropy	A standard error measurement for classification tasks, used here to measure next-token prediction error.
CUDA	NVIDIA technology that allows PyTorch to use NVIDIA GPUs for computation.
DataLoader	A PyTorch tool that groups dataset examples into batches and feeds them to the model.
Dataset	A structured collection of examples that PyTorch can read from.
Decoding	Converting token IDs back into text.
Dimension	One axis of a tensor, such as batch, token position, or embedding feature.
Dropout	A training method that randomly hides some internal values to reduce overfitting.
Embedding	A learned vector representation of a token.
Embedding dimension	The number of values in each token vector.
Embedding table	A lookup table where each token ID points to a learned vector.
Encoding	Converting text into token IDs.
Epoch	One full pass through the training data.
Evaluation	Measuring model performance without updating model weights.
Feed-forward network	A neural network block that processes each token vector after attention.
Forward pass	Sending input through the model to get predictions.
GELU	Gaussian Error Linear Unit, a smooth activation function used in GPT-style models.
Gradient	A learning signal showing how a trainable value should change to reduce loss.
Greedy decoding	Generation method that always selects the highest-scoring next token.
Hidden state	The model internal representation after processing.
Key	In attention, the vector used to describe what kind of information a token represents.
Layer normalization	A method that keeps internal values on a stable scale during training.
Learning rate	A setting that controls how large each model update step is.
Linear layer	A trainable layer that mixes input numbers using learned weights and optional biases.
Logits	Raw prediction scores before converting them into probabilities.
Loss	A number measuring how wrong the model prediction was.
Mask	A tensor used to block certain token positions, such as future tokens.
Matrix multiplication	A structured way to combine rows and columns of numbers, used to compare queries and keys.
Mean	The average of a set of numbers.
Multi-head attention	Several attention mechanisms running in parallel.
NaN	Not a Number, used when a numeric result is invalid or undefined.
Neural network	A system of trainable mathematical layers that learns from examples.
Normalization	Adjusting numbers so their scale is more stable.
Numerical stability	Keeping calculations in ranges that computers can handle reliably.
Optimizer	The algorithm that updates model parameters to reduce loss.
Overfitting	When a model memorizes training examples too closely and generalizes poorly.
Parameter	A trainable value inside the model.
Positional embedding	A learned vector that tells the model where a token appears in the sequence.
Probability	A number between 0 and 1 representing relative likelihood.
Projection	A learned transformation from one vector representation to another.
Prompt	The starting text given to the model before generation.
Query	In attention, the vector describing what information a token is looking for.
Residual connection	Adding the original input back after a transformation to improve information flow.
Self-attention	Attention where tokens in one sequence attend to other tokens in the same sequence.
Shape	The size of each dimension in a tensor.
Softmax	A function that turns raw scores into probabilities that add up to 1.
Stride	How far the dataset window moves before creating the next training example.
Tensor	A container of numbers used by PyTorch. It can behave like a list, table, or higher-dimensional block.
Token	A piece of text represented by a number. It may be a word, word part, punctuation, or spacing pattern.
Token ID	The integer number representing a token.
Tokenizer	Software that converts text into token IDs and token IDs back into text.
Training	Repeatedly predicting, measuring error, and updating model values to reduce future error.
Transformer block	A processing block containing attention, normalization, feed-forward layers, and residual connections.
Validation data	Data used to measure performance but not used for model updates.
Value	In attention, the vector containing the actual information a token can pass along.
Variance	A measure of how spread out numbers are around their average.
Vector	A list of numbers representing something, such as a token.
Vocabulary	The full set of tokens the tokenizer knows.
Weight	A trainable number inside a neural network.
Weight decay	A regularization technique that discourages weights from becoming too large.
Weighted sum	A combination where values with larger weights contribute more to the result.

Before We Wrap Up

This wasn’t meant to make you a machine learning engineer. It was meant to make the thing less opaque – tokens, attention, weights, loss, all of it has a concrete meaning and you’ve now seen where each piece lives in actual code.

Sebastian’s book goes much deeper if you want to keep pulling the thread. The code here is a starting point, not the full picture.