How GPT Models Generate Text: A Step-by-Step Breakdown

February 04, 2025

Understanding Text Generation in GPT Models: A Deep Dive

In this post, we’ll explore how GPT models generate text step-by-step. By decoding and breaking down how tokens are processed, passed through the model, and mapped back to vocabulary, we’ll answer some key questions about how GPT works under the hood.

How GPT Generates Text

Text generation in GPT models is an autoregressive process, which means the model generates one token at a time, based on previously generated tokens. Here’s a simplified function for greedy text generation:

def generate_text_simple(model, idx, max_new_tokens, context_size):
    for _ in range(max_new_tokens):
        # Trim the input to fit the model's context window
        idx_cond = idx[:, -context_size:]
        
        # Pass through the model (disable gradient calculation)
        with torch.no_grad():
            logits = model(idx_cond)
        
        # Focus on the last time step
        logits = logits[:, -1, :]  # Shape: (batch_size, vocab_size)
        
        # Convert logits to probabilities
        probas = torch.softmax(logits, dim=-1)
        
        # Pick the most likely next token (greedy decoding)
        idx_next = torch.argmax(probas, dim=-1, keepdim=True)
        
        # Append the new token to the input sequence
        idx = torch.cat((idx, idx_next), dim=1)
    
    return idx

What’s Happening Here?

This function generates new tokens one by one using greedy decoding, where we always choose the most likely token at each step. Let’s break it down further:

Trimming the Input Context:
```
idx_cond = idx[:, -context_size:]
```
GPT has a fixed context window. If the input sequence exceeds this size, we keep only the most recent tokens.
Passing the Input Through the Model:
```
logits = model(idx_cond)
```
The model processes the input and outputs a set of logits for each token in the input sequence. Logits are raw scores that we’ll later convert to probabilities.
Extracting the Last Token’s Logits:
```
logits = logits[:, -1, :]  # Shape: (batch_size, vocab_size)
```
Since we’re generating text autoregressively (one token at a time), we only care about the logits for the last time step.
Converting Logits to Probabilities:
```
probas = torch.softmax(logits, dim=-1)
```
The softmax function normalizes the logits into a probability distribution over the entire vocabulary.
Selecting the Most Likely Token:
```
idx_next = torch.argmax(probas, dim=-1, keepdim=True)
```
This line picks the token with the highest probability (greedy decoding). The token ID is appended to the input sequence.

Tokens, Embeddings, and Vocabulary: How It All Fits Together

Step 1: From Token to Embedding

When we feed text into GPT, it’s first tokenized into subword units (e.g., “Hello world!” → [620, 1567, 198]), and then these token IDs are mapped to embeddings. Each embedding is a dense vector of size d_model (e.g., 768 for GPT-2 small).

Step 2: Passing Embeddings Through Transformer Layers

The embeddings are passed through multiple transformer layers, which apply self-attention, feedforward transformations, and normalization. This enriches each token’s embedding with contextual information from the entire input sequence.

Step 3: Mapping Hidden States to Logits

After the final transformer layer, the hidden states are mapped to a logits vector of size vocab_size. Each element in this vector represents the model’s score for a particular token in the vocabulary.

Step 4: Converting Logits to Tokens

To generate the next token:

Softmax converts logits to probabilities.
We pick the most probable token (using greedy decoding, or alternatively, sampling techniques like top-k or nucleus sampling).

Why Do We Pick a Token Instead of Directly Converting the Embedding?

This is a key insight! The model doesn’t directly output a token—it outputs a high-dimensional vector (logits) with scores for each token in the vocabulary. By selecting the token with the highest score, we pick the closest match in terms of meaning and context.

Understanding the Vocabulary and Decoding Process

The vocabulary is implicitly referenced throughout the process:

Input tokens (idx) are indices that map to specific vocabulary tokens.
Model outputs logits that represent scores for every token in the vocabulary.
Final decoding step converts logits to token IDs, which can then be converted back to text using a tokenizer.

Visualization: Putting It All Together

Here’s a quick visualization of the key steps:

Input Text → Token IDs → Embeddings → Transformer Layers → Logits → Probabilities → Next Token

Key Takeaways

GPT generates text autoregressively, one token at a time.
Embeddings are processed through transformer layers, which enrich them with contextual meaning.
Logits are mapped to tokens using softmax and decoding methods (like greedy decoding).
The vocabulary is referenced implicitly through token IDs, logits, and final decoding steps.

By understanding this flow, you’ll gain a deeper appreciation for how GPT models generate coherent and contextually rich text. Whether you’re building your own transformer-based model or fine-tuning GPT, these fundamentals are essential.

Next Steps

Want to experiment? Try modifying the text generation function to use top-k sampling, temperature scaling, or beam search to explore different decoding strategies!