Spaces:
Running
Running
File size: 22,430 Bytes
967e756 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 |
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Understanding Attention Mechanisms in LLMs</title>
<script src="https://cdn.tailwindcss.com"></script>
<link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/6.4.0/css/all.min.css">
<style>
.code-block {
background-color: #2d2d2d;
color: #f8f8f2;
padding: 1rem;
border-radius: 0.5rem;
font-family: 'Courier New', Courier, monospace;
overflow-x: auto;
}
.attention-visual {
display: flex;
justify-content: center;
margin: 2rem 0;
}
.attention-node {
width: 60px;
height: 60px;
border-radius: 50%;
display: flex;
align-items: center;
justify-content: center;
font-weight: bold;
position: relative;
}
.attention-line {
position: absolute;
background-color: rgba(59, 130, 246, 0.5);
transform-origin: left center;
}
.explanation-box {
background-color: #f0f9ff;
border-left: 4px solid #3b82f6;
padding: 1rem;
margin: 1rem 0;
border-radius: 0 0.5rem 0.5rem 0;
}
.citation {
background-color: #f8fafc;
padding: 0.5rem;
margin: 0.5rem 0;
border-left: 3px solid #94a3b8;
}
</style>
</head>
<body class="bg-gray-50">
<div class="max-w-4xl mx-auto px-4 py-8">
<header class="text-center mb-12">
<h1 class="text-4xl font-bold text-blue-800 mb-4">Attention Mechanisms in Large Language Models</h1>
<p class="text-xl text-gray-600">Understanding the core innovation behind modern AI language models</p>
<div class="mt-6">
<span class="inline-block bg-blue-100 text-blue-800 px-3 py-1 rounded-full text-sm font-medium">Machine Learning</span>
<span class="inline-block bg-purple-100 text-purple-800 px-3 py-1 rounded-full text-sm font-medium ml-2">Natural Language Processing</span>
<span class="inline-block bg-green-100 text-green-800 px-3 py-1 rounded-full text-sm font-medium ml-2">Deep Learning</span>
</div>
</header>
<div class="bg-white rounded-xl shadow-md overflow-hidden mb-8">
<div class="p-8">
<h2 class="text-2xl font-bold text-gray-800 mb-6">Introduction to Attention</h2>
<p class="text-gray-700 mb-4">
The attention mechanism is a fundamental component of modern transformer-based language models like GPT, BERT, and others.
It allows models to dynamically focus on different parts of the input sequence when producing each part of the output sequence.
</p>
<p class="text-gray-700 mb-6">
Unlike traditional sequence models that process inputs in a fixed order, attention mechanisms enable models to learn which parts of the input are most relevant at each step of processing.
</p>
<div class="attention-visual">
<div class="flex flex-col items-center">
<div class="flex space-x-8 mb-8">
<div class="attention-node bg-blue-100 text-blue-800">Input</div>
<div class="attention-node bg-purple-100 text-purple-800">Q</div>
<div class="attention-node bg-green-100 text-green-800">K</div>
<div class="attention-node bg-yellow-100 text-yellow-800">V</div>
</div>
<div class="attention-node bg-red-100 text-red-800">Output</div>
</div>
</div>
<div class="explanation-box">
<h3 class="font-semibold text-lg text-blue-800 mb-2">Key Insight</h3>
<p>
Attention mechanisms compute a weighted sum of values (V), where the weights are determined by the compatibility between queries (Q) and keys (K).
This allows the model to focus on different parts of the input sequence dynamically.
</p>
</div>
</div>
</div>
<div class="bg-white rounded-xl shadow-md overflow-hidden mb-8">
<div class="p-8">
<h2 class="text-2xl font-bold text-gray-800 mb-6">The Q, K, V Triad</h2>
<div class="grid md:grid-cols-3 gap-6 mb-8">
<div class="bg-blue-50 p-4 rounded-lg">
<h3 class="font-bold text-blue-800 mb-2"><i class="fas fa-question-circle mr-2"></i>Queries (Q)</h3>
<p class="text-gray-700">
Represent what the model is "looking for" at the current position. They are learned representations that help determine which parts of the input to focus on.
</p>
</div>
<div class="bg-green-50 p-4 rounded-lg">
<h3 class="font-bold text-green-800 mb-2"><i class="fas fa-key mr-2"></i>Keys (K)</h3>
<p class="text-gray-700">
Represent what each input element "contains" or "offers". They are compared against queries to determine attention weights.
</p>
</div>
<div class="bg-yellow-50 p-4 rounded-lg">
<h3 class="font-bold text-yellow-800 mb-2"><i class="fas fa-database mr-2"></i>Values (V)</h3>
<p class="text-gray-700">
Contain the actual information that will be aggregated based on the attention weights. They represent what gets passed forward.
</p>
</div>
</div>
<h3 class="text-xl font-semibold text-gray-800 mb-4">Why We Need All Three</h3>
<p class="text-gray-700 mb-4">
The separation of Q, K, and V provides flexibility and expressive power to the attention mechanism:
</p>
<ul class="list-disc pl-6 text-gray-700 space-y-2 mb-6">
<li><strong>Decoupling:</strong> Allows different representations for what to look for (Q) versus what to retrieve (V)</li>
<li><strong>Flexibility:</strong> Enables different types of attention patterns (e.g., looking ahead vs. looking back)</li>
<li><strong>Efficiency:</strong> Permits caching of K and V for autoregressive generation</li>
<li><strong>Interpretability:</strong> Makes attention patterns more meaningful and analyzable</li>
</ul>
<h3 class="text-xl font-semibold text-gray-800 mb-4">How Q, K, V Are Created</h3>
<p class="text-gray-700 mb-4">
In transformer models, Q, K, and V are all derived from the same input sequence through learned linear transformations:
</p>
<div class="code-block mb-6">
<pre># Python example of creating Q, K, V
import torch
import torch.nn as nn
# Suppose we have input embeddings of shape (batch_size, seq_len, d_model)
batch_size = 32
seq_len = 10
d_model = 512
input_embeddings = torch.randn(batch_size, seq_len, d_model)
# Create linear projection layers
q_proj = nn.Linear(d_model, d_model) # Query projection
k_proj = nn.Linear(d_model, d_model) # Key projection
v_proj = nn.Linear(d_model, d_model) # Value projection
# Project inputs to get Q, K, V
Q = q_proj(input_embeddings) # Shape: (batch_size, seq_len, d_model)
K = k_proj(input_embeddings) # Shape: (batch_size, seq_len, d_model)
V = v_proj(input_embeddings) # Shape: (batch_size, seq_len, d_model)</pre>
</div>
<div class="explanation-box">
<h3 class="font-semibold text-lg text-blue-800 mb-2">Important Note</h3>
<p>
In practice, the dimensions are often split into multiple "heads" (multi-head attention), where each head learns different attention patterns.
This allows the model to attend to different aspects of the input simultaneously.
</p>
</div>
</div>
</div>
<div class="bg-white rounded-xl shadow-md overflow-hidden mb-8">
<div class="p-8">
<h2 class="text-2xl font-bold text-gray-800 mb-6">Scaled Dot-Product Attention</h2>
<p class="text-gray-700 mb-4">
The core computation in attention mechanisms is the scaled dot-product attention, which can be implemented as follows:
</p>
<div class="code-block mb-6">
<pre>def scaled_dot_product_attention(Q, K, V, mask=None):
"""
Q: Query tensor (batch_size, ..., seq_len_q, d_k)
K: Key tensor (batch_size, ..., seq_len_k, d_k)
V: Value tensor (batch_size, ..., seq_len_k, d_v)
mask: Optional mask tensor for masking out certain positions
"""
# Compute dot products between Q and K
matmul_qk = torch.matmul(Q, K.transpose(-2, -1)) # (..., seq_len_q, seq_len_k)
# Scale by square root of dimension
d_k = Q.size(-1)
scaled_attention_logits = matmul_qk / torch.sqrt(torch.tensor(d_k, dtype=torch.float32))
# Apply mask if provided (for decoder self-attention)
if mask is not None:
scaled_attention_logits += (mask * -1e9)
# Softmax to get attention weights
attention_weights = torch.softmax(scaled_attention_logits, dim=-1)
# Multiply weights by values
output = torch.matmul(attention_weights, V) # (..., seq_len_q, d_v)
return output, attention_weights</pre>
</div>
<div class="explanation-box">
<h3 class="font-semibold text-lg text-blue-800 mb-2">Scaling Explanation</h3>
<p>
The scaling factor (√dₖ) is crucial because dot products grow large in magnitude as the dimension increases.
This can push the softmax function into regions where it has extremely small gradients, making learning difficult.
Scaling by √dₖ counteracts this effect.
</p>
</div>
<h3 class="text-xl font-semibold text-gray-800 mb-4">Complete Multi-Head Attention Example</h3>
<div class="code-block mb-6">
<pre>class MultiHeadAttention(nn.Module):
def __init__(self, d_model, num_heads):
super(MultiHeadAttention, self).__init__()
self.num_heads = num_heads
self.d_model = d_model
assert d_model % num_heads == 0, "d_model must be divisible by num_heads"
self.depth = d_model // num_heads
# Linear layers for Q, K, V projections
self.wq = nn.Linear(d_model, d_model)
self.wk = nn.Linear(d_model, d_model)
self.wv = nn.Linear(d_model, d_model)
self.dense = nn.Linear(d_model, d_model)
def split_heads(self, x, batch_size):
"""Split the last dimension into (num_heads, depth)"""
x = x.view(batch_size, -1, self.num_heads, self.depth)
return x.transpose(1, 2) # (batch_size, num_heads, seq_len, depth)
def forward(self, q, k, v, mask=None):
batch_size = q.size(0)
# Linear projections
q = self.wq(q) # (batch_size, seq_len, d_model)
k = self.wk(k)
v = self.wv(v)
# Split into multiple heads
q = self.split_heads(q, batch_size)
k = self.split_heads(k, batch_size)
v = self.split_heads(v, batch_size)
# Scaled dot-product attention
scaled_attention, attention_weights = scaled_dot_product_attention(q, k, v, mask)
# Concatenate heads
scaled_attention = scaled_attention.transpose(1, 2) # (batch_size, seq_len, num_heads, depth)
concat_attention = scaled_attention.contiguous().view(batch_size, -1, self.d_model)
# Final linear layer
output = self.dense(concat_attention)
return output, attention_weights</pre>
</div>
</div>
</div>
<div class="bg-white rounded-xl shadow-md overflow-hidden mb-8">
<div class="p-8">
<h2 class="text-2xl font-bold text-gray-800 mb-6">Types of Attention Patterns</h2>
<div class="grid md:grid-cols-2 gap-6 mb-6">
<div class="bg-indigo-50 p-4 rounded-lg">
<h3 class="font-bold text-indigo-800 mb-2"><i class="fas fa-arrows-alt-h mr-2"></i>Self-Attention</h3>
<p class="text-gray-700">
Q, K, and V all come from the same sequence. Allows each position to attend to all positions in the same sequence.
</p>
<div class="mt-3">
<span class="inline-block bg-indigo-100 text-indigo-800 px-2 py-1 rounded-full text-xs font-medium">Encoder</span>
<span class="inline-block bg-indigo-100 text-indigo-800 px-2 py-1 rounded-full text-xs font-medium ml-1">BERT</span>
</div>
</div>
<div class="bg-pink-50 p-4 rounded-lg">
<h3 class="font-bold text-pink-800 mb-2"><i class="fas fa-arrow-right mr-2"></i>Masked Self-Attention</h3>
<p class="text-gray-700">
Used in decoder to prevent positions from attending to subsequent positions (autoregressive property).
</p>
<div class="mt-3">
<span class="inline-block bg-pink-100 text-pink-800 px-2 py-1 rounded-full text-xs font-medium">Decoder</span>
<span class="inline-block bg-pink-100 text-pink-800 px-2 py-1 rounded-full text-xs font-medium ml-1">GPT</span>
</div>
</div>
<div class="bg-teal-50 p-4 rounded-lg">
<h3 class="font-bold text-teal-800 mb-2"><i class="fas fa-exchange-alt mr-2"></i>Cross-Attention</h3>
<p class="text-gray-700">
Q comes from one sequence, while K and V come from another sequence (e.g., encoder-decoder attention).
</p>
<div class="mt-3">
<span class="inline-block bg-teal-100 text-teal-800 px-2 py-1 rounded-full text-xs font-medium">Seq2Seq</span>
<span class="inline-block bg-teal-100 text-teal-800 px-2 py-1 rounded-full text-xs font-medium ml-1">Translation</span>
</div>
</div>
<div class="bg-orange-50 p-4 rounded-lg">
<h3 class="font-bold text-orange-800 mb-2"><i class="fas fa-sliders-h mr-2"></i>Sparse Attention</h3>
<p class="text-gray-700">
Only attends to a subset of positions to reduce computational complexity (e.g., local, strided, or global attention).
</p>
<div class="mt-3">
<span class="inline-block bg-orange-100 text-orange-800 px-2 py-1 rounded-full text-xs font-medium">Longformer</span>
<span class="inline-block bg-orange-100 text-orange-800 px-2 py-1 rounded-full text-xs font-medium ml-1">BigBird</span>
</div>
</div>
</div>
</div>
</div>
<div class="bg-white rounded-xl shadow-md overflow-hidden mb-8">
<div class="p-8">
<h2 class="text-2xl font-bold text-gray-800 mb-6">Key Citations and Resources</h2>
<div class="space-y-4">
<div class="citation">
<h3 class="font-semibold text-gray-800">1. Vaswani et al. (2017) - Original Transformer Paper</h3>
<p class="text-gray-600">"Attention Is All You Need" - Introduced the transformer architecture with scaled dot-product attention.</p>
<a href="https://arxiv.org/abs/1706.03762" class="text-blue-600 hover:underline inline-block mt-1">arXiv:1706.03762</a>
</div>
<div class="citation">
<h3 class="font-semibold text-gray-800">2. Jurafsky & Martin (2023) - NLP Textbook</h3>
<p class="text-gray-600">"Speech and Language Processing" - Comprehensive chapter on attention and transformer models.</p>
<a href="https://web.stanford.edu/~jurafsky/slp3/" class="text-blue-600 hover:underline inline-block mt-1">Stanford NLP Textbook</a>
</div>
<div class="citation">
<h3 class="font-semibold text-gray-800">3. Illustrated Transformer (Blog Post)</h3>
<p class="text-gray-600">Jay Alammar's visual explanation of transformer attention mechanisms.</p>
<a href="https://jalammar.github.io/illustrated-transformer/" class="text-blue-600 hover:underline inline-block mt-1">jalammar.github.io</a>
</div>
<div class="citation">
<h3 class="font-semibold text-gray-800">4. Harvard NLP (2022) - Annotated Transformer</h3>
<p class="text-gray-600">Line-by-line implementation guide with PyTorch.</p>
<a href="http://nlp.seas.harvard.edu/2018/04/03/attention.html" class="text-blue-600 hover:underline inline-block mt-1">Harvard NLP Tutorial</a>
</div>
<div class="citation">
<h3 class="font-semibold text-gray-800">5. Efficient Transformers Survey (2020)</h3>
<p class="text-gray-600">Tay et al. review various attention variants for efficiency.</p>
<a href="https://arxiv.org/abs/2009.06732" class="text-blue-600 hover:underline inline-block mt-1">arXiv:2009.06732</a>
</div>
</div>
</div>
</div>
<div class="bg-white rounded-xl shadow-md overflow-hidden mb-8">
<div class="p-8">
<h2 class="text-2xl font-bold text-gray-800 mb-6">Practical Considerations</h2>
<div class="grid md:grid-cols-2 gap-6">
<div>
<h3 class="text-xl font-semibold text-gray-800 mb-3"><i class="fas fa-lightbulb text-yellow-500 mr-2"></i>Tips for Implementation</h3>
<ul class="list-disc pl-6 text-gray-700 space-y-2">
<li>Use layer normalization before (not after) attention in transformer blocks</li>
<li>Initialize attention projections with small random weights</li>
<li>Monitor attention patterns during training for debugging</li>
<li>Consider using flash attention for efficiency in production</li>
<li>Use attention masking carefully for padding and autoregressive generation</li>
</ul>
</div>
<div>
<h3 class="text-xl font-semibold text-gray-800 mb-3"><i class="fas fa-exclamation-triangle text-red-500 mr-2"></i>Common Pitfalls</h3>
<ul class="list-disc pl-6 text-gray-700 space-y-2">
<li>Forgetting to scale attention scores by √dₖ</li>
<li>Improper handling of attention masks</li>
<li>Not using residual connections around attention</li>
<li>Oversized attention heads that don't learn meaningful patterns</li>
<li>Ignoring attention patterns when debugging model behavior</li>
</ul>
</div>
</div>
</div>
</div>
<footer class="text-center py-8 text-gray-600">
<p>© 2023 Understanding Attention Mechanisms in LLMs</p>
<p class="mt-2">Educational resource for machine learning students</p>
<div class="mt-4 flex justify-center space-x-4">
<a href="#" class="text-blue-600 hover:text-blue-800"><i class="fab fa-github fa-lg"></i></a>
<a href="#" class="text-blue-600 hover:text-blue-800"><i class="fab fa-twitter fa-lg"></i></a>
<a href="#" class="text-blue-600 hover:text-blue-800"><i class="fab fa-linkedin fa-lg"></i></a>
</div>
</footer>
</div>
<p style="border-radius: 8px; text-align: center; font-size: 12px; color: #fff; margin-top: 16px;position: fixed; left: 8px; bottom: 8px; z-index: 10; background: rgba(0, 0, 0, 0.8); padding: 4px 8px;">Made with <img src="https://enzostvs-deepsite.hf.space/logo.svg" alt="DeepSite Logo" style="width: 16px; height: 16px; vertical-align: middle;display:inline-block;margin-right:3px;filter:brightness(0) invert(1);"><a href="https://enzostvs-deepsite.hf.space" style="color: #fff;text-decoration: underline;" target="_blank" >DeepSite</a> - 🧬 <a href="https://enzostvs-deepsite.hf.space?remix=ontoligent/ds-5001-text-as-data" style="color: #fff;text-decoration: underline;" target="_blank" >Remix</a></p></body>
</html> |