File size: 17,937 Bytes
be25a05
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>JEPA and Cognitive Architectures</title>
    <script src="https://cdn.tailwindcss.com"></script>
    <link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/6.4.0/css/all.min.css">
    <style>
        @import url('https://fonts.googleapis.com/css2?family=Inter:wght@300;400;500;600;700&display=swap');
        
        body {
            font-family: 'Inter', sans-serif;
            background-color: #f9fafb;
            color: #111827;
        }
        
        .gradient-header {
            background: linear-gradient(135deg, #4f46e5 0%, #7c3aed 100%);
        }
        
        .diagram-container {
            background-color: #f3f4f6;
            border-radius: 0.5rem;
            padding: 1.5rem;
            margin: 1.5rem 0;
            border-left: 4px solid #4f46e5;
        }
        
        .concept-card {
            transition: all 0.3s ease;
            border-radius: 0.5rem;
            box-shadow: 0 1px 3px rgba(0,0,0,0.1);
        }
        
        .concept-card:hover {
            transform: translateY(-2px);
            box-shadow: 0 10px 15px -3px rgba(0,0,0,0.1);
        }
        
        .section-divider {
            border-top: 2px dashed #d1d5db;
            margin: 2rem 0;
        }
        
        .key-point {
            background-color: #eef2ff;
            border-left: 4px solid #4f46e5;
            padding: 1rem;
            margin: 1rem 0;
            border-radius: 0 0.375rem 0.375rem 0;
        }
        
        code {
            background-color: #f3f4f6;
            padding: 0.2rem 0.4rem;
            border-radius: 0.25rem;
            font-family: 'Courier New', monospace;
            font-size: 0.9em;
            color: #7c3aed;
        }
        
        .pseudo-code {
            background-color: #1e293b;
            color: #f8fafc;
            padding: 1rem;
            border-radius: 0.5rem;
            font-family: 'Courier New', monospace;
            overflow-x: auto;
            margin: 1.5rem 0;
        }
        
        .pseudo-code .keyword {
            color: #f472b6;
        }
        
        .pseudo-code .comment {
            color: #94a3b8;
            font-style: italic;
        }
        
        .pseudo-code .string {
            color: #86efac;
        }
        
        .pseudo-code .function {
            color: #60a5fa;
        }
    </style>
</head>
<body class="bg-gray-50">
    <div class="max-w-5xl mx-auto px-4 py-8">
        <!-- Header -->
        <header class="gradient-header text-white rounded-xl p-8 mb-8 shadow-lg">
            <div class="flex items-center justify-between">
                <div>
                    <h1 class="text-4xl font-bold mb-2">JEPA and Cognitive Architectures</h1>
                    <p class="text-xl opacity-90">A Comprehensive Introduction to Predictive AI Systems</p>
                </div>
                <div class="bg-white/20 p-4 rounded-lg">
                    <i class="fas fa-brain text-4xl"></i>
                </div>
            </div>
        </header>
        
        <!-- Navigation -->
        <nav class="bg-white rounded-lg shadow-sm p-4 mb-8 sticky top-4 z-10">
            <ul class="flex flex-wrap gap-4 justify-center">
                <li><a href="#motivation" class="text-indigo-600 hover:text-indigo-800 font-medium">Motivation</a></li>
                <li><a href="#jepa-core" class="text-indigo-600 hover:text-indigo-800 font-medium">JEPA Core</a></li>
                <li><a href="#cognitive-arch" class="text-indigo-600 hover:text-indigo-800 font-medium">Cognitive Architecture</a></li>
                <li><a href="#modules" class="text-indigo-600 hover:text-indigo-800 font-medium">Modules</a></li>
                <li><a href="#examples" class="text-indigo-600 hover:text-indigo-800 font-medium">Examples</a></li>
                <li><a href="#conclusion" class="text-indigo-600 hover:text-indigo-800 font-medium">Conclusion</a></li>
            </ul>
        </nav>
        
        <!-- Main Content -->
        <main class="space-y-8">
            <!-- Motivation Section -->
            <section id="motivation" class="bg-white rounded-xl shadow-sm p-6">
                <h2 class="text-2xl font-bold mb-4 text-gray-800 flex items-center">
                    <i class="fas fa-lightbulb text-yellow-500 mr-3"></i>
                    <span>1. Motivation and Background</span>
                </h2>
                
                <h3 class="text-xl font-semibold mt-6 mb-3 text-gray-700">1.1 The Need for Predictive Representations</h3>
                <p class="text-gray-700 mb-4">
                    Modern AI systems must <span class="font-medium">perceive</span>, <span class="font-medium">reason</span>, and <span class="font-medium">act</span> in complex, dynamic environments. Human intelligence excels not because we memorize every detail, but because we <span class="font-medium">summarize</span>, <span class="font-medium">predict</span>, and <span class="font-medium">plan</span> using abstract representations—ignoring irrelevant noise and focusing on what is useful for future reasoning or action.
                </p>
                <p class="text-gray-700 mb-4">
                    Recent advances in deep learning (e.g., large language models, vision transformers) have shown the power of self-supervised representation learning. However, standard architectures (like autoregressive models) are often forced to model all details, including noise and unpredictability, limiting robustness and sample efficiency.
                </p>
                
                <h3 class="text-xl font-semibold mt-6 mb-3 text-gray-700">1.2 Enter JEPA: Joint Embedding Predictive Architecture</h3>
                <p class="text-gray-700">
                    Proposed by Yann LeCun and colleagues, <span class="font-medium text-indigo-700">JEPA</span> offers a novel approach:
                </p>
                <ul class="list-disc pl-6 mt-2 space-y-2 text-gray-700">
                    <li><span class="font-medium">Learn representations by predicting only what is predictable</span>—not every detail, but the essential structure that allows for accurate reasoning and planning.</li>
                </ul>
                
                <div class="key-point mt-6">
                    <p class="font-medium text-gray-800">Key Insight:</p>
                    <p>JEPA focuses on learning the predictable aspects of data while ignoring unpredictable noise, leading to more robust and efficient representations.</p>
                </div>
            </section>
            
            <!-- JEPA Core Section -->
            <section id="jepa-core" class="bg-white rounded-xl shadow-sm p-6">
                <h2 class="text-2xl font-bold mb-4 text-gray-800 flex items-center">
                    <i class="fas fa-puzzle-piece text-blue-500 mr-3"></i>
                    <span>2. JEPA: Core Ideas and Mechanism</span>
                </h2>
                
                <h3 class="text-xl font-semibold mt-6 mb-3 text-gray-700">2.1 What is JEPA?</h3>
                <p class="text-gray-700 mb-4">
                    <span class="font-medium text-indigo-700">JEPA (Joint Embedding Predictive Architecture)</span> is a self-supervised learning framework where a model is trained to embed contexts (observed parts) and targets (future or missing parts) into a shared semantic space.
                </p>
                
                <div class="bg-blue-50 p-4 rounded-lg mb-6">
                    <p class="font-medium text-blue-800">Objective:</p>
                    <ul class="list-disc pl-6 mt-2 space-y-1 text-blue-800">
                        <li>If the context and target belong together (e.g., two halves of the same image, or a sentence and its continuation), their embeddings should be <span class="font-medium">close</span>.</li>
                        <li>If they do not (random combinations), their embeddings should be <span class="font-medium">far apart</span>.</li>
                        <li>This is typically implemented via a <span class="font-medium">contrastive loss</span>.</li>
                    </ul>
                </div>
                
                <h3 class="text-xl font-semibold mt-6 mb-3 text-gray-700">2.2 Why Is This Powerful?</h3>
                <div class="grid grid-cols-1 md:grid-cols-3 gap-4 mb-6">
                    <div class="concept-card bg-white p-4 border border-gray-200">
                        <div class="text-purple-600 mb-2">
                            <i class="fas fa-filter text-xl"></i>
                        </div>
                        <h4 class="font-semibold mb-2">Focuses on Structure</h4>
                        <p class="text-sm text-gray-600">Encodes only predictable, meaningful features while ignoring noise</p>
                    </div>
                    <div class="concept-card bg-white p-4 border border-gray-200">
                        <div class="text-green-600 mb-2">
                            <i class="fas fa-shapes text-xl"></i>
                        </div>
                        <h4 class="font-semibold mb-2">Multi-Modal</h4>
                        <p class="text-sm text-gray-600">Works for vision, language, audio, video, and more</p>
                    </div>
                    <div class="concept-card bg-white p-4 border border-gray-200">
                        <div class="text-red-600 mb-2">
                            <i class="fas fa-robot text-xl"></i>
                        </div>
                        <h4 class="font-semibold mb-2">Transferable Features</h4>
                        <p class="text-sm text-gray-600">Learns representations useful for reasoning and planning</p>
                    </div>
                </div>
                
                <h3 class="text-xl font-semibold mt-6 mb-3 text-gray-700">2.3 The JEPA Training Loop</h3>
                <div class="diagram-container">
                    <div class="flex flex-col items-center">
                        <div class="flex items-center justify-center space-x-8 mb-6">
                            <div class="text-center">
                                <div class="bg-indigo-100 p-3 rounded-lg inline-block">
                                    <i class="fas fa-eye text-indigo-600 text-2xl"></i>
                                </div>
                                <p class="mt-2 font-medium">Context Encoder</p>
                                <p class="text-sm text-gray-600">Takes observed input</p>
                            </div>
                            <div class="text-center">
                                <div class="bg-indigo-100 p-3 rounded-lg inline-block">
                                    <i class="fas fa-project-diagram text-indigo-600 text-2xl"></i>
                                </div>
                                <p class="mt-2 font-medium">Embedding Space</p>
                                <p class="text-sm text-gray-600">Shared representation</p>
                            </div>
                            <div class="text-center">
                                <div class="bg-indigo-100 p-3 rounded-lg inline-block">
                                    <i class="fas fa-bullseye text-indigo-600 text-2xl"></i>
                                </div>
                                <p class="mt-2 font-medium">Target Encoder</p>
                                <p class="text-sm text-gray-600">Takes future/missing part</p>
                            </div>
                        </div>
                        <div class="w-full bg-indigo-50 p-4 rounded-lg">
                            <div class="flex justify-between items-center px-4">
                                <div class="text-center">
                                    <p class="font-medium">Input Context</p>
                                    <p class="text-sm">(e.g., left image half)</p>
                                </div>
                                <div class="text-center">
                                    <p class="font-medium">Similarity</p>
                                    <p class="text-sm">Contrastive Loss</p>
                                </div>
                                <div class="text-center">
                                    <p class="font-medium">Input Target</p>
                                    <p class="text-sm">(e.g., right image half)</p>
                                </div>
                            </div>
                        </div>
                    </div>
                </div>
                
                <h4 class="font-semibold mt-6 mb-2 text-gray-700">Concrete Examples:</h4>
                <div class="grid grid-cols-1 md:grid-cols-2 gap-4">
                    <div class="bg-gray-50 p-4 rounded-lg border border-gray-200">
                        <div class="flex items-center mb-2">
                            <div class="bg-purple-100 p-2 rounded-full mr-3">
                                <i class="fas fa-image text-purple-600"></i>
                            </div>
                            <h5 class="font-medium">Vision Example</h5>
                        </div>
                        <ul class="list-disc pl-6 text-sm text-gray-700">
                            <li>Context: Left half of a cat image</li>
                            <li>Target: Right half</li>
                            <li>Embeddings should be close if they come from the same photo, far otherwise</li>
                        </ul>
                    </div>
                    <div class="bg-gray-50 p-4 rounded-lg border border-gray-200">
                        <div class="flex items-center mb-2">
                            <div class="bg-green-100 p-2 rounded-full mr-3">
                                <i class="fas fa-language text-green-600"></i>
                            </div>
                            <h5 class="font-medium">Language Example</h5>
                        </div>
                        <ul class="list-disc pl-6 text-sm text-gray-700">
                            <li>Context: "The cat sat on the"</li>
                            <li>Target: "mat"</li>
                            <li>Close if the sequence is real, far if target is random</li>
                        </ul>
                    </div>
                </div>
            </section>
            
            <!-- Cognitive Architecture Section -->
            <section id="cognitive-arch" class="bg-white rounded-xl shadow-sm p-6">
                <h2 class="text-2xl font-bold mb-4 text-gray-800 flex items-center">
                    <i class="fas fa-sitemap text-teal-500 mr-3"></i>
                    <span>3. From Representation to Reasoning: JEPA in Cognitive Architectures</span>
                </h2>
                
                <p class="text-gray-700 mb-4">
                    JEPA shines as a <span class="font-medium">perception module</span> within a larger, <span class="font-medium">modular cognitive agent</span>. This mirrors biological systems: sensory organs and cortex encode perceptions, while higher reasoning and planning are handled by specialized systems.
                </p>
                
                <h3 class="text-xl font-semibold mt-6 mb-3 text-gray-700">3.1 The Modular Agent</h3>
                <p class="text-gray-700 mb-4">
                    The LeCun-style architecture for an intelligent agent typically includes:
                </p>
                
                <div class="grid grid-cols-1 md:grid-cols-2 lg:grid-cols-3 gap-4 mb-6">
                    <div class="concept-card bg-indigo-50 p-4">
                        <div class="flex items-center mb-2">
                            <div class="bg-indigo-100 p-2 rounded-full mr-3">
                                <i class="fas fa-eye text-indigo-600"></i>
                            </div>
                            <h4 class="font-medium">1. Perception Module (JEPA)</h4>
                        </div>
                        <p class="text-sm text-gray-700">Encodes current observation into a compact, predictive embedding</p>
                    </div>
                    <div class="concept-card bg-blue-50 p-4">
                        <div class="flex items-center mb-2">
                            <div class="bg-blue-100 p-2 rounded-full mr-3">
                                <i class="fas fa-memory text-blue-600"></i>
                            </div>
                            <h4 class="font-medium">2. Short-term Memory</h4>
                        </div>
                        <p class="text-sm text-gray-700">Stores recent sequence of embeddings (history)</p>
                    </div>
                    <div class="concept-card bg-purple-50 p-4">
                        <div class="flex items-center mb-2">
                            <div class="bg-purple-100 p-2 rounded-full mr-3">
                                <i class="fas fa-globe text-purple-600"></i>
                            </div>
                            <h4 class="font-medium">3. World Model</h4>
                        </div>
                        <p class="text-sm text-gray-700">Integrates the sequence to produce a latent state</p>
                    </div>
                    <div class="concept-card bg-green-50 p-4">
                        <div class="flex items-center mb-2">
                            <div class="bg-green-100 p-2 rounded-full mr-3">
                                <i class="fas fa-cogs text-green-600"></i>
                            </div>
                            <h4 class="font-medium">4. Configurator</h4>
                        </div>
                       
</html>