|
<!DOCTYPE html> |
|
<html lang="en"> |
|
<head> |
|
<meta charset="UTF-8"> |
|
<meta name="viewport" content="width=device-width, initial-scale=1.0"> |
|
<title>JEPA and Cognitive Architectures</title> |
|
<script src="https://cdn.tailwindcss.com"></script> |
|
<link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/6.4.0/css/all.min.css"> |
|
<style> |
|
@import url('https://fonts.googleapis.com/css2?family=Inter:wght@300;400;500;600;700&display=swap'); |
|
|
|
body { |
|
font-family: 'Inter', sans-serif; |
|
background-color: #f9fafb; |
|
color: #111827; |
|
} |
|
|
|
.gradient-header { |
|
background: linear-gradient(135deg, #4f46e5 0%, #7c3aed 100%); |
|
} |
|
|
|
.diagram-container { |
|
background-color: #f3f4f6; |
|
border-radius: 0.5rem; |
|
padding: 1.5rem; |
|
margin: 1.5rem 0; |
|
border-left: 4px solid #4f46e5; |
|
} |
|
|
|
.concept-card { |
|
transition: all 0.3s ease; |
|
border-radius: 0.5rem; |
|
box-shadow: 0 1px 3px rgba(0,0,0,0.1); |
|
} |
|
|
|
.concept-card:hover { |
|
transform: translateY(-2px); |
|
box-shadow: 0 10px 15px -3px rgba(0,0,0,0.1); |
|
} |
|
|
|
.section-divider { |
|
border-top: 2px dashed #d1d5db; |
|
margin: 2rem 0; |
|
} |
|
|
|
.key-point { |
|
background-color: #eef2ff; |
|
border-left: 4px solid #4f46e5; |
|
padding: 1rem; |
|
margin: 1rem 0; |
|
border-radius: 0 0.375rem 0.375rem 0; |
|
} |
|
|
|
code { |
|
background-color: #f3f4f6; |
|
padding: 0.2rem 0.4rem; |
|
border-radius: 0.25rem; |
|
font-family: 'Courier New', monospace; |
|
font-size: 0.9em; |
|
color: #7c3aed; |
|
} |
|
|
|
.pseudo-code { |
|
background-color: #1e293b; |
|
color: #f8fafc; |
|
padding: 1rem; |
|
border-radius: 0.5rem; |
|
font-family: 'Courier New', monospace; |
|
overflow-x: auto; |
|
margin: 1.5rem 0; |
|
} |
|
|
|
.pseudo-code .keyword { |
|
color: #f472b6; |
|
} |
|
|
|
.pseudo-code .comment { |
|
color: #94a3b8; |
|
font-style: italic; |
|
} |
|
|
|
.pseudo-code .string { |
|
color: #86efac; |
|
} |
|
|
|
.pseudo-code .function { |
|
color: #60a5fa; |
|
} |
|
</style> |
|
</head> |
|
<body class="bg-gray-50"> |
|
<div class="max-w-5xl mx-auto px-4 py-8"> |
|
|
|
<header class="gradient-header text-white rounded-xl p-8 mb-8 shadow-lg"> |
|
<div class="flex items-center justify-between"> |
|
<div> |
|
<h1 class="text-4xl font-bold mb-2">JEPA and Cognitive Architectures</h1> |
|
<p class="text-xl opacity-90">A Comprehensive Introduction to Predictive AI Systems</p> |
|
</div> |
|
<div class="bg-white/20 p-4 rounded-lg"> |
|
<i class="fas fa-brain text-4xl"></i> |
|
</div> |
|
</div> |
|
</header> |
|
|
|
|
|
<nav class="bg-white rounded-lg shadow-sm p-4 mb-8 sticky top-4 z-10"> |
|
<ul class="flex flex-wrap gap-4 justify-center"> |
|
<li><a href="#motivation" class="text-indigo-600 hover:text-indigo-800 font-medium">Motivation</a></li> |
|
<li><a href="#jepa-core" class="text-indigo-600 hover:text-indigo-800 font-medium">JEPA Core</a></li> |
|
<li><a href="#cognitive-arch" class="text-indigo-600 hover:text-indigo-800 font-medium">Cognitive Architecture</a></li> |
|
<li><a href="#modules" class="text-indigo-600 hover:text-indigo-800 font-medium">Modules</a></li> |
|
<li><a href="#examples" class="text-indigo-600 hover:text-indigo-800 font-medium">Examples</a></li> |
|
<li><a href="#conclusion" class="text-indigo-600 hover:text-indigo-800 font-medium">Conclusion</a></li> |
|
</ul> |
|
</nav> |
|
|
|
|
|
<main class="space-y-8"> |
|
|
|
<section id="motivation" class="bg-white rounded-xl shadow-sm p-6"> |
|
<h2 class="text-2xl font-bold mb-4 text-gray-800 flex items-center"> |
|
<i class="fas fa-lightbulb text-yellow-500 mr-3"></i> |
|
<span>1. Motivation and Background</span> |
|
</h2> |
|
|
|
<h3 class="text-xl font-semibold mt-6 mb-3 text-gray-700">1.1 The Need for Predictive Representations</h3> |
|
<p class="text-gray-700 mb-4"> |
|
Modern AI systems must <span class="font-medium">perceive</span>, <span class="font-medium">reason</span>, and <span class="font-medium">act</span> in complex, dynamic environments. Human intelligence excels not because we memorize every detail, but because we <span class="font-medium">summarize</span>, <span class="font-medium">predict</span>, and <span class="font-medium">plan</span> using abstract representations—ignoring irrelevant noise and focusing on what is useful for future reasoning or action. |
|
</p> |
|
<p class="text-gray-700 mb-4"> |
|
Recent advances in deep learning (e.g., large language models, vision transformers) have shown the power of self-supervised representation learning. However, standard architectures (like autoregressive models) are often forced to model all details, including noise and unpredictability, limiting robustness and sample efficiency. |
|
</p> |
|
|
|
<h3 class="text-xl font-semibold mt-6 mb-3 text-gray-700">1.2 Enter JEPA: Joint Embedding Predictive Architecture</h3> |
|
<p class="text-gray-700"> |
|
Proposed by Yann LeCun and colleagues, <span class="font-medium text-indigo-700">JEPA</span> offers a novel approach: |
|
</p> |
|
<ul class="list-disc pl-6 mt-2 space-y-2 text-gray-700"> |
|
<li><span class="font-medium">Learn representations by predicting only what is predictable</span>—not every detail, but the essential structure that allows for accurate reasoning and planning.</li> |
|
</ul> |
|
|
|
<div class="key-point mt-6"> |
|
<p class="font-medium text-gray-800">Key Insight:</p> |
|
<p>JEPA focuses on learning the predictable aspects of data while ignoring unpredictable noise, leading to more robust and efficient representations.</p> |
|
</div> |
|
</section> |
|
|
|
|
|
<section id="jepa-core" class="bg-white rounded-xl shadow-sm p-6"> |
|
<h2 class="text-2xl font-bold mb-4 text-gray-800 flex items-center"> |
|
<i class="fas fa-puzzle-piece text-blue-500 mr-3"></i> |
|
<span>2. JEPA: Core Ideas and Mechanism</span> |
|
</h2> |
|
|
|
<h3 class="text-xl font-semibold mt-6 mb-3 text-gray-700">2.1 What is JEPA?</h3> |
|
<p class="text-gray-700 mb-4"> |
|
<span class="font-medium text-indigo-700">JEPA (Joint Embedding Predictive Architecture)</span> is a self-supervised learning framework where a model is trained to embed contexts (observed parts) and targets (future or missing parts) into a shared semantic space. |
|
</p> |
|
|
|
<div class="bg-blue-50 p-4 rounded-lg mb-6"> |
|
<p class="font-medium text-blue-800">Objective:</p> |
|
<ul class="list-disc pl-6 mt-2 space-y-1 text-blue-800"> |
|
<li>If the context and target belong together (e.g., two halves of the same image, or a sentence and its continuation), their embeddings should be <span class="font-medium">close</span>.</li> |
|
<li>If they do not (random combinations), their embeddings should be <span class="font-medium">far apart</span>.</li> |
|
<li>This is typically implemented via a <span class="font-medium">contrastive loss</span>.</li> |
|
</ul> |
|
</div> |
|
|
|
<h3 class="text-xl font-semibold mt-6 mb-3 text-gray-700">2.2 Why Is This Powerful?</h3> |
|
<div class="grid grid-cols-1 md:grid-cols-3 gap-4 mb-6"> |
|
<div class="concept-card bg-white p-4 border border-gray-200"> |
|
<div class="text-purple-600 mb-2"> |
|
<i class="fas fa-filter text-xl"></i> |
|
</div> |
|
<h4 class="font-semibold mb-2">Focuses on Structure</h4> |
|
<p class="text-sm text-gray-600">Encodes only predictable, meaningful features while ignoring noise</p> |
|
</div> |
|
<div class="concept-card bg-white p-4 border border-gray-200"> |
|
<div class="text-green-600 mb-2"> |
|
<i class="fas fa-shapes text-xl"></i> |
|
</div> |
|
<h4 class="font-semibold mb-2">Multi-Modal</h4> |
|
<p class="text-sm text-gray-600">Works for vision, language, audio, video, and more</p> |
|
</div> |
|
<div class="concept-card bg-white p-4 border border-gray-200"> |
|
<div class="text-red-600 mb-2"> |
|
<i class="fas fa-robot text-xl"></i> |
|
</div> |
|
<h4 class="font-semibold mb-2">Transferable Features</h4> |
|
<p class="text-sm text-gray-600">Learns representations useful for reasoning and planning</p> |
|
</div> |
|
</div> |
|
|
|
<h3 class="text-xl font-semibold mt-6 mb-3 text-gray-700">2.3 The JEPA Training Loop</h3> |
|
<div class="diagram-container"> |
|
<div class="flex flex-col items-center"> |
|
<div class="flex items-center justify-center space-x-8 mb-6"> |
|
<div class="text-center"> |
|
<div class="bg-indigo-100 p-3 rounded-lg inline-block"> |
|
<i class="fas fa-eye text-indigo-600 text-2xl"></i> |
|
</div> |
|
<p class="mt-2 font-medium">Context Encoder</p> |
|
<p class="text-sm text-gray-600">Takes observed input</p> |
|
</div> |
|
<div class="text-center"> |
|
<div class="bg-indigo-100 p-3 rounded-lg inline-block"> |
|
<i class="fas fa-project-diagram text-indigo-600 text-2xl"></i> |
|
</div> |
|
<p class="mt-2 font-medium">Embedding Space</p> |
|
<p class="text-sm text-gray-600">Shared representation</p> |
|
</div> |
|
<div class="text-center"> |
|
<div class="bg-indigo-100 p-3 rounded-lg inline-block"> |
|
<i class="fas fa-bullseye text-indigo-600 text-2xl"></i> |
|
</div> |
|
<p class="mt-2 font-medium">Target Encoder</p> |
|
<p class="text-sm text-gray-600">Takes future/missing part</p> |
|
</div> |
|
</div> |
|
<div class="w-full bg-indigo-50 p-4 rounded-lg"> |
|
<div class="flex justify-between items-center px-4"> |
|
<div class="text-center"> |
|
<p class="font-medium">Input Context</p> |
|
<p class="text-sm">(e.g., left image half)</p> |
|
</div> |
|
<div class="text-center"> |
|
<p class="font-medium">Similarity</p> |
|
<p class="text-sm">Contrastive Loss</p> |
|
</div> |
|
<div class="text-center"> |
|
<p class="font-medium">Input Target</p> |
|
<p class="text-sm">(e.g., right image half)</p> |
|
</div> |
|
</div> |
|
</div> |
|
</div> |
|
</div> |
|
|
|
<h4 class="font-semibold mt-6 mb-2 text-gray-700">Concrete Examples:</h4> |
|
<div class="grid grid-cols-1 md:grid-cols-2 gap-4"> |
|
<div class="bg-gray-50 p-4 rounded-lg border border-gray-200"> |
|
<div class="flex items-center mb-2"> |
|
<div class="bg-purple-100 p-2 rounded-full mr-3"> |
|
<i class="fas fa-image text-purple-600"></i> |
|
</div> |
|
<h5 class="font-medium">Vision Example</h5> |
|
</div> |
|
<ul class="list-disc pl-6 text-sm text-gray-700"> |
|
<li>Context: Left half of a cat image</li> |
|
<li>Target: Right half</li> |
|
<li>Embeddings should be close if they come from the same photo, far otherwise</li> |
|
</ul> |
|
</div> |
|
<div class="bg-gray-50 p-4 rounded-lg border border-gray-200"> |
|
<div class="flex items-center mb-2"> |
|
<div class="bg-green-100 p-2 rounded-full mr-3"> |
|
<i class="fas fa-language text-green-600"></i> |
|
</div> |
|
<h5 class="font-medium">Language Example</h5> |
|
</div> |
|
<ul class="list-disc pl-6 text-sm text-gray-700"> |
|
<li>Context: "The cat sat on the"</li> |
|
<li>Target: "mat"</li> |
|
<li>Close if the sequence is real, far if target is random</li> |
|
</ul> |
|
</div> |
|
</div> |
|
</section> |
|
|
|
|
|
<section id="cognitive-arch" class="bg-white rounded-xl shadow-sm p-6"> |
|
<h2 class="text-2xl font-bold mb-4 text-gray-800 flex items-center"> |
|
<i class="fas fa-sitemap text-teal-500 mr-3"></i> |
|
<span>3. From Representation to Reasoning: JEPA in Cognitive Architectures</span> |
|
</h2> |
|
|
|
<p class="text-gray-700 mb-4"> |
|
JEPA shines as a <span class="font-medium">perception module</span> within a larger, <span class="font-medium">modular cognitive agent</span>. This mirrors biological systems: sensory organs and cortex encode perceptions, while higher reasoning and planning are handled by specialized systems. |
|
</p> |
|
|
|
<h3 class="text-xl font-semibold mt-6 mb-3 text-gray-700">3.1 The Modular Agent</h3> |
|
<p class="text-gray-700 mb-4"> |
|
The LeCun-style architecture for an intelligent agent typically includes: |
|
</p> |
|
|
|
<div class="grid grid-cols-1 md:grid-cols-2 lg:grid-cols-3 gap-4 mb-6"> |
|
<div class="concept-card bg-indigo-50 p-4"> |
|
<div class="flex items-center mb-2"> |
|
<div class="bg-indigo-100 p-2 rounded-full mr-3"> |
|
<i class="fas fa-eye text-indigo-600"></i> |
|
</div> |
|
<h4 class="font-medium">1. Perception Module (JEPA)</h4> |
|
</div> |
|
<p class="text-sm text-gray-700">Encodes current observation into a compact, predictive embedding</p> |
|
</div> |
|
<div class="concept-card bg-blue-50 p-4"> |
|
<div class="flex items-center mb-2"> |
|
<div class="bg-blue-100 p-2 rounded-full mr-3"> |
|
<i class="fas fa-memory text-blue-600"></i> |
|
</div> |
|
<h4 class="font-medium">2. Short-term Memory</h4> |
|
</div> |
|
<p class="text-sm text-gray-700">Stores recent sequence of embeddings (history)</p> |
|
</div> |
|
<div class="concept-card bg-purple-50 p-4"> |
|
<div class="flex items-center mb-2"> |
|
<div class="bg-purple-100 p-2 rounded-full mr-3"> |
|
<i class="fas fa-globe text-purple-600"></i> |
|
</div> |
|
<h4 class="font-medium">3. World Model</h4> |
|
</div> |
|
<p class="text-sm text-gray-700">Integrates the sequence to produce a latent state</p> |
|
</div> |
|
<div class="concept-card bg-green-50 p-4"> |
|
<div class="flex items-center mb-2"> |
|
<div class="bg-green-100 p-2 rounded-full mr-3"> |
|
<i class="fas fa-cogs text-green-600"></i> |
|
</div> |
|
<h4 class="font-medium">4. Configurator</h4> |
|
</div> |
|
|
|
</html> |