GDPR-EDPB-AI / README.md
arsiba's picture
doc: minor fixes
e267105

A newer version of the Gradio SDK is available: 5.27.1

Upgrade
metadata
title: GDPR AI
emoji: πŸš€
colorFrom: pink
colorTo: pink
sdk: gradio
sdk_version: 5.26.0
app_file: app.py
pinned: false
license: apache-2.0
short_description: RAG AI with GDPR & EDPB PDFs

GDPR/EDPB Legal Assistant - README

Overview

The GDPR/EDPB Legal Assistant is an open-source, Retrieval-Augmented Generation (RAG)-based AI assistant designed to support privacy professionals, legal experts, and Data Protection Officers (DPOs) in navigating complex European data protection regulations, primarily focusing on the General Data Protection Regulation (GDPR) and the European Data Protection Board (EDPB) guidelines.

This project is trained on the complete archive of documents published by the EDPB, including guidelines, decisions, meeting minutes, and the full legal text of the GDPR. This vast dataset is critical for compliance and legal interpretation in the field of data protection.

The main goal is to provide a free and accessible AI assistant that allows users to quickly find relevant legal guidance and regulatory context without having to manually search through thousands of PDF documents.

Key Features

  • Retrieval-Augmented Generation (RAG): Combines document retrieval with generative answering to provide context-aware responses.
  • Legal Document Indexing: Includes the full archive of the EDPB and GDPR legal text as Faiss index.
  • Local Processing Pipeline: All PDF files are preprocessed and converted into the uploaded FAISS vector database.
  • Gradio UI: Clean and interactive interface for querying the assistant.
  • Optimized for Legal Context: Focused on accuracy, transparency, and relevance for legal professionals.

How It Works

1. PDF Preprocessing & Storage

  • All raw EDPB and GDPR documents (PDFs) are parsed locally.
  • Extracted text is split into semantically meaningful chunks.
  • Metadata (e.g., source document, publication date) is extracted and preserved.

2. Vector Embedding & Indexing

  • The text chunks are embedded using Sentence Transformers.
  • Embeddings are indexed using FAISS and stored in this repository:
    • vector_db/index.faiss
    • vector_db/chunks.pkl
    • vector_db/metadata.pkl

3. Query & Answer Generation

  • A user submits a question via the Gradio interface.
  • The system retrieves the most relevant chunks using FAISS.
  • A large language model (Qwen2-7B-Instruct) generates a detailed answer using the retrieved context.

Project Status

  • βœ… ~1,600+ EDPB documents processed
  • βœ… Local FAISS retrieval working
  • 🚧 Applying for Hugging Face GPU grant to scale inference

Roadmap

  • Fine-tune legal QA models on EDPB text
  • Improve source highlighting and reference linking
  • Extend support to other european GDPR and privacy decisions

Contributing

We welcome contributions! Feel free to open issues or submit pull requests to improve preprocessing, model inference, or the UI.