Spaces:

atolat30
/

pythonic-rag

Sleeping

App Files Files Community

atolat30 commited on 14 days ago

Commit

83f80d2

1 Parent(s): 2b61f9d

Improve text processing and add better error handling

Browse files

Files changed (4) hide show

README.md +31 -4
aimakerspace/text_utils.py +33 -6
app.py +37 -11
requirements.txt +6 -0

README.md CHANGED Viewed

@@ -166,7 +166,19 @@ Simply put, this downloads the file as a temp file, we load it in with `TextFile
 #### ❓ QUESTION #1:
-Why do we want to support streaming? What about streaming is important, or useful?
 ### On Chat Start:
@@ -208,7 +220,14 @@ Now, we'll save that into our user session!
 #### ❓ QUESTION #2:
-Why are we using User Session here? What about Python makes us need to use this? Why not just store everything in a global variable?
 ### On Message
@@ -329,10 +348,18 @@ Try uploading a text file and asking some questions!
 Upload a PDF file of the recent DeepSeek-R1 paper and ask the following questions:
-1. What is RL and how does it help reasoning?
-2. What is the difference between DeepSeek-R1 and DeepSeek-R1-Zero?
 3. What is this paper about?
 Does this application pass your vibe check? Are there any immediate pitfalls you're noticing?
 ## 🚧 CHALLENGE MODE 🚧

 #### ❓ QUESTION #1:
+Why do we want to support streaming? What about streaming is important, or useful?
+Streaming is important for several key reasons:
+1. **Responsiveness & User Experience**: Rather than waiting for the entire response to be generated before seeing anything, users see the response being built word by word. This creates a more engaging and interactive experience, making the application feel more responsive.
+2. **Resource Management**: Streaming allows for better memory management since we don't need to hold the entire response in memory before sending it. This is especially important when dealing with large language models that can generate lengthy responses.
+3. **Early Error Detection**: If there's an issue with the generation, it can be detected early in the stream rather than waiting for the complete response. This allows for faster error handling and recovery.
+4. **Token Management**: When working with API services like OpenAI that charge by tokens, streaming lets us handle and potentially control token usage in real-time rather than after the fact.
+5. **Connection Stability**: In web applications, long-running single requests are more prone to timeouts and connection issues. Streaming breaks the response into smaller chunks, making the communication more resilient to network instability. A network problem will present the user with a fast failure and chance to reset/refresh vs having to wait for seconds and stall.
 ### On Chat Start:
 #### ❓ QUESTION #2:
+Why are we using User Session here? What about Python makes us need to use this? Why not just store everything in a global variable?
+In Python, global variables are shared across all instances of a web application. If we stored user data in global variables, it would be shared across all user sessions, meaning:
+1. One user's uploaded PDF would be visible to all other users
+2. Multiple users uploading PDFs would overwrite each other's data
+3. The vector database would contain a mix of documents from different users
+4. Memory usage would grow unbounded as more users upload files.
 ### On Message
 Upload a PDF file of the recent DeepSeek-R1 paper and ask the following questions:
+1. What is RL and how does it help reasoning?
+Reinforcement Learning (RL) is a type of machine learning where an agent learns to make decisions by taking actions in an environment to maximize some notion of cumulative reward. In the context of reasoning, RL helps by enabling models, such as DeepSeek-R1-Zero, to explore and develop reasoning capabilities through self-evolution without relying on supervised fine-tuning (SFT). This allows the model to discover and improve reasoning patterns autonomously, resulting in enhanced performance on reasoning tasks such as mathematics, coding, and scientific reasoning. Specifically, RL incentivizes the model to engage in complex reasoning processes and generate long chains of thought, contributing to improved outcomes on reasoning benchmarks.
+2. What is the difference between DeepSeek-R1 and DeepSeek-R1-Zero?
+DeepSeek-R1 and DeepSeek-R1-Zero differ primarily in their performance capabilities and handling of tasks. DeepSeek-R1 currently falls short compared to DeepSeek-V3 in areas such as function calling, multi-turn interactions, complex role-playing, and JSON output. It is also sensitive to prompts, where few-shot prompting negatively impacts its performance. In contrast, DeepSeek-R1-Zero has shown a steady improvement in reasoning capabilities through reinforcement learning (RL), demonstrating significant competitive performance on the AIME 2024 benchmark. Additionally, DeepSeek-R1-Zero focuses on producing a reasoning process followed by a final answer without content-specific biases, while DeepSeek-R1 may have limitations in language mixing and content readability.
 3. What is this paper about?
+(BAD VIBES!!!)
+I don't know the answer.
 Does this application pass your vibe check? Are there any immediate pitfalls you're noticing?
 ## 🚧 CHALLENGE MODE 🚧

aimakerspace/text_utils.py CHANGED Viewed

@@ -40,8 +40,8 @@ class TextFileLoader:
 class CharacterTextSplitter:
     def __init__(
         self,
-        chunk_size: int = 1000,
-        chunk_overlap: int = 200,
     ):
         assert (
             chunk_size > chunk_overlap
@@ -51,9 +51,24 @@ class CharacterTextSplitter:
         self.chunk_overlap = chunk_overlap
     def split(self, text: str) -> List[str]:
         chunks = []
-        for i in range(0, len(text), self.chunk_size - self.chunk_overlap):
-            chunks.append(text[i : i + self.chunk_size])
         return chunks
     def split_texts(self, texts: List[str]) -> List[str]:
@@ -85,19 +100,31 @@ class PDFLoader:
             self.load_file()
         except IOError as e:
             raise ValueError(f"Cannot access file at '{self.path}': {str(e)}")
         except Exception as e:
             raise ValueError(f"Error processing file at '{self.path}': {str(e)}")
     def load_file(self):
         with open(self.path, 'rb') as file:
             # Create PDF reader object
             pdf_reader = PyPDF2.PdfReader(file)
             # Extract text from each page
             text = ""
-            for page in pdf_reader.pages:
-                text += page.extract_text() + "\n"
             self.documents.append(text)

 class CharacterTextSplitter:
     def __init__(
         self,
+        chunk_size: int = 2000,
+        chunk_overlap: int = 400,
     ):
         assert (
             chunk_size > chunk_overlap
         self.chunk_overlap = chunk_overlap
     def split(self, text: str) -> List[str]:
+        paragraphs = text.split('\n\n')
         chunks = []
+        current_chunk = ""
+        for paragraph in paragraphs:
+            if len(current_chunk) + len(paragraph) > self.chunk_size:
+                if current_chunk:
+                    chunks.append(current_chunk.strip())
+                current_chunk = paragraph
+            else:
+                if current_chunk:
+                    current_chunk += "\n\n" + paragraph
+                else:
+                    current_chunk = paragraph
+        if current_chunk:
+            chunks.append(current_chunk.strip())
         return chunks
     def split_texts(self, texts: List[str]) -> List[str]:
             self.load_file()
         except IOError as e:
+            print(f"IOError while accessing file: {str(e)}")
             raise ValueError(f"Cannot access file at '{self.path}': {str(e)}")
         except Exception as e:
+            print(f"Unexpected error while processing file: {str(e)}")
             raise ValueError(f"Error processing file at '{self.path}': {str(e)}")
     def load_file(self):
         with open(self.path, 'rb') as file:
             # Create PDF reader object
             pdf_reader = PyPDF2.PdfReader(file)
+            print(f"PDF loaded successfully. Number of pages: {len(pdf_reader.pages)}")
             # Extract text from each page
             text = ""
+            for i, page in enumerate(pdf_reader.pages):
+                page_text = page.extract_text()
+                if not page_text.strip():
+                    print(f"Warning: Page {i+1} appears to be empty or unreadable")
+                text += page_text + "\n"
+                print(f"Processed page {i+1}, extracted {len(page_text)} characters")
+            if not text.strip():
+                print("Warning: No text was extracted from the PDF")
+            else:
+                print(f"Successfully extracted {len(text)} characters of text")
             self.documents.append(text)

app.py CHANGED Viewed

@@ -96,6 +96,7 @@ async def on_chat_start():
         ).send()
     file = files[0]
     msg = cl.Message(
         content=f"Processing `{file.name}`..."
@@ -103,15 +104,32 @@ async def on_chat_start():
     await msg.send()
     # load the file
-    texts = process_file(file)
-    print(f"Processing {len(texts)} text chunks")
     # Create a dict vector store
-    vector_db = VectorDatabase()
-    vector_db = await vector_db.abuild_from_list(texts)
-    chat_openai = ChatOpenAI()
     # Create a chain
     retrieval_augmented_qa_pipeline = RetrievalAugmentedQAPipeline(
@@ -129,11 +147,19 @@ async def on_chat_start():
 @cl.on_message
 async def main(message):
     chain = cl.user_session.get("chain")
     msg = cl.Message(content="")
-    result = await chain.arun_pipeline(message.content)
-    async for stream_resp in result["response"]:
-        await msg.stream_token(stream_resp)
-    await msg.send()

         ).send()
     file = files[0]
+    print(f"Received file: {file.name} ({file.type})")
     msg = cl.Message(
         content=f"Processing `{file.name}`..."
     await msg.send()
     # load the file
+    try:
+        texts = process_file(file)
+        print(f"Successfully processed file. Generated {len(texts)} text chunks")
+        print("Sample of first chunk:", texts[0][:200] if texts else "No texts generated")
+    except Exception as e:
+        print(f"Error processing file: {str(e)}")
+        await cl.Message(content=f"Error processing file: {str(e)}").send()
+        return
     # Create a dict vector store
+    try:
+        vector_db = VectorDatabase()
+        vector_db = await vector_db.abuild_from_list(texts)
+        print("Successfully created vector database")
+    except Exception as e:
+        print(f"Error creating vector database: {str(e)}")
+        await cl.Message(content=f"Error creating vector database: {str(e)}").send()
+        return
+    try:
+        chat_openai = ChatOpenAI()
+        print("Successfully initialized ChatOpenAI")
+    except Exception as e:
+        print(f"Error initializing ChatOpenAI: {str(e)}")
+        await cl.Message(content=f"Error initializing ChatOpenAI: {str(e)}").send()
+        return
     # Create a chain
     retrieval_augmented_qa_pipeline = RetrievalAugmentedQAPipeline(
 @cl.on_message
 async def main(message):
     chain = cl.user_session.get("chain")
+    if not chain:
+        await cl.Message(content="Error: Chat session not initialized. Please try uploading the file again.").send()
+        return
     msg = cl.Message(content="")
+    try:
+        result = await chain.arun_pipeline(message.content)
+        print(f"Retrieved {len(result['context'])} relevant contexts")
+        async for stream_resp in result["response"]:
+            await msg.stream_token(stream_resp)
+        await msg.send()
+    except Exception as e:
+        print(f"Error in chat pipeline: {str(e)}")
+        await cl.Message(content=f"Error processing your question: {str(e)}").send()

requirements.txt ADDED Viewed

	@@ -0,0 +1,6 @@

+chainlit==2.0.4
+numpy==2.2.2
+openai==1.59.9
+pydantic==2.10.1
+pypdf2==3.0.1
+websockets==14.2