atolat30 commited on
Commit
83f80d2
Β·
1 Parent(s): 2b61f9d

Improve text processing and add better error handling

Browse files
Files changed (4) hide show
  1. README.md +31 -4
  2. aimakerspace/text_utils.py +33 -6
  3. app.py +37 -11
  4. requirements.txt +6 -0
README.md CHANGED
@@ -166,7 +166,19 @@ Simply put, this downloads the file as a temp file, we load it in with `TextFile
166
 
167
  #### ❓ QUESTION #1:
168
 
169
- Why do we want to support streaming? What about streaming is important, or useful?
 
 
 
 
 
 
 
 
 
 
 
 
170
 
171
  ### On Chat Start:
172
 
@@ -208,7 +220,14 @@ Now, we'll save that into our user session!
208
 
209
  #### ❓ QUESTION #2:
210
 
211
- Why are we using User Session here? What about Python makes us need to use this? Why not just store everything in a global variable?
 
 
 
 
 
 
 
212
 
213
  ### On Message
214
 
@@ -329,10 +348,18 @@ Try uploading a text file and asking some questions!
329
 
330
  Upload a PDF file of the recent DeepSeek-R1 paper and ask the following questions:
331
 
332
- 1. What is RL and how does it help reasoning?
333
- 2. What is the difference between DeepSeek-R1 and DeepSeek-R1-Zero?
 
 
 
 
334
  3. What is this paper about?
335
 
 
 
 
 
336
  Does this application pass your vibe check? Are there any immediate pitfalls you're noticing?
337
 
338
  ## 🚧 CHALLENGE MODE 🚧
 
166
 
167
  #### ❓ QUESTION #1:
168
 
169
+ Why do we want to support streaming? What about streaming is important, or useful?
170
+ Streaming is important for several key reasons:
171
+
172
+ 1. **Responsiveness & User Experience**: Rather than waiting for the entire response to be generated before seeing anything, users see the response being built word by word. This creates a more engaging and interactive experience, making the application feel more responsive.
173
+
174
+ 2. **Resource Management**: Streaming allows for better memory management since we don't need to hold the entire response in memory before sending it. This is especially important when dealing with large language models that can generate lengthy responses.
175
+
176
+ 3. **Early Error Detection**: If there's an issue with the generation, it can be detected early in the stream rather than waiting for the complete response. This allows for faster error handling and recovery.
177
+
178
+ 4. **Token Management**: When working with API services like OpenAI that charge by tokens, streaming lets us handle and potentially control token usage in real-time rather than after the fact.
179
+
180
+ 5. **Connection Stability**: In web applications, long-running single requests are more prone to timeouts and connection issues. Streaming breaks the response into smaller chunks, making the communication more resilient to network instability. A network problem will present the user with a fast failure and chance to reset/refresh vs having to wait for seconds and stall.
181
+
182
 
183
  ### On Chat Start:
184
 
 
220
 
221
  #### ❓ QUESTION #2:
222
 
223
+ Why are we using User Session here? What about Python makes us need to use this? Why not just store everything in a global variable?
224
+ In Python, global variables are shared across all instances of a web application. If we stored user data in global variables, it would be shared across all user sessions, meaning:
225
+
226
+ 1. One user's uploaded PDF would be visible to all other users
227
+ 2. Multiple users uploading PDFs would overwrite each other's data
228
+ 3. The vector database would contain a mix of documents from different users
229
+ 4. Memory usage would grow unbounded as more users upload files.
230
+
231
 
232
  ### On Message
233
 
 
348
 
349
  Upload a PDF file of the recent DeepSeek-R1 paper and ask the following questions:
350
 
351
+ 1. What is RL and how does it help reasoning?
352
+ Reinforcement Learning (RL) is a type of machine learning where an agent learns to make decisions by taking actions in an environment to maximize some notion of cumulative reward. In the context of reasoning, RL helps by enabling models, such as DeepSeek-R1-Zero, to explore and develop reasoning capabilities through self-evolution without relying on supervised fine-tuning (SFT). This allows the model to discover and improve reasoning patterns autonomously, resulting in enhanced performance on reasoning tasks such as mathematics, coding, and scientific reasoning. Specifically, RL incentivizes the model to engage in complex reasoning processes and generate long chains of thought, contributing to improved outcomes on reasoning benchmarks.
353
+
354
+ 2. What is the difference between DeepSeek-R1 and DeepSeek-R1-Zero?
355
+ DeepSeek-R1 and DeepSeek-R1-Zero differ primarily in their performance capabilities and handling of tasks. DeepSeek-R1 currently falls short compared to DeepSeek-V3 in areas such as function calling, multi-turn interactions, complex role-playing, and JSON output. It is also sensitive to prompts, where few-shot prompting negatively impacts its performance. In contrast, DeepSeek-R1-Zero has shown a steady improvement in reasoning capabilities through reinforcement learning (RL), demonstrating significant competitive performance on the AIME 2024 benchmark. Additionally, DeepSeek-R1-Zero focuses on producing a reasoning process followed by a final answer without content-specific biases, while DeepSeek-R1 may have limitations in language mixing and content readability.
356
+
357
  3. What is this paper about?
358
 
359
+ (BAD VIBES!!!)
360
+ I don't know the answer.
361
+
362
+
363
  Does this application pass your vibe check? Are there any immediate pitfalls you're noticing?
364
 
365
  ## 🚧 CHALLENGE MODE 🚧
aimakerspace/text_utils.py CHANGED
@@ -40,8 +40,8 @@ class TextFileLoader:
40
  class CharacterTextSplitter:
41
  def __init__(
42
  self,
43
- chunk_size: int = 1000,
44
- chunk_overlap: int = 200,
45
  ):
46
  assert (
47
  chunk_size > chunk_overlap
@@ -51,9 +51,24 @@ class CharacterTextSplitter:
51
  self.chunk_overlap = chunk_overlap
52
 
53
  def split(self, text: str) -> List[str]:
 
54
  chunks = []
55
- for i in range(0, len(text), self.chunk_size - self.chunk_overlap):
56
- chunks.append(text[i : i + self.chunk_size])
 
 
 
 
 
 
 
 
 
 
 
 
 
 
57
  return chunks
58
 
59
  def split_texts(self, texts: List[str]) -> List[str]:
@@ -85,19 +100,31 @@ class PDFLoader:
85
  self.load_file()
86
 
87
  except IOError as e:
 
88
  raise ValueError(f"Cannot access file at '{self.path}': {str(e)}")
89
  except Exception as e:
 
90
  raise ValueError(f"Error processing file at '{self.path}': {str(e)}")
91
 
92
  def load_file(self):
93
  with open(self.path, 'rb') as file:
94
  # Create PDF reader object
95
  pdf_reader = PyPDF2.PdfReader(file)
 
96
 
97
  # Extract text from each page
98
  text = ""
99
- for page in pdf_reader.pages:
100
- text += page.extract_text() + "\n"
 
 
 
 
 
 
 
 
 
101
 
102
  self.documents.append(text)
103
 
 
40
  class CharacterTextSplitter:
41
  def __init__(
42
  self,
43
+ chunk_size: int = 2000,
44
+ chunk_overlap: int = 400,
45
  ):
46
  assert (
47
  chunk_size > chunk_overlap
 
51
  self.chunk_overlap = chunk_overlap
52
 
53
  def split(self, text: str) -> List[str]:
54
+ paragraphs = text.split('\n\n')
55
  chunks = []
56
+ current_chunk = ""
57
+
58
+ for paragraph in paragraphs:
59
+ if len(current_chunk) + len(paragraph) > self.chunk_size:
60
+ if current_chunk:
61
+ chunks.append(current_chunk.strip())
62
+ current_chunk = paragraph
63
+ else:
64
+ if current_chunk:
65
+ current_chunk += "\n\n" + paragraph
66
+ else:
67
+ current_chunk = paragraph
68
+
69
+ if current_chunk:
70
+ chunks.append(current_chunk.strip())
71
+
72
  return chunks
73
 
74
  def split_texts(self, texts: List[str]) -> List[str]:
 
100
  self.load_file()
101
 
102
  except IOError as e:
103
+ print(f"IOError while accessing file: {str(e)}")
104
  raise ValueError(f"Cannot access file at '{self.path}': {str(e)}")
105
  except Exception as e:
106
+ print(f"Unexpected error while processing file: {str(e)}")
107
  raise ValueError(f"Error processing file at '{self.path}': {str(e)}")
108
 
109
  def load_file(self):
110
  with open(self.path, 'rb') as file:
111
  # Create PDF reader object
112
  pdf_reader = PyPDF2.PdfReader(file)
113
+ print(f"PDF loaded successfully. Number of pages: {len(pdf_reader.pages)}")
114
 
115
  # Extract text from each page
116
  text = ""
117
+ for i, page in enumerate(pdf_reader.pages):
118
+ page_text = page.extract_text()
119
+ if not page_text.strip():
120
+ print(f"Warning: Page {i+1} appears to be empty or unreadable")
121
+ text += page_text + "\n"
122
+ print(f"Processed page {i+1}, extracted {len(page_text)} characters")
123
+
124
+ if not text.strip():
125
+ print("Warning: No text was extracted from the PDF")
126
+ else:
127
+ print(f"Successfully extracted {len(text)} characters of text")
128
 
129
  self.documents.append(text)
130
 
app.py CHANGED
@@ -96,6 +96,7 @@ async def on_chat_start():
96
  ).send()
97
 
98
  file = files[0]
 
99
 
100
  msg = cl.Message(
101
  content=f"Processing `{file.name}`..."
@@ -103,15 +104,32 @@ async def on_chat_start():
103
  await msg.send()
104
 
105
  # load the file
106
- texts = process_file(file)
107
-
108
- print(f"Processing {len(texts)} text chunks")
 
 
 
 
 
109
 
110
  # Create a dict vector store
111
- vector_db = VectorDatabase()
112
- vector_db = await vector_db.abuild_from_list(texts)
 
 
 
 
 
 
113
 
114
- chat_openai = ChatOpenAI()
 
 
 
 
 
 
115
 
116
  # Create a chain
117
  retrieval_augmented_qa_pipeline = RetrievalAugmentedQAPipeline(
@@ -129,11 +147,19 @@ async def on_chat_start():
129
  @cl.on_message
130
  async def main(message):
131
  chain = cl.user_session.get("chain")
 
 
 
132
 
133
  msg = cl.Message(content="")
134
- result = await chain.arun_pipeline(message.content)
135
-
136
- async for stream_resp in result["response"]:
137
- await msg.stream_token(stream_resp)
 
 
138
 
139
- await msg.send()
 
 
 
 
96
  ).send()
97
 
98
  file = files[0]
99
+ print(f"Received file: {file.name} ({file.type})")
100
 
101
  msg = cl.Message(
102
  content=f"Processing `{file.name}`..."
 
104
  await msg.send()
105
 
106
  # load the file
107
+ try:
108
+ texts = process_file(file)
109
+ print(f"Successfully processed file. Generated {len(texts)} text chunks")
110
+ print("Sample of first chunk:", texts[0][:200] if texts else "No texts generated")
111
+ except Exception as e:
112
+ print(f"Error processing file: {str(e)}")
113
+ await cl.Message(content=f"Error processing file: {str(e)}").send()
114
+ return
115
 
116
  # Create a dict vector store
117
+ try:
118
+ vector_db = VectorDatabase()
119
+ vector_db = await vector_db.abuild_from_list(texts)
120
+ print("Successfully created vector database")
121
+ except Exception as e:
122
+ print(f"Error creating vector database: {str(e)}")
123
+ await cl.Message(content=f"Error creating vector database: {str(e)}").send()
124
+ return
125
 
126
+ try:
127
+ chat_openai = ChatOpenAI()
128
+ print("Successfully initialized ChatOpenAI")
129
+ except Exception as e:
130
+ print(f"Error initializing ChatOpenAI: {str(e)}")
131
+ await cl.Message(content=f"Error initializing ChatOpenAI: {str(e)}").send()
132
+ return
133
 
134
  # Create a chain
135
  retrieval_augmented_qa_pipeline = RetrievalAugmentedQAPipeline(
 
147
  @cl.on_message
148
  async def main(message):
149
  chain = cl.user_session.get("chain")
150
+ if not chain:
151
+ await cl.Message(content="Error: Chat session not initialized. Please try uploading the file again.").send()
152
+ return
153
 
154
  msg = cl.Message(content="")
155
+ try:
156
+ result = await chain.arun_pipeline(message.content)
157
+ print(f"Retrieved {len(result['context'])} relevant contexts")
158
+
159
+ async for stream_resp in result["response"]:
160
+ await msg.stream_token(stream_resp)
161
 
162
+ await msg.send()
163
+ except Exception as e:
164
+ print(f"Error in chat pipeline: {str(e)}")
165
+ await cl.Message(content=f"Error processing your question: {str(e)}").send()
requirements.txt ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ chainlit==2.0.4
2
+ numpy==2.2.2
3
+ openai==1.59.9
4
+ pydantic==2.10.1
5
+ pypdf2==3.0.1
6
+ websockets==14.2