Spaces:
Sleeping
Sleeping
Improve text processing and add better error handling
Browse files- README.md +31 -4
- aimakerspace/text_utils.py +33 -6
- app.py +37 -11
- requirements.txt +6 -0
README.md
CHANGED
@@ -166,7 +166,19 @@ Simply put, this downloads the file as a temp file, we load it in with `TextFile
|
|
166 |
|
167 |
#### β QUESTION #1:
|
168 |
|
169 |
-
Why do we want to support streaming? What about streaming is important, or useful?
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
170 |
|
171 |
### On Chat Start:
|
172 |
|
@@ -208,7 +220,14 @@ Now, we'll save that into our user session!
|
|
208 |
|
209 |
#### β QUESTION #2:
|
210 |
|
211 |
-
Why are we using User Session here? What about Python makes us need to use this? Why not just store everything in a global variable?
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
212 |
|
213 |
### On Message
|
214 |
|
@@ -329,10 +348,18 @@ Try uploading a text file and asking some questions!
|
|
329 |
|
330 |
Upload a PDF file of the recent DeepSeek-R1 paper and ask the following questions:
|
331 |
|
332 |
-
1. What is RL and how does it help reasoning?
|
333 |
-
|
|
|
|
|
|
|
|
|
334 |
3. What is this paper about?
|
335 |
|
|
|
|
|
|
|
|
|
336 |
Does this application pass your vibe check? Are there any immediate pitfalls you're noticing?
|
337 |
|
338 |
## π§ CHALLENGE MODE π§
|
|
|
166 |
|
167 |
#### β QUESTION #1:
|
168 |
|
169 |
+
Why do we want to support streaming? What about streaming is important, or useful?
|
170 |
+
Streaming is important for several key reasons:
|
171 |
+
|
172 |
+
1. **Responsiveness & User Experience**: Rather than waiting for the entire response to be generated before seeing anything, users see the response being built word by word. This creates a more engaging and interactive experience, making the application feel more responsive.
|
173 |
+
|
174 |
+
2. **Resource Management**: Streaming allows for better memory management since we don't need to hold the entire response in memory before sending it. This is especially important when dealing with large language models that can generate lengthy responses.
|
175 |
+
|
176 |
+
3. **Early Error Detection**: If there's an issue with the generation, it can be detected early in the stream rather than waiting for the complete response. This allows for faster error handling and recovery.
|
177 |
+
|
178 |
+
4. **Token Management**: When working with API services like OpenAI that charge by tokens, streaming lets us handle and potentially control token usage in real-time rather than after the fact.
|
179 |
+
|
180 |
+
5. **Connection Stability**: In web applications, long-running single requests are more prone to timeouts and connection issues. Streaming breaks the response into smaller chunks, making the communication more resilient to network instability. A network problem will present the user with a fast failure and chance to reset/refresh vs having to wait for seconds and stall.
|
181 |
+
|
182 |
|
183 |
### On Chat Start:
|
184 |
|
|
|
220 |
|
221 |
#### β QUESTION #2:
|
222 |
|
223 |
+
Why are we using User Session here? What about Python makes us need to use this? Why not just store everything in a global variable?
|
224 |
+
In Python, global variables are shared across all instances of a web application. If we stored user data in global variables, it would be shared across all user sessions, meaning:
|
225 |
+
|
226 |
+
1. One user's uploaded PDF would be visible to all other users
|
227 |
+
2. Multiple users uploading PDFs would overwrite each other's data
|
228 |
+
3. The vector database would contain a mix of documents from different users
|
229 |
+
4. Memory usage would grow unbounded as more users upload files.
|
230 |
+
|
231 |
|
232 |
### On Message
|
233 |
|
|
|
348 |
|
349 |
Upload a PDF file of the recent DeepSeek-R1 paper and ask the following questions:
|
350 |
|
351 |
+
1. What is RL and how does it help reasoning?
|
352 |
+
Reinforcement Learning (RL) is a type of machine learning where an agent learns to make decisions by taking actions in an environment to maximize some notion of cumulative reward. In the context of reasoning, RL helps by enabling models, such as DeepSeek-R1-Zero, to explore and develop reasoning capabilities through self-evolution without relying on supervised fine-tuning (SFT). This allows the model to discover and improve reasoning patterns autonomously, resulting in enhanced performance on reasoning tasks such as mathematics, coding, and scientific reasoning. Specifically, RL incentivizes the model to engage in complex reasoning processes and generate long chains of thought, contributing to improved outcomes on reasoning benchmarks.
|
353 |
+
|
354 |
+
2. What is the difference between DeepSeek-R1 and DeepSeek-R1-Zero?
|
355 |
+
DeepSeek-R1 and DeepSeek-R1-Zero differ primarily in their performance capabilities and handling of tasks. DeepSeek-R1 currently falls short compared to DeepSeek-V3 in areas such as function calling, multi-turn interactions, complex role-playing, and JSON output. It is also sensitive to prompts, where few-shot prompting negatively impacts its performance. In contrast, DeepSeek-R1-Zero has shown a steady improvement in reasoning capabilities through reinforcement learning (RL), demonstrating significant competitive performance on the AIME 2024 benchmark. Additionally, DeepSeek-R1-Zero focuses on producing a reasoning process followed by a final answer without content-specific biases, while DeepSeek-R1 may have limitations in language mixing and content readability.
|
356 |
+
|
357 |
3. What is this paper about?
|
358 |
|
359 |
+
(BAD VIBES!!!)
|
360 |
+
I don't know the answer.
|
361 |
+
|
362 |
+
|
363 |
Does this application pass your vibe check? Are there any immediate pitfalls you're noticing?
|
364 |
|
365 |
## π§ CHALLENGE MODE π§
|
aimakerspace/text_utils.py
CHANGED
@@ -40,8 +40,8 @@ class TextFileLoader:
|
|
40 |
class CharacterTextSplitter:
|
41 |
def __init__(
|
42 |
self,
|
43 |
-
chunk_size: int =
|
44 |
-
chunk_overlap: int =
|
45 |
):
|
46 |
assert (
|
47 |
chunk_size > chunk_overlap
|
@@ -51,9 +51,24 @@ class CharacterTextSplitter:
|
|
51 |
self.chunk_overlap = chunk_overlap
|
52 |
|
53 |
def split(self, text: str) -> List[str]:
|
|
|
54 |
chunks = []
|
55 |
-
|
56 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
57 |
return chunks
|
58 |
|
59 |
def split_texts(self, texts: List[str]) -> List[str]:
|
@@ -85,19 +100,31 @@ class PDFLoader:
|
|
85 |
self.load_file()
|
86 |
|
87 |
except IOError as e:
|
|
|
88 |
raise ValueError(f"Cannot access file at '{self.path}': {str(e)}")
|
89 |
except Exception as e:
|
|
|
90 |
raise ValueError(f"Error processing file at '{self.path}': {str(e)}")
|
91 |
|
92 |
def load_file(self):
|
93 |
with open(self.path, 'rb') as file:
|
94 |
# Create PDF reader object
|
95 |
pdf_reader = PyPDF2.PdfReader(file)
|
|
|
96 |
|
97 |
# Extract text from each page
|
98 |
text = ""
|
99 |
-
for page in pdf_reader.pages:
|
100 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
101 |
|
102 |
self.documents.append(text)
|
103 |
|
|
|
40 |
class CharacterTextSplitter:
|
41 |
def __init__(
|
42 |
self,
|
43 |
+
chunk_size: int = 2000,
|
44 |
+
chunk_overlap: int = 400,
|
45 |
):
|
46 |
assert (
|
47 |
chunk_size > chunk_overlap
|
|
|
51 |
self.chunk_overlap = chunk_overlap
|
52 |
|
53 |
def split(self, text: str) -> List[str]:
|
54 |
+
paragraphs = text.split('\n\n')
|
55 |
chunks = []
|
56 |
+
current_chunk = ""
|
57 |
+
|
58 |
+
for paragraph in paragraphs:
|
59 |
+
if len(current_chunk) + len(paragraph) > self.chunk_size:
|
60 |
+
if current_chunk:
|
61 |
+
chunks.append(current_chunk.strip())
|
62 |
+
current_chunk = paragraph
|
63 |
+
else:
|
64 |
+
if current_chunk:
|
65 |
+
current_chunk += "\n\n" + paragraph
|
66 |
+
else:
|
67 |
+
current_chunk = paragraph
|
68 |
+
|
69 |
+
if current_chunk:
|
70 |
+
chunks.append(current_chunk.strip())
|
71 |
+
|
72 |
return chunks
|
73 |
|
74 |
def split_texts(self, texts: List[str]) -> List[str]:
|
|
|
100 |
self.load_file()
|
101 |
|
102 |
except IOError as e:
|
103 |
+
print(f"IOError while accessing file: {str(e)}")
|
104 |
raise ValueError(f"Cannot access file at '{self.path}': {str(e)}")
|
105 |
except Exception as e:
|
106 |
+
print(f"Unexpected error while processing file: {str(e)}")
|
107 |
raise ValueError(f"Error processing file at '{self.path}': {str(e)}")
|
108 |
|
109 |
def load_file(self):
|
110 |
with open(self.path, 'rb') as file:
|
111 |
# Create PDF reader object
|
112 |
pdf_reader = PyPDF2.PdfReader(file)
|
113 |
+
print(f"PDF loaded successfully. Number of pages: {len(pdf_reader.pages)}")
|
114 |
|
115 |
# Extract text from each page
|
116 |
text = ""
|
117 |
+
for i, page in enumerate(pdf_reader.pages):
|
118 |
+
page_text = page.extract_text()
|
119 |
+
if not page_text.strip():
|
120 |
+
print(f"Warning: Page {i+1} appears to be empty or unreadable")
|
121 |
+
text += page_text + "\n"
|
122 |
+
print(f"Processed page {i+1}, extracted {len(page_text)} characters")
|
123 |
+
|
124 |
+
if not text.strip():
|
125 |
+
print("Warning: No text was extracted from the PDF")
|
126 |
+
else:
|
127 |
+
print(f"Successfully extracted {len(text)} characters of text")
|
128 |
|
129 |
self.documents.append(text)
|
130 |
|
app.py
CHANGED
@@ -96,6 +96,7 @@ async def on_chat_start():
|
|
96 |
).send()
|
97 |
|
98 |
file = files[0]
|
|
|
99 |
|
100 |
msg = cl.Message(
|
101 |
content=f"Processing `{file.name}`..."
|
@@ -103,15 +104,32 @@ async def on_chat_start():
|
|
103 |
await msg.send()
|
104 |
|
105 |
# load the file
|
106 |
-
|
107 |
-
|
108 |
-
|
|
|
|
|
|
|
|
|
|
|
109 |
|
110 |
# Create a dict vector store
|
111 |
-
|
112 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
113 |
|
114 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
115 |
|
116 |
# Create a chain
|
117 |
retrieval_augmented_qa_pipeline = RetrievalAugmentedQAPipeline(
|
@@ -129,11 +147,19 @@ async def on_chat_start():
|
|
129 |
@cl.on_message
|
130 |
async def main(message):
|
131 |
chain = cl.user_session.get("chain")
|
|
|
|
|
|
|
132 |
|
133 |
msg = cl.Message(content="")
|
134 |
-
|
135 |
-
|
136 |
-
|
137 |
-
|
|
|
|
|
138 |
|
139 |
-
|
|
|
|
|
|
|
|
96 |
).send()
|
97 |
|
98 |
file = files[0]
|
99 |
+
print(f"Received file: {file.name} ({file.type})")
|
100 |
|
101 |
msg = cl.Message(
|
102 |
content=f"Processing `{file.name}`..."
|
|
|
104 |
await msg.send()
|
105 |
|
106 |
# load the file
|
107 |
+
try:
|
108 |
+
texts = process_file(file)
|
109 |
+
print(f"Successfully processed file. Generated {len(texts)} text chunks")
|
110 |
+
print("Sample of first chunk:", texts[0][:200] if texts else "No texts generated")
|
111 |
+
except Exception as e:
|
112 |
+
print(f"Error processing file: {str(e)}")
|
113 |
+
await cl.Message(content=f"Error processing file: {str(e)}").send()
|
114 |
+
return
|
115 |
|
116 |
# Create a dict vector store
|
117 |
+
try:
|
118 |
+
vector_db = VectorDatabase()
|
119 |
+
vector_db = await vector_db.abuild_from_list(texts)
|
120 |
+
print("Successfully created vector database")
|
121 |
+
except Exception as e:
|
122 |
+
print(f"Error creating vector database: {str(e)}")
|
123 |
+
await cl.Message(content=f"Error creating vector database: {str(e)}").send()
|
124 |
+
return
|
125 |
|
126 |
+
try:
|
127 |
+
chat_openai = ChatOpenAI()
|
128 |
+
print("Successfully initialized ChatOpenAI")
|
129 |
+
except Exception as e:
|
130 |
+
print(f"Error initializing ChatOpenAI: {str(e)}")
|
131 |
+
await cl.Message(content=f"Error initializing ChatOpenAI: {str(e)}").send()
|
132 |
+
return
|
133 |
|
134 |
# Create a chain
|
135 |
retrieval_augmented_qa_pipeline = RetrievalAugmentedQAPipeline(
|
|
|
147 |
@cl.on_message
|
148 |
async def main(message):
|
149 |
chain = cl.user_session.get("chain")
|
150 |
+
if not chain:
|
151 |
+
await cl.Message(content="Error: Chat session not initialized. Please try uploading the file again.").send()
|
152 |
+
return
|
153 |
|
154 |
msg = cl.Message(content="")
|
155 |
+
try:
|
156 |
+
result = await chain.arun_pipeline(message.content)
|
157 |
+
print(f"Retrieved {len(result['context'])} relevant contexts")
|
158 |
+
|
159 |
+
async for stream_resp in result["response"]:
|
160 |
+
await msg.stream_token(stream_resp)
|
161 |
|
162 |
+
await msg.send()
|
163 |
+
except Exception as e:
|
164 |
+
print(f"Error in chat pipeline: {str(e)}")
|
165 |
+
await cl.Message(content=f"Error processing your question: {str(e)}").send()
|
requirements.txt
ADDED
@@ -0,0 +1,6 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
chainlit==2.0.4
|
2 |
+
numpy==2.2.2
|
3 |
+
openai==1.59.9
|
4 |
+
pydantic==2.10.1
|
5 |
+
pypdf2==3.0.1
|
6 |
+
websockets==14.2
|