Spaces:
Sleeping
Sleeping
Improve context retrieval and prompt templates
Browse files
README.md
CHANGED
@@ -359,6 +359,22 @@ DeepSeek-R1 and DeepSeek-R1-Zero differ primarily in their performance capabilit
|
|
359 |
(BAD VIBES!!!)
|
360 |
I don't know the answer.
|
361 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
362 |
|
363 |
Does this application pass your vibe check? Are there any immediate pitfalls you're noticing?
|
364 |
|
|
|
359 |
(BAD VIBES!!!)
|
360 |
I don't know the answer.
|
361 |
|
362 |
+
More details
|
363 |
+
|
364 |
+
```
|
365 |
+
Using 457 words of context
|
366 |
+
|
367 |
+
Final messages being sent to the model:
|
368 |
+
|
369 |
+
System prompt:
|
370 |
+
{'role': 'system', 'content': 'You are a helpful AI assistant that answers questions based on the provided context. \nYour task is to:\n1. Carefully read and understand the context\n2. Answer the user\'s question using ONLY the information from the context\n3. If the answer cannot be found in the context, say "I cannot find the answer in the provided context"\n4. If you find partial information, share what you found and indicate if more information might be needed\n\nRemember: Only use information from the provided context to answer questions.'}
|
371 |
+
|
372 |
+
User prompt:
|
373 |
+
{'role': 'user', 'content': 'Context:\nKumar, F. Song, N. Siegel, L. Wang, A. Creswell, G. Irving, and I. Higgins. Solving math word problems with process-and outcome-based feedback. arXiv preprint arXiv:2211.14275, 2022. P . Wang, L. Li, Z. Shao, R. Xu, D. Dai, Y. Li, D. Chen, Y. Wu, and Z. Sui. Math-shepherd: A label- free step-by-step verifier for llms in mathematical reasoning. arXiv preprint arXiv:2312.08935 , 2023. X. Wang, J. Wei, D. Schuurmans, Q. Le, E. Chi, S. Narang, A. Chowdhery, and D. Zhou. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171, 2022. Y. Wang, X. Ma, G. Zhang, Y. Ni, A. Chandra, S. Guo, W. Ren, A. Arulraj, X. He, Z. Jiang, T. Li, M. Ku, K. Wang, A. Zhuang, R. Fan, X. Yue, and W. Chen. Mmlu-pro: A more robust and challenging multi-task language understanding benchmark. CoRR , abs/2406.01574, 2024. URL https://doi.org/10.48550/arXiv.2406.01574 . C. S. Xia, Y. Deng, S. Dunn, and L. Zhang. Agentless: Demystifying llm-based software engineering\nagents. arXiv preprint, 2024. H. Xin, Z. Z. Ren, J. Song, Z. Shao, W. Zhao, H. Wang, B. Liu, L. Zhang, X. Lu, Q. Du, W. Gao, Q. Zhu, D. Yang, Z. Gou, Z. F. Wu, F. Luo, and C. Ruan. Deepseek-prover-v1.5: Harnessing proof assistant feedback for reinforcement learning and monte-carlo tree search, 2024. URL https://arxiv.org/abs/2408.08152 . J. Zhou, T. Lu, S. Mishra, S. Brahma, S. Basu, Y. Luan, D. Zhou, and L. Hou. Instruction-following evaluation for large language models. arXiv preprint arXiv:2311.07911, 2023. 19 Appendix A. Contributions and Acknowledgments Core Contributors Daya Guo Dejian Yang Haowei Zhang Junxiao Song Ruoyu Zhang Runxin Xu Qihao Zhu Shirong Ma Peiyi Wang Xiao Bi Xiaokang Zhang Xingkai Yu Yu Wu Z.F. Wu Zhibin Gou Zhihong Shao Zhuoshu Li Ziyi Gao Contributors Aixin Liu Bing Xue Bingxuan Wang Bochao Wu Bei Feng Chengda Lu Chenggang Zhao Chengqi Deng Chong Ruan Damai Dai Deli Chen Dongjie Ji Erhang Li Fangyun Lin Fucong Dai Fuli Luo* Guangbo Hao Guanting Chen Guowei Li\nGong, N. Duan, and T. Baldwin. CMMLU: Measur- ing massive multitask language understanding in Chinese. arXiv preprint arXiv:2306.09212 , 2023. T. Li, W.-L. Chiang, E. Frick, L. Dunlap, T. Wu, B. Zhu, J. E. Gonzalez, and I. Stoica. From crowdsourced data to high-quality benchmarks: Arena-hard and benchbuilder pipeline. arXiv preprint arXiv:2406.11939, 2024. H. Lightman, V . Kosaraju, Y. Burda, H. Edwards, B. Baker, T. Lee, J. Leike, J. Schulman, I. Sutskever, and K. Cobbe. Let’s verify step by step. arXiv preprint arXiv:2305.20050, 2023. B. Y. Lin. ZeroEval: A Unified Framework for Evaluating Language Models, July 2024. URL https://github.com/WildEval/ZeroEval . MAA. American invitational mathematics examination - aime. In American Invitational Mathematics Examination -AIME 2024 , February 2024. URL https://maa.org/math -competitions/american-invitational-mathematics-examination-aime . OpenAI. Hello GPT-4o, 2024a. URL https://openai.com/index/hello-gpt-4o/ . OpenAI. Learning to reason\n\n\nQuestion:\nWhat is this paper about?\n'}
|
374 |
+
Retrieved 3 relevant contexts
|
375 |
+
2025-04-15 02:06:19 - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
|
376 |
+
```
|
377 |
+
|
378 |
|
379 |
Does this application pass your vibe check? Are there any immediate pitfalls you're noticing?
|
380 |
|
app.py
CHANGED
@@ -27,9 +27,12 @@ user_prompt_template = """\
|
|
27 |
Context:
|
28 |
{context}
|
29 |
|
|
|
|
|
30 |
Question:
|
31 |
{question}
|
32 |
-
|
|
|
33 |
user_role_prompt = UserRolePrompt(user_prompt_template)
|
34 |
|
35 |
class RetrievalAugmentedQAPipeline:
|
@@ -38,23 +41,29 @@ class RetrievalAugmentedQAPipeline:
|
|
38 |
self.vector_db_retriever = vector_db_retriever
|
39 |
|
40 |
async def arun_pipeline(self, user_query: str):
|
41 |
-
# Get more contexts
|
42 |
-
|
|
|
|
|
43 |
print("\nRetrieved contexts:")
|
44 |
for i, (context, score) in enumerate(context_list):
|
45 |
print(f"\nContext {i+1} (score: {score:.3f}):")
|
46 |
-
print(context[:
|
47 |
|
48 |
# Limit total context length to approximately 3000 tokens (12000 characters)
|
49 |
context_prompt = ""
|
50 |
total_length = 0
|
51 |
max_length = 12000 # Reduced from 24000 to 12000
|
52 |
|
53 |
-
|
54 |
-
|
55 |
-
|
56 |
-
|
57 |
-
total_length
|
|
|
|
|
|
|
|
|
58 |
|
59 |
print(f"\nUsing {len(context_prompt.split())} words of context")
|
60 |
|
|
|
27 |
Context:
|
28 |
{context}
|
29 |
|
30 |
+
Based on the above context, please answer the following question. If the answer cannot be found in the context, say "I cannot find the answer in the provided context." If you find partial information, share what you found and indicate if more information might be needed.
|
31 |
+
|
32 |
Question:
|
33 |
{question}
|
34 |
+
|
35 |
+
Please provide a clear and concise answer based ONLY on the information in the context above."""
|
36 |
user_role_prompt = UserRolePrompt(user_prompt_template)
|
37 |
|
38 |
class RetrievalAugmentedQAPipeline:
|
|
|
41 |
self.vector_db_retriever = vector_db_retriever
|
42 |
|
43 |
async def arun_pipeline(self, user_query: str):
|
44 |
+
# Get more contexts with a broader search
|
45 |
+
print("\nSearching for relevant contexts...")
|
46 |
+
context_list = self.vector_db_retriever.search_by_text(user_query, k=5) # Increased from 3 to 5
|
47 |
+
|
48 |
print("\nRetrieved contexts:")
|
49 |
for i, (context, score) in enumerate(context_list):
|
50 |
print(f"\nContext {i+1} (score: {score:.3f}):")
|
51 |
+
print(context[:500] + "..." if len(context) > 500 else context) # Show more context
|
52 |
|
53 |
# Limit total context length to approximately 3000 tokens (12000 characters)
|
54 |
context_prompt = ""
|
55 |
total_length = 0
|
56 |
max_length = 12000 # Reduced from 24000 to 12000
|
57 |
|
58 |
+
# Sort contexts by score before truncating
|
59 |
+
sorted_contexts = sorted(context_list, key=lambda x: x[1], reverse=True)
|
60 |
+
|
61 |
+
for context, score in sorted_contexts:
|
62 |
+
if total_length + len(context) > max_length:
|
63 |
+
print(f"\nSkipping context with score {score:.3f} due to length limit")
|
64 |
+
continue
|
65 |
+
context_prompt += context + "\n"
|
66 |
+
total_length += len(context)
|
67 |
|
68 |
print(f"\nUsing {len(context_prompt.split())} words of context")
|
69 |
|