fikird commited on
Commit
48922fa
Β·
1 Parent(s): ad4c231

Complete rewrite of ISE with advanced RAG and OSINT capabilities

Browse files
Files changed (8) hide show
  1. README.md +102 -40
  2. app.py +200 -281
  3. engines/image.py +164 -0
  4. engines/osint.py +167 -0
  5. engines/search.py +133 -0
  6. requirements.txt +27 -43
  7. utils/helpers.py +160 -0
  8. utils/web.py +128 -0
README.md CHANGED
@@ -1,5 +1,5 @@
1
  ---
2
- title: Intelligent Search Engine
3
  emoji: πŸ”
4
  colorFrom: blue
5
  colorTo: indigo
@@ -9,62 +9,124 @@ app_file: app.py
9
  pinned: false
10
  ---
11
 
12
- # πŸ” Intelligent Search Engine
13
 
14
- An AI-powered search engine that provides intelligent summaries and insights from web content.
15
 
16
- ## Features
17
 
18
- - 🌐 Web search powered by DuckDuckGo
19
- - πŸ€– AI-powered content summarization
20
- - πŸ“Š Semantic search capabilities
21
- - πŸ“± Clean, responsive UI
 
22
 
23
- ## Technical Details
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
24
 
25
  ### Core Components
 
 
 
 
 
26
 
27
- 1. **Search Engine (`search_engine.py`)**
28
- - DuckDuckGo integration for web search
29
- - Content processing and summarization
30
- - URL validation and metadata extraction
 
31
 
32
- 2. **Web Interface (`app.py`)**
33
- - Gradio-based UI
34
- - Error handling
35
- - Result formatting
 
36
 
37
- ### Models
 
 
 
38
 
39
- - Summarization: facebook/bart-base
40
- - Embeddings: sentence-transformers/all-MiniLM-L6-v2
41
 
42
- ### Dependencies
 
 
 
 
43
 
44
- - Python 3.10
45
- - Gradio 5.7.1
46
- - Transformers
47
- - DuckDuckGo Search
48
- - BeautifulSoup4
49
- - Langchain
50
- - Sentence Transformers
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
51
 
52
- ## Usage
53
 
54
- 1. Enter your search query in the text box
55
- 2. Adjust the number of results using the slider
56
- 3. Click "Submit" to see the results
 
 
57
 
58
- ## Example Queries
59
 
60
- - "Latest developments in artificial intelligence"
61
- - "Climate change solutions"
62
- - "Space exploration news"
 
 
63
 
64
- ## Deployment
65
 
66
- This project is deployed on Hugging Face Spaces, optimized for CPU environments.
67
 
68
- ## License
69
 
70
- Apache 2.0
 
1
  ---
2
+ title: Intelligent Search Engine (ISE)
3
  emoji: πŸ”
4
  colorFrom: blue
5
  colorTo: indigo
 
9
  pinned: false
10
  ---
11
 
12
+ # πŸ” Intelligent Search Engine (ISE)
13
 
14
+ An advanced OSINT search engine with RAG capabilities and multi-modal search features.
15
 
16
+ ## 🌟 Features
17
 
18
+ ### 🌐 Intelligent Search
19
+ - Web search with context understanding
20
+ - AI-powered answer synthesis
21
+ - Source citation and verification
22
+ - RAG-based knowledge retrieval
23
 
24
+ ### πŸ‘€ OSINT Capabilities
25
+ - Username search across multiple platforms
26
+ - Person search (name, age, location)
27
+ - Social media profile exploration
28
+ - Personal information gathering
29
+ - Historical data retrieval
30
+
31
+ ### πŸ“Έ Image Analysis
32
+ - Face detection and recognition
33
+ - Object and scene recognition
34
+ - Image metadata extraction
35
+ - Similar image search
36
+ - Cross-reference with social media
37
+
38
+ ### πŸ—ΊοΈ Location Intelligence
39
+ - Geographic information analysis
40
+ - Location-based searching
41
+ - Address validation and normalization
42
+ - Proximity analysis
43
+
44
+ ## πŸ› οΈ Technology Stack
45
 
46
  ### Core Components
47
+ - Python 3.10+
48
+ - LangChain for RAG capabilities
49
+ - HuggingFace Transformers
50
+ - PyTorch (CPU optimized)
51
+ - Gradio for UI
52
 
53
+ ### Search & Scraping
54
+ - DuckDuckGo Search
55
+ - Google Search Python
56
+ - BeautifulSoup4
57
+ - Requests/AIOHTTP
58
 
59
+ ### OSINT Tools
60
+ - Holehe
61
+ - Sherlock Project
62
+ - Python WHOIS
63
+ - Geopy
64
 
65
+ ### Image Processing
66
+ - Face Recognition
67
+ - Pillow
68
+ - Torchvision
69
 
70
+ ## πŸ“¦ Installation
 
71
 
72
+ 1. Clone the repository:
73
+ ```bash
74
+ git clone https://github.com/yourusername/intelligent-search-engine.git
75
+ cd intelligent-search-engine
76
+ ```
77
 
78
+ 2. Install dependencies:
79
+ ```bash
80
+ pip install -r requirements.txt
81
+ ```
82
+
83
+ 3. Run the application:
84
+ ```bash
85
+ python app.py
86
+ ```
87
+
88
+ ## 🎯 Usage
89
+
90
+ ### Web Interface
91
+ The application provides a user-friendly web interface with multiple tabs:
92
+
93
+ 1. **Search Tab**
94
+ - Enter your search query
95
+ - Get AI-powered answers with sources
96
+
97
+ 2. **Username Search Tab**
98
+ - Search usernames across platforms
99
+ - View consolidated social media presence
100
+
101
+ 3. **Person Search Tab**
102
+ - Search by name, location, age
103
+ - Get comprehensive personal information
104
+
105
+ 4. **Image Analysis Tab**
106
+ - Upload images for analysis
107
+ - Detect faces and objects
108
+ - Search for similar images
109
 
110
+ ## πŸ”’ Privacy & Security
111
 
112
+ - No sensitive data storage
113
+ - Anonymized result presentation
114
+ - Rate limiting for API calls
115
+ - Basic URL validation
116
+ - Secure data handling
117
 
118
+ ## 🀝 Contributing
119
 
120
+ 1. Fork the repository
121
+ 2. Create a feature branch
122
+ 3. Commit your changes
123
+ 4. Push to the branch
124
+ 5. Create a Pull Request
125
 
126
+ ## πŸ“ License
127
 
128
+ This project is licensed under the MIT License - see the LICENSE file for details.
129
 
130
+ ## ⚠️ Disclaimer
131
 
132
+ This tool is for educational and research purposes only. Users are responsible for complying with applicable laws and regulations regarding information gathering and privacy.
app.py CHANGED
@@ -1,306 +1,225 @@
1
- import gradio as gr
 
 
 
2
  import asyncio
3
- from search_engine import search, advanced_search
4
- from osint_engine import create_report
5
- import time
 
 
 
6
 
7
- def format_results(results):
8
- if not results:
 
 
 
 
 
 
9
  return "No results found."
10
 
11
- if isinstance(results, list):
12
- # Format web search results
13
- formatted_results = []
14
- for result in results:
15
- formatted_result = f"""
16
- ### [{result['title']}]({result['url']})
 
 
17
 
18
- {result['summary']}
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
19
 
20
- **Source:** {result['url']}
21
- **Published:** {result.get('published_date', 'N/A')}
22
- """
23
- formatted_results.append(formatted_result)
24
- return "\n---\n".join(formatted_results)
25
- elif isinstance(results, dict):
26
- # Format OSINT results
27
- if "error" in results:
28
- return f"Error: {results['error']}"
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
29
 
30
- formatted = []
 
 
31
 
32
- # Web results
33
- if "web" in results:
34
- formatted.append(format_results(results["web"]))
35
 
36
- # Username/Platform results
37
- if "platforms" in results:
38
- platforms = results["platforms"]
39
- if platforms:
40
- formatted.append("\n### πŸ” Platform Results\n")
41
- for platform in platforms:
42
- formatted.append(f"""
43
- - **Platform:** {platform['platform']}
44
- **URL:** [{platform['url']}]({platform['url']})
45
- **Status:** {'Found βœ…' if platform.get('exists', False) else 'Not Found ❌'}
46
- """)
47
-
48
- # Image analysis
49
- if "analysis" in results:
50
- analysis = results["analysis"]
51
- if analysis:
52
- formatted.append("\n### πŸ–ΌοΈ Image Analysis\n")
53
- for key, value in analysis.items():
54
- formatted.append(f"- **{key.title()}:** {value}")
55
-
56
- # Similar images
57
- if "similar_images" in results:
58
- similar = results["similar_images"]
59
- if similar:
60
- formatted.append("\n### πŸ” Similar Images\n")
61
- for img in similar:
62
- formatted.append(f"- [{img['source']}]({img['url']})")
63
-
64
- # Location info
65
- if "location" in results:
66
- location = results["location"]
67
- if location and not isinstance(location, str):
68
- formatted.append("\n### πŸ“ Location Information\n")
69
- for key, value in location.items():
70
- if key != 'raw':
71
- formatted.append(f"- **{key.title()}:** {value}")
72
-
73
- # Domain info
74
- if "domain" in results:
75
- domain = results["domain"]
76
- if domain and not isinstance(domain, str):
77
- formatted.append("\n### 🌐 Domain Information\n")
78
- for key, value in domain.items():
79
- formatted.append(f"- **{key.title()}:** {value}")
80
-
81
- # Historical data
82
- if "historical" in results:
83
- historical = results["historical"]
84
- if historical:
85
- formatted.append("\n### πŸ“… Historical Data\n")
86
- for entry in historical[:5]: # Limit to 5 entries
87
- formatted.append(f"""
88
- - **Date:** {entry.get('timestamp', 'N/A')}
89
- **URL:** [{entry.get('url', 'N/A')}]({entry.get('url', '#')})
90
- **Type:** {entry.get('mime_type', 'N/A')}
91
- """)
92
 
93
- return "\n".join(formatted) if formatted else "No relevant information found."
94
- else:
95
- return str(results)
96
-
97
- def safe_search(query, search_type="web", max_results=5, platform=None,
98
- image_url=None, phone=None, location=None, domain=None,
99
- name=None, address=None, progress=gr.Progress()):
100
- """Safe wrapper for search functions"""
101
- try:
102
- kwargs = {
103
- "max_results": max_results,
104
- "platform": platform,
105
- "phone": phone,
106
- "location": location,
107
- "domain": domain,
108
- "name": name,
109
- "address": address
110
- }
111
 
112
- progress(0, desc="Initializing search...")
113
- time.sleep(0.5) # Show loading state
 
114
 
115
- if search_type == "web":
116
- progress(0.3, desc="Searching web...")
117
- results = search(query, max_results)
118
- else:
119
- # For async searches
120
- if search_type == "image" and image_url:
121
- query = image_url
122
- progress(0.5, desc=f"Performing {search_type} search...")
123
- loop = asyncio.new_event_loop()
124
- asyncio.set_event_loop(loop)
125
- results = loop.run_until_complete(advanced_search(query, search_type, **kwargs))
126
- loop.close()
127
 
128
- progress(0.8, desc="Processing results...")
129
- time.sleep(0.5) # Show processing state
130
- progress(1.0, desc="Done!")
131
- return format_results(results)
132
  except Exception as e:
133
  return f"Error: {str(e)}"
134
 
135
- # Create Gradio interface
136
- with gr.Blocks(theme=gr.themes.Soft()) as demo:
137
- gr.Markdown("# πŸ” Intelligent Search Engine")
138
- gr.Markdown("""
139
- An AI-powered search engine with advanced OSINT capabilities.
140
-
141
- Features:
142
- - Web search with AI summaries
143
- - Username search across platforms
144
- - Image search and analysis
145
- - Social media profile search
146
- - Personal information gathering
147
- - Historical data search
148
- """)
149
-
150
- with gr.Tab("Web Search"):
151
- with gr.Row():
152
- query_input = gr.Textbox(
153
- label="Search Query",
154
- placeholder="Enter your search query...",
155
- lines=2
156
- )
157
- max_results = gr.Slider(
158
- minimum=1,
159
- maximum=10,
160
- value=5,
161
- step=1,
162
- label="Number of Results"
163
- )
164
- search_button = gr.Button("πŸ” Search", variant="primary")
165
- results_output = gr.Markdown(label="Search Results")
166
- search_button.click(
167
- fn=safe_search,
168
- inputs=[query_input, gr.State("web"), max_results],
169
- outputs=results_output,
170
- show_progress=True
171
- )
172
 
173
- with gr.Tab("Username Search"):
174
- username_input = gr.Textbox(
175
- label="Username",
176
- placeholder="Enter username to search..."
177
- )
178
- username_button = gr.Button("πŸ” Search Username", variant="primary")
179
- username_output = gr.Markdown(label="Username Search Results")
180
- username_button.click(
181
- fn=safe_search,
182
- inputs=[username_input, gr.State("username")],
183
- outputs=username_output,
184
- show_progress=True
185
- )
186
 
187
- with gr.Tab("Image Search"):
188
- with gr.Row():
189
- image_url = gr.Textbox(
190
- label="Image URL",
191
- placeholder="Enter image URL to search..."
192
- )
193
- image_upload = gr.Image(
194
- label="Or Upload Image",
195
- type="filepath"
196
- )
197
- image_button = gr.Button("πŸ” Search Image", variant="primary")
198
- image_output = gr.Markdown(label="Image Search Results")
199
-
200
- def handle_image_search(url, uploaded_image):
201
- if uploaded_image:
202
- return safe_search(uploaded_image, "image", image_url=uploaded_image)
203
- return safe_search(url, "image", image_url=url)
204
-
205
- image_button.click(
206
- fn=handle_image_search,
207
- inputs=[image_url, image_upload],
208
- outputs=image_output,
209
- show_progress=True
210
- )
211
-
212
- with gr.Tab("Social Media Search"):
213
- with gr.Row():
214
- social_username = gr.Textbox(
215
- label="Username",
216
- placeholder="Enter username..."
217
- )
218
- platform = gr.Dropdown(
219
- choices=[
220
- "all", "twitter", "instagram", "facebook", "linkedin",
221
- "github", "reddit", "youtube", "tiktok", "pinterest",
222
- "snapchat", "twitch", "medium", "devto", "stackoverflow"
223
- ],
224
- value="all",
225
- label="Platform"
226
- )
227
- social_button = gr.Button("πŸ” Search Social Media", variant="primary")
228
- social_output = gr.Markdown(label="Social Media Results")
229
- social_button.click(
230
- fn=safe_search,
231
- inputs=[social_username, gr.State("social"), gr.State(5), platform],
232
- outputs=social_output,
233
- show_progress=True
234
- )
235
-
236
- with gr.Tab("Personal Info"):
237
- with gr.Group():
238
- with gr.Row():
239
- name = gr.Textbox(label="Full Name", placeholder="John Doe")
240
- address = gr.Textbox(label="Address/Location", placeholder="City, Country")
241
- initial_search = gr.Button("πŸ” Find Possible Matches", variant="primary")
242
- matches_output = gr.Markdown(label="Possible Matches")
243
-
244
- with gr.Row(visible=False) as details_row:
245
- selected_person = gr.Dropdown(
246
- choices=[],
247
- label="Select Person",
248
- interactive=True
249
  )
250
- details_button = gr.Button("πŸ” Get Detailed Info", variant="secondary")
251
-
252
- details_output = gr.Markdown(label="Detailed Information")
253
 
254
- def find_matches(name, address):
255
- return safe_search(name, "personal", name=name, location=address)
 
 
 
 
 
 
 
 
 
 
 
 
 
256
 
257
- def get_details(person):
258
- if not person:
259
- return "Please select a person first."
260
- return safe_search(person, "personal", name=person)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
261
 
262
- initial_search.click(
263
- fn=find_matches,
264
- inputs=[name, address],
265
- outputs=matches_output
266
- ).then(
267
- lambda: gr.Row(visible=True),
268
- None,
269
- details_row
270
- )
271
-
272
- details_button.click(
273
- fn=get_details,
274
- inputs=[selected_person],
275
- outputs=details_output,
276
- show_progress=True
277
- )
278
-
279
- with gr.Tab("Historical Data"):
280
- url_input = gr.Textbox(
281
- label="URL",
282
- placeholder="Enter URL to search historical data..."
283
- )
284
- historical_button = gr.Button("πŸ” Search Historical Data", variant="primary")
285
- historical_output = gr.Markdown(label="Historical Data Results")
286
- historical_button.click(
287
- fn=safe_search,
288
- inputs=[url_input, gr.State("historical")],
289
- outputs=historical_output,
290
- show_progress=True
291
- )
292
 
293
- gr.Markdown("""
294
- ### Examples
295
- Try these example searches:
296
- - Web Search: "Latest developments in artificial intelligence"
297
- - Username: "johndoe"
298
- - Image URL: "https://images.app.goo.gl/w5BtxZKvzg6BdkGE8"
299
- - Social Media: "techuser" on Twitter
300
- - Personal Info: "John Smith" in "New York, USA"
301
- - Historical Data: "example.com"
302
- """)
303
 
304
- # Launch the app
305
  if __name__ == "__main__":
306
- demo.launch()
 
 
1
+ """
2
+ Intelligent Search Engine with RAG and OSINT capabilities.
3
+ """
4
+ import os
5
  import asyncio
6
+ import gradio as gr
7
+ from engines.search import SearchEngine
8
+ from engines.osint import OSINTEngine
9
+ from engines.image import ImageEngine
10
+ import markdown2
11
+ from typing import Dict, Any, List
12
 
13
+ # Initialize engines
14
+ search_engine = SearchEngine()
15
+ osint_engine = OSINTEngine()
16
+ image_engine = ImageEngine()
17
+
18
+ def format_search_results(results: Dict[str, Any]) -> str:
19
+ """Format search results with markdown."""
20
+ if not results or "answer" not in results:
21
  return "No results found."
22
 
23
+ formatted = f"### Answer\n{results['answer']}\n\n"
24
+
25
+ if results.get("sources"):
26
+ formatted += "\n### Sources\n"
27
+ for i, source in enumerate(results["sources"], 1):
28
+ formatted += f"{i}. [{source}]({source})\n"
29
+
30
+ return formatted
31
 
32
+ def format_osint_results(results: Dict[str, Any]) -> str:
33
+ """Format OSINT results with markdown."""
34
+ formatted = "### OSINT Results\n\n"
35
+
36
+ if "error" in results:
37
+ return f"Error: {results['error']}"
38
+
39
+ if "found_on" in results:
40
+ formatted += "#### Social Media Presence\n"
41
+ for platform in results["found_on"]:
42
+ formatted += f"- {platform['platform']}: [{platform['url']}]({platform['url']})\n"
43
+
44
+ if "person_info" in results:
45
+ person = results["person_info"]
46
+ formatted += f"\n#### Personal Information\n"
47
+ formatted += f"- Name: {person.get('name', 'N/A')}\n"
48
+ if person.get("age"):
49
+ formatted += f"- Age: {person['age']}\n"
50
+ if person.get("location"):
51
+ formatted += f"- Location: {person['location']}\n"
52
+ if person.get("gender"):
53
+ formatted += f"- Gender: {person['gender']}\n"
54
+
55
+ return formatted
56
 
57
+ async def search_query(query: str) -> str:
58
+ """Handle search queries."""
59
+ try:
60
+ results = await search_engine.search(query)
61
+ return format_search_results(results)
62
+ except Exception as e:
63
+ return f"Error: {str(e)}"
64
+
65
+ async def search_username(username: str) -> str:
66
+ """Search for username across platforms."""
67
+ try:
68
+ results = await osint_engine.search_username(username)
69
+ return format_osint_results(results)
70
+ except Exception as e:
71
+ return f"Error: {str(e)}"
72
+
73
+ async def search_person(name: str, location: str = "", age: str = "", gender: str = "") -> str:
74
+ """Search for person information."""
75
+ try:
76
+ age_int = int(age) if age.strip() else None
77
+ person = await osint_engine.search_person(
78
+ name=name,
79
+ location=location if location.strip() else None,
80
+ age=age_int,
81
+ gender=gender if gender.strip() else None
82
+ )
83
+ return format_osint_results({"person_info": person.to_dict()})
84
+ except Exception as e:
85
+ return f"Error: {str(e)}"
86
+
87
+ async def analyze_image_file(image) -> str:
88
+ """Analyze uploaded image."""
89
+ try:
90
+ if not image:
91
+ return "No image provided."
92
 
93
+ # Read image data
94
+ with open(image.name, "rb") as f:
95
+ image_data = f.read()
96
 
97
+ # Analyze image
98
+ results = await image_engine.analyze_image(image_data)
 
99
 
100
+ if "error" in results:
101
+ return f"Error analyzing image: {results['error']}"
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
102
 
103
+ # Format results
104
+ formatted = "### Image Analysis Results\n\n"
105
+
106
+ # Add predictions
107
+ formatted += "#### Content Detection\n"
108
+ for pred in results["predictions"]:
109
+ confidence = pred["confidence"] * 100
110
+ formatted += f"- {pred['label']}: {confidence:.1f}%\n"
 
 
 
 
 
 
 
 
 
 
111
 
112
+ # Add face detection results
113
+ formatted += f"\n#### Face Detection\n"
114
+ formatted += f"- Found {len(results['faces'])} faces\n"
115
 
116
+ # Add metadata
117
+ formatted += f"\n#### Image Metadata\n"
118
+ metadata = results["metadata"]
119
+ formatted += f"- Size: {metadata['width']}x{metadata['height']}\n"
120
+ formatted += f"- Format: {metadata['format']}\n"
121
+ formatted += f"- Mode: {metadata['mode']}\n"
122
+
123
+ return formatted
 
 
 
 
124
 
 
 
 
 
125
  except Exception as e:
126
  return f"Error: {str(e)}"
127
 
128
+ def create_ui() -> gr.Blocks:
129
+ """Create the Gradio interface."""
130
+ with gr.Blocks(title="Intelligent Search Engine", theme=gr.themes.Soft()) as app:
131
+ gr.Markdown("""
132
+ # πŸ” Intelligent Search Engine
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
133
 
134
+ Advanced search engine with RAG and OSINT capabilities.
135
+ """)
 
 
 
 
 
 
 
 
 
 
 
136
 
137
+ with gr.Tabs():
138
+ # Intelligent Search Tab
139
+ with gr.Tab("🌐 Search"):
140
+ with gr.Column():
141
+ search_input = gr.Textbox(
142
+ label="Enter your search query",
143
+ placeholder="What would you like to know?"
144
+ )
145
+ search_button = gr.Button("Search", variant="primary")
146
+ search_output = gr.Markdown(label="Results")
147
+
148
+ search_button.click(
149
+ fn=search_query,
150
+ inputs=search_input,
151
+ outputs=search_output
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
152
  )
 
 
 
153
 
154
+ # Username Search Tab
155
+ with gr.Tab("πŸ‘€ Username Search"):
156
+ with gr.Column():
157
+ username_input = gr.Textbox(
158
+ label="Enter username",
159
+ placeholder="Username to search across platforms"
160
+ )
161
+ username_button = gr.Button("Search Username", variant="primary")
162
+ username_output = gr.Markdown(label="Results")
163
+
164
+ username_button.click(
165
+ fn=search_username,
166
+ inputs=username_input,
167
+ outputs=username_output
168
+ )
169
 
170
+ # Person Search Tab
171
+ with gr.Tab("πŸ‘₯ Person Search"):
172
+ with gr.Column():
173
+ name_input = gr.Textbox(
174
+ label="Full Name",
175
+ placeholder="Enter person's name"
176
+ )
177
+ location_input = gr.Textbox(
178
+ label="Location (optional)",
179
+ placeholder="City, Country"
180
+ )
181
+ age_input = gr.Textbox(
182
+ label="Age (optional)",
183
+ placeholder="Enter age"
184
+ )
185
+ gender_input = gr.Dropdown(
186
+ label="Gender (optional)",
187
+ choices=["", "Male", "Female", "Other"]
188
+ )
189
+ person_button = gr.Button("Search Person", variant="primary")
190
+ person_output = gr.Markdown(label="Results")
191
+
192
+ person_button.click(
193
+ fn=search_person,
194
+ inputs=[name_input, location_input, age_input, gender_input],
195
+ outputs=person_output
196
+ )
197
 
198
+ # Image Analysis Tab
199
+ with gr.Tab("πŸ–ΌοΈ Image Analysis"):
200
+ with gr.Column():
201
+ image_input = gr.File(
202
+ label="Upload Image",
203
+ file_types=["image"]
204
+ )
205
+ image_button = gr.Button("Analyze Image", variant="primary")
206
+ image_output = gr.Markdown(label="Results")
207
+
208
+ image_button.click(
209
+ fn=analyze_image_file,
210
+ inputs=image_input,
211
+ outputs=image_output
212
+ )
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
213
 
214
+ gr.Markdown("""
215
+ ### πŸ“ Notes
216
+ - The search engine uses RAG (Retrieval-Augmented Generation) for intelligent answers
217
+ - OSINT capabilities include social media presence, personal information, and image analysis
218
+ - All searches are conducted using publicly available information
219
+ """)
220
+
221
+ return app
 
 
222
 
 
223
  if __name__ == "__main__":
224
+ app = create_ui()
225
+ app.launch(share=True)
engines/image.py ADDED
@@ -0,0 +1,164 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Image analysis engine for processing and analyzing images.
3
+ """
4
+ from typing import Dict, Any, List, Optional
5
+ import io
6
+ from PIL import Image
7
+ import torch
8
+ from torchvision import transforms
9
+ from transformers import AutoFeatureExtractor, AutoModelForImageClassification
10
+ import face_recognition
11
+ import numpy as np
12
+ from tenacity import retry, stop_after_attempt, wait_exponential
13
+
14
+ class ImageEngine:
15
+ def __init__(self):
16
+ # Initialize image classification model
17
+ self.feature_extractor = AutoFeatureExtractor.from_pretrained(
18
+ "microsoft/resnet-50"
19
+ )
20
+ self.model = AutoModelForImageClassification.from_pretrained(
21
+ "microsoft/resnet-50"
22
+ )
23
+
24
+ # Set up image transforms
25
+ self.transform = transforms.Compose([
26
+ transforms.Resize(256),
27
+ transforms.CenterCrop(224),
28
+ transforms.ToTensor(),
29
+ transforms.Normalize(
30
+ mean=[0.485, 0.456, 0.406],
31
+ std=[0.229, 0.224, 0.225]
32
+ )
33
+ ])
34
+
35
+ @retry(stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, min=4, max=10))
36
+ async def analyze_image(self, image_data: bytes) -> Dict[str, Any]:
37
+ """Analyze image content and detect objects/faces."""
38
+ try:
39
+ # Load image
40
+ image = Image.open(io.BytesIO(image_data)).convert('RGB')
41
+
42
+ # Prepare image for model
43
+ inputs = self.feature_extractor(images=image, return_tensors="pt")
44
+
45
+ # Get model predictions
46
+ with torch.no_grad():
47
+ outputs = self.model(**inputs)
48
+ probs = outputs.logits.softmax(-1)
49
+
50
+ # Get top predictions
51
+ top_probs, top_indices = torch.topk(probs, k=5)
52
+
53
+ # Convert predictions to list
54
+ predictions = [
55
+ {
56
+ "label": self.model.config.id2label[idx.item()],
57
+ "confidence": prob.item()
58
+ }
59
+ for prob, idx in zip(top_probs[0], top_indices[0])
60
+ ]
61
+
62
+ # Analyze faces
63
+ np_image = np.array(image)
64
+ face_locations = face_recognition.face_locations(np_image)
65
+ face_encodings = face_recognition.face_encodings(np_image, face_locations)
66
+
67
+ faces = []
68
+ for i, (face_encoding, face_location) in enumerate(zip(face_encodings, face_locations)):
69
+ face = {
70
+ "id": i + 1,
71
+ "location": {
72
+ "top": face_location[0],
73
+ "right": face_location[1],
74
+ "bottom": face_location[2],
75
+ "left": face_location[3]
76
+ },
77
+ "encoding": face_encoding.tolist()
78
+ }
79
+ faces.append(face)
80
+
81
+ # Get image metadata
82
+ metadata = {
83
+ "format": image.format,
84
+ "mode": image.mode,
85
+ "size": image.size,
86
+ "width": image.width,
87
+ "height": image.height
88
+ }
89
+
90
+ return {
91
+ "predictions": predictions,
92
+ "faces": faces,
93
+ "metadata": metadata
94
+ }
95
+
96
+ except Exception as e:
97
+ return {"error": str(e)}
98
+
99
+ async def compare_faces(self, face1_data: bytes, face2_data: bytes) -> Dict[str, Any]:
100
+ """Compare two faces and determine if they are the same person."""
101
+ try:
102
+ # Load and process first image
103
+ image1 = face_recognition.load_image_file(io.BytesIO(face1_data))
104
+ face1_encoding = face_recognition.face_encodings(image1)
105
+
106
+ if not face1_encoding:
107
+ return {"error": "No face found in first image"}
108
+
109
+ # Load and process second image
110
+ image2 = face_recognition.load_image_file(io.BytesIO(face2_data))
111
+ face2_encoding = face_recognition.face_encodings(image2)
112
+
113
+ if not face2_encoding:
114
+ return {"error": "No face found in second image"}
115
+
116
+ # Compare faces
117
+ results = face_recognition.compare_faces(
118
+ [face1_encoding[0]], face2_encoding[0]
119
+ )
120
+
121
+ # Calculate face distance (lower means more similar)
122
+ face_distance = face_recognition.face_distance(
123
+ [face1_encoding[0]], face2_encoding[0]
124
+ )
125
+
126
+ return {
127
+ "match": bool(results[0]),
128
+ "confidence": float(1 - face_distance[0]),
129
+ "distance": float(face_distance[0])
130
+ }
131
+
132
+ except Exception as e:
133
+ return {"error": str(e)}
134
+
135
+ async def search_similar_faces(self,
136
+ target_encoding: List[float],
137
+ face_database: List[Dict[str, Any]],
138
+ threshold: float = 0.6) -> List[Dict[str, Any]]:
139
+ """Search for similar faces in a database of face encodings."""
140
+ try:
141
+ matches = []
142
+ target_encoding = np.array(target_encoding)
143
+
144
+ for face_data in face_database:
145
+ if "encoding" not in face_data:
146
+ continue
147
+
148
+ current_encoding = np.array(face_data["encoding"])
149
+ distance = face_recognition.face_distance([target_encoding], current_encoding)[0]
150
+
151
+ if distance < threshold:
152
+ matches.append({
153
+ "face_id": face_data.get("id"),
154
+ "confidence": float(1 - distance),
155
+ "metadata": face_data.get("metadata", {})
156
+ })
157
+
158
+ # Sort matches by confidence
159
+ matches.sort(key=lambda x: x["confidence"], reverse=True)
160
+
161
+ return matches
162
+
163
+ except Exception as e:
164
+ return [{"error": str(e)}]
engines/osint.py ADDED
@@ -0,0 +1,167 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ OSINT engine for comprehensive information gathering.
3
+ """
4
+ from typing import Dict, List, Any, Optional
5
+ import asyncio
6
+ import json
7
+ from dataclasses import dataclass
8
+ import holehe.core as holehe
9
+ from sherlock import sherlock
10
+ import face_recognition
11
+ import numpy as np
12
+ from PIL import Image
13
+ import io
14
+ import requests
15
+ from geopy.geocoders import Nominatim
16
+ from geopy.exc import GeocoderTimedOut
17
+ import whois
18
+ from datetime import datetime
19
+ from tenacity import retry, stop_after_attempt, wait_exponential
20
+
21
+ @dataclass
22
+ class PersonInfo:
23
+ name: str
24
+ age: Optional[int] = None
25
+ location: Optional[str] = None
26
+ gender: Optional[str] = None
27
+ social_profiles: List[Dict[str, str]] = None
28
+ images: List[str] = None
29
+
30
+ def to_dict(self) -> Dict[str, Any]:
31
+ return {
32
+ "name": self.name,
33
+ "age": self.age,
34
+ "location": self.location,
35
+ "gender": self.gender,
36
+ "social_profiles": self.social_profiles or [],
37
+ "images": self.images or []
38
+ }
39
+
40
+ class OSINTEngine:
41
+ def __init__(self):
42
+ self.geolocator = Nominatim(user_agent="intelligent_search_engine")
43
+ self.known_platforms = [
44
+ "Twitter", "Instagram", "Facebook", "LinkedIn", "GitHub",
45
+ "Reddit", "YouTube", "TikTok", "Pinterest", "Snapchat",
46
+ "Twitch", "Medium", "Dev.to", "Stack Overflow"
47
+ ]
48
+
49
+ @retry(stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, min=4, max=10))
50
+ async def search_username(self, username: str) -> Dict[str, Any]:
51
+ """Search for username across multiple platforms."""
52
+ results = []
53
+
54
+ # Use holehe for email-based search
55
+ email = f"{username}@gmail.com" # Example email
56
+ holehe_results = await holehe.check_email(email)
57
+
58
+ # Use sherlock for username search
59
+ sherlock_results = sherlock.sherlock(username, self.known_platforms, verbose=False)
60
+
61
+ # Combine results
62
+ for platform, data in {**holehe_results, **sherlock_results}.items():
63
+ if data.get("exists", False):
64
+ results.append({
65
+ "platform": platform,
66
+ "url": data.get("url", ""),
67
+ "confidence": data.get("confidence", "high")
68
+ })
69
+
70
+ return {
71
+ "username": username,
72
+ "found_on": results
73
+ }
74
+
75
+ async def search_person(self, name: str, location: Optional[str] = None,
76
+ age: Optional[int] = None, gender: Optional[str] = None) -> PersonInfo:
77
+ """Search for information about a person."""
78
+ person = PersonInfo(
79
+ name=name,
80
+ age=age,
81
+ location=location,
82
+ gender=gender
83
+ )
84
+
85
+ # Initialize social profiles list
86
+ person.social_profiles = []
87
+
88
+ # Search for social media profiles
89
+ username_variants = [
90
+ name.replace(" ", ""),
91
+ name.replace(" ", "_"),
92
+ name.replace(" ", "."),
93
+ name.lower().replace(" ", "")
94
+ ]
95
+
96
+ for username in username_variants:
97
+ results = await self.search_username(username)
98
+ person.social_profiles.extend(results.get("found_on", []))
99
+
100
+ return person
101
+
102
+ @retry(stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, min=4, max=10))
103
+ async def analyze_image(self, image_data: bytes) -> Dict[str, Any]:
104
+ """Analyze an image for faces and other identifiable information."""
105
+ try:
106
+ # Load image
107
+ image = face_recognition.load_image_file(io.BytesIO(image_data))
108
+
109
+ # Detect faces
110
+ face_locations = face_recognition.face_locations(image)
111
+ face_encodings = face_recognition.face_encodings(image, face_locations)
112
+
113
+ results = {
114
+ "faces_found": len(face_locations),
115
+ "faces": []
116
+ }
117
+
118
+ # Analyze each face
119
+ for i, (face_encoding, face_location) in enumerate(zip(face_encodings, face_locations)):
120
+ face_data = {
121
+ "location": {
122
+ "top": face_location[0],
123
+ "right": face_location[1],
124
+ "bottom": face_location[2],
125
+ "left": face_location[3]
126
+ }
127
+ }
128
+ results["faces"].append(face_data)
129
+
130
+ return results
131
+ except Exception as e:
132
+ return {"error": str(e)}
133
+
134
+ async def search_location(self, location: str) -> Dict[str, Any]:
135
+ """Gather information about a location."""
136
+ try:
137
+ # Geocode the location
138
+ location_data = self.geolocator.geocode(location, timeout=10)
139
+
140
+ if not location_data:
141
+ return {"error": "Location not found"}
142
+
143
+ return {
144
+ "address": location_data.address,
145
+ "latitude": location_data.latitude,
146
+ "longitude": location_data.longitude,
147
+ "raw": location_data.raw
148
+ }
149
+ except GeocoderTimedOut:
150
+ return {"error": "Geocoding service timed out"}
151
+ except Exception as e:
152
+ return {"error": str(e)}
153
+
154
+ async def analyze_domain(self, domain: str) -> Dict[str, Any]:
155
+ """Analyze a domain for WHOIS and other information."""
156
+ try:
157
+ w = whois.whois(domain)
158
+ return {
159
+ "registrar": w.registrar,
160
+ "creation_date": w.creation_date,
161
+ "expiration_date": w.expiration_date,
162
+ "last_updated": w.updated_date,
163
+ "status": w.status,
164
+ "name_servers": w.name_servers
165
+ }
166
+ except Exception as e:
167
+ return {"error": str(e)}
engines/search.py ADDED
@@ -0,0 +1,133 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ RAG-based search engine with intelligent answer synthesis.
3
+ """
4
+ from typing import List, Dict, Any, Optional
5
+ import asyncio
6
+ from langchain.chains import RetrievalQAWithSourcesChain
7
+ from langchain.embeddings import HuggingFaceEmbeddings
8
+ from langchain.vectorstores import FAISS
9
+ from langchain.text_splitter import RecursiveCharacterTextSplitter
10
+ from langchain.docstore.document import Document
11
+ from duckduckgo_search import DDGS
12
+ from googlesearch import search as gsearch
13
+ import requests
14
+ from bs4 import BeautifulSoup
15
+ from tenacity import retry, stop_after_attempt, wait_exponential
16
+
17
+ class SearchEngine:
18
+ def __init__(self):
19
+ self.embeddings = HuggingFaceEmbeddings(
20
+ model_name="sentence-transformers/all-mpnet-base-v2"
21
+ )
22
+ self.text_splitter = RecursiveCharacterTextSplitter(
23
+ chunk_size=500,
24
+ chunk_overlap=50
25
+ )
26
+
27
+ @retry(stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, min=4, max=10))
28
+ async def search_web(self, query: str, max_results: int = 10) -> List[Dict[str, str]]:
29
+ """Perform web search using multiple search engines."""
30
+ results = []
31
+
32
+ # DuckDuckGo Search
33
+ try:
34
+ with DDGS() as ddgs:
35
+ ddg_results = [r for r in ddgs.text(query, max_results=max_results)]
36
+ results.extend(ddg_results)
37
+ except Exception as e:
38
+ print(f"DuckDuckGo search error: {e}")
39
+
40
+ # Google Search
41
+ try:
42
+ google_results = gsearch(query, num_results=max_results)
43
+ results.extend([{"link": url, "title": url} for url in google_results])
44
+ except Exception as e:
45
+ print(f"Google search error: {e}")
46
+
47
+ return results[:max_results]
48
+
49
+ @retry(stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, min=4, max=10))
50
+ async def fetch_content(self, url: str) -> Optional[str]:
51
+ """Fetch and extract content from a webpage."""
52
+ try:
53
+ headers = {
54
+ "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
55
+ }
56
+ response = requests.get(url, headers=headers, timeout=10)
57
+ response.raise_for_status()
58
+
59
+ soup = BeautifulSoup(response.text, "html.parser")
60
+
61
+ # Remove unwanted elements
62
+ for element in soup(["script", "style", "nav", "footer", "header"]):
63
+ element.decompose()
64
+
65
+ text = soup.get_text(separator="\n", strip=True)
66
+ return text
67
+ except Exception as e:
68
+ print(f"Error fetching {url}: {e}")
69
+ return None
70
+
71
+ async def process_search_results(self, query: str) -> Dict[str, Any]:
72
+ """Process search results and create a RAG-based answer."""
73
+ # Perform web search
74
+ search_results = await self.search_web(query)
75
+
76
+ # Fetch content from search results
77
+ documents = []
78
+ for result in search_results:
79
+ url = result.get("link")
80
+ if not url:
81
+ continue
82
+
83
+ content = await self.fetch_content(url)
84
+ if content:
85
+ # Split content into chunks
86
+ chunks = self.text_splitter.split_text(content)
87
+ for chunk in chunks:
88
+ doc = Document(
89
+ page_content=chunk,
90
+ metadata={"source": url, "title": result.get("title", url)}
91
+ )
92
+ documents.append(doc)
93
+
94
+ if not documents:
95
+ return {
96
+ "answer": "I couldn't find any relevant information.",
97
+ "sources": []
98
+ }
99
+
100
+ # Create vector store
101
+ vectorstore = FAISS.from_documents(documents, self.embeddings)
102
+
103
+ # Create retrieval chain
104
+ chain = RetrievalQAWithSourcesChain.from_chain_type(
105
+ llm=None, # We'll implement custom answer synthesis
106
+ retriever=vectorstore.as_retriever()
107
+ )
108
+
109
+ # Get relevant documents
110
+ relevant_docs = chain.retriever.get_relevant_documents(query)
111
+
112
+ # For now, return the most relevant chunks and sources
113
+ sources = []
114
+ content = []
115
+ for doc in relevant_docs[:3]:
116
+ if doc.metadata["source"] not in sources:
117
+ sources.append(doc.metadata["source"])
118
+ content.append(doc.page_content)
119
+
120
+ return {
121
+ "answer": "\n\n".join(content),
122
+ "sources": sources
123
+ }
124
+
125
+ async def search(self, query: str) -> Dict[str, Any]:
126
+ """Main search interface."""
127
+ try:
128
+ return await self.process_search_results(query)
129
+ except Exception as e:
130
+ return {
131
+ "answer": f"An error occurred: {str(e)}",
132
+ "sources": []
133
+ }
requirements.txt CHANGED
@@ -1,58 +1,42 @@
1
- # Base dependencies
 
 
2
  numpy>=1.23.5
3
- scikit-learn>=1.2.2
4
- scipy>=1.10.1
5
  pandas>=2.0.2
6
  tqdm>=4.65.0
7
- Pillow==10.0.0
 
8
  requests==2.31.0
 
 
 
 
 
 
 
9
 
10
- # PyTorch CPU (pre-built wheels)
11
  --extra-index-url https://download.pytorch.org/whl/cpu
12
  torch==2.0.1+cpu
13
  torchvision==0.15.2+cpu
14
- torchaudio==2.0.2+cpu
15
-
16
- # Transformers and embeddings
17
  transformers==4.31.0
18
- tokenizers==0.13.3
19
- --extra-index-url https://huggingface.github.io/pytorch-transformers/whl/cpu/
20
  sentence-transformers==2.2.2
21
- huggingface-hub>=0.16.4
22
 
23
- # Web interface
24
  gradio==3.40.1
25
 
26
- # Search and scraping
27
- duckduckgo-search==3.8.5
28
- beautifulsoup4==4.12.2
29
- lxml==4.9.3
30
- googlesearch-python==1.2.3
31
- waybackpy==3.0.6
32
- google==3.0.0
33
-
34
- # LangChain and dependencies
35
- langchain==0.0.335
36
- pydantic==1.10.13
37
-
38
- # Browser automation
39
- selenium==4.15.2
40
- webdriver-manager==4.0.1
41
-
42
- # Networking and async
43
- aiohttp==3.8.5
44
- httpx==0.24.1
45
- async-timeout==4.0.3
46
- attrs==23.1.0
47
- multidict==6.0.4
48
- yarl==1.9.2
49
- frozenlist==1.4.0
50
- charset-normalizer==3.2.0
51
- idna==3.4
52
- certifi==2023.7.22
53
- urllib3==2.0.4
54
-
55
- # Domain info
56
  python-whois==0.8.0
57
  geopy==2.4.1
58
- protobuf==4.25.1
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Core dependencies
2
+ langchain==0.0.335
3
+ pydantic==1.10.13
4
  numpy>=1.23.5
 
 
5
  pandas>=2.0.2
6
  tqdm>=4.65.0
7
+
8
+ # Web and Networking
9
  requests==2.31.0
10
+ aiohttp==3.8.5
11
+ httpx==0.24.1
12
+ beautifulsoup4==4.12.2
13
+ selenium==4.15.2
14
+ webdriver-manager==4.0.1
15
+ googlesearch-python==1.2.3
16
+ duckduckgo-search==3.8.5
17
 
18
+ # ML and AI
19
  --extra-index-url https://download.pytorch.org/whl/cpu
20
  torch==2.0.1+cpu
21
  torchvision==0.15.2+cpu
 
 
 
22
  transformers==4.31.0
 
 
23
  sentence-transformers==2.2.2
 
24
 
25
+ # UI
26
  gradio==3.40.1
27
 
28
+ # OSINT Tools
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
29
  python-whois==0.8.0
30
  geopy==2.4.1
31
+ socid-extractor==1.0.0
32
+ holehe==1.61
33
+ sherlock-project==0.14.3
34
+
35
+ # Image Processing
36
+ Pillow==10.0.0
37
+ face-recognition==1.3.0
38
+
39
+ # Utilities
40
+ python-dotenv==1.0.0
41
+ tenacity==8.2.3
42
+ retry==0.9.2
utils/helpers.py ADDED
@@ -0,0 +1,160 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Common helper functions for the search engine.
3
+ """
4
+ from typing import Dict, Any, List, Optional
5
+ import re
6
+ from datetime import datetime
7
+ import hashlib
8
+ import json
9
+
10
+ def clean_text(text: str) -> str:
11
+ """Clean and normalize text content."""
12
+ # Remove extra whitespace
13
+ text = re.sub(r"\s+", " ", text)
14
+
15
+ # Remove special characters
16
+ text = re.sub(r"[^\w\s.,!?-]", "", text)
17
+
18
+ return text.strip()
19
+
20
+ def extract_entities(text: str) -> Dict[str, List[str]]:
21
+ """Extract basic entities from text."""
22
+ entities = {
23
+ "emails": [],
24
+ "phones": [],
25
+ "urls": [],
26
+ "dates": []
27
+ }
28
+
29
+ # Extract emails
30
+ email_pattern = r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}"
31
+ entities["emails"] = re.findall(email_pattern, text)
32
+
33
+ # Extract phone numbers
34
+ phone_pattern = r"\+?\d{1,4}?[-.\s]?\(?\d{1,3}?\)?[-.\s]?\d{1,4}[-.\s]?\d{1,4}[-.\s]?\d{1,9}"
35
+ entities["phones"] = re.findall(phone_pattern, text)
36
+
37
+ # Extract URLs
38
+ url_pattern = r"https?://(?:[-\w.]|(?:%[\da-fA-F]{2}))+"
39
+ entities["urls"] = re.findall(url_pattern, text)
40
+
41
+ # Extract dates
42
+ date_pattern = r"\d{1,2}[-/]\d{1,2}[-/]\d{2,4}"
43
+ entities["dates"] = re.findall(date_pattern, text)
44
+
45
+ return entities
46
+
47
+ def generate_hash(data: Any) -> str:
48
+ """Generate a hash for data deduplication."""
49
+ if isinstance(data, (dict, list)):
50
+ data = json.dumps(data, sort_keys=True)
51
+ elif not isinstance(data, str):
52
+ data = str(data)
53
+
54
+ return hashlib.md5(data.encode()).hexdigest()
55
+
56
+ def format_date(date_str: str) -> Optional[str]:
57
+ """Format date string to consistent format."""
58
+ date_formats = [
59
+ "%Y-%m-%d",
60
+ "%d/%m/%Y",
61
+ "%m/%d/%Y",
62
+ "%Y/%m/%d",
63
+ "%d-%m-%Y",
64
+ "%m-%d-%Y"
65
+ ]
66
+
67
+ for fmt in date_formats:
68
+ try:
69
+ date_obj = datetime.strptime(date_str, fmt)
70
+ return date_obj.strftime("%Y-%m-%d")
71
+ except ValueError:
72
+ continue
73
+
74
+ return None
75
+
76
+ def extract_name_parts(full_name: str) -> Dict[str, str]:
77
+ """Extract first, middle, and last names."""
78
+ parts = full_name.strip().split()
79
+
80
+ if len(parts) == 1:
81
+ return {
82
+ "first_name": parts[0],
83
+ "middle_name": None,
84
+ "last_name": None
85
+ }
86
+ elif len(parts) == 2:
87
+ return {
88
+ "first_name": parts[0],
89
+ "middle_name": None,
90
+ "last_name": parts[1]
91
+ }
92
+ else:
93
+ return {
94
+ "first_name": parts[0],
95
+ "middle_name": " ".join(parts[1:-1]),
96
+ "last_name": parts[-1]
97
+ }
98
+
99
+ def generate_username_variants(name: str) -> List[str]:
100
+ """Generate possible username variants from a name."""
101
+ name = name.lower()
102
+ parts = name.split()
103
+ variants = []
104
+
105
+ if len(parts) >= 2:
106
+ first, last = parts[0], parts[-1]
107
+ variants.extend([
108
+ first + last,
109
+ first + "_" + last,
110
+ first + "." + last,
111
+ first[0] + last,
112
+ first + last[0],
113
+ last + first,
114
+ last + "_" + first,
115
+ last + "." + first
116
+ ])
117
+
118
+ if len(parts) == 1:
119
+ variants.extend([
120
+ parts[0],
121
+ parts[0] + "123",
122
+ "the" + parts[0],
123
+ "real" + parts[0]
124
+ ])
125
+
126
+ return list(set(variants))
127
+
128
+ def calculate_text_similarity(text1: str, text2: str) -> float:
129
+ """Calculate simple text similarity score."""
130
+ # Convert to sets of words
131
+ set1 = set(text1.lower().split())
132
+ set2 = set(text2.lower().split())
133
+
134
+ # Calculate Jaccard similarity
135
+ intersection = len(set1.intersection(set2))
136
+ union = len(set1.union(set2))
137
+
138
+ return intersection / union if union > 0 else 0.0
139
+
140
+ def extract_social_links(text: str) -> List[Dict[str, str]]:
141
+ """Extract social media profile links from text."""
142
+ social_patterns = {
143
+ "twitter": r"https?://(?:www\.)?twitter\.com/([a-zA-Z0-9_]+)",
144
+ "facebook": r"https?://(?:www\.)?facebook\.com/([a-zA-Z0-9.]+)",
145
+ "instagram": r"https?://(?:www\.)?instagram\.com/([a-zA-Z0-9_.]+)",
146
+ "linkedin": r"https?://(?:www\.)?linkedin\.com/in/([a-zA-Z0-9_-]+)",
147
+ "github": r"https?://(?:www\.)?github\.com/([a-zA-Z0-9_-]+)"
148
+ }
149
+
150
+ results = []
151
+ for platform, pattern in social_patterns.items():
152
+ matches = re.finditer(pattern, text)
153
+ for match in matches:
154
+ results.append({
155
+ "platform": platform,
156
+ "username": match.group(1),
157
+ "url": match.group(0)
158
+ })
159
+
160
+ return results
utils/web.py ADDED
@@ -0,0 +1,128 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Web scraping and processing utilities.
3
+ """
4
+ from typing import Dict, Any, List, Optional
5
+ import requests
6
+ from bs4 import BeautifulSoup
7
+ import re
8
+ from urllib.parse import urlparse, urljoin
9
+ from tenacity import retry, stop_after_attempt, wait_exponential
10
+
11
+ class WebUtils:
12
+ def __init__(self):
13
+ self.session = requests.Session()
14
+ self.session.headers.update({
15
+ "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
16
+ })
17
+
18
+ @retry(stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, min=4, max=10))
19
+ async def fetch_url(self, url: str, timeout: int = 10) -> Optional[str]:
20
+ """Fetch content from a URL."""
21
+ try:
22
+ response = self.session.get(url, timeout=timeout)
23
+ response.raise_for_status()
24
+ return response.text
25
+ except Exception as e:
26
+ print(f"Error fetching {url}: {e}")
27
+ return None
28
+
29
+ def extract_text(self, html: str) -> str:
30
+ """Extract clean text from HTML content."""
31
+ soup = BeautifulSoup(html, "html.parser")
32
+
33
+ # Remove unwanted elements
34
+ for element in soup(["script", "style", "nav", "footer", "header"]):
35
+ element.decompose()
36
+
37
+ # Get text and clean it
38
+ text = soup.get_text(separator="\n", strip=True)
39
+
40
+ # Remove excessive newlines
41
+ text = re.sub(r"\n\s*\n", "\n\n", text)
42
+
43
+ return text.strip()
44
+
45
+ def extract_metadata(self, html: str, url: str) -> Dict[str, Any]:
46
+ """Extract metadata from HTML content."""
47
+ soup = BeautifulSoup(html, "html.parser")
48
+
49
+ metadata = {
50
+ "url": url,
51
+ "title": None,
52
+ "description": None,
53
+ "keywords": None,
54
+ "author": None,
55
+ "published_date": None
56
+ }
57
+
58
+ # Extract title
59
+ metadata["title"] = (
60
+ soup.title.string if soup.title else None
61
+ )
62
+
63
+ # Extract meta tags
64
+ meta_tags = soup.find_all("meta")
65
+ for tag in meta_tags:
66
+ # Description
67
+ if tag.get("name", "").lower() == "description":
68
+ metadata["description"] = tag.get("content")
69
+
70
+ # Keywords
71
+ elif tag.get("name", "").lower() == "keywords":
72
+ metadata["keywords"] = tag.get("content")
73
+
74
+ # Author
75
+ elif tag.get("name", "").lower() == "author":
76
+ metadata["author"] = tag.get("content")
77
+
78
+ # Published date
79
+ elif tag.get("name", "").lower() in ["published_time", "publication_date"]:
80
+ metadata["published_date"] = tag.get("content")
81
+
82
+ return metadata
83
+
84
+ def extract_links(self, html: str, base_url: str) -> List[str]:
85
+ """Extract all links from HTML content."""
86
+ soup = BeautifulSoup(html, "html.parser")
87
+ links = []
88
+
89
+ for link in soup.find_all("a"):
90
+ href = link.get("href")
91
+ if href:
92
+ # Convert relative URLs to absolute
93
+ absolute_url = urljoin(base_url, href)
94
+ # Only include http(s) URLs
95
+ if absolute_url.startswith(("http://", "https://")):
96
+ links.append(absolute_url)
97
+
98
+ return list(set(links)) # Remove duplicates
99
+
100
+ def is_valid_url(self, url: str) -> bool:
101
+ """Check if a URL is valid."""
102
+ try:
103
+ result = urlparse(url)
104
+ return all([result.scheme, result.netloc])
105
+ except Exception:
106
+ return False
107
+
108
+ def clean_url(self, url: str) -> str:
109
+ """Clean and normalize a URL."""
110
+ # Remove tracking parameters
111
+ parsed = urlparse(url)
112
+ path = parsed.path
113
+
114
+ # Remove common tracking parameters
115
+ query_params = []
116
+ if parsed.query:
117
+ for param in parsed.query.split("&"):
118
+ if "=" in param:
119
+ key = param.split("=")[0].lower()
120
+ if not any(track in key for track in ["utm_", "ref_", "source", "campaign"]):
121
+ query_params.append(param)
122
+
123
+ # Rebuild URL
124
+ clean_url = f"{parsed.scheme}://{parsed.netloc}{path}"
125
+ if query_params:
126
+ clean_url += "?" + "&".join(query_params)
127
+
128
+ return clean_url