Spaces:

tensor-boy
/

ISE

Runtime error

App Files Files Community

fikird commited on Dec 2, 2024

Commit

48922fa

1 Parent(s): ad4c231

Complete rewrite of ISE with advanced RAG and OSINT capabilities

Browse files

Files changed (8) hide show

README.md +102 -40
app.py +200 -281
engines/image.py +164 -0
engines/osint.py +167 -0
engines/search.py +133 -0
requirements.txt +27 -43
utils/helpers.py +160 -0
utils/web.py +128 -0

README.md CHANGED Viewed

@@ -1,5 +1,5 @@
 ---
-title: Intelligent Search Engine
 emoji: 🔍
 colorFrom: blue
 colorTo: indigo
@@ -9,62 +9,124 @@ app_file: app.py
 pinned: false
 ---
-# 🔍 Intelligent Search Engine
-An AI-powered search engine that provides intelligent summaries and insights from web content.
-## Features
-- 🌐 Web search powered by DuckDuckGo
-- 🤖 AI-powered content summarization
-- 📊 Semantic search capabilities
-- 📱 Clean, responsive UI
-## Technical Details
 ### Core Components
-1. **Search Engine (`search_engine.py`)**
-   - DuckDuckGo integration for web search
-   - Content processing and summarization
-   - URL validation and metadata extraction
-2. **Web Interface (`app.py`)**
-   - Gradio-based UI
-   - Error handling
-   - Result formatting
-### Models
-- Summarization: facebook/bart-base
-- Embeddings: sentence-transformers/all-MiniLM-L6-v2
-### Dependencies
-- Python 3.10
-- Gradio  5.7.1
-- Transformers
-- DuckDuckGo Search
-- BeautifulSoup4
-- Langchain
-- Sentence Transformers
-## Usage
-1. Enter your search query in the text box
-2. Adjust the number of results using the slider
-3. Click "Submit" to see the results
-## Example Queries
-- "Latest developments in artificial intelligence"
-- "Climate change solutions"
-- "Space exploration news"
-## Deployment
-This project is deployed on Hugging Face Spaces, optimized for CPU environments.
-## License
-Apache 2.0

 ---
+title: Intelligent Search Engine (ISE)
 emoji: 🔍
 colorFrom: blue
 colorTo: indigo
 pinned: false
 ---
+# 🔍 Intelligent Search Engine (ISE)
+An advanced OSINT search engine with RAG capabilities and multi-modal search features.
+## 🌟 Features
+### 🌐 Intelligent Search
+- Web search with context understanding
+- AI-powered answer synthesis
+- Source citation and verification
+- RAG-based knowledge retrieval
+### 👤 OSINT Capabilities
+- Username search across multiple platforms
+- Person search (name, age, location)
+- Social media profile exploration
+- Personal information gathering
+- Historical data retrieval
+### 📸 Image Analysis
+- Face detection and recognition
+- Object and scene recognition
+- Image metadata extraction
+- Similar image search
+- Cross-reference with social media
+### 🗺️ Location Intelligence
+- Geographic information analysis
+- Location-based searching
+- Address validation and normalization
+- Proximity analysis
+## 🛠️ Technology Stack
 ### Core Components
+- Python 3.10+
+- LangChain for RAG capabilities
+- HuggingFace Transformers
+- PyTorch (CPU optimized)
+- Gradio for UI
+### Search & Scraping
+- DuckDuckGo Search
+- Google Search Python
+- BeautifulSoup4
+- Requests/AIOHTTP
+### OSINT Tools
+- Holehe
+- Sherlock Project
+- Python WHOIS
+- Geopy
+### Image Processing
+- Face Recognition
+- Pillow
+- Torchvision
+## 📦 Installation
+1. Clone the repository:
+```bash
+git clone https://github.com/yourusername/intelligent-search-engine.git
+cd intelligent-search-engine
+```
+2. Install dependencies:
+```bash
+pip install -r requirements.txt
+```
+3. Run the application:
+```bash
+python app.py
+```
+## 🎯 Usage
+### Web Interface
+The application provides a user-friendly web interface with multiple tabs:
+1. **Search Tab**
+   - Enter your search query
+   - Get AI-powered answers with sources
+2. **Username Search Tab**
+   - Search usernames across platforms
+   - View consolidated social media presence
+3. **Person Search Tab**
+   - Search by name, location, age
+   - Get comprehensive personal information
+4. **Image Analysis Tab**
+   - Upload images for analysis
+   - Detect faces and objects
+   - Search for similar images
+## 🔒 Privacy & Security
+- No sensitive data storage
+- Anonymized result presentation
+- Rate limiting for API calls
+- Basic URL validation
+- Secure data handling
+## 🤝 Contributing
+1. Fork the repository
+2. Create a feature branch
+3. Commit your changes
+4. Push to the branch
+5. Create a Pull Request
+## 📝 License
+This project is licensed under the MIT License - see the LICENSE file for details.
+## ⚠️ Disclaimer
+This tool is for educational and research purposes only. Users are responsible for complying with applicable laws and regulations regarding information gathering and privacy.

app.py CHANGED Viewed

@@ -1,306 +1,225 @@
-import gradio as gr
 import asyncio
-from search_engine import search, advanced_search
-from osint_engine import create_report
-import time
-def format_results(results):
-    if not results:
         return "No results found."
-    if isinstance(results, list):
-        # Format web search results
-        formatted_results = []
-        for result in results:
-            formatted_result = f"""
-### [{result['title']}]({result['url']})
-{result['summary']}
-**Source:** {result['url']}
-**Published:** {result.get('published_date', 'N/A')}
-            """
-            formatted_results.append(formatted_result)
-        return "\n---\n".join(formatted_results)
-    elif isinstance(results, dict):
-        # Format OSINT results
-        if "error" in results:
-            return f"Error: {results['error']}"
-        formatted = []
-        # Web results
-        if "web" in results:
-            formatted.append(format_results(results["web"]))
-        # Username/Platform results
-        if "platforms" in results:
-            platforms = results["platforms"]
-            if platforms:
-                formatted.append("\n### 🔍 Platform Results\n")
-                for platform in platforms:
-                    formatted.append(f"""
-- **Platform:** {platform['platform']}
-  **URL:** [{platform['url']}]({platform['url']})
-  **Status:** {'Found ✅' if platform.get('exists', False) else 'Not Found ❌'}
-""")
-        # Image analysis
-        if "analysis" in results:
-            analysis = results["analysis"]
-            if analysis:
-                formatted.append("\n### 🖼️ Image Analysis\n")
-                for key, value in analysis.items():
-                    formatted.append(f"- **{key.title()}:** {value}")
-        # Similar images
-        if "similar_images" in results:
-            similar = results["similar_images"]
-            if similar:
-                formatted.append("\n### 🔍 Similar Images\n")
-                for img in similar:
-                    formatted.append(f"- [{img['source']}]({img['url']})")
-        # Location info
-        if "location" in results:
-            location = results["location"]
-            if location and not isinstance(location, str):
-                formatted.append("\n### 📍 Location Information\n")
-                for key, value in location.items():
-                    if key != 'raw':
-                        formatted.append(f"- **{key.title()}:** {value}")
-        # Domain info
-        if "domain" in results:
-            domain = results["domain"]
-            if domain and not isinstance(domain, str):
-                formatted.append("\n### 🌐 Domain Information\n")
-                for key, value in domain.items():
-                    formatted.append(f"- **{key.title()}:** {value}")
-        # Historical data
-        if "historical" in results:
-            historical = results["historical"]
-            if historical:
-                formatted.append("\n### 📅 Historical Data\n")
-                for entry in historical[:5]:  # Limit to 5 entries
-                    formatted.append(f"""
-- **Date:** {entry.get('timestamp', 'N/A')}
-  **URL:** [{entry.get('url', 'N/A')}]({entry.get('url', '#')})
-  **Type:** {entry.get('mime_type', 'N/A')}
-""")
-        return "\n".join(formatted) if formatted else "No relevant information found."
-    else:
-        return str(results)
-def safe_search(query, search_type="web", max_results=5, platform=None,
-                image_url=None, phone=None, location=None, domain=None,
-                name=None, address=None, progress=gr.Progress()):
-    """Safe wrapper for search functions"""
-    try:
-        kwargs = {
-            "max_results": max_results,
-            "platform": platform,
-            "phone": phone,
-            "location": location,
-            "domain": domain,
-            "name": name,
-            "address": address
-        }
-        progress(0, desc="Initializing search...")
-        time.sleep(0.5)  # Show loading state
-        if search_type == "web":
-            progress(0.3, desc="Searching web...")
-            results = search(query, max_results)
-        else:
-            # For async searches
-            if search_type == "image" and image_url:
-                query = image_url
-            progress(0.5, desc=f"Performing {search_type} search...")
-            loop = asyncio.new_event_loop()
-            asyncio.set_event_loop(loop)
-            results = loop.run_until_complete(advanced_search(query, search_type, **kwargs))
-            loop.close()
-        progress(0.8, desc="Processing results...")
-        time.sleep(0.5)  # Show processing state
-        progress(1.0, desc="Done!")
-        return format_results(results)
     except Exception as e:
         return f"Error: {str(e)}"
-# Create Gradio interface
-with gr.Blocks(theme=gr.themes.Soft()) as demo:
-    gr.Markdown("# 🔍 Intelligent Search Engine")
-    gr.Markdown("""
-    An AI-powered search engine with advanced OSINT capabilities.
-    Features:
-    - Web search with AI summaries
-    - Username search across platforms
-    - Image search and analysis
-    - Social media profile search
-    - Personal information gathering
-    - Historical data search
-    """)
-    with gr.Tab("Web Search"):
-        with gr.Row():
-            query_input = gr.Textbox(
-                label="Search Query",
-                placeholder="Enter your search query...",
-                lines=2
-            )
-            max_results = gr.Slider(
-                minimum=1,
-                maximum=10,
-                value=5,
-                step=1,
-                label="Number of Results"
-            )
-        search_button = gr.Button("🔍 Search", variant="primary")
-        results_output = gr.Markdown(label="Search Results")
-        search_button.click(
-            fn=safe_search,
-            inputs=[query_input, gr.State("web"), max_results],
-            outputs=results_output,
-            show_progress=True
-        )
-    with gr.Tab("Username Search"):
-        username_input = gr.Textbox(
-            label="Username",
-            placeholder="Enter username to search..."
-        )
-        username_button = gr.Button("🔍 Search Username", variant="primary")
-        username_output = gr.Markdown(label="Username Search Results")
-        username_button.click(
-            fn=safe_search,
-            inputs=[username_input, gr.State("username")],
-            outputs=username_output,
-            show_progress=True
-        )
-    with gr.Tab("Image Search"):
-        with gr.Row():
-            image_url = gr.Textbox(
-                label="Image URL",
-                placeholder="Enter image URL to search..."
-            )
-            image_upload = gr.Image(
-                label="Or Upload Image",
-                type="filepath"
-            )
-        image_button = gr.Button("🔍 Search Image", variant="primary")
-        image_output = gr.Markdown(label="Image Search Results")
-        def handle_image_search(url, uploaded_image):
-            if uploaded_image:
-                return safe_search(uploaded_image, "image", image_url=uploaded_image)
-            return safe_search(url, "image", image_url=url)
-        image_button.click(
-            fn=handle_image_search,
-            inputs=[image_url, image_upload],
-            outputs=image_output,
-            show_progress=True
-        )
-    with gr.Tab("Social Media Search"):
-        with gr.Row():
-            social_username = gr.Textbox(
-                label="Username",
-                placeholder="Enter username..."
-            )
-            platform = gr.Dropdown(
-                choices=[
-                    "all", "twitter", "instagram", "facebook", "linkedin",
-                    "github", "reddit", "youtube", "tiktok", "pinterest",
-                    "snapchat", "twitch", "medium", "devto", "stackoverflow"
-                ],
-                value="all",
-                label="Platform"
-            )
-        social_button = gr.Button("🔍 Search Social Media", variant="primary")
-        social_output = gr.Markdown(label="Social Media Results")
-        social_button.click(
-            fn=safe_search,
-            inputs=[social_username, gr.State("social"), gr.State(5), platform],
-            outputs=social_output,
-            show_progress=True
-        )
-    with gr.Tab("Personal Info"):
-        with gr.Group():
-            with gr.Row():
-                name = gr.Textbox(label="Full Name", placeholder="John Doe")
-                address = gr.Textbox(label="Address/Location", placeholder="City, Country")
-            initial_search = gr.Button("🔍 Find Possible Matches", variant="primary")
-            matches_output = gr.Markdown(label="Possible Matches")
-            with gr.Row(visible=False) as details_row:
-                selected_person = gr.Dropdown(
-                    choices=[],
-                    label="Select Person",
-                    interactive=True
                 )
-                details_button = gr.Button("🔍 Get Detailed Info", variant="secondary")
-            details_output = gr.Markdown(label="Detailed Information")
-        def find_matches(name, address):
-            return safe_search(name, "personal", name=name, location=address)
-        def get_details(person):
-            if not person:
-                return "Please select a person first."
-            return safe_search(person, "personal", name=person)
-        initial_search.click(
-            fn=find_matches,
-            inputs=[name, address],
-            outputs=matches_output
-        ).then(
-            lambda: gr.Row(visible=True),
-            None,
-            details_row
-        )
-        details_button.click(
-            fn=get_details,
-            inputs=[selected_person],
-            outputs=details_output,
-            show_progress=True
-        )
-    with gr.Tab("Historical Data"):
-        url_input = gr.Textbox(
-            label="URL",
-            placeholder="Enter URL to search historical data..."
-        )
-        historical_button = gr.Button("🔍 Search Historical Data", variant="primary")
-        historical_output = gr.Markdown(label="Historical Data Results")
-        historical_button.click(
-            fn=safe_search,
-            inputs=[url_input, gr.State("historical")],
-            outputs=historical_output,
-            show_progress=True
-        )
-    gr.Markdown("""
-    ### Examples
-    Try these example searches:
-    - Web Search: "Latest developments in artificial intelligence"
-    - Username: "johndoe"
-    - Image URL: "https://images.app.goo.gl/w5BtxZKvzg6BdkGE8"
-    - Social Media: "techuser" on Twitter
-    - Personal Info: "John Smith" in "New York, USA"
-    - Historical Data: "example.com"
-    """)
-# Launch the app
 if __name__ == "__main__":
-    demo.launch()

+"""
+Intelligent Search Engine with RAG and OSINT capabilities.
+"""
+import os
 import asyncio
+import gradio as gr
+from engines.search import SearchEngine
+from engines.osint import OSINTEngine
+from engines.image import ImageEngine
+import markdown2
+from typing import Dict, Any, List
+# Initialize engines
+search_engine = SearchEngine()
+osint_engine = OSINTEngine()
+image_engine = ImageEngine()
+def format_search_results(results: Dict[str, Any]) -> str:
+    """Format search results with markdown."""
+    if not results or "answer" not in results:
         return "No results found."
+    formatted = f"### Answer\n{results['answer']}\n\n"
+    if results.get("sources"):
+        formatted += "\n### Sources\n"
+        for i, source in enumerate(results["sources"], 1):
+            formatted += f"{i}. [{source}]({source})\n"
+    return formatted
+def format_osint_results(results: Dict[str, Any]) -> str:
+    """Format OSINT results with markdown."""
+    formatted = "### OSINT Results\n\n"
+    if "error" in results:
+        return f"Error: {results['error']}"
+    if "found_on" in results:
+        formatted += "#### Social Media Presence\n"
+        for platform in results["found_on"]:
+            formatted += f"- {platform['platform']}: [{platform['url']}]({platform['url']})\n"
+    if "person_info" in results:
+        person = results["person_info"]
+        formatted += f"\n#### Personal Information\n"
+        formatted += f"- Name: {person.get('name', 'N/A')}\n"
+        if person.get("age"):
+            formatted += f"- Age: {person['age']}\n"
+        if person.get("location"):
+            formatted += f"- Location: {person['location']}\n"
+        if person.get("gender"):
+            formatted += f"- Gender: {person['gender']}\n"
+    return formatted
+async def search_query(query: str) -> str:
+    """Handle search queries."""
+    try:
+        results = await search_engine.search(query)
+        return format_search_results(results)
+    except Exception as e:
+        return f"Error: {str(e)}"
+async def search_username(username: str) -> str:
+    """Search for username across platforms."""
+    try:
+        results = await osint_engine.search_username(username)
+        return format_osint_results(results)
+    except Exception as e:
+        return f"Error: {str(e)}"
+async def search_person(name: str, location: str = "", age: str = "", gender: str = "") -> str:
+    """Search for person information."""
+    try:
+        age_int = int(age) if age.strip() else None
+        person = await osint_engine.search_person(
+            name=name,
+            location=location if location.strip() else None,
+            age=age_int,
+            gender=gender if gender.strip() else None
+        )
+        return format_osint_results({"person_info": person.to_dict()})
+    except Exception as e:
+        return f"Error: {str(e)}"
+async def analyze_image_file(image) -> str:
+    """Analyze uploaded image."""
+    try:
+        if not image:
+            return "No image provided."
+        # Read image data
+        with open(image.name, "rb") as f:
+            image_data = f.read()
+        # Analyze image
+        results = await image_engine.analyze_image(image_data)
+        if "error" in results:
+            return f"Error analyzing image: {results['error']}"
+        # Format results
+        formatted = "### Image Analysis Results\n\n"
+        # Add predictions
+        formatted += "#### Content Detection\n"
+        for pred in results["predictions"]:
+            confidence = pred["confidence"] * 100
+            formatted += f"- {pred['label']}: {confidence:.1f}%\n"
+        # Add face detection results
+        formatted += f"\n#### Face Detection\n"
+        formatted += f"- Found {len(results['faces'])} faces\n"
+        # Add metadata
+        formatted += f"\n#### Image Metadata\n"
+        metadata = results["metadata"]
+        formatted += f"- Size: {metadata['width']}x{metadata['height']}\n"
+        formatted += f"- Format: {metadata['format']}\n"
+        formatted += f"- Mode: {metadata['mode']}\n"
+        return formatted
     except Exception as e:
         return f"Error: {str(e)}"
+def create_ui() -> gr.Blocks:
+    """Create the Gradio interface."""
+    with gr.Blocks(title="Intelligent Search Engine", theme=gr.themes.Soft()) as app:
+        gr.Markdown("""
+        # 🔍 Intelligent Search Engine
+        Advanced search engine with RAG and OSINT capabilities.
+        """)
+        with gr.Tabs():
+            # Intelligent Search Tab
+            with gr.Tab("🌐 Search"):
+                with gr.Column():
+                    search_input = gr.Textbox(
+                        label="Enter your search query",
+                        placeholder="What would you like to know?"
+                    )
+                    search_button = gr.Button("Search", variant="primary")
+                    search_output = gr.Markdown(label="Results")
+                search_button.click(
+                    fn=search_query,
+                    inputs=search_input,
+                    outputs=search_output
                 )
+            # Username Search Tab
+            with gr.Tab("👤 Username Search"):
+                with gr.Column():
+                    username_input = gr.Textbox(
+                        label="Enter username",
+                        placeholder="Username to search across platforms"
+                    )
+                    username_button = gr.Button("Search Username", variant="primary")
+                    username_output = gr.Markdown(label="Results")
+                username_button.click(
+                    fn=search_username,
+                    inputs=username_input,
+                    outputs=username_output
+                )
+            # Person Search Tab
+            with gr.Tab("👥 Person Search"):
+                with gr.Column():
+                    name_input = gr.Textbox(
+                        label="Full Name",
+                        placeholder="Enter person's name"
+                    )
+                    location_input = gr.Textbox(
+                        label="Location (optional)",
+                        placeholder="City, Country"
+                    )
+                    age_input = gr.Textbox(
+                        label="Age (optional)",
+                        placeholder="Enter age"
+                    )
+                    gender_input = gr.Dropdown(
+                        label="Gender (optional)",
+                        choices=["", "Male", "Female", "Other"]
+                    )
+                    person_button = gr.Button("Search Person", variant="primary")
+                    person_output = gr.Markdown(label="Results")
+                person_button.click(
+                    fn=search_person,
+                    inputs=[name_input, location_input, age_input, gender_input],
+                    outputs=person_output
+                )
+            # Image Analysis Tab
+            with gr.Tab("🖼️ Image Analysis"):
+                with gr.Column():
+                    image_input = gr.File(
+                        label="Upload Image",
+                        file_types=["image"]
+                    )
+                    image_button = gr.Button("Analyze Image", variant="primary")
+                    image_output = gr.Markdown(label="Results")
+                image_button.click(
+                    fn=analyze_image_file,
+                    inputs=image_input,
+                    outputs=image_output
+                )
+        gr.Markdown("""
+        ### 📝 Notes
+        - The search engine uses RAG (Retrieval-Augmented Generation) for intelligent answers
+        - OSINT capabilities include social media presence, personal information, and image analysis
+        - All searches are conducted using publicly available information
+        """)
+    return app
 if __name__ == "__main__":
+    app = create_ui()
+    app.launch(share=True)

engines/image.py ADDED Viewed

	@@ -0,0 +1,164 @@

+"""
+Image analysis engine for processing and analyzing images.
+"""
+from typing import Dict, Any, List, Optional
+import io
+from PIL import Image
+import torch
+from torchvision import transforms
+from transformers import AutoFeatureExtractor, AutoModelForImageClassification
+import face_recognition
+import numpy as np
+from tenacity import retry, stop_after_attempt, wait_exponential
+class ImageEngine:
+    def __init__(self):
+        # Initialize image classification model
+        self.feature_extractor = AutoFeatureExtractor.from_pretrained(
+            "microsoft/resnet-50"
+        )
+        self.model = AutoModelForImageClassification.from_pretrained(
+            "microsoft/resnet-50"
+        )
+        # Set up image transforms
+        self.transform = transforms.Compose([
+            transforms.Resize(256),
+            transforms.CenterCrop(224),
+            transforms.ToTensor(),
+            transforms.Normalize(
+                mean=[0.485, 0.456, 0.406],
+                std=[0.229, 0.224, 0.225]
+            )
+        ])
+    @retry(stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, min=4, max=10))
+    async def analyze_image(self, image_data: bytes) -> Dict[str, Any]:
+        """Analyze image content and detect objects/faces."""
+        try:
+            # Load image
+            image = Image.open(io.BytesIO(image_data)).convert('RGB')
+            # Prepare image for model
+            inputs = self.feature_extractor(images=image, return_tensors="pt")
+            # Get model predictions
+            with torch.no_grad():
+                outputs = self.model(**inputs)
+                probs = outputs.logits.softmax(-1)
+            # Get top predictions
+            top_probs, top_indices = torch.topk(probs, k=5)
+            # Convert predictions to list
+            predictions = [
+                {
+                    "label": self.model.config.id2label[idx.item()],
+                    "confidence": prob.item()
+                }
+                for prob, idx in zip(top_probs[0], top_indices[0])
+            ]
+            # Analyze faces
+            np_image = np.array(image)
+            face_locations = face_recognition.face_locations(np_image)
+            face_encodings = face_recognition.face_encodings(np_image, face_locations)
+            faces = []
+            for i, (face_encoding, face_location) in enumerate(zip(face_encodings, face_locations)):
+                face = {
+                    "id": i + 1,
+                    "location": {
+                        "top": face_location[0],
+                        "right": face_location[1],
+                        "bottom": face_location[2],
+                        "left": face_location[3]
+                    },
+                    "encoding": face_encoding.tolist()
+                }
+                faces.append(face)
+            # Get image metadata
+            metadata = {
+                "format": image.format,
+                "mode": image.mode,
+                "size": image.size,
+                "width": image.width,
+                "height": image.height
+            }
+            return {
+                "predictions": predictions,
+                "faces": faces,
+                "metadata": metadata
+            }
+        except Exception as e:
+            return {"error": str(e)}
+    async def compare_faces(self, face1_data: bytes, face2_data: bytes) -> Dict[str, Any]:
+        """Compare two faces and determine if they are the same person."""
+        try:
+            # Load and process first image
+            image1 = face_recognition.load_image_file(io.BytesIO(face1_data))
+            face1_encoding = face_recognition.face_encodings(image1)
+            if not face1_encoding:
+                return {"error": "No face found in first image"}
+            # Load and process second image
+            image2 = face_recognition.load_image_file(io.BytesIO(face2_data))
+            face2_encoding = face_recognition.face_encodings(image2)
+            if not face2_encoding:
+                return {"error": "No face found in second image"}
+            # Compare faces
+            results = face_recognition.compare_faces(
+                [face1_encoding[0]], face2_encoding[0]
+            )
+            # Calculate face distance (lower means more similar)
+            face_distance = face_recognition.face_distance(
+                [face1_encoding[0]], face2_encoding[0]
+            )
+            return {
+                "match": bool(results[0]),
+                "confidence": float(1 - face_distance[0]),
+                "distance": float(face_distance[0])
+            }
+        except Exception as e:
+            return {"error": str(e)}
+    async def search_similar_faces(self,
+                                 target_encoding: List[float],
+                                 face_database: List[Dict[str, Any]],
+                                 threshold: float = 0.6) -> List[Dict[str, Any]]:
+        """Search for similar faces in a database of face encodings."""
+        try:
+            matches = []
+            target_encoding = np.array(target_encoding)
+            for face_data in face_database:
+                if "encoding" not in face_data:
+                    continue
+                current_encoding = np.array(face_data["encoding"])
+                distance = face_recognition.face_distance([target_encoding], current_encoding)[0]
+                if distance < threshold:
+                    matches.append({
+                        "face_id": face_data.get("id"),
+                        "confidence": float(1 - distance),
+                        "metadata": face_data.get("metadata", {})
+                    })
+            # Sort matches by confidence
+            matches.sort(key=lambda x: x["confidence"], reverse=True)
+            return matches
+        except Exception as e:
+            return [{"error": str(e)}]

engines/osint.py ADDED Viewed

	@@ -0,0 +1,167 @@

+"""
+OSINT engine for comprehensive information gathering.
+"""
+from typing import Dict, List, Any, Optional
+import asyncio
+import json
+from dataclasses import dataclass
+import holehe.core as holehe
+from sherlock import sherlock
+import face_recognition
+import numpy as np
+from PIL import Image
+import io
+import requests
+from geopy.geocoders import Nominatim
+from geopy.exc import GeocoderTimedOut
+import whois
+from datetime import datetime
+from tenacity import retry, stop_after_attempt, wait_exponential
+@dataclass
+class PersonInfo:
+    name: str
+    age: Optional[int] = None
+    location: Optional[str] = None
+    gender: Optional[str] = None
+    social_profiles: List[Dict[str, str]] = None
+    images: List[str] = None
+    def to_dict(self) -> Dict[str, Any]:
+        return {
+            "name": self.name,
+            "age": self.age,
+            "location": self.location,
+            "gender": self.gender,
+            "social_profiles": self.social_profiles or [],
+            "images": self.images or []
+        }
+class OSINTEngine:
+    def __init__(self):
+        self.geolocator = Nominatim(user_agent="intelligent_search_engine")
+        self.known_platforms = [
+            "Twitter", "Instagram", "Facebook", "LinkedIn", "GitHub",
+            "Reddit", "YouTube", "TikTok", "Pinterest", "Snapchat",
+            "Twitch", "Medium", "Dev.to", "Stack Overflow"
+        ]
+    @retry(stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, min=4, max=10))
+    async def search_username(self, username: str) -> Dict[str, Any]:
+        """Search for username across multiple platforms."""
+        results = []
+        # Use holehe for email-based search
+        email = f"{username}@gmail.com"  # Example email
+        holehe_results = await holehe.check_email(email)
+        # Use sherlock for username search
+        sherlock_results = sherlock.sherlock(username, self.known_platforms, verbose=False)
+        # Combine results
+        for platform, data in {**holehe_results, **sherlock_results}.items():
+            if data.get("exists", False):
+                results.append({
+                    "platform": platform,
+                    "url": data.get("url", ""),
+                    "confidence": data.get("confidence", "high")
+                })
+        return {
+            "username": username,
+            "found_on": results
+        }
+    async def search_person(self, name: str, location: Optional[str] = None,
+                          age: Optional[int] = None, gender: Optional[str] = None) -> PersonInfo:
+        """Search for information about a person."""
+        person = PersonInfo(
+            name=name,
+            age=age,
+            location=location,
+            gender=gender
+        )
+        # Initialize social profiles list
+        person.social_profiles = []
+        # Search for social media profiles
+        username_variants = [
+            name.replace(" ", ""),
+            name.replace(" ", "_"),
+            name.replace(" ", "."),
+            name.lower().replace(" ", "")
+        ]
+        for username in username_variants:
+            results = await self.search_username(username)
+            person.social_profiles.extend(results.get("found_on", []))
+        return person
+    @retry(stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, min=4, max=10))
+    async def analyze_image(self, image_data: bytes) -> Dict[str, Any]:
+        """Analyze an image for faces and other identifiable information."""
+        try:
+            # Load image
+            image = face_recognition.load_image_file(io.BytesIO(image_data))
+            # Detect faces
+            face_locations = face_recognition.face_locations(image)
+            face_encodings = face_recognition.face_encodings(image, face_locations)
+            results = {
+                "faces_found": len(face_locations),
+                "faces": []
+            }
+            # Analyze each face
+            for i, (face_encoding, face_location) in enumerate(zip(face_encodings, face_locations)):
+                face_data = {
+                    "location": {
+                        "top": face_location[0],
+                        "right": face_location[1],
+                        "bottom": face_location[2],
+                        "left": face_location[3]
+                    }
+                }
+                results["faces"].append(face_data)
+            return results
+        except Exception as e:
+            return {"error": str(e)}
+    async def search_location(self, location: str) -> Dict[str, Any]:
+        """Gather information about a location."""
+        try:
+            # Geocode the location
+            location_data = self.geolocator.geocode(location, timeout=10)
+            if not location_data:
+                return {"error": "Location not found"}
+            return {
+                "address": location_data.address,
+                "latitude": location_data.latitude,
+                "longitude": location_data.longitude,
+                "raw": location_data.raw
+            }
+        except GeocoderTimedOut:
+            return {"error": "Geocoding service timed out"}
+        except Exception as e:
+            return {"error": str(e)}
+    async def analyze_domain(self, domain: str) -> Dict[str, Any]:
+        """Analyze a domain for WHOIS and other information."""
+        try:
+            w = whois.whois(domain)
+            return {
+                "registrar": w.registrar,
+                "creation_date": w.creation_date,
+                "expiration_date": w.expiration_date,
+                "last_updated": w.updated_date,
+                "status": w.status,
+                "name_servers": w.name_servers
+            }
+        except Exception as e:
+            return {"error": str(e)}

engines/search.py ADDED Viewed

	@@ -0,0 +1,133 @@

+"""
+RAG-based search engine with intelligent answer synthesis.
+"""
+from typing import List, Dict, Any, Optional
+import asyncio
+from langchain.chains import RetrievalQAWithSourcesChain
+from langchain.embeddings import HuggingFaceEmbeddings
+from langchain.vectorstores import FAISS
+from langchain.text_splitter import RecursiveCharacterTextSplitter
+from langchain.docstore.document import Document
+from duckduckgo_search import DDGS
+from googlesearch import search as gsearch
+import requests
+from bs4 import BeautifulSoup
+from tenacity import retry, stop_after_attempt, wait_exponential
+class SearchEngine:
+    def __init__(self):
+        self.embeddings = HuggingFaceEmbeddings(
+            model_name="sentence-transformers/all-mpnet-base-v2"
+        )
+        self.text_splitter = RecursiveCharacterTextSplitter(
+            chunk_size=500,
+            chunk_overlap=50
+        )
+    @retry(stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, min=4, max=10))
+    async def search_web(self, query: str, max_results: int = 10) -> List[Dict[str, str]]:
+        """Perform web search using multiple search engines."""
+        results = []
+        # DuckDuckGo Search
+        try:
+            with DDGS() as ddgs:
+                ddg_results = [r for r in ddgs.text(query, max_results=max_results)]
+                results.extend(ddg_results)
+        except Exception as e:
+            print(f"DuckDuckGo search error: {e}")
+        # Google Search
+        try:
+            google_results = gsearch(query, num_results=max_results)
+            results.extend([{"link": url, "title": url} for url in google_results])
+        except Exception as e:
+            print(f"Google search error: {e}")
+        return results[:max_results]
+    @retry(stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, min=4, max=10))
+    async def fetch_content(self, url: str) -> Optional[str]:
+        """Fetch and extract content from a webpage."""
+        try:
+            headers = {
+                "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
+            }
+            response = requests.get(url, headers=headers, timeout=10)
+            response.raise_for_status()
+            soup = BeautifulSoup(response.text, "html.parser")
+            # Remove unwanted elements
+            for element in soup(["script", "style", "nav", "footer", "header"]):
+                element.decompose()
+            text = soup.get_text(separator="\n", strip=True)
+            return text
+        except Exception as e:
+            print(f"Error fetching {url}: {e}")
+            return None
+    async def process_search_results(self, query: str) -> Dict[str, Any]:
+        """Process search results and create a RAG-based answer."""
+        # Perform web search
+        search_results = await self.search_web(query)
+        # Fetch content from search results
+        documents = []
+        for result in search_results:
+            url = result.get("link")
+            if not url:
+                continue
+            content = await self.fetch_content(url)
+            if content:
+                # Split content into chunks
+                chunks = self.text_splitter.split_text(content)
+                for chunk in chunks:
+                    doc = Document(
+                        page_content=chunk,
+                        metadata={"source": url, "title": result.get("title", url)}
+                    )
+                    documents.append(doc)
+        if not documents:
+            return {
+                "answer": "I couldn't find any relevant information.",
+                "sources": []
+            }
+        # Create vector store
+        vectorstore = FAISS.from_documents(documents, self.embeddings)
+        # Create retrieval chain
+        chain = RetrievalQAWithSourcesChain.from_chain_type(
+            llm=None,  # We'll implement custom answer synthesis
+            retriever=vectorstore.as_retriever()
+        )
+        # Get relevant documents
+        relevant_docs = chain.retriever.get_relevant_documents(query)
+        # For now, return the most relevant chunks and sources
+        sources = []
+        content = []
+        for doc in relevant_docs[:3]:
+            if doc.metadata["source"] not in sources:
+                sources.append(doc.metadata["source"])
+            content.append(doc.page_content)
+        return {
+            "answer": "\n\n".join(content),
+            "sources": sources
+        }
+    async def search(self, query: str) -> Dict[str, Any]:
+        """Main search interface."""
+        try:
+            return await self.process_search_results(query)
+        except Exception as e:
+            return {
+                "answer": f"An error occurred: {str(e)}",
+                "sources": []
+            }

requirements.txt CHANGED Viewed

@@ -1,58 +1,42 @@
-# Base dependencies
 numpy>=1.23.5
-scikit-learn>=1.2.2
-scipy>=1.10.1
 pandas>=2.0.2
 tqdm>=4.65.0
-Pillow==10.0.0
 requests==2.31.0
-# PyTorch CPU (pre-built wheels)
 --extra-index-url https://download.pytorch.org/whl/cpu
 torch==2.0.1+cpu
 torchvision==0.15.2+cpu
-torchaudio==2.0.2+cpu
-# Transformers and embeddings
 transformers==4.31.0
-tokenizers==0.13.3
---extra-index-url https://huggingface.github.io/pytorch-transformers/whl/cpu/
 sentence-transformers==2.2.2
-huggingface-hub>=0.16.4
-# Web interface
 gradio==3.40.1
-# Search and scraping
-duckduckgo-search==3.8.5
-beautifulsoup4==4.12.2
-lxml==4.9.3
-googlesearch-python==1.2.3
-waybackpy==3.0.6
-google==3.0.0
-# LangChain and dependencies
-langchain==0.0.335
-pydantic==1.10.13
-# Browser automation
-selenium==4.15.2
-webdriver-manager==4.0.1
-# Networking and async
-aiohttp==3.8.5
-httpx==0.24.1
-async-timeout==4.0.3
-attrs==23.1.0
-multidict==6.0.4
-yarl==1.9.2
-frozenlist==1.4.0
-charset-normalizer==3.2.0
-idna==3.4
-certifi==2023.7.22
-urllib3==2.0.4
-# Domain info
 python-whois==0.8.0
 geopy==2.4.1
-protobuf==4.25.1

+# Core dependencies
+langchain==0.0.335
+pydantic==1.10.13
 numpy>=1.23.5
 pandas>=2.0.2
 tqdm>=4.65.0
+# Web and Networking
 requests==2.31.0
+aiohttp==3.8.5
+httpx==0.24.1
+beautifulsoup4==4.12.2
+selenium==4.15.2
+webdriver-manager==4.0.1
+googlesearch-python==1.2.3
+duckduckgo-search==3.8.5
+# ML and AI
 --extra-index-url https://download.pytorch.org/whl/cpu
 torch==2.0.1+cpu
 torchvision==0.15.2+cpu
 transformers==4.31.0
 sentence-transformers==2.2.2
+# UI
 gradio==3.40.1
+# OSINT Tools
 python-whois==0.8.0
 geopy==2.4.1
+socid-extractor==1.0.0
+holehe==1.61
+sherlock-project==0.14.3
+# Image Processing
+Pillow==10.0.0
+face-recognition==1.3.0
+# Utilities
+python-dotenv==1.0.0
+tenacity==8.2.3
+retry==0.9.2

utils/helpers.py ADDED Viewed

	@@ -0,0 +1,160 @@

+"""
+Common helper functions for the search engine.
+"""
+from typing import Dict, Any, List, Optional
+import re
+from datetime import datetime
+import hashlib
+import json
+def clean_text(text: str) -> str:
+    """Clean and normalize text content."""
+    # Remove extra whitespace
+    text = re.sub(r"\s+", " ", text)
+    # Remove special characters
+    text = re.sub(r"[^\w\s.,!?-]", "", text)
+    return text.strip()
+def extract_entities(text: str) -> Dict[str, List[str]]:
+    """Extract basic entities from text."""
+    entities = {
+        "emails": [],
+        "phones": [],
+        "urls": [],
+        "dates": []
+    }
+    # Extract emails
+    email_pattern = r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}"
+    entities["emails"] = re.findall(email_pattern, text)
+    # Extract phone numbers
+    phone_pattern = r"\+?\d{1,4}?[-.\s]?\(?\d{1,3}?\)?[-.\s]?\d{1,4}[-.\s]?\d{1,4}[-.\s]?\d{1,9}"
+    entities["phones"] = re.findall(phone_pattern, text)
+    # Extract URLs
+    url_pattern = r"https?://(?:[-\w.]|(?:%[\da-fA-F]{2}))+"
+    entities["urls"] = re.findall(url_pattern, text)
+    # Extract dates
+    date_pattern = r"\d{1,2}[-/]\d{1,2}[-/]\d{2,4}"
+    entities["dates"] = re.findall(date_pattern, text)
+    return entities
+def generate_hash(data: Any) -> str:
+    """Generate a hash for data deduplication."""
+    if isinstance(data, (dict, list)):
+        data = json.dumps(data, sort_keys=True)
+    elif not isinstance(data, str):
+        data = str(data)
+    return hashlib.md5(data.encode()).hexdigest()
+def format_date(date_str: str) -> Optional[str]:
+    """Format date string to consistent format."""
+    date_formats = [
+        "%Y-%m-%d",
+        "%d/%m/%Y",
+        "%m/%d/%Y",
+        "%Y/%m/%d",
+        "%d-%m-%Y",
+        "%m-%d-%Y"
+    ]
+    for fmt in date_formats:
+        try:
+            date_obj = datetime.strptime(date_str, fmt)
+            return date_obj.strftime("%Y-%m-%d")
+        except ValueError:
+            continue
+    return None
+def extract_name_parts(full_name: str) -> Dict[str, str]:
+    """Extract first, middle, and last names."""
+    parts = full_name.strip().split()
+    if len(parts) == 1:
+        return {
+            "first_name": parts[0],
+            "middle_name": None,
+            "last_name": None
+        }
+    elif len(parts) == 2:
+        return {
+            "first_name": parts[0],
+            "middle_name": None,
+            "last_name": parts[1]
+        }
+    else:
+        return {
+            "first_name": parts[0],
+            "middle_name": " ".join(parts[1:-1]),
+            "last_name": parts[-1]
+        }
+def generate_username_variants(name: str) -> List[str]:
+    """Generate possible username variants from a name."""
+    name = name.lower()
+    parts = name.split()
+    variants = []
+    if len(parts) >= 2:
+        first, last = parts[0], parts[-1]
+        variants.extend([
+            first + last,
+            first + "_" + last,
+            first + "." + last,
+            first[0] + last,
+            first + last[0],
+            last + first,
+            last + "_" + first,
+            last + "." + first
+        ])
+    if len(parts) == 1:
+        variants.extend([
+            parts[0],
+            parts[0] + "123",
+            "the" + parts[0],
+            "real" + parts[0]
+        ])
+    return list(set(variants))
+def calculate_text_similarity(text1: str, text2: str) -> float:
+    """Calculate simple text similarity score."""
+    # Convert to sets of words
+    set1 = set(text1.lower().split())
+    set2 = set(text2.lower().split())
+    # Calculate Jaccard similarity
+    intersection = len(set1.intersection(set2))
+    union = len(set1.union(set2))
+    return intersection / union if union > 0 else 0.0
+def extract_social_links(text: str) -> List[Dict[str, str]]:
+    """Extract social media profile links from text."""
+    social_patterns = {
+        "twitter": r"https?://(?:www\.)?twitter\.com/([a-zA-Z0-9_]+)",
+        "facebook": r"https?://(?:www\.)?facebook\.com/([a-zA-Z0-9.]+)",
+        "instagram": r"https?://(?:www\.)?instagram\.com/([a-zA-Z0-9_.]+)",
+        "linkedin": r"https?://(?:www\.)?linkedin\.com/in/([a-zA-Z0-9_-]+)",
+        "github": r"https?://(?:www\.)?github\.com/([a-zA-Z0-9_-]+)"
+    }
+    results = []
+    for platform, pattern in social_patterns.items():
+        matches = re.finditer(pattern, text)
+        for match in matches:
+            results.append({
+                "platform": platform,
+                "username": match.group(1),
+                "url": match.group(0)
+            })
+    return results

utils/web.py ADDED Viewed

	@@ -0,0 +1,128 @@

+"""
+Web scraping and processing utilities.
+"""
+from typing import Dict, Any, List, Optional
+import requests
+from bs4 import BeautifulSoup
+import re
+from urllib.parse import urlparse, urljoin
+from tenacity import retry, stop_after_attempt, wait_exponential
+class WebUtils:
+    def __init__(self):
+        self.session = requests.Session()
+        self.session.headers.update({
+            "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
+        })
+    @retry(stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, min=4, max=10))
+    async def fetch_url(self, url: str, timeout: int = 10) -> Optional[str]:
+        """Fetch content from a URL."""
+        try:
+            response = self.session.get(url, timeout=timeout)
+            response.raise_for_status()
+            return response.text
+        except Exception as e:
+            print(f"Error fetching {url}: {e}")
+            return None
+    def extract_text(self, html: str) -> str:
+        """Extract clean text from HTML content."""
+        soup = BeautifulSoup(html, "html.parser")
+        # Remove unwanted elements
+        for element in soup(["script", "style", "nav", "footer", "header"]):
+            element.decompose()
+        # Get text and clean it
+        text = soup.get_text(separator="\n", strip=True)
+        # Remove excessive newlines
+        text = re.sub(r"\n\s*\n", "\n\n", text)
+        return text.strip()
+    def extract_metadata(self, html: str, url: str) -> Dict[str, Any]:
+        """Extract metadata from HTML content."""
+        soup = BeautifulSoup(html, "html.parser")
+        metadata = {
+            "url": url,
+            "title": None,
+            "description": None,
+            "keywords": None,
+            "author": None,
+            "published_date": None
+        }
+        # Extract title
+        metadata["title"] = (
+            soup.title.string if soup.title else None
+        )
+        # Extract meta tags
+        meta_tags = soup.find_all("meta")
+        for tag in meta_tags:
+            # Description
+            if tag.get("name", "").lower() == "description":
+                metadata["description"] = tag.get("content")
+            # Keywords
+            elif tag.get("name", "").lower() == "keywords":
+                metadata["keywords"] = tag.get("content")
+            # Author
+            elif tag.get("name", "").lower() == "author":
+                metadata["author"] = tag.get("content")
+            # Published date
+            elif tag.get("name", "").lower() in ["published_time", "publication_date"]:
+                metadata["published_date"] = tag.get("content")
+        return metadata
+    def extract_links(self, html: str, base_url: str) -> List[str]:
+        """Extract all links from HTML content."""
+        soup = BeautifulSoup(html, "html.parser")
+        links = []
+        for link in soup.find_all("a"):
+            href = link.get("href")
+            if href:
+                # Convert relative URLs to absolute
+                absolute_url = urljoin(base_url, href)
+                # Only include http(s) URLs
+                if absolute_url.startswith(("http://", "https://")):
+                    links.append(absolute_url)
+        return list(set(links))  # Remove duplicates
+    def is_valid_url(self, url: str) -> bool:
+        """Check if a URL is valid."""
+        try:
+            result = urlparse(url)
+            return all([result.scheme, result.netloc])
+        except Exception:
+            return False
+    def clean_url(self, url: str) -> str:
+        """Clean and normalize a URL."""
+        # Remove tracking parameters
+        parsed = urlparse(url)
+        path = parsed.path
+        # Remove common tracking parameters
+        query_params = []
+        if parsed.query:
+            for param in parsed.query.split("&"):
+                if "=" in param:
+                    key = param.split("=")[0].lower()
+                    if not any(track in key for track in ["utm_", "ref_", "source", "campaign"]):
+                        query_params.append(param)
+        # Rebuild URL
+        clean_url = f"{parsed.scheme}://{parsed.netloc}{path}"
+        if query_params:
+            clean_url += "?" + "&".join(query_params)
+        return clean_url