synthdatagen

Sleeping

App Files Files Community

Lizk75 commited on Mar 26

Commit

e365a68

1 Parent(s): 855c78c

[deploy] push to HF

Browse files

Files changed (11) hide show

.dockerignore +3 -0
.gitignore +15 -0
Dockerfile +20 -0
README.md +117 -11
app.py +14 -0
requirements.txt +7 -0
src/models.py +56 -0
src/prompts.py +64 -0
src/synth_data_gen.py +42 -0
src/ui.py +159 -0
src/utils.py +49 -0

.dockerignore ADDED Viewed

	@@ -0,0 +1,3 @@

+.env
+__pycache__/
+*.py[cod]

.gitignore ADDED Viewed

	@@ -0,0 +1,15 @@

+# Ignore Jupyter Notebook checkpoints
+.ipynb_checkpoints/
+# Ignore output folder
+output/
+# Ignore environment files
+.env
+# Ignore Python cache files
+**/__pycache__/
+*.py[cod]
+# Ignore .gitattributes file
+.gitattributes

Dockerfile ADDED Viewed

	@@ -0,0 +1,20 @@

+# Use a Python 3.10 slim image as the base
+FROM python:3.10-slim
+# Set working directory inside the container
+WORKDIR /app
+# Copy the requirements.txt first (for better caching)
+COPY requirements.txt .
+# Install the dependencies from requirements.txt
+RUN pip install --no-cache-dir -r requirements.txt
+# Now, copy the rest of the project files into the container
+COPY . .
+# Expose the port where Gradio will run (default: 7860)
+EXPOSE 7860
+# Set the default command to run the app
+CMD ["python", "app.py"]

README.md CHANGED Viewed

@@ -1,11 +1,117 @@
----
-title: Synthdatagen
-emoji: 🔥
-colorFrom: yellow
-colorTo: red
-sdk: docker
-pinned: false
-short_description: AI-powered platform for generating structured synthetic data
----
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

+# 🧬 SynthDataGen: AI-Powered Synthetic Dataset Generator
+<img src="assets/logo.jpg" alt="DataSynth_logo" width="200">
+<a href="https://synthdatagen-app.onrender.com/">👀 <b>Live Demo</b></a>
+📷 <b>Screenshots</b>
+<a href="screenshot_1.png"><img src="assets/screenshot_1.png" width="400"></a>
+<a href="screenshot_2.png"><img src="assets/screenshot_2.png" width="335"></a>
+## 📖 Overview
+**SynthDataGen** is an AI-powered tool that creates **realistic, fake data** for any project. You don’t need to collect real information—instead, just tell SynthDataGen what kind of data you want, and it will **quickly generate** it. Thanks to its **easy-to-use web interface** built with Gradio, **anyone** can start making custom datasets right away.
+### 🔑 **Key Features**
+- The app can generate **various types of datasets**, such as **tables**, **time-series data**, or **text content**.
+- The output can be saved in different **formats**, including **JSON**, **CSV**, **Parquet**, or **Markdown**.
+- **AI models** like **GPT** and **Claude** are used to automatically create the dataset based on the task.
+- A short **description of the desired dataset** is all that's needed to trigger the generation process.
+- A **download link** is provided once the dataset is ready, making it easy to save and use.
+- The **interface updates options automatically** and includes helpful **examples for inspiration**.
+### 🎯 **How It Works**
+1️⃣ Describe the dataset to generate by entering a short business problem or topic.
+2️⃣ Select the dataset type, output format, AI model, and number of samples.
+3️⃣ Download the generated dataset once it's ready — clean, structured, and ready to use..
+### 🤔 **Why Choose SynthDataGen?**
+- ⏰ **Time Saver**: Automatically creates tables, time-series, or text data—no need to gather real data yourself.
+- ⚙️ **Flexible and Accessible**: Supports multiple formats (JSON, CSV, Parquet, Markdown) with a beginner-friendly interface.
+- 🤖 **Powered by GPT & Claude**: Uses two top AI models to produce realistic synthetic data for prototyping or research.
+### 🔧 **SynthDataGen Customization**
+SynthDataGen is fully customizable through Python code. You can easily modify:
+- ✏️ **System prompt** to control how the AI models generate code
+- 🤖 Easily add **new frontier** or **open-source models** (e.g., LLaMA, DeepSeek, Qwen), or integrate any model from **Hugging Face libraries** and **inference endpoints**.
+- 📊 **Dataset types**, by adding new categories like image metadata, dialogue transcripts ...
+- 📁 **Output formats**, such as YAML, XML ...
+- 🎨 **Interface styling**, including layout, colors, and themes
+### 🏗️ **Architecture**
+<a href="func_architecture.png"><img src="assets/func_architecture.png"></a>
+<a href="tech_architecture.png"><img src="assets/tech_architecture.png"></a>
+## ⚙️ Setup & Installation
+**1. Clone the Repository**
+```bash
+git clone https://github.com/lisek75/synthdatagen_app.git
+cd synthdatagen_app
+```
+**2. Install Dependencies**
+```bash
+conda env create -f synthdatagen_env.yml
+conda activate synthdatagen
+```
+**3. Configure API Keys & Endpoints**
+Create `.env` file with the following variables:
+```python
+OPENAI_API_KEY = your_openai_api_key
+ANTHROPIC_API_KEY = your_anthropic_api_key
+```
+Ensure that the `.env` file remains **secure** and is not shared publicly.
+## 🚀 Running the Gradio App
+**Run the Application Locally**
+```bash
+python app.py
+```
+**Run the Application with Docker**
+To run the app using Docker, you can either build the image yourself or use the pre-built image from Docker Hub.
+- Build and run the app locally:
+Build the image from the provided Dockerfile using your own Docker Hub username:
+```bash
+docker build -t <user-dockerhub-username>/synthdatagen:v1.0 .
+docker run -d --name synthdatagen-container -p 7860:7860 --env-file .env <user-dockerhub-username>/synthdatagen:v1.0
+```
+This will build the Docker image and run the app in a container.
+- Run the app directly from Docker Hub:
+Pull the pre-built image from the Docker Hub repository (⚠️make sure to use the latest version tag from Docker Hub).
+Check: https://hub.docker.com/r/lizk75/synthdatagen/tags
+```bash
+docker pull lizk75/synthdatagen:v1.0
+docker run -d --name synthdatagen-container -p 7860:7860 --env-file .env lizk75/synthdatagen:v1.0
+```
+## 🧑‍💻 Usage Guide
+- You can launch the app directly from:
+    - The **demo link** provided at the top of this README.
+    - Or by executing it **locally** using the command `python app.py` from Visual Studio or any other IDE.
+- **Describe your dataset** by entering a clear business problem or topic.
+- Select the **dataset type** and **output format**.
+- Choose an **AI model** (GPT or Claude).
+- Set the desired **number of samples**.
+- Click **Create Dataset** and download the generated file.
+## 📓 Google Colab
+A **notebook version** is available for users who prefer running the app in a notebook environment. The notebook includes additional **open-source models ** that require a **GPU**, which is why it's recommended to run it on Google Colab or a local machine with GPU support.
+https://github.com/lisek75/nlp_llms_notebook/blob/main/07_data_generator.ipynb

app.py ADDED Viewed

	@@ -0,0 +1,14 @@

+from src.ui import build_ui
+if __name__ == "__main__":
+    # Build the user interface
+    ui = build_ui()
+    # Launch the UI in the browser with access to the "output" folder
+    ui.launch(
+        inbrowser=True,
+        allowed_paths=["output"],
+        server_name="0.0.0.0",
+        server_port=7860
+    )

requirements.txt ADDED Viewed

	@@ -0,0 +1,7 @@

+python-dotenv==1.0.1
+openai==1.65.5
+anthropic==0.49.0
+gradio==5.21.0
+pandas
+numpy
+pyarrow

src/models.py ADDED Viewed

	@@ -0,0 +1,56 @@

+from openai import OpenAI
+import anthropic
+import os
+from dotenv import load_dotenv
+# Load environment variables from .env file
+load_dotenv(override=True)
+# Retrieve API keys from environment
+openai_api_key = os.getenv("OPENAI_API_KEY")
+anthropic_api_key = os.getenv("ANTHROPIC_API_KEY")
+# Warn if any API key is missing
+if not openai_api_key:
+    print("❌ OpenAI API Key is missing!")
+if not anthropic_api_key:
+    print("❌ Anthropic API Key is missing!")
+# Initialize API clients
+openai = OpenAI(api_key=openai_api_key)
+claude = anthropic.Anthropic()
+# Model names
+OPENAI_MODEL = "gpt-4o-mini"
+CLAUDE_MODEL = "claude-3-5-sonnet-20240620"
+# Call OpenAI's GPT model with prompt and system message
+def get_gpt_completion(prompt, system_message):
+    try:
+        response = openai.chat.completions.create(
+            model=OPENAI_MODEL,
+            messages=[
+                {"role": "system", "content": system_message},
+                {"role": "user", "content": prompt}
+            ],
+            stream=False,
+        )
+        return response.choices[0].message.content
+    except Exception as e:
+        print(f"GPT error: {e}")
+        raise
+# Call Anthropic's Claude model with prompt and system message
+def get_claude_completion(prompt, system_message):
+    try:
+        result = claude.messages.create(
+            model=CLAUDE_MODEL,
+            max_tokens=2000,
+            system=system_message,
+            messages=[{"role": "user", "content": prompt}]
+        )
+        return result.content[0].text
+    except Exception as e:
+        print(f"Claude error: {e}")
+        raise

src/prompts.py ADDED Viewed

	@@ -0,0 +1,64 @@

+from datetime import datetime
+system_message = """
+You are a helpful assistant whose main purpose is to generate synthetic datasets based on a given business problem.
+🔹 General Guidelines:
+- Be accurate and concise.
+- Use only standard Python libraries (pandas, numpy, os, datetime, etc.)
+- The dataset must contain the requested number of samples.
+- Always respect the requested output format exactly.
+- If multiple entities exist, save each to a separate file.
+- Do not use f-strings anywhere in the code — not in file paths or in content. Use standard string concatenation instead.
+🔹 File Path Rules:
+- Define the full file path using os.path.join(...) — exactly as shown — no shortcuts or direct strings.
+  - Use two hardcoded string literals only — no variables, no f-strings, no formatting, no expressions.
+  - First argument: full directory path (use forward slashes).
+  - Second argument: full filename with timestamp and correct extension.
+  - Example: os.path.join("C:/Users/.../output", "sales_20250323_123456.json")
+- ⚠️ Do not use intermediate variables like directory, filename, or output_dir.
+- ⚠️ Do not skip or replace any of the above instructions. They are required for the code to work correctly.
+🔹 File Saving Instructions:
+- ✅ CSV:
+    df.to_csv(file_path, index=False, encoding="utf-8")
+- ✅ JSON:
+    with open(file_path, "w", encoding="utf-8") as f:
+        df.to_json(f, orient="records", lines=False, force_ascii=False)
+- ✅ Parquet:
+    df.to_parquet(file_path, engine="pyarrow", index=False)
+- ✅ Markdown (for Text):
+    - Generate properly formatted Markdown content.
+    - Save it as a `.md` file using UTF-8 encoding.
+"""
+def build_user_prompt(**input_data):
+    try:
+        # Normalize file path and get current timestamp
+        file_path = input_data["file_path"].replace("\\", "/")
+        timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
+        # Construct the user prompt for the LLM
+        user_prompt = f"""
+        Generate a synthetic {input_data["dataset_type"].lower()} dataset in {input_data["output_format"].upper()} format.
+        Business problem: {input_data["business_problem"]}
+        Samples: {input_data["num_samples"]}
+        Directory: {file_path}
+        Timestamp: {timestamp}
+        """
+        return user_prompt
+    except KeyError as e:
+        # Handle missing keys in input_data
+        print(f"Missing input key: {e}")
+        raise
+    except Exception as e:
+        # Log any other error during prompt building
+        print(f"Error in build_user_prompt: {e}")
+        raise

src/synth_data_gen.py ADDED Viewed

	@@ -0,0 +1,42 @@

+import os
+from datetime import datetime
+from .prompts import build_user_prompt, system_message
+from .models import get_gpt_completion, get_claude_completion
+from .utils import execute_code_in_virtualenv
+class SynthDataGen:
+    def __init__(self, output_dir="output"):
+        # Set the default output directory and ensures it's created
+        self.output_dir = output_dir
+        os.makedirs(self.output_dir, exist_ok=True)
+    def get_timestamp(self):
+        # Return current timestamp for file naming
+        return datetime.now().strftime("%Y%m%d_%H%M%S")
+    def generate_dataset(self, **input_data):
+        try:
+            # Add output directory path to input data
+            input_data["file_path"] = self.output_dir
+            # Build the prompt to send to the LLM
+            prompt = build_user_prompt(**input_data)
+            # Call the selected LLM based on the model
+            if input_data["model"] == "GPT":
+                code = get_gpt_completion(prompt, system_message)
+            elif input_data["model"] == "Claude":
+                code = get_claude_completion(prompt, system_message)
+            else:
+                raise ValueError("Invalid model selected.")
+            # Execute the generated code and return the output file path
+            file_path = execute_code_in_virtualenv(code)
+            return file_path
+        except Exception as e:
+            # Log and re-raise any errors
+            print(f"Error in generate_dataset: {e}")
+            raise

src/ui.py ADDED Viewed

	@@ -0,0 +1,159 @@

+import os
+import gradio as gr
+import threading
+from src.synth_data_gen import SynthDataGen
+generator = SynthDataGen()
+# Update the output format choices based on the selected dataset type
+def update_output_format(dataset_type):
+    if dataset_type in ["Tabular", "Time-series"]:
+        return gr.update(choices=["JSON", "csv", "Parquet"], value="JSON")
+    elif dataset_type == "Text":
+        return gr.update(choices=["JSON", "Markdown"], value="JSON")
+def update_pipeline(business_problem, dataset_type, output_format, num_samples, model):
+    # Check if business problem is empty
+    if not business_problem.strip():
+        yield [gr.update(visible=False), gr.update(visible=True), "❌ Please enter a business problem before generating."]
+        return
+    # Initial feedback while generating
+    yield [gr.update(visible=False), gr.update(visible=False), "⏳ Generating dataset..."]
+    try:
+        # Pack inputs into a dictionary for the generator
+        input_data = {
+            "business_problem": business_problem,
+            "dataset_type": dataset_type,
+            "output_format": output_format,
+            "num_samples": num_samples,
+            "model": model
+        }
+        # Generate dataset file
+        file_path = generator.generate_dataset(**input_data)
+        print("🧪 File result returned:", file_path)
+        # Check if file exists and return success message + file path
+        if isinstance(file_path, str) and os.path.exists(file_path):
+            threading.Timer(60, os.remove, args=[file_path]).start()  # Auto-delete after 60s
+            yield [gr.update(value=file_path, visible=True), gr.update(visible=True), "✅ Dataset ready for download."]
+        else:
+            # Handle invalid or missing file
+            yield [gr.update(visible=False), gr.update(visible=True), "❌ Error: File not created or path invalid."]
+    except Exception as e:
+        # Catch and display any errors in the pipeline
+        yield [gr.update(visible=False), gr.update(visible=True), f"❌ Pipeline error: {e}"]
+def build_ui(css_path="assets/styles.css"):
+    with open(css_path, "r") as f:
+        css = f.read()
+    with gr.Blocks(css=css, title="🧬SynthDataGen") as ui:
+        with gr.Column(elem_id="app-container"):
+            gr.Markdown("<h1 id='app-title'>SynthDataGen 🧬 </h1>")
+            gr.Markdown("<h2 id='app-subtitle'>AI-Powered Synthetic Dataset Generator</h2>")
+            gr.HTML("""
+            <div id="intro-text">
+                <p>With SynthDataGen, easily generate <strong>diverse datasets in different formats</strong> for testing, development, and AI training.</p>
+                <h4>🎯 How It Works:</h4>
+                <ol>
+                <li>1️⃣ Define your business problem or dataset topic.</li>
+                <li>2️⃣ Select the dataset type, output format, model, and number of samples.</li>
+                <li>3️⃣ Receive your synthetic dataset — ready to download and use!</li>
+                </ol>
+            </div>
+            """)
+            gr.HTML("""
+                <div id="learn-more-button">
+                    <a href="https://github.com/lisek75/synthdatagen_app/blob/main/README.md" class="button-link" target="_blank">Learn More</a>
+                </div>
+                """)
+            gr.Markdown("""
+                <p><strong>🧠 Need inspiration?</strong> Try one of these examples:</p>
+                <ul>
+                <li>Movie summaries for genre classification.</li>
+                <li>Generate customer chats with realistic dialogue, chat_id, timestamp, names, sentiment label, and aligned transcript.</li>
+                <li>Create daily stock prices for 2 companies with typical fields like date, ticker, open, close, high, low, and volume.</li>
+                </ul>
+                """)
+            gr.Markdown("<p><strong>Start generating your synthetic datasets now!</strong> 🗂️✨</p>")
+            with gr.Group(elem_id="input-container"):
+                business_problem = gr.Textbox(
+                    placeholder="Describe the dataset you want (e.g., Job postings, Customer reviews, Sensor data, Movie titles)",
+                    lines=2,
+                    label="📌 Business Problem",
+                    elem_classes=["label-box"],
+                    elem_id="business-problem-box"
+                )
+                with gr.Row(elem_classes="column-gap"):
+                    with gr.Column(scale=1):
+                        dataset_type = gr.Dropdown(
+                            ["Tabular", "Time-series", "Text"],
+                            value="Tabular",
+                            label="📊 Dataset Type",
+                            elem_classes=["label-box"],
+                            elem_id="custom-dropdown"
+                        )
+                    with gr.Column(scale=1):
+                        output_format = gr.Dropdown(
+                            choices=["JSON", "csv", "Parquet"],
+                            value="JSON",
+                            label="📁 Output Format",
+                            elem_classes=["label-box"],
+                            elem_id="custom-dropdown"
+                        )
+                    # Bind the update function to the dataset type dropdown
+                    dataset_type.change(
+                        update_output_format,
+                        inputs=[dataset_type],
+                        outputs=[output_format]
+                    )
+                with gr.Row(elem_classes="row-spacer column-gap"):
+                    with gr.Column(scale=1):
+                        model = gr.Dropdown(
+                            ["GPT", "Claude"],
+                            value="GPT",
+                            label="🤖 Model",
+                            elem_classes=["label-box"],
+                            elem_id="custom-dropdown"
+                        )
+                    with gr.Column(scale=1):
+                        num_samples = gr.Slider(
+                            minimum=10,
+                            maximum=1000,
+                            value=10,
+                            step=1,
+                            interactive=True,
+                            label="🔢 Number of Samples",
+                            elem_classes=["label-box"]
+                        )
+            # Hidden file component for dataset download
+            file_download = gr.File(label="Download Dataset", visible=False, elem_id="download-box")
+            # Component to display status messages
+            status_message = gr.Markdown("", label="Status")
+            # Button to trigger dataset generation
+            run_btn = gr.Button("Create a dataset", elem_id="run-btn")
+            run_btn.click(
+                update_pipeline,
+                inputs=[business_problem, dataset_type, output_format, num_samples, model],
+                outputs=[file_download, run_btn, status_message]
+            )
+    return ui # Return the complete UI

src/utils.py ADDED Viewed

	@@ -0,0 +1,49 @@

+import re
+import os
+import subprocess
+import sys
+# Extract Python code block from a LLM response
+def extract_code(text):
+    try:
+        match = re.search(r"```python(.*?)```", text, re.DOTALL)
+        if match:
+            code = match.group(0).strip()
+        else:
+            code = ""
+            print("No matching code block found.")
+        return code.replace("```python\n", "").replace("```", "")
+    except Exception as e:
+        print(f"Code extraction error: {e}")
+        raise
+# Extract file path from a code string that uses os.path.join()
+def extract_file_path(code_str):
+    try:
+        match = re.search(r'os\.path\.join\(\s*["\'](.+?)["\']\s*,\s*["\'](.+?)["\']\s*\)', code_str)
+        if match:
+            folder = match.group(1)
+            filename = match.group(2)
+            return os.path.join(folder, filename)
+        print("No file path found.")
+        return None
+    except Exception as e:
+        print(f"File path extraction error: {e}")
+        raise
+# Execute extracted Python code in a subprocess using the given interpreter
+def execute_code_in_virtualenv(text, python_interpreter=sys.executable):
+    if not python_interpreter:
+        raise EnvironmentError("Python interpreter not found.")
+    code_str = extract_code(text)
+    command = [python_interpreter, "-c", code_str]
+    try:
+        print("✅ Running script:", command)
+        result = subprocess.run(command, check=True, capture_output=True, text=True)
+        file_path = extract_file_path(code_str)
+        print("✅ Extracted file path:", file_path)
+        return file_path
+    except subprocess.CalledProcessError as e:
+        return (f"Execution error:\n{e.stderr.strip()}", None)