Spaces:
Sleeping
Sleeping
[deploy] push to HF
Browse files- .dockerignore +3 -0
- .gitignore +15 -0
- Dockerfile +20 -0
- README.md +117 -11
- app.py +14 -0
- requirements.txt +7 -0
- src/models.py +56 -0
- src/prompts.py +64 -0
- src/synth_data_gen.py +42 -0
- src/ui.py +159 -0
- src/utils.py +49 -0
.dockerignore
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
.env
|
2 |
+
__pycache__/
|
3 |
+
*.py[cod]
|
.gitignore
ADDED
@@ -0,0 +1,15 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# Ignore Jupyter Notebook checkpoints
|
2 |
+
.ipynb_checkpoints/
|
3 |
+
|
4 |
+
# Ignore output folder
|
5 |
+
output/
|
6 |
+
|
7 |
+
# Ignore environment files
|
8 |
+
.env
|
9 |
+
|
10 |
+
# Ignore Python cache files
|
11 |
+
**/__pycache__/
|
12 |
+
*.py[cod]
|
13 |
+
|
14 |
+
# Ignore .gitattributes file
|
15 |
+
.gitattributes
|
Dockerfile
ADDED
@@ -0,0 +1,20 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# Use a Python 3.10 slim image as the base
|
2 |
+
FROM python:3.10-slim
|
3 |
+
|
4 |
+
# Set working directory inside the container
|
5 |
+
WORKDIR /app
|
6 |
+
|
7 |
+
# Copy the requirements.txt first (for better caching)
|
8 |
+
COPY requirements.txt .
|
9 |
+
|
10 |
+
# Install the dependencies from requirements.txt
|
11 |
+
RUN pip install --no-cache-dir -r requirements.txt
|
12 |
+
|
13 |
+
# Now, copy the rest of the project files into the container
|
14 |
+
COPY . .
|
15 |
+
|
16 |
+
# Expose the port where Gradio will run (default: 7860)
|
17 |
+
EXPOSE 7860
|
18 |
+
|
19 |
+
# Set the default command to run the app
|
20 |
+
CMD ["python", "app.py"]
|
README.md
CHANGED
@@ -1,11 +1,117 @@
|
|
1 |
-
|
2 |
-
|
3 |
-
|
4 |
-
|
5 |
-
|
6 |
-
|
7 |
-
|
8 |
-
|
9 |
-
|
10 |
-
|
11 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# 🧬 SynthDataGen: AI-Powered Synthetic Dataset Generator
|
2 |
+
|
3 |
+
<img src="assets/logo.jpg" alt="DataSynth_logo" width="200">
|
4 |
+
|
5 |
+
<a href="https://synthdatagen-app.onrender.com/">👀 <b>Live Demo</b></a>
|
6 |
+
|
7 |
+
📷 <b>Screenshots</b>
|
8 |
+
|
9 |
+
<a href="screenshot_1.png"><img src="assets/screenshot_1.png" width="400"></a>
|
10 |
+
<a href="screenshot_2.png"><img src="assets/screenshot_2.png" width="335"></a>
|
11 |
+
|
12 |
+
|
13 |
+
## 📖 Overview
|
14 |
+
**SynthDataGen** is an AI-powered tool that creates **realistic, fake data** for any project. You don’t need to collect real information—instead, just tell SynthDataGen what kind of data you want, and it will **quickly generate** it. Thanks to its **easy-to-use web interface** built with Gradio, **anyone** can start making custom datasets right away.
|
15 |
+
|
16 |
+
### 🔑 **Key Features**
|
17 |
+
- The app can generate **various types of datasets**, such as **tables**, **time-series data**, or **text content**.
|
18 |
+
- The output can be saved in different **formats**, including **JSON**, **CSV**, **Parquet**, or **Markdown**.
|
19 |
+
- **AI models** like **GPT** and **Claude** are used to automatically create the dataset based on the task.
|
20 |
+
- A short **description of the desired dataset** is all that's needed to trigger the generation process.
|
21 |
+
- A **download link** is provided once the dataset is ready, making it easy to save and use.
|
22 |
+
- The **interface updates options automatically** and includes helpful **examples for inspiration**.
|
23 |
+
|
24 |
+
### 🎯 **How It Works**
|
25 |
+
1️⃣ Describe the dataset to generate by entering a short business problem or topic.
|
26 |
+
|
27 |
+
2️⃣ Select the dataset type, output format, AI model, and number of samples.
|
28 |
+
|
29 |
+
3️⃣ Download the generated dataset once it's ready — clean, structured, and ready to use..
|
30 |
+
|
31 |
+
### 🤔 **Why Choose SynthDataGen?**
|
32 |
+
- ⏰ **Time Saver**: Automatically creates tables, time-series, or text data—no need to gather real data yourself.
|
33 |
+
- ⚙️ **Flexible and Accessible**: Supports multiple formats (JSON, CSV, Parquet, Markdown) with a beginner-friendly interface.
|
34 |
+
- 🤖 **Powered by GPT & Claude**: Uses two top AI models to produce realistic synthetic data for prototyping or research.
|
35 |
+
|
36 |
+
### 🔧 **SynthDataGen Customization**
|
37 |
+
SynthDataGen is fully customizable through Python code. You can easily modify:
|
38 |
+
- ✏️ **System prompt** to control how the AI models generate code
|
39 |
+
- 🤖 Easily add **new frontier** or **open-source models** (e.g., LLaMA, DeepSeek, Qwen), or integrate any model from **Hugging Face libraries** and **inference endpoints**.
|
40 |
+
- 📊 **Dataset types**, by adding new categories like image metadata, dialogue transcripts ...
|
41 |
+
- 📁 **Output formats**, such as YAML, XML ...
|
42 |
+
- 🎨 **Interface styling**, including layout, colors, and themes
|
43 |
+
|
44 |
+
### 🏗️ **Architecture**
|
45 |
+
|
46 |
+
<a href="func_architecture.png"><img src="assets/func_architecture.png"></a>
|
47 |
+
<a href="tech_architecture.png"><img src="assets/tech_architecture.png"></a>
|
48 |
+
|
49 |
+
## ⚙️ Setup & Installation
|
50 |
+
|
51 |
+
**1. Clone the Repository**
|
52 |
+
```bash
|
53 |
+
git clone https://github.com/lisek75/synthdatagen_app.git
|
54 |
+
cd synthdatagen_app
|
55 |
+
```
|
56 |
+
|
57 |
+
**2. Install Dependencies**
|
58 |
+
|
59 |
+
```bash
|
60 |
+
conda env create -f synthdatagen_env.yml
|
61 |
+
conda activate synthdatagen
|
62 |
+
```
|
63 |
+
**3. Configure API Keys & Endpoints**
|
64 |
+
|
65 |
+
Create `.env` file with the following variables:
|
66 |
+
```python
|
67 |
+
OPENAI_API_KEY = your_openai_api_key
|
68 |
+
ANTHROPIC_API_KEY = your_anthropic_api_key
|
69 |
+
```
|
70 |
+
Ensure that the `.env` file remains **secure** and is not shared publicly.
|
71 |
+
|
72 |
+
|
73 |
+
## 🚀 Running the Gradio App
|
74 |
+
|
75 |
+
**Run the Application Locally**
|
76 |
+
```bash
|
77 |
+
python app.py
|
78 |
+
```
|
79 |
+
|
80 |
+
**Run the Application with Docker**
|
81 |
+
|
82 |
+
To run the app using Docker, you can either build the image yourself or use the pre-built image from Docker Hub.
|
83 |
+
|
84 |
+
- Build and run the app locally:
|
85 |
+
Build the image from the provided Dockerfile using your own Docker Hub username:
|
86 |
+
```bash
|
87 |
+
docker build -t <user-dockerhub-username>/synthdatagen:v1.0 .
|
88 |
+
docker run -d --name synthdatagen-container -p 7860:7860 --env-file .env <user-dockerhub-username>/synthdatagen:v1.0
|
89 |
+
```
|
90 |
+
This will build the Docker image and run the app in a container.
|
91 |
+
|
92 |
+
- Run the app directly from Docker Hub:
|
93 |
+
Pull the pre-built image from the Docker Hub repository (⚠️make sure to use the latest version tag from Docker Hub).
|
94 |
+
Check: https://hub.docker.com/r/lizk75/synthdatagen/tags
|
95 |
+
|
96 |
+
```bash
|
97 |
+
docker pull lizk75/synthdatagen:v1.0
|
98 |
+
docker run -d --name synthdatagen-container -p 7860:7860 --env-file .env lizk75/synthdatagen:v1.0
|
99 |
+
```
|
100 |
+
|
101 |
+
|
102 |
+
## 🧑💻 Usage Guide
|
103 |
+
- You can launch the app directly from:
|
104 |
+
- The **demo link** provided at the top of this README.
|
105 |
+
- Or by executing it **locally** using the command `python app.py` from Visual Studio or any other IDE.
|
106 |
+
- **Describe your dataset** by entering a clear business problem or topic.
|
107 |
+
- Select the **dataset type** and **output format**.
|
108 |
+
- Choose an **AI model** (GPT or Claude).
|
109 |
+
- Set the desired **number of samples**.
|
110 |
+
- Click **Create Dataset** and download the generated file.
|
111 |
+
|
112 |
+
|
113 |
+
## 📓 Google Colab
|
114 |
+
A **notebook version** is available for users who prefer running the app in a notebook environment. The notebook includes additional **open-source models ** that require a **GPU**, which is why it's recommended to run it on Google Colab or a local machine with GPU support.
|
115 |
+
|
116 |
+
https://github.com/lisek75/nlp_llms_notebook/blob/main/07_data_generator.ipynb
|
117 |
+
|
app.py
ADDED
@@ -0,0 +1,14 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
from src.ui import build_ui
|
2 |
+
|
3 |
+
if __name__ == "__main__":
|
4 |
+
# Build the user interface
|
5 |
+
ui = build_ui()
|
6 |
+
|
7 |
+
# Launch the UI in the browser with access to the "output" folder
|
8 |
+
ui.launch(
|
9 |
+
inbrowser=True,
|
10 |
+
allowed_paths=["output"],
|
11 |
+
server_name="0.0.0.0",
|
12 |
+
server_port=7860
|
13 |
+
)
|
14 |
+
|
requirements.txt
ADDED
@@ -0,0 +1,7 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
python-dotenv==1.0.1
|
2 |
+
openai==1.65.5
|
3 |
+
anthropic==0.49.0
|
4 |
+
gradio==5.21.0
|
5 |
+
pandas
|
6 |
+
numpy
|
7 |
+
pyarrow
|
src/models.py
ADDED
@@ -0,0 +1,56 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
from openai import OpenAI
|
2 |
+
import anthropic
|
3 |
+
import os
|
4 |
+
from dotenv import load_dotenv
|
5 |
+
|
6 |
+
# Load environment variables from .env file
|
7 |
+
load_dotenv(override=True)
|
8 |
+
|
9 |
+
# Retrieve API keys from environment
|
10 |
+
openai_api_key = os.getenv("OPENAI_API_KEY")
|
11 |
+
anthropic_api_key = os.getenv("ANTHROPIC_API_KEY")
|
12 |
+
|
13 |
+
# Warn if any API key is missing
|
14 |
+
if not openai_api_key:
|
15 |
+
print("❌ OpenAI API Key is missing!")
|
16 |
+
|
17 |
+
if not anthropic_api_key:
|
18 |
+
print("❌ Anthropic API Key is missing!")
|
19 |
+
|
20 |
+
# Initialize API clients
|
21 |
+
openai = OpenAI(api_key=openai_api_key)
|
22 |
+
claude = anthropic.Anthropic()
|
23 |
+
|
24 |
+
# Model names
|
25 |
+
OPENAI_MODEL = "gpt-4o-mini"
|
26 |
+
CLAUDE_MODEL = "claude-3-5-sonnet-20240620"
|
27 |
+
|
28 |
+
# Call OpenAI's GPT model with prompt and system message
|
29 |
+
def get_gpt_completion(prompt, system_message):
|
30 |
+
try:
|
31 |
+
response = openai.chat.completions.create(
|
32 |
+
model=OPENAI_MODEL,
|
33 |
+
messages=[
|
34 |
+
{"role": "system", "content": system_message},
|
35 |
+
{"role": "user", "content": prompt}
|
36 |
+
],
|
37 |
+
stream=False,
|
38 |
+
)
|
39 |
+
return response.choices[0].message.content
|
40 |
+
except Exception as e:
|
41 |
+
print(f"GPT error: {e}")
|
42 |
+
raise
|
43 |
+
|
44 |
+
# Call Anthropic's Claude model with prompt and system message
|
45 |
+
def get_claude_completion(prompt, system_message):
|
46 |
+
try:
|
47 |
+
result = claude.messages.create(
|
48 |
+
model=CLAUDE_MODEL,
|
49 |
+
max_tokens=2000,
|
50 |
+
system=system_message,
|
51 |
+
messages=[{"role": "user", "content": prompt}]
|
52 |
+
)
|
53 |
+
return result.content[0].text
|
54 |
+
except Exception as e:
|
55 |
+
print(f"Claude error: {e}")
|
56 |
+
raise
|
src/prompts.py
ADDED
@@ -0,0 +1,64 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
from datetime import datetime
|
2 |
+
|
3 |
+
system_message = """
|
4 |
+
You are a helpful assistant whose main purpose is to generate synthetic datasets based on a given business problem.
|
5 |
+
|
6 |
+
🔹 General Guidelines:
|
7 |
+
- Be accurate and concise.
|
8 |
+
- Use only standard Python libraries (pandas, numpy, os, datetime, etc.)
|
9 |
+
- The dataset must contain the requested number of samples.
|
10 |
+
- Always respect the requested output format exactly.
|
11 |
+
- If multiple entities exist, save each to a separate file.
|
12 |
+
- Do not use f-strings anywhere in the code — not in file paths or in content. Use standard string concatenation instead.
|
13 |
+
|
14 |
+
🔹 File Path Rules:
|
15 |
+
- Define the full file path using os.path.join(...) — exactly as shown — no shortcuts or direct strings.
|
16 |
+
- Use two hardcoded string literals only — no variables, no f-strings, no formatting, no expressions.
|
17 |
+
- First argument: full directory path (use forward slashes).
|
18 |
+
- Second argument: full filename with timestamp and correct extension.
|
19 |
+
- Example: os.path.join("C:/Users/.../output", "sales_20250323_123456.json")
|
20 |
+
- ⚠️ Do not use intermediate variables like directory, filename, or output_dir.
|
21 |
+
- ⚠️ Do not skip or replace any of the above instructions. They are required for the code to work correctly.
|
22 |
+
|
23 |
+
🔹 File Saving Instructions:
|
24 |
+
|
25 |
+
- ✅ CSV:
|
26 |
+
df.to_csv(file_path, index=False, encoding="utf-8")
|
27 |
+
|
28 |
+
- ✅ JSON:
|
29 |
+
with open(file_path, "w", encoding="utf-8") as f:
|
30 |
+
df.to_json(f, orient="records", lines=False, force_ascii=False)
|
31 |
+
|
32 |
+
- ✅ Parquet:
|
33 |
+
df.to_parquet(file_path, engine="pyarrow", index=False)
|
34 |
+
|
35 |
+
- ✅ Markdown (for Text):
|
36 |
+
- Generate properly formatted Markdown content.
|
37 |
+
- Save it as a `.md` file using UTF-8 encoding.
|
38 |
+
"""
|
39 |
+
|
40 |
+
def build_user_prompt(**input_data):
|
41 |
+
try:
|
42 |
+
# Normalize file path and get current timestamp
|
43 |
+
file_path = input_data["file_path"].replace("\\", "/")
|
44 |
+
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
|
45 |
+
|
46 |
+
# Construct the user prompt for the LLM
|
47 |
+
user_prompt = f"""
|
48 |
+
Generate a synthetic {input_data["dataset_type"].lower()} dataset in {input_data["output_format"].upper()} format.
|
49 |
+
Business problem: {input_data["business_problem"]}
|
50 |
+
Samples: {input_data["num_samples"]}
|
51 |
+
Directory: {file_path}
|
52 |
+
Timestamp: {timestamp}
|
53 |
+
"""
|
54 |
+
return user_prompt
|
55 |
+
|
56 |
+
except KeyError as e:
|
57 |
+
# Handle missing keys in input_data
|
58 |
+
print(f"Missing input key: {e}")
|
59 |
+
raise
|
60 |
+
except Exception as e:
|
61 |
+
# Log any other error during prompt building
|
62 |
+
print(f"Error in build_user_prompt: {e}")
|
63 |
+
raise
|
64 |
+
|
src/synth_data_gen.py
ADDED
@@ -0,0 +1,42 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
import os
|
2 |
+
from datetime import datetime
|
3 |
+
from .prompts import build_user_prompt, system_message
|
4 |
+
from .models import get_gpt_completion, get_claude_completion
|
5 |
+
from .utils import execute_code_in_virtualenv
|
6 |
+
|
7 |
+
class SynthDataGen:
|
8 |
+
def __init__(self, output_dir="output"):
|
9 |
+
# Set the default output directory and ensures it's created
|
10 |
+
self.output_dir = output_dir
|
11 |
+
os.makedirs(self.output_dir, exist_ok=True)
|
12 |
+
|
13 |
+
def get_timestamp(self):
|
14 |
+
# Return current timestamp for file naming
|
15 |
+
return datetime.now().strftime("%Y%m%d_%H%M%S")
|
16 |
+
|
17 |
+
def generate_dataset(self, **input_data):
|
18 |
+
try:
|
19 |
+
# Add output directory path to input data
|
20 |
+
input_data["file_path"] = self.output_dir
|
21 |
+
|
22 |
+
# Build the prompt to send to the LLM
|
23 |
+
prompt = build_user_prompt(**input_data)
|
24 |
+
|
25 |
+
# Call the selected LLM based on the model
|
26 |
+
if input_data["model"] == "GPT":
|
27 |
+
code = get_gpt_completion(prompt, system_message)
|
28 |
+
elif input_data["model"] == "Claude":
|
29 |
+
code = get_claude_completion(prompt, system_message)
|
30 |
+
else:
|
31 |
+
raise ValueError("Invalid model selected.")
|
32 |
+
|
33 |
+
# Execute the generated code and return the output file path
|
34 |
+
file_path = execute_code_in_virtualenv(code)
|
35 |
+
return file_path
|
36 |
+
|
37 |
+
except Exception as e:
|
38 |
+
# Log and re-raise any errors
|
39 |
+
print(f"Error in generate_dataset: {e}")
|
40 |
+
raise
|
41 |
+
|
42 |
+
|
src/ui.py
ADDED
@@ -0,0 +1,159 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
import os
|
2 |
+
import gradio as gr
|
3 |
+
import threading
|
4 |
+
from src.synth_data_gen import SynthDataGen
|
5 |
+
|
6 |
+
generator = SynthDataGen()
|
7 |
+
|
8 |
+
# Update the output format choices based on the selected dataset type
|
9 |
+
def update_output_format(dataset_type):
|
10 |
+
if dataset_type in ["Tabular", "Time-series"]:
|
11 |
+
return gr.update(choices=["JSON", "csv", "Parquet"], value="JSON")
|
12 |
+
elif dataset_type == "Text":
|
13 |
+
return gr.update(choices=["JSON", "Markdown"], value="JSON")
|
14 |
+
|
15 |
+
def update_pipeline(business_problem, dataset_type, output_format, num_samples, model):
|
16 |
+
# Check if business problem is empty
|
17 |
+
if not business_problem.strip():
|
18 |
+
yield [gr.update(visible=False), gr.update(visible=True), "❌ Please enter a business problem before generating."]
|
19 |
+
return
|
20 |
+
|
21 |
+
# Initial feedback while generating
|
22 |
+
yield [gr.update(visible=False), gr.update(visible=False), "⏳ Generating dataset..."]
|
23 |
+
|
24 |
+
try:
|
25 |
+
# Pack inputs into a dictionary for the generator
|
26 |
+
input_data = {
|
27 |
+
"business_problem": business_problem,
|
28 |
+
"dataset_type": dataset_type,
|
29 |
+
"output_format": output_format,
|
30 |
+
"num_samples": num_samples,
|
31 |
+
"model": model
|
32 |
+
}
|
33 |
+
|
34 |
+
# Generate dataset file
|
35 |
+
file_path = generator.generate_dataset(**input_data)
|
36 |
+
print("🧪 File result returned:", file_path)
|
37 |
+
|
38 |
+
# Check if file exists and return success message + file path
|
39 |
+
if isinstance(file_path, str) and os.path.exists(file_path):
|
40 |
+
threading.Timer(60, os.remove, args=[file_path]).start() # Auto-delete after 60s
|
41 |
+
yield [gr.update(value=file_path, visible=True), gr.update(visible=True), "✅ Dataset ready for download."]
|
42 |
+
else:
|
43 |
+
# Handle invalid or missing file
|
44 |
+
yield [gr.update(visible=False), gr.update(visible=True), "❌ Error: File not created or path invalid."]
|
45 |
+
|
46 |
+
except Exception as e:
|
47 |
+
# Catch and display any errors in the pipeline
|
48 |
+
yield [gr.update(visible=False), gr.update(visible=True), f"❌ Pipeline error: {e}"]
|
49 |
+
|
50 |
+
def build_ui(css_path="assets/styles.css"):
|
51 |
+
with open(css_path, "r") as f:
|
52 |
+
css = f.read()
|
53 |
+
|
54 |
+
with gr.Blocks(css=css, title="🧬SynthDataGen") as ui:
|
55 |
+
with gr.Column(elem_id="app-container"):
|
56 |
+
gr.Markdown("<h1 id='app-title'>SynthDataGen 🧬 </h1>")
|
57 |
+
gr.Markdown("<h2 id='app-subtitle'>AI-Powered Synthetic Dataset Generator</h2>")
|
58 |
+
|
59 |
+
gr.HTML("""
|
60 |
+
<div id="intro-text">
|
61 |
+
<p>With SynthDataGen, easily generate <strong>diverse datasets in different formats</strong> for testing, development, and AI training.</p>
|
62 |
+
<h4>🎯 How It Works:</h4>
|
63 |
+
<ol>
|
64 |
+
<li>1️⃣ Define your business problem or dataset topic.</li>
|
65 |
+
<li>2️⃣ Select the dataset type, output format, model, and number of samples.</li>
|
66 |
+
<li>3️⃣ Receive your synthetic dataset — ready to download and use!</li>
|
67 |
+
</ol>
|
68 |
+
</div>
|
69 |
+
""")
|
70 |
+
|
71 |
+
gr.HTML("""
|
72 |
+
<div id="learn-more-button">
|
73 |
+
<a href="https://github.com/lisek75/synthdatagen_app/blob/main/README.md" class="button-link" target="_blank">Learn More</a>
|
74 |
+
</div>
|
75 |
+
""")
|
76 |
+
|
77 |
+
gr.Markdown("""
|
78 |
+
<p><strong>🧠 Need inspiration?</strong> Try one of these examples:</p>
|
79 |
+
<ul>
|
80 |
+
<li>Movie summaries for genre classification.</li>
|
81 |
+
<li>Generate customer chats with realistic dialogue, chat_id, timestamp, names, sentiment label, and aligned transcript.</li>
|
82 |
+
<li>Create daily stock prices for 2 companies with typical fields like date, ticker, open, close, high, low, and volume.</li>
|
83 |
+
</ul>
|
84 |
+
""")
|
85 |
+
|
86 |
+
gr.Markdown("<p><strong>Start generating your synthetic datasets now!</strong> 🗂️✨</p>")
|
87 |
+
|
88 |
+
with gr.Group(elem_id="input-container"):
|
89 |
+
|
90 |
+
business_problem = gr.Textbox(
|
91 |
+
placeholder="Describe the dataset you want (e.g., Job postings, Customer reviews, Sensor data, Movie titles)",
|
92 |
+
lines=2,
|
93 |
+
label="📌 Business Problem",
|
94 |
+
elem_classes=["label-box"],
|
95 |
+
elem_id="business-problem-box"
|
96 |
+
)
|
97 |
+
|
98 |
+
with gr.Row(elem_classes="column-gap"):
|
99 |
+
with gr.Column(scale=1):
|
100 |
+
dataset_type = gr.Dropdown(
|
101 |
+
["Tabular", "Time-series", "Text"],
|
102 |
+
value="Tabular",
|
103 |
+
label="📊 Dataset Type",
|
104 |
+
elem_classes=["label-box"],
|
105 |
+
elem_id="custom-dropdown"
|
106 |
+
)
|
107 |
+
|
108 |
+
with gr.Column(scale=1):
|
109 |
+
output_format = gr.Dropdown(
|
110 |
+
choices=["JSON", "csv", "Parquet"],
|
111 |
+
value="JSON",
|
112 |
+
label="📁 Output Format",
|
113 |
+
elem_classes=["label-box"],
|
114 |
+
elem_id="custom-dropdown"
|
115 |
+
)
|
116 |
+
|
117 |
+
# Bind the update function to the dataset type dropdown
|
118 |
+
dataset_type.change(
|
119 |
+
update_output_format,
|
120 |
+
inputs=[dataset_type],
|
121 |
+
outputs=[output_format]
|
122 |
+
)
|
123 |
+
|
124 |
+
with gr.Row(elem_classes="row-spacer column-gap"):
|
125 |
+
with gr.Column(scale=1):
|
126 |
+
model = gr.Dropdown(
|
127 |
+
["GPT", "Claude"],
|
128 |
+
value="GPT",
|
129 |
+
label="🤖 Model",
|
130 |
+
elem_classes=["label-box"],
|
131 |
+
elem_id="custom-dropdown"
|
132 |
+
)
|
133 |
+
|
134 |
+
with gr.Column(scale=1):
|
135 |
+
num_samples = gr.Slider(
|
136 |
+
minimum=10,
|
137 |
+
maximum=1000,
|
138 |
+
value=10,
|
139 |
+
step=1,
|
140 |
+
interactive=True,
|
141 |
+
label="🔢 Number of Samples",
|
142 |
+
elem_classes=["label-box"]
|
143 |
+
)
|
144 |
+
|
145 |
+
# Hidden file component for dataset download
|
146 |
+
file_download = gr.File(label="Download Dataset", visible=False, elem_id="download-box")
|
147 |
+
|
148 |
+
# Component to display status messages
|
149 |
+
status_message = gr.Markdown("", label="Status")
|
150 |
+
|
151 |
+
# Button to trigger dataset generation
|
152 |
+
run_btn = gr.Button("Create a dataset", elem_id="run-btn")
|
153 |
+
run_btn.click(
|
154 |
+
update_pipeline,
|
155 |
+
inputs=[business_problem, dataset_type, output_format, num_samples, model],
|
156 |
+
outputs=[file_download, run_btn, status_message]
|
157 |
+
)
|
158 |
+
|
159 |
+
return ui # Return the complete UI
|
src/utils.py
ADDED
@@ -0,0 +1,49 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
import re
|
2 |
+
import os
|
3 |
+
import subprocess
|
4 |
+
import sys
|
5 |
+
|
6 |
+
# Extract Python code block from a LLM response
|
7 |
+
def extract_code(text):
|
8 |
+
try:
|
9 |
+
match = re.search(r"```python(.*?)```", text, re.DOTALL)
|
10 |
+
if match:
|
11 |
+
code = match.group(0).strip()
|
12 |
+
else:
|
13 |
+
code = ""
|
14 |
+
print("No matching code block found.")
|
15 |
+
return code.replace("```python\n", "").replace("```", "")
|
16 |
+
except Exception as e:
|
17 |
+
print(f"Code extraction error: {e}")
|
18 |
+
raise
|
19 |
+
|
20 |
+
# Extract file path from a code string that uses os.path.join()
|
21 |
+
def extract_file_path(code_str):
|
22 |
+
try:
|
23 |
+
match = re.search(r'os\.path\.join\(\s*["\'](.+?)["\']\s*,\s*["\'](.+?)["\']\s*\)', code_str)
|
24 |
+
if match:
|
25 |
+
folder = match.group(1)
|
26 |
+
filename = match.group(2)
|
27 |
+
return os.path.join(folder, filename)
|
28 |
+
print("No file path found.")
|
29 |
+
return None
|
30 |
+
except Exception as e:
|
31 |
+
print(f"File path extraction error: {e}")
|
32 |
+
raise
|
33 |
+
|
34 |
+
# Execute extracted Python code in a subprocess using the given interpreter
|
35 |
+
def execute_code_in_virtualenv(text, python_interpreter=sys.executable):
|
36 |
+
if not python_interpreter:
|
37 |
+
raise EnvironmentError("Python interpreter not found.")
|
38 |
+
|
39 |
+
code_str = extract_code(text)
|
40 |
+
command = [python_interpreter, "-c", code_str]
|
41 |
+
|
42 |
+
try:
|
43 |
+
print("✅ Running script:", command)
|
44 |
+
result = subprocess.run(command, check=True, capture_output=True, text=True)
|
45 |
+
file_path = extract_file_path(code_str)
|
46 |
+
print("✅ Extracted file path:", file_path)
|
47 |
+
return file_path
|
48 |
+
except subprocess.CalledProcessError as e:
|
49 |
+
return (f"Execution error:\n{e.stderr.strip()}", None)
|