Lizk75 commited on
Commit
e365a68
·
1 Parent(s): 855c78c

[deploy] push to HF

Browse files
Files changed (11) hide show
  1. .dockerignore +3 -0
  2. .gitignore +15 -0
  3. Dockerfile +20 -0
  4. README.md +117 -11
  5. app.py +14 -0
  6. requirements.txt +7 -0
  7. src/models.py +56 -0
  8. src/prompts.py +64 -0
  9. src/synth_data_gen.py +42 -0
  10. src/ui.py +159 -0
  11. src/utils.py +49 -0
.dockerignore ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ .env
2
+ __pycache__/
3
+ *.py[cod]
.gitignore ADDED
@@ -0,0 +1,15 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Ignore Jupyter Notebook checkpoints
2
+ .ipynb_checkpoints/
3
+
4
+ # Ignore output folder
5
+ output/
6
+
7
+ # Ignore environment files
8
+ .env
9
+
10
+ # Ignore Python cache files
11
+ **/__pycache__/
12
+ *.py[cod]
13
+
14
+ # Ignore .gitattributes file
15
+ .gitattributes
Dockerfile ADDED
@@ -0,0 +1,20 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Use a Python 3.10 slim image as the base
2
+ FROM python:3.10-slim
3
+
4
+ # Set working directory inside the container
5
+ WORKDIR /app
6
+
7
+ # Copy the requirements.txt first (for better caching)
8
+ COPY requirements.txt .
9
+
10
+ # Install the dependencies from requirements.txt
11
+ RUN pip install --no-cache-dir -r requirements.txt
12
+
13
+ # Now, copy the rest of the project files into the container
14
+ COPY . .
15
+
16
+ # Expose the port where Gradio will run (default: 7860)
17
+ EXPOSE 7860
18
+
19
+ # Set the default command to run the app
20
+ CMD ["python", "app.py"]
README.md CHANGED
@@ -1,11 +1,117 @@
1
- ---
2
- title: Synthdatagen
3
- emoji: 🔥
4
- colorFrom: yellow
5
- colorTo: red
6
- sdk: docker
7
- pinned: false
8
- short_description: AI-powered platform for generating structured synthetic data
9
- ---
10
-
11
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # 🧬 SynthDataGen: AI-Powered Synthetic Dataset Generator
2
+
3
+ <img src="assets/logo.jpg" alt="DataSynth_logo" width="200">
4
+
5
+ <a href="https://synthdatagen-app.onrender.com/">👀 <b>Live Demo</b></a>
6
+
7
+ 📷 <b>Screenshots</b>
8
+
9
+ <a href="screenshot_1.png"><img src="assets/screenshot_1.png" width="400"></a>
10
+ <a href="screenshot_2.png"><img src="assets/screenshot_2.png" width="335"></a>
11
+
12
+
13
+ ## 📖 Overview
14
+ **SynthDataGen** is an AI-powered tool that creates **realistic, fake data** for any project. You don’t need to collect real information—instead, just tell SynthDataGen what kind of data you want, and it will **quickly generate** it. Thanks to its **easy-to-use web interface** built with Gradio, **anyone** can start making custom datasets right away.
15
+
16
+ ### 🔑 **Key Features**
17
+ - The app can generate **various types of datasets**, such as **tables**, **time-series data**, or **text content**.
18
+ - The output can be saved in different **formats**, including **JSON**, **CSV**, **Parquet**, or **Markdown**.
19
+ - **AI models** like **GPT** and **Claude** are used to automatically create the dataset based on the task.
20
+ - A short **description of the desired dataset** is all that's needed to trigger the generation process.
21
+ - A **download link** is provided once the dataset is ready, making it easy to save and use.
22
+ - The **interface updates options automatically** and includes helpful **examples for inspiration**.
23
+
24
+ ### 🎯 **How It Works**
25
+ 1️⃣ Describe the dataset to generate by entering a short business problem or topic.
26
+
27
+ 2️⃣ Select the dataset type, output format, AI model, and number of samples.
28
+
29
+ 3️⃣ Download the generated dataset once it's ready — clean, structured, and ready to use..
30
+
31
+ ### 🤔 **Why Choose SynthDataGen?**
32
+ - ⏰ **Time Saver**: Automatically creates tables, time-series, or text data—no need to gather real data yourself.
33
+ - ⚙️ **Flexible and Accessible**: Supports multiple formats (JSON, CSV, Parquet, Markdown) with a beginner-friendly interface.
34
+ - 🤖 **Powered by GPT & Claude**: Uses two top AI models to produce realistic synthetic data for prototyping or research.
35
+
36
+ ### 🔧 **SynthDataGen Customization**
37
+ SynthDataGen is fully customizable through Python code. You can easily modify:
38
+ - ✏️ **System prompt** to control how the AI models generate code
39
+ - 🤖 Easily add **new frontier** or **open-source models** (e.g., LLaMA, DeepSeek, Qwen), or integrate any model from **Hugging Face libraries** and **inference endpoints**.
40
+ - 📊 **Dataset types**, by adding new categories like image metadata, dialogue transcripts ...
41
+ - 📁 **Output formats**, such as YAML, XML ...
42
+ - 🎨 **Interface styling**, including layout, colors, and themes
43
+
44
+ ### 🏗️ **Architecture**
45
+
46
+ <a href="func_architecture.png"><img src="assets/func_architecture.png"></a>
47
+ <a href="tech_architecture.png"><img src="assets/tech_architecture.png"></a>
48
+
49
+ ## ⚙️ Setup & Installation
50
+
51
+ **1. Clone the Repository**
52
+ ```bash
53
+ git clone https://github.com/lisek75/synthdatagen_app.git
54
+ cd synthdatagen_app
55
+ ```
56
+
57
+ **2. Install Dependencies**
58
+
59
+ ```bash
60
+ conda env create -f synthdatagen_env.yml
61
+ conda activate synthdatagen
62
+ ```
63
+ **3. Configure API Keys & Endpoints**
64
+
65
+ Create `.env` file with the following variables:
66
+ ```python
67
+ OPENAI_API_KEY = your_openai_api_key
68
+ ANTHROPIC_API_KEY = your_anthropic_api_key
69
+ ```
70
+ Ensure that the `.env` file remains **secure** and is not shared publicly.
71
+
72
+
73
+ ## 🚀 Running the Gradio App
74
+
75
+ **Run the Application Locally**
76
+ ```bash
77
+ python app.py
78
+ ```
79
+
80
+ **Run the Application with Docker**
81
+
82
+ To run the app using Docker, you can either build the image yourself or use the pre-built image from Docker Hub.
83
+
84
+ - Build and run the app locally:
85
+ Build the image from the provided Dockerfile using your own Docker Hub username:
86
+ ```bash
87
+ docker build -t <user-dockerhub-username>/synthdatagen:v1.0 .
88
+ docker run -d --name synthdatagen-container -p 7860:7860 --env-file .env <user-dockerhub-username>/synthdatagen:v1.0
89
+ ```
90
+ This will build the Docker image and run the app in a container.
91
+
92
+ - Run the app directly from Docker Hub:
93
+ Pull the pre-built image from the Docker Hub repository (⚠️make sure to use the latest version tag from Docker Hub).
94
+ Check: https://hub.docker.com/r/lizk75/synthdatagen/tags
95
+
96
+ ```bash
97
+ docker pull lizk75/synthdatagen:v1.0
98
+ docker run -d --name synthdatagen-container -p 7860:7860 --env-file .env lizk75/synthdatagen:v1.0
99
+ ```
100
+
101
+
102
+ ## 🧑‍💻 Usage Guide
103
+ - You can launch the app directly from:
104
+ - The **demo link** provided at the top of this README.
105
+ - Or by executing it **locally** using the command `python app.py` from Visual Studio or any other IDE.
106
+ - **Describe your dataset** by entering a clear business problem or topic.
107
+ - Select the **dataset type** and **output format**.
108
+ - Choose an **AI model** (GPT or Claude).
109
+ - Set the desired **number of samples**.
110
+ - Click **Create Dataset** and download the generated file.
111
+
112
+
113
+ ## 📓 Google Colab
114
+ A **notebook version** is available for users who prefer running the app in a notebook environment. The notebook includes additional **open-source models ** that require a **GPU**, which is why it's recommended to run it on Google Colab or a local machine with GPU support.
115
+
116
+ https://github.com/lisek75/nlp_llms_notebook/blob/main/07_data_generator.ipynb
117
+
app.py ADDED
@@ -0,0 +1,14 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from src.ui import build_ui
2
+
3
+ if __name__ == "__main__":
4
+ # Build the user interface
5
+ ui = build_ui()
6
+
7
+ # Launch the UI in the browser with access to the "output" folder
8
+ ui.launch(
9
+ inbrowser=True,
10
+ allowed_paths=["output"],
11
+ server_name="0.0.0.0",
12
+ server_port=7860
13
+ )
14
+
requirements.txt ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ python-dotenv==1.0.1
2
+ openai==1.65.5
3
+ anthropic==0.49.0
4
+ gradio==5.21.0
5
+ pandas
6
+ numpy
7
+ pyarrow
src/models.py ADDED
@@ -0,0 +1,56 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from openai import OpenAI
2
+ import anthropic
3
+ import os
4
+ from dotenv import load_dotenv
5
+
6
+ # Load environment variables from .env file
7
+ load_dotenv(override=True)
8
+
9
+ # Retrieve API keys from environment
10
+ openai_api_key = os.getenv("OPENAI_API_KEY")
11
+ anthropic_api_key = os.getenv("ANTHROPIC_API_KEY")
12
+
13
+ # Warn if any API key is missing
14
+ if not openai_api_key:
15
+ print("❌ OpenAI API Key is missing!")
16
+
17
+ if not anthropic_api_key:
18
+ print("❌ Anthropic API Key is missing!")
19
+
20
+ # Initialize API clients
21
+ openai = OpenAI(api_key=openai_api_key)
22
+ claude = anthropic.Anthropic()
23
+
24
+ # Model names
25
+ OPENAI_MODEL = "gpt-4o-mini"
26
+ CLAUDE_MODEL = "claude-3-5-sonnet-20240620"
27
+
28
+ # Call OpenAI's GPT model with prompt and system message
29
+ def get_gpt_completion(prompt, system_message):
30
+ try:
31
+ response = openai.chat.completions.create(
32
+ model=OPENAI_MODEL,
33
+ messages=[
34
+ {"role": "system", "content": system_message},
35
+ {"role": "user", "content": prompt}
36
+ ],
37
+ stream=False,
38
+ )
39
+ return response.choices[0].message.content
40
+ except Exception as e:
41
+ print(f"GPT error: {e}")
42
+ raise
43
+
44
+ # Call Anthropic's Claude model with prompt and system message
45
+ def get_claude_completion(prompt, system_message):
46
+ try:
47
+ result = claude.messages.create(
48
+ model=CLAUDE_MODEL,
49
+ max_tokens=2000,
50
+ system=system_message,
51
+ messages=[{"role": "user", "content": prompt}]
52
+ )
53
+ return result.content[0].text
54
+ except Exception as e:
55
+ print(f"Claude error: {e}")
56
+ raise
src/prompts.py ADDED
@@ -0,0 +1,64 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from datetime import datetime
2
+
3
+ system_message = """
4
+ You are a helpful assistant whose main purpose is to generate synthetic datasets based on a given business problem.
5
+
6
+ 🔹 General Guidelines:
7
+ - Be accurate and concise.
8
+ - Use only standard Python libraries (pandas, numpy, os, datetime, etc.)
9
+ - The dataset must contain the requested number of samples.
10
+ - Always respect the requested output format exactly.
11
+ - If multiple entities exist, save each to a separate file.
12
+ - Do not use f-strings anywhere in the code — not in file paths or in content. Use standard string concatenation instead.
13
+
14
+ 🔹 File Path Rules:
15
+ - Define the full file path using os.path.join(...) — exactly as shown — no shortcuts or direct strings.
16
+ - Use two hardcoded string literals only — no variables, no f-strings, no formatting, no expressions.
17
+ - First argument: full directory path (use forward slashes).
18
+ - Second argument: full filename with timestamp and correct extension.
19
+ - Example: os.path.join("C:/Users/.../output", "sales_20250323_123456.json")
20
+ - ⚠️ Do not use intermediate variables like directory, filename, or output_dir.
21
+ - ⚠️ Do not skip or replace any of the above instructions. They are required for the code to work correctly.
22
+
23
+ 🔹 File Saving Instructions:
24
+
25
+ - ✅ CSV:
26
+ df.to_csv(file_path, index=False, encoding="utf-8")
27
+
28
+ - ✅ JSON:
29
+ with open(file_path, "w", encoding="utf-8") as f:
30
+ df.to_json(f, orient="records", lines=False, force_ascii=False)
31
+
32
+ - ✅ Parquet:
33
+ df.to_parquet(file_path, engine="pyarrow", index=False)
34
+
35
+ - ✅ Markdown (for Text):
36
+ - Generate properly formatted Markdown content.
37
+ - Save it as a `.md` file using UTF-8 encoding.
38
+ """
39
+
40
+ def build_user_prompt(**input_data):
41
+ try:
42
+ # Normalize file path and get current timestamp
43
+ file_path = input_data["file_path"].replace("\\", "/")
44
+ timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
45
+
46
+ # Construct the user prompt for the LLM
47
+ user_prompt = f"""
48
+ Generate a synthetic {input_data["dataset_type"].lower()} dataset in {input_data["output_format"].upper()} format.
49
+ Business problem: {input_data["business_problem"]}
50
+ Samples: {input_data["num_samples"]}
51
+ Directory: {file_path}
52
+ Timestamp: {timestamp}
53
+ """
54
+ return user_prompt
55
+
56
+ except KeyError as e:
57
+ # Handle missing keys in input_data
58
+ print(f"Missing input key: {e}")
59
+ raise
60
+ except Exception as e:
61
+ # Log any other error during prompt building
62
+ print(f"Error in build_user_prompt: {e}")
63
+ raise
64
+
src/synth_data_gen.py ADDED
@@ -0,0 +1,42 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+ from datetime import datetime
3
+ from .prompts import build_user_prompt, system_message
4
+ from .models import get_gpt_completion, get_claude_completion
5
+ from .utils import execute_code_in_virtualenv
6
+
7
+ class SynthDataGen:
8
+ def __init__(self, output_dir="output"):
9
+ # Set the default output directory and ensures it's created
10
+ self.output_dir = output_dir
11
+ os.makedirs(self.output_dir, exist_ok=True)
12
+
13
+ def get_timestamp(self):
14
+ # Return current timestamp for file naming
15
+ return datetime.now().strftime("%Y%m%d_%H%M%S")
16
+
17
+ def generate_dataset(self, **input_data):
18
+ try:
19
+ # Add output directory path to input data
20
+ input_data["file_path"] = self.output_dir
21
+
22
+ # Build the prompt to send to the LLM
23
+ prompt = build_user_prompt(**input_data)
24
+
25
+ # Call the selected LLM based on the model
26
+ if input_data["model"] == "GPT":
27
+ code = get_gpt_completion(prompt, system_message)
28
+ elif input_data["model"] == "Claude":
29
+ code = get_claude_completion(prompt, system_message)
30
+ else:
31
+ raise ValueError("Invalid model selected.")
32
+
33
+ # Execute the generated code and return the output file path
34
+ file_path = execute_code_in_virtualenv(code)
35
+ return file_path
36
+
37
+ except Exception as e:
38
+ # Log and re-raise any errors
39
+ print(f"Error in generate_dataset: {e}")
40
+ raise
41
+
42
+
src/ui.py ADDED
@@ -0,0 +1,159 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+ import gradio as gr
3
+ import threading
4
+ from src.synth_data_gen import SynthDataGen
5
+
6
+ generator = SynthDataGen()
7
+
8
+ # Update the output format choices based on the selected dataset type
9
+ def update_output_format(dataset_type):
10
+ if dataset_type in ["Tabular", "Time-series"]:
11
+ return gr.update(choices=["JSON", "csv", "Parquet"], value="JSON")
12
+ elif dataset_type == "Text":
13
+ return gr.update(choices=["JSON", "Markdown"], value="JSON")
14
+
15
+ def update_pipeline(business_problem, dataset_type, output_format, num_samples, model):
16
+ # Check if business problem is empty
17
+ if not business_problem.strip():
18
+ yield [gr.update(visible=False), gr.update(visible=True), "❌ Please enter a business problem before generating."]
19
+ return
20
+
21
+ # Initial feedback while generating
22
+ yield [gr.update(visible=False), gr.update(visible=False), "⏳ Generating dataset..."]
23
+
24
+ try:
25
+ # Pack inputs into a dictionary for the generator
26
+ input_data = {
27
+ "business_problem": business_problem,
28
+ "dataset_type": dataset_type,
29
+ "output_format": output_format,
30
+ "num_samples": num_samples,
31
+ "model": model
32
+ }
33
+
34
+ # Generate dataset file
35
+ file_path = generator.generate_dataset(**input_data)
36
+ print("🧪 File result returned:", file_path)
37
+
38
+ # Check if file exists and return success message + file path
39
+ if isinstance(file_path, str) and os.path.exists(file_path):
40
+ threading.Timer(60, os.remove, args=[file_path]).start() # Auto-delete after 60s
41
+ yield [gr.update(value=file_path, visible=True), gr.update(visible=True), "✅ Dataset ready for download."]
42
+ else:
43
+ # Handle invalid or missing file
44
+ yield [gr.update(visible=False), gr.update(visible=True), "❌ Error: File not created or path invalid."]
45
+
46
+ except Exception as e:
47
+ # Catch and display any errors in the pipeline
48
+ yield [gr.update(visible=False), gr.update(visible=True), f"❌ Pipeline error: {e}"]
49
+
50
+ def build_ui(css_path="assets/styles.css"):
51
+ with open(css_path, "r") as f:
52
+ css = f.read()
53
+
54
+ with gr.Blocks(css=css, title="🧬SynthDataGen") as ui:
55
+ with gr.Column(elem_id="app-container"):
56
+ gr.Markdown("<h1 id='app-title'>SynthDataGen 🧬 </h1>")
57
+ gr.Markdown("<h2 id='app-subtitle'>AI-Powered Synthetic Dataset Generator</h2>")
58
+
59
+ gr.HTML("""
60
+ <div id="intro-text">
61
+ <p>With SynthDataGen, easily generate <strong>diverse datasets in different formats</strong> for testing, development, and AI training.</p>
62
+ <h4>🎯 How It Works:</h4>
63
+ <ol>
64
+ <li>1️⃣ Define your business problem or dataset topic.</li>
65
+ <li>2️⃣ Select the dataset type, output format, model, and number of samples.</li>
66
+ <li>3️⃣ Receive your synthetic dataset — ready to download and use!</li>
67
+ </ol>
68
+ </div>
69
+ """)
70
+
71
+ gr.HTML("""
72
+ <div id="learn-more-button">
73
+ <a href="https://github.com/lisek75/synthdatagen_app/blob/main/README.md" class="button-link" target="_blank">Learn More</a>
74
+ </div>
75
+ """)
76
+
77
+ gr.Markdown("""
78
+ <p><strong>🧠 Need inspiration?</strong> Try one of these examples:</p>
79
+ <ul>
80
+ <li>Movie summaries for genre classification.</li>
81
+ <li>Generate customer chats with realistic dialogue, chat_id, timestamp, names, sentiment label, and aligned transcript.</li>
82
+ <li>Create daily stock prices for 2 companies with typical fields like date, ticker, open, close, high, low, and volume.</li>
83
+ </ul>
84
+ """)
85
+
86
+ gr.Markdown("<p><strong>Start generating your synthetic datasets now!</strong> 🗂️✨</p>")
87
+
88
+ with gr.Group(elem_id="input-container"):
89
+
90
+ business_problem = gr.Textbox(
91
+ placeholder="Describe the dataset you want (e.g., Job postings, Customer reviews, Sensor data, Movie titles)",
92
+ lines=2,
93
+ label="📌 Business Problem",
94
+ elem_classes=["label-box"],
95
+ elem_id="business-problem-box"
96
+ )
97
+
98
+ with gr.Row(elem_classes="column-gap"):
99
+ with gr.Column(scale=1):
100
+ dataset_type = gr.Dropdown(
101
+ ["Tabular", "Time-series", "Text"],
102
+ value="Tabular",
103
+ label="📊 Dataset Type",
104
+ elem_classes=["label-box"],
105
+ elem_id="custom-dropdown"
106
+ )
107
+
108
+ with gr.Column(scale=1):
109
+ output_format = gr.Dropdown(
110
+ choices=["JSON", "csv", "Parquet"],
111
+ value="JSON",
112
+ label="📁 Output Format",
113
+ elem_classes=["label-box"],
114
+ elem_id="custom-dropdown"
115
+ )
116
+
117
+ # Bind the update function to the dataset type dropdown
118
+ dataset_type.change(
119
+ update_output_format,
120
+ inputs=[dataset_type],
121
+ outputs=[output_format]
122
+ )
123
+
124
+ with gr.Row(elem_classes="row-spacer column-gap"):
125
+ with gr.Column(scale=1):
126
+ model = gr.Dropdown(
127
+ ["GPT", "Claude"],
128
+ value="GPT",
129
+ label="🤖 Model",
130
+ elem_classes=["label-box"],
131
+ elem_id="custom-dropdown"
132
+ )
133
+
134
+ with gr.Column(scale=1):
135
+ num_samples = gr.Slider(
136
+ minimum=10,
137
+ maximum=1000,
138
+ value=10,
139
+ step=1,
140
+ interactive=True,
141
+ label="🔢 Number of Samples",
142
+ elem_classes=["label-box"]
143
+ )
144
+
145
+ # Hidden file component for dataset download
146
+ file_download = gr.File(label="Download Dataset", visible=False, elem_id="download-box")
147
+
148
+ # Component to display status messages
149
+ status_message = gr.Markdown("", label="Status")
150
+
151
+ # Button to trigger dataset generation
152
+ run_btn = gr.Button("Create a dataset", elem_id="run-btn")
153
+ run_btn.click(
154
+ update_pipeline,
155
+ inputs=[business_problem, dataset_type, output_format, num_samples, model],
156
+ outputs=[file_download, run_btn, status_message]
157
+ )
158
+
159
+ return ui # Return the complete UI
src/utils.py ADDED
@@ -0,0 +1,49 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import re
2
+ import os
3
+ import subprocess
4
+ import sys
5
+
6
+ # Extract Python code block from a LLM response
7
+ def extract_code(text):
8
+ try:
9
+ match = re.search(r"```python(.*?)```", text, re.DOTALL)
10
+ if match:
11
+ code = match.group(0).strip()
12
+ else:
13
+ code = ""
14
+ print("No matching code block found.")
15
+ return code.replace("```python\n", "").replace("```", "")
16
+ except Exception as e:
17
+ print(f"Code extraction error: {e}")
18
+ raise
19
+
20
+ # Extract file path from a code string that uses os.path.join()
21
+ def extract_file_path(code_str):
22
+ try:
23
+ match = re.search(r'os\.path\.join\(\s*["\'](.+?)["\']\s*,\s*["\'](.+?)["\']\s*\)', code_str)
24
+ if match:
25
+ folder = match.group(1)
26
+ filename = match.group(2)
27
+ return os.path.join(folder, filename)
28
+ print("No file path found.")
29
+ return None
30
+ except Exception as e:
31
+ print(f"File path extraction error: {e}")
32
+ raise
33
+
34
+ # Execute extracted Python code in a subprocess using the given interpreter
35
+ def execute_code_in_virtualenv(text, python_interpreter=sys.executable):
36
+ if not python_interpreter:
37
+ raise EnvironmentError("Python interpreter not found.")
38
+
39
+ code_str = extract_code(text)
40
+ command = [python_interpreter, "-c", code_str]
41
+
42
+ try:
43
+ print("✅ Running script:", command)
44
+ result = subprocess.run(command, check=True, capture_output=True, text=True)
45
+ file_path = extract_file_path(code_str)
46
+ print("✅ Extracted file path:", file_path)
47
+ return file_path
48
+ except subprocess.CalledProcessError as e:
49
+ return (f"Execution error:\n{e.stderr.strip()}", None)