Spaces:
Running
Running
title: Moroccan Darija TTS Demo | |
emoji: 🎙️ | |
colorFrom: red | |
colorTo: green | |
sdk: gradio | |
sdk_version: 5.27.0 | |
app_file: app.py | |
pinned: false | |
# Moroccan Darija Text-to-Speech Model | |
This project implements a Text-to-Speech (TTS) system for Moroccan Darija using the SpeechT5 architecture. It's fine-tuned on the DODa-audio-dataset to generate natural-sounding Darija speech from text input. | |
## Table of Contents | |
- [Dataset Overview](#dataset-overview) | |
- [Project Structure](#project-structure) | |
- [Installation](#installation) | |
- [Model Training](#model-training) | |
- [Inference](#inference) | |
- [Gradio Demo](#gradio-demo) | |
- [Project Features](#project-features) | |
- [Potential Applications](#potential-applications) | |
- [Limitations and Future Work](#limitations-and-future-work) | |
- [License](#license) | |
## Dataset Overview | |
The **DODa audio dataset** contains 12,743 sentences recorded by 7 contributors (4 females, 3 males). Key characteristics: | |
- Audio recordings standardized at 16kHz sample rate | |
- Multiple text representations (Latin script, Arabic script, and English translations) | |
- High-quality recordings with manual corrections | |
### Dataset Structure | |
| Column Name | Description | | |
|-------------|-------------| | |
| **audio** | Speech recordings for Darija sentences | | |
| **darija_Ltn** | Darija sentences using Latin letters | | |
| **darija_Arab_new** | Corrected Darija sentences using Arabic script | | |
| **english** | English translation of Darija sentences | | |
| **darija_Arab_old** | Original (uncorrected) Darija sentences in Arabic script | | |
### Speaker Distribution | |
The dataset includes recordings from 7 speakers (4 females, 3 males) with the following distribution: | |
``` | |
Samples 0-999 -> Female 1 | |
Samples 1000-1999 -> Male 3 | |
Samples 2000-2730 -> Female 2 | |
Samples 2731-2800 -> Male 1 | |
Samples 2801-3999 -> Male 2 | |
Samples 4000-4999 -> Male 1 | |
Samples 5000-5999 -> Female 3 | |
Samples 6000-6999 -> Male 1 | |
Samples 7000-7999 -> Female 4 | |
Samples 8000-8999 -> Female 1 | |
Samples 9000-9999 -> Male 2 | |
Samples 10000-11999 -> Male 1 | |
Samples 12000-12350 -> Male 2 | |
Samples 12351-12742 -> Male 1 | |
``` | |
## Installation | |
To set up the project environment: | |
```bash | |
# Clone the repository | |
git clone https://github.com/yourusername/darija-tts.git | |
cd darija-tts | |
# Create a virtual environment (optional but recommended) | |
python -m venv venv | |
source venv/bin/activate # On Windows: venv\Scriptsctivate | |
# Install dependencies | |
pip install -r requirements.txt | |
``` | |
## Model Training | |
The model training process involves: | |
1. **Data Loading**: Loading the DODa audio dataset from Hugging Face | |
2. **Data Preprocessing**: Normalizing text and extracting speaker embeddings | |
3. **Model Setup**: Configuring the SpeechT5 model for Darija TTS | |
4. **Training**: Fine-tuning the model using the prepared dataset | |
To run the training: | |
```bash | |
# Open the Jupyter notebook | |
jupyter notebook notebooks/train_darija_tts.ipynb | |
``` | |
Key training parameters: | |
- Learning rate: 1e-4 | |
- Batch size: 4 (with gradient accumulation: 8) | |
- Training steps: 1000 | |
- Evaluation frequency: Every 100 steps | |
## Inference | |
To generate speech from text after training: | |
```python | |
from transformers import SpeechT5Processor, SpeechT5ForTextToSpeech, SpeechT5HifiGan | |
import torch | |
import soundfile as sf | |
# Load models | |
model_path = "./models/speecht5_finetuned_Darija" | |
processor = SpeechT5Processor.from_pretrained(model_path) | |
model = SpeechT5ForTextToSpeech.from_pretrained(model_path) | |
vocoder = SpeechT5HifiGan.from_pretrained("microsoft/speecht5_hifigan") | |
# Load speaker embedding | |
speaker_embedding = torch.load("./data/speaker_embeddings/female_embedding.pt") | |
# Normalize and process input text | |
text = "Salam, kifach nta lyoum?" | |
inputs = processor(text=text, return_tensors="pt") | |
# Generate speech | |
speech = model.generate_speech(inputs["input_ids"], speaker_embedding, vocoder=vocoder) | |
# Save audio file | |
sf.write("output.wav", speech.numpy(), 16000) | |
``` | |
## Gradio Demo | |
The project includes a Gradio demo that provides a user-friendly interface for text-to-speech conversion: | |
```bash | |
# Run the demo locally | |
cd demo | |
python app.py | |
``` | |
The demo features: | |
- Text input field for Darija text (Latin script) | |
- Voice selection (male/female) | |
- Speech speed adjustment | |
- Audio playback of generated speech | |
### Deploying to Hugging Face Spaces | |
To deploy the demo to Hugging Face Spaces: | |
1. Push your model to the Hugging Face Hub | |
2. Create a new Space with the Gradio SDK | |
3. Upload the demo files to the Space | |
See the notebook for detailed deployment instructions. | |
## Project Features | |
- **Multi-Speaker TTS**: Generate speech in both male and female voices | |
- **Voice Cloning**: Utilizes speaker embeddings for voice characteristics preservation | |
- **Speed Control**: Adjust the speech rate as needed | |
- **Text Normalization**: Handles various text inputs through proper normalization | |
## Potential Applications | |
- **Voice Assistants**: Build voice assistants that speak Moroccan Darija | |
- **Accessibility Tools**: Create tools for people with visual impairments | |
- **Language Learning**: Develop applications for learning Darija pronunciation | |
- **Content Creation**: Generate voiceovers for videos or audio content | |
- **Public Announcements**: Create automated announcement systems in Darija | |
## Limitations and Future Work | |
Current limitations: | |
- The model may struggle with code-switching between Darija and other languages | |
- Pronunciation of certain loanwords might be inconsistent | |
- Limited emotional range in the generated speech | |
Future improvements: | |
- Fine-tune with more diverse speech data | |
- Implement emotion control for expressive speech | |
- Add support for Arabic script input | |
- Develop a multilingual version supporting Darija, Arabic, and French | |
## License | |
This project is released under the MIT License. The DODa audio dataset is also available under the MIT License. | |
## Acknowledgments | |
- The [DODa audio dataset](https://huggingface.co./datasets/atlasia/DODa-audio-dataset) creators | |
- Hugging Face for the Transformers library and model hosting | |
- Microsoft Research for the SpeechT5 model architecture |