Spaces:
Running
A newer version of the Gradio SDK is available:
5.28.0
title: Moroccan Darija TTS Demo
emoji: 🎙️
colorFrom: red
colorTo: green
sdk: gradio
sdk_version: 5.27.0
app_file: app.py
pinned: false
Moroccan Darija Text-to-Speech Model
This project implements a Text-to-Speech (TTS) system for Moroccan Darija using the SpeechT5 architecture. It's fine-tuned on the DODa-audio-dataset to generate natural-sounding Darija speech from text input.
Table of Contents
- Dataset Overview
- Project Structure
- Installation
- Model Training
- Inference
- Gradio Demo
- Project Features
- Potential Applications
- Limitations and Future Work
- License
Dataset Overview
The DODa audio dataset contains 12,743 sentences recorded by 7 contributors (4 females, 3 males). Key characteristics:
- Audio recordings standardized at 16kHz sample rate
- Multiple text representations (Latin script, Arabic script, and English translations)
- High-quality recordings with manual corrections
Dataset Structure
Column Name | Description |
---|---|
audio | Speech recordings for Darija sentences |
darija_Ltn | Darija sentences using Latin letters |
darija_Arab_new | Corrected Darija sentences using Arabic script |
english | English translation of Darija sentences |
darija_Arab_old | Original (uncorrected) Darija sentences in Arabic script |
Speaker Distribution
The dataset includes recordings from 7 speakers (4 females, 3 males) with the following distribution:
Samples 0-999 -> Female 1
Samples 1000-1999 -> Male 3
Samples 2000-2730 -> Female 2
Samples 2731-2800 -> Male 1
Samples 2801-3999 -> Male 2
Samples 4000-4999 -> Male 1
Samples 5000-5999 -> Female 3
Samples 6000-6999 -> Male 1
Samples 7000-7999 -> Female 4
Samples 8000-8999 -> Female 1
Samples 9000-9999 -> Male 2
Samples 10000-11999 -> Male 1
Samples 12000-12350 -> Male 2
Samples 12351-12742 -> Male 1
Installation
To set up the project environment:
# Clone the repository
git clone https://github.com/yourusername/darija-tts.git
cd darija-tts
# Create a virtual environment (optional but recommended)
python -m venv venv
source venv/bin/activate # On Windows: venv\Scriptsctivate
# Install dependencies
pip install -r requirements.txt
Model Training
The model training process involves:
- Data Loading: Loading the DODa audio dataset from Hugging Face
- Data Preprocessing: Normalizing text and extracting speaker embeddings
- Model Setup: Configuring the SpeechT5 model for Darija TTS
- Training: Fine-tuning the model using the prepared dataset
To run the training:
# Open the Jupyter notebook
jupyter notebook notebooks/train_darija_tts.ipynb
Key training parameters:
- Learning rate: 1e-4
- Batch size: 4 (with gradient accumulation: 8)
- Training steps: 1000
- Evaluation frequency: Every 100 steps
Inference
To generate speech from text after training:
from transformers import SpeechT5Processor, SpeechT5ForTextToSpeech, SpeechT5HifiGan
import torch
import soundfile as sf
# Load models
model_path = "./models/speecht5_finetuned_Darija"
processor = SpeechT5Processor.from_pretrained(model_path)
model = SpeechT5ForTextToSpeech.from_pretrained(model_path)
vocoder = SpeechT5HifiGan.from_pretrained("microsoft/speecht5_hifigan")
# Load speaker embedding
speaker_embedding = torch.load("./data/speaker_embeddings/female_embedding.pt")
# Normalize and process input text
text = "Salam, kifach nta lyoum?"
inputs = processor(text=text, return_tensors="pt")
# Generate speech
speech = model.generate_speech(inputs["input_ids"], speaker_embedding, vocoder=vocoder)
# Save audio file
sf.write("output.wav", speech.numpy(), 16000)
Gradio Demo
The project includes a Gradio demo that provides a user-friendly interface for text-to-speech conversion:
# Run the demo locally
cd demo
python app.py
The demo features:
- Text input field for Darija text (Latin script)
- Voice selection (male/female)
- Speech speed adjustment
- Audio playback of generated speech
Deploying to Hugging Face Spaces
To deploy the demo to Hugging Face Spaces:
- Push your model to the Hugging Face Hub
- Create a new Space with the Gradio SDK
- Upload the demo files to the Space
See the notebook for detailed deployment instructions.
Project Features
- Multi-Speaker TTS: Generate speech in both male and female voices
- Voice Cloning: Utilizes speaker embeddings for voice characteristics preservation
- Speed Control: Adjust the speech rate as needed
- Text Normalization: Handles various text inputs through proper normalization
Potential Applications
- Voice Assistants: Build voice assistants that speak Moroccan Darija
- Accessibility Tools: Create tools for people with visual impairments
- Language Learning: Develop applications for learning Darija pronunciation
- Content Creation: Generate voiceovers for videos or audio content
- Public Announcements: Create automated announcement systems in Darija
Limitations and Future Work
Current limitations:
- The model may struggle with code-switching between Darija and other languages
- Pronunciation of certain loanwords might be inconsistent
- Limited emotional range in the generated speech
Future improvements:
- Fine-tune with more diverse speech data
- Implement emotion control for expressive speech
- Add support for Arabic script input
- Develop a multilingual version supporting Darija, Arabic, and French
License
This project is released under the MIT License. The DODa audio dataset is also available under the MIT License.
Acknowledgments
- The DODa audio dataset creators
- Hugging Face for the Transformers library and model hosting
- Microsoft Research for the SpeechT5 model architecture