metadata

title: Moroccan Darija TTS Demo
emoji: 🎙️
colorFrom: red
colorTo: green
sdk: gradio
sdk_version: 5.27.0
app_file: app.py
pinned: false

Moroccan Darija Text-to-Speech Model

This project implements a Text-to-Speech (TTS) system for Moroccan Darija using the SpeechT5 architecture. It's fine-tuned on the DODa-audio-dataset to generate natural-sounding Darija speech from text input.

Dataset Overview
Project Structure
Installation
Model Training
Inference
Gradio Demo
Project Features
Potential Applications
Limitations and Future Work
License

Dataset Overview

The DODa audio dataset contains 12,743 sentences recorded by 7 contributors (4 females, 3 males). Key characteristics:

Audio recordings standardized at 16kHz sample rate
Multiple text representations (Latin script, Arabic script, and English translations)
High-quality recordings with manual corrections

Dataset Structure

Column Name	Description
audio	Speech recordings for Darija sentences
darija_Ltn	Darija sentences using Latin letters
darija_Arab_new	Corrected Darija sentences using Arabic script
english	English translation of Darija sentences
darija_Arab_old	Original (uncorrected) Darija sentences in Arabic script

Speaker Distribution

The dataset includes recordings from 7 speakers (4 females, 3 males) with the following distribution:

Samples 0-999     -> Female 1
Samples 1000-1999 -> Male 3
Samples 2000-2730 -> Female 2
Samples 2731-2800 -> Male 1
Samples 2801-3999 -> Male 2
Samples 4000-4999 -> Male 1
Samples 5000-5999 -> Female 3
Samples 6000-6999 -> Male 1
Samples 7000-7999 -> Female 4
Samples 8000-8999 -> Female 1
Samples 9000-9999 -> Male 2
Samples 10000-11999 -> Male 1
Samples 12000-12350 -> Male 2
Samples 12351-12742 -> Male 1

Installation

To set up the project environment:

# Clone the repository
git clone https://github.com/yourusername/darija-tts.git
cd darija-tts

# Create a virtual environment (optional but recommended)
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scriptsctivate

# Install dependencies
pip install -r requirements.txt

Model Training

The model training process involves:

Data Loading: Loading the DODa audio dataset from Hugging Face
Data Preprocessing: Normalizing text and extracting speaker embeddings
Model Setup: Configuring the SpeechT5 model for Darija TTS
Training: Fine-tuning the model using the prepared dataset

To run the training:

# Open the Jupyter notebook
jupyter notebook notebooks/train_darija_tts.ipynb

Key training parameters:

Learning rate: 1e-4
Batch size: 4 (with gradient accumulation: 8)
Training steps: 1000
Evaluation frequency: Every 100 steps

Inference

To generate speech from text after training:

from transformers import SpeechT5Processor, SpeechT5ForTextToSpeech, SpeechT5HifiGan
import torch
import soundfile as sf

# Load models
model_path = "./models/speecht5_finetuned_Darija"
processor = SpeechT5Processor.from_pretrained(model_path)
model = SpeechT5ForTextToSpeech.from_pretrained(model_path)
vocoder = SpeechT5HifiGan.from_pretrained("microsoft/speecht5_hifigan")

# Load speaker embedding
speaker_embedding = torch.load("./data/speaker_embeddings/female_embedding.pt")

# Normalize and process input text
text = "Salam, kifach nta lyoum?"
inputs = processor(text=text, return_tensors="pt")

# Generate speech
speech = model.generate_speech(inputs["input_ids"], speaker_embedding, vocoder=vocoder)

# Save audio file
sf.write("output.wav", speech.numpy(), 16000)

Gradio Demo

The project includes a Gradio demo that provides a user-friendly interface for text-to-speech conversion:

# Run the demo locally
cd demo
python app.py

The demo features:

Text input field for Darija text (Latin script)
Voice selection (male/female)
Speech speed adjustment
Audio playback of generated speech

Deploying to Hugging Face Spaces

To deploy the demo to Hugging Face Spaces:

Push your model to the Hugging Face Hub
Create a new Space with the Gradio SDK
Upload the demo files to the Space

See the notebook for detailed deployment instructions.

Project Features

Multi-Speaker TTS: Generate speech in both male and female voices
Voice Cloning: Utilizes speaker embeddings for voice characteristics preservation
Speed Control: Adjust the speech rate as needed
Text Normalization: Handles various text inputs through proper normalization

Potential Applications

Voice Assistants: Build voice assistants that speak Moroccan Darija
Accessibility Tools: Create tools for people with visual impairments
Language Learning: Develop applications for learning Darija pronunciation
Content Creation: Generate voiceovers for videos or audio content
Public Announcements: Create automated announcement systems in Darija

Limitations and Future Work

Current limitations:

The model may struggle with code-switching between Darija and other languages
Pronunciation of certain loanwords might be inconsistent
Limited emotional range in the generated speech

Future improvements:

Fine-tune with more diverse speech data
Implement emotion control for expressive speech
Add support for Arabic script input
Develop a multilingual version supporting Darija, Arabic, and French

License

This project is released under the MIT License. The DODa audio dataset is also available under the MIT License.

Acknowledgments

The DODa audio dataset creators
Hugging Face for the Transformers library and model hosting
Microsoft Research for the SpeechT5 model architecture

Spaces:

HAMMALE
/

speecht5-darija

Running