speecht5-darija / README.md
HAMMALE's picture
Update README.md
c6694be verified

A newer version of the Gradio SDK is available: 5.28.0

Upgrade
metadata
title: Moroccan Darija TTS Demo
emoji: 🎙️
colorFrom: red
colorTo: green
sdk: gradio
sdk_version: 5.27.0
app_file: app.py
pinned: false

Moroccan Darija Text-to-Speech Model

This project implements a Text-to-Speech (TTS) system for Moroccan Darija using the SpeechT5 architecture. It's fine-tuned on the DODa-audio-dataset to generate natural-sounding Darija speech from text input.

Table of Contents

Dataset Overview

The DODa audio dataset contains 12,743 sentences recorded by 7 contributors (4 females, 3 males). Key characteristics:

  • Audio recordings standardized at 16kHz sample rate
  • Multiple text representations (Latin script, Arabic script, and English translations)
  • High-quality recordings with manual corrections

Dataset Structure

Column Name Description
audio Speech recordings for Darija sentences
darija_Ltn Darija sentences using Latin letters
darija_Arab_new Corrected Darija sentences using Arabic script
english English translation of Darija sentences
darija_Arab_old Original (uncorrected) Darija sentences in Arabic script

Speaker Distribution

The dataset includes recordings from 7 speakers (4 females, 3 males) with the following distribution:

Samples 0-999     -> Female 1
Samples 1000-1999 -> Male 3
Samples 2000-2730 -> Female 2
Samples 2731-2800 -> Male 1
Samples 2801-3999 -> Male 2
Samples 4000-4999 -> Male 1
Samples 5000-5999 -> Female 3
Samples 6000-6999 -> Male 1
Samples 7000-7999 -> Female 4
Samples 8000-8999 -> Female 1
Samples 9000-9999 -> Male 2
Samples 10000-11999 -> Male 1
Samples 12000-12350 -> Male 2
Samples 12351-12742 -> Male 1

Installation

To set up the project environment:

# Clone the repository
git clone https://github.com/yourusername/darija-tts.git
cd darija-tts

# Create a virtual environment (optional but recommended)
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scriptsctivate

# Install dependencies
pip install -r requirements.txt

Model Training

The model training process involves:

  1. Data Loading: Loading the DODa audio dataset from Hugging Face
  2. Data Preprocessing: Normalizing text and extracting speaker embeddings
  3. Model Setup: Configuring the SpeechT5 model for Darija TTS
  4. Training: Fine-tuning the model using the prepared dataset

To run the training:

# Open the Jupyter notebook
jupyter notebook notebooks/train_darija_tts.ipynb

Key training parameters:

  • Learning rate: 1e-4
  • Batch size: 4 (with gradient accumulation: 8)
  • Training steps: 1000
  • Evaluation frequency: Every 100 steps

Inference

To generate speech from text after training:

from transformers import SpeechT5Processor, SpeechT5ForTextToSpeech, SpeechT5HifiGan
import torch
import soundfile as sf

# Load models
model_path = "./models/speecht5_finetuned_Darija"
processor = SpeechT5Processor.from_pretrained(model_path)
model = SpeechT5ForTextToSpeech.from_pretrained(model_path)
vocoder = SpeechT5HifiGan.from_pretrained("microsoft/speecht5_hifigan")

# Load speaker embedding
speaker_embedding = torch.load("./data/speaker_embeddings/female_embedding.pt")

# Normalize and process input text
text = "Salam, kifach nta lyoum?"
inputs = processor(text=text, return_tensors="pt")

# Generate speech
speech = model.generate_speech(inputs["input_ids"], speaker_embedding, vocoder=vocoder)

# Save audio file
sf.write("output.wav", speech.numpy(), 16000)

Gradio Demo

The project includes a Gradio demo that provides a user-friendly interface for text-to-speech conversion:

# Run the demo locally
cd demo
python app.py

The demo features:

  • Text input field for Darija text (Latin script)
  • Voice selection (male/female)
  • Speech speed adjustment
  • Audio playback of generated speech

Deploying to Hugging Face Spaces

To deploy the demo to Hugging Face Spaces:

  1. Push your model to the Hugging Face Hub
  2. Create a new Space with the Gradio SDK
  3. Upload the demo files to the Space

See the notebook for detailed deployment instructions.

Project Features

  • Multi-Speaker TTS: Generate speech in both male and female voices
  • Voice Cloning: Utilizes speaker embeddings for voice characteristics preservation
  • Speed Control: Adjust the speech rate as needed
  • Text Normalization: Handles various text inputs through proper normalization

Potential Applications

  • Voice Assistants: Build voice assistants that speak Moroccan Darija
  • Accessibility Tools: Create tools for people with visual impairments
  • Language Learning: Develop applications for learning Darija pronunciation
  • Content Creation: Generate voiceovers for videos or audio content
  • Public Announcements: Create automated announcement systems in Darija

Limitations and Future Work

Current limitations:

  • The model may struggle with code-switching between Darija and other languages
  • Pronunciation of certain loanwords might be inconsistent
  • Limited emotional range in the generated speech

Future improvements:

  • Fine-tune with more diverse speech data
  • Implement emotion control for expressive speech
  • Add support for Arabic script input
  • Develop a multilingual version supporting Darija, Arabic, and French

License

This project is released under the MIT License. The DODa audio dataset is also available under the MIT License.

Acknowledgments

  • The DODa audio dataset creators
  • Hugging Face for the Transformers library and model hosting
  • Microsoft Research for the SpeechT5 model architecture