speecht5-darija / README.md
HAMMALE's picture
Update README.md
c6694be verified
|
raw
history blame contribute delete
6.07 kB
---
title: Moroccan Darija TTS Demo
emoji: 🎙️
colorFrom: red
colorTo: green
sdk: gradio
sdk_version: 5.27.0
app_file: app.py
pinned: false
---
# Moroccan Darija Text-to-Speech Model
This project implements a Text-to-Speech (TTS) system for Moroccan Darija using the SpeechT5 architecture. It's fine-tuned on the DODa-audio-dataset to generate natural-sounding Darija speech from text input.
## Table of Contents
- [Dataset Overview](#dataset-overview)
- [Project Structure](#project-structure)
- [Installation](#installation)
- [Model Training](#model-training)
- [Inference](#inference)
- [Gradio Demo](#gradio-demo)
- [Project Features](#project-features)
- [Potential Applications](#potential-applications)
- [Limitations and Future Work](#limitations-and-future-work)
- [License](#license)
## Dataset Overview
The **DODa audio dataset** contains 12,743 sentences recorded by 7 contributors (4 females, 3 males). Key characteristics:
- Audio recordings standardized at 16kHz sample rate
- Multiple text representations (Latin script, Arabic script, and English translations)
- High-quality recordings with manual corrections
### Dataset Structure
| Column Name | Description |
|-------------|-------------|
| **audio** | Speech recordings for Darija sentences |
| **darija_Ltn** | Darija sentences using Latin letters |
| **darija_Arab_new** | Corrected Darija sentences using Arabic script |
| **english** | English translation of Darija sentences |
| **darija_Arab_old** | Original (uncorrected) Darija sentences in Arabic script |
### Speaker Distribution
The dataset includes recordings from 7 speakers (4 females, 3 males) with the following distribution:
```
Samples 0-999 -> Female 1
Samples 1000-1999 -> Male 3
Samples 2000-2730 -> Female 2
Samples 2731-2800 -> Male 1
Samples 2801-3999 -> Male 2
Samples 4000-4999 -> Male 1
Samples 5000-5999 -> Female 3
Samples 6000-6999 -> Male 1
Samples 7000-7999 -> Female 4
Samples 8000-8999 -> Female 1
Samples 9000-9999 -> Male 2
Samples 10000-11999 -> Male 1
Samples 12000-12350 -> Male 2
Samples 12351-12742 -> Male 1
```
## Installation
To set up the project environment:
```bash
# Clone the repository
git clone https://github.com/yourusername/darija-tts.git
cd darija-tts
# Create a virtual environment (optional but recommended)
python -m venv venv
source venv/bin/activate # On Windows: venv\Scriptsctivate
# Install dependencies
pip install -r requirements.txt
```
## Model Training
The model training process involves:
1. **Data Loading**: Loading the DODa audio dataset from Hugging Face
2. **Data Preprocessing**: Normalizing text and extracting speaker embeddings
3. **Model Setup**: Configuring the SpeechT5 model for Darija TTS
4. **Training**: Fine-tuning the model using the prepared dataset
To run the training:
```bash
# Open the Jupyter notebook
jupyter notebook notebooks/train_darija_tts.ipynb
```
Key training parameters:
- Learning rate: 1e-4
- Batch size: 4 (with gradient accumulation: 8)
- Training steps: 1000
- Evaluation frequency: Every 100 steps
## Inference
To generate speech from text after training:
```python
from transformers import SpeechT5Processor, SpeechT5ForTextToSpeech, SpeechT5HifiGan
import torch
import soundfile as sf
# Load models
model_path = "./models/speecht5_finetuned_Darija"
processor = SpeechT5Processor.from_pretrained(model_path)
model = SpeechT5ForTextToSpeech.from_pretrained(model_path)
vocoder = SpeechT5HifiGan.from_pretrained("microsoft/speecht5_hifigan")
# Load speaker embedding
speaker_embedding = torch.load("./data/speaker_embeddings/female_embedding.pt")
# Normalize and process input text
text = "Salam, kifach nta lyoum?"
inputs = processor(text=text, return_tensors="pt")
# Generate speech
speech = model.generate_speech(inputs["input_ids"], speaker_embedding, vocoder=vocoder)
# Save audio file
sf.write("output.wav", speech.numpy(), 16000)
```
## Gradio Demo
The project includes a Gradio demo that provides a user-friendly interface for text-to-speech conversion:
```bash
# Run the demo locally
cd demo
python app.py
```
The demo features:
- Text input field for Darija text (Latin script)
- Voice selection (male/female)
- Speech speed adjustment
- Audio playback of generated speech
### Deploying to Hugging Face Spaces
To deploy the demo to Hugging Face Spaces:
1. Push your model to the Hugging Face Hub
2. Create a new Space with the Gradio SDK
3. Upload the demo files to the Space
See the notebook for detailed deployment instructions.
## Project Features
- **Multi-Speaker TTS**: Generate speech in both male and female voices
- **Voice Cloning**: Utilizes speaker embeddings for voice characteristics preservation
- **Speed Control**: Adjust the speech rate as needed
- **Text Normalization**: Handles various text inputs through proper normalization
## Potential Applications
- **Voice Assistants**: Build voice assistants that speak Moroccan Darija
- **Accessibility Tools**: Create tools for people with visual impairments
- **Language Learning**: Develop applications for learning Darija pronunciation
- **Content Creation**: Generate voiceovers for videos or audio content
- **Public Announcements**: Create automated announcement systems in Darija
## Limitations and Future Work
Current limitations:
- The model may struggle with code-switching between Darija and other languages
- Pronunciation of certain loanwords might be inconsistent
- Limited emotional range in the generated speech
Future improvements:
- Fine-tune with more diverse speech data
- Implement emotion control for expressive speech
- Add support for Arabic script input
- Develop a multilingual version supporting Darija, Arabic, and French
## License
This project is released under the MIT License. The DODa audio dataset is also available under the MIT License.
## Acknowledgments
- The [DODa audio dataset](https://huggingface.co./datasets/atlasia/DODa-audio-dataset) creators
- Hugging Face for the Transformers library and model hosting
- Microsoft Research for the SpeechT5 model architecture