Spaces:

HAMMALE
/

speecht5-darija

Running

App Files Files Community

speecht5-darija / README.md

HAMMALE

Update README.md

c6694be verified 5 days ago

preview code

raw

history blame contribute delete

6.07 kB

	---
	title: Moroccan Darija TTS Demo
	emoji: 🎙️
	colorFrom: red
	colorTo: green
	sdk: gradio
	sdk_version: 5.27.0
	app_file: app.py
	pinned: false
	---



	# Moroccan Darija Text-to-Speech Model

	This project implements a Text-to-Speech (TTS) system for Moroccan Darija using the SpeechT5 architecture. It's fine-tuned on the DODa-audio-dataset to generate natural-sounding Darija speech from text input.

	## Table of Contents
	- [Dataset Overview](#dataset-overview)
	- [Project Structure](#project-structure)
	- [Installation](#installation)
	- [Model Training](#model-training)
	- [Inference](#inference)
	- [Gradio Demo](#gradio-demo)
	- [Project Features](#project-features)
	- [Potential Applications](#potential-applications)
	- [Limitations and Future Work](#limitations-and-future-work)
	- [License](#license)

	## Dataset Overview

	The DODa audio dataset contains 12,743 sentences recorded by 7 contributors (4 females, 3 males). Key characteristics:

	- Audio recordings standardized at 16kHz sample rate
	- Multiple text representations (Latin script, Arabic script, and English translations)
	- High-quality recordings with manual corrections

	### Dataset Structure
	\| Column Name \| Description \|
	\|-------------\|-------------\|
	\| audio \| Speech recordings for Darija sentences \|
	\| darija_Ltn \| Darija sentences using Latin letters \|
	\| darija_Arab_new \| Corrected Darija sentences using Arabic script \|
	\| english \| English translation of Darija sentences \|
	\| darija_Arab_old \| Original (uncorrected) Darija sentences in Arabic script \|

	### Speaker Distribution
	The dataset includes recordings from 7 speakers (4 females, 3 males) with the following distribution:
	```
	Samples 0-999 -> Female 1
	Samples 1000-1999 -> Male 3
	Samples 2000-2730 -> Female 2
	Samples 2731-2800 -> Male 1
	Samples 2801-3999 -> Male 2
	Samples 4000-4999 -> Male 1
	Samples 5000-5999 -> Female 3
	Samples 6000-6999 -> Male 1
	Samples 7000-7999 -> Female 4
	Samples 8000-8999 -> Female 1
	Samples 9000-9999 -> Male 2
	Samples 10000-11999 -> Male 1
	Samples 12000-12350 -> Male 2
	Samples 12351-12742 -> Male 1
	```



	## Installation

	To set up the project environment:

	```bash
	# Clone the repository
	git clone https://github.com/yourusername/darija-tts.git
	cd darija-tts

	# Create a virtual environment (optional but recommended)
	python -m venv venv
	source venv/bin/activate # On Windows: venv\Scriptsctivate

	# Install dependencies
	pip install -r requirements.txt
	```

	## Model Training

	The model training process involves:

	1. Data Loading: Loading the DODa audio dataset from Hugging Face
	2. Data Preprocessing: Normalizing text and extracting speaker embeddings
	3. Model Setup: Configuring the SpeechT5 model for Darija TTS
	4. Training: Fine-tuning the model using the prepared dataset

	To run the training:

	```bash
	# Open the Jupyter notebook
	jupyter notebook notebooks/train_darija_tts.ipynb
	```

	Key training parameters:
	- Learning rate: 1e-4
	- Batch size: 4 (with gradient accumulation: 8)
	- Training steps: 1000
	- Evaluation frequency: Every 100 steps

	## Inference

	To generate speech from text after training:

	```python
	from transformers import SpeechT5Processor, SpeechT5ForTextToSpeech, SpeechT5HifiGan
	import torch
	import soundfile as sf

	# Load models
	model_path = "./models/speecht5_finetuned_Darija"
	processor = SpeechT5Processor.from_pretrained(model_path)
	model = SpeechT5ForTextToSpeech.from_pretrained(model_path)
	vocoder = SpeechT5HifiGan.from_pretrained("microsoft/speecht5_hifigan")

	# Load speaker embedding
	speaker_embedding = torch.load("./data/speaker_embeddings/female_embedding.pt")

	# Normalize and process input text
	text = "Salam, kifach nta lyoum?"
	inputs = processor(text=text, return_tensors="pt")

	# Generate speech
	speech = model.generate_speech(inputs["input_ids"], speaker_embedding, vocoder=vocoder)

	# Save audio file
	sf.write("output.wav", speech.numpy(), 16000)
	```

	## Gradio Demo

	The project includes a Gradio demo that provides a user-friendly interface for text-to-speech conversion:

	```bash
	# Run the demo locally
	cd demo
	python app.py
	```

	The demo features:
	- Text input field for Darija text (Latin script)
	- Voice selection (male/female)
	- Speech speed adjustment
	- Audio playback of generated speech

	### Deploying to Hugging Face Spaces

	To deploy the demo to Hugging Face Spaces:

	1. Push your model to the Hugging Face Hub
	2. Create a new Space with the Gradio SDK
	3. Upload the demo files to the Space

	See the notebook for detailed deployment instructions.

	## Project Features

	- Multi-Speaker TTS: Generate speech in both male and female voices
	- Voice Cloning: Utilizes speaker embeddings for voice characteristics preservation
	- Speed Control: Adjust the speech rate as needed
	- Text Normalization: Handles various text inputs through proper normalization

	## Potential Applications

	- Voice Assistants: Build voice assistants that speak Moroccan Darija
	- Accessibility Tools: Create tools for people with visual impairments
	- Language Learning: Develop applications for learning Darija pronunciation
	- Content Creation: Generate voiceovers for videos or audio content
	- Public Announcements: Create automated announcement systems in Darija

	## Limitations and Future Work

	Current limitations:
	- The model may struggle with code-switching between Darija and other languages
	- Pronunciation of certain loanwords might be inconsistent
	- Limited emotional range in the generated speech

	Future improvements:
	- Fine-tune with more diverse speech data
	- Implement emotion control for expressive speech
	- Add support for Arabic script input
	- Develop a multilingual version supporting Darija, Arabic, and French

	## License

	This project is released under the MIT License. The DODa audio dataset is also available under the MIT License.

	## Acknowledgments

	- The [DODa audio dataset](https://huggingface.co./datasets/atlasia/DODa-audio-dataset) creators
	- Hugging Face for the Transformers library and model hosting
	- Microsoft Research for the SpeechT5 model architecture