whisper-medium-creole-oswald
This model is a fine-tuned version of openai/whisper-medium on the creole-text-voice dataset.
The main objective is to create a 99% accurate Haitian Creole Speech-to-Text model, capable of transcribing diverse Haitian voices across accents, regions, and speaking styles.
π§ Model description
whisper-medium-creole-oswald is optimized for Haitian Creole automatic speech recognition (ASR). It builds upon the Whisper architecture by OpenAI and adapts it to Haitian Creole through transfer learning and fine-tuning on a high-quality curated dataset containing hours of Haitian Creole audio-text pairs.
- Architecture: Whisper Medium
- Fine-tuned for: Haitian Creole (KreyΓ²l Ayisyen)
- Vocabulary: Based on Latin script (Creole orthography), preserving diacritics and linguistic nuances.
- Voice types: Made with female synthetics voices.
- Sampling rate: 16kHz
- Training objective: Maximize transcription accuracy for everyday Creole speech
β Intended uses
Transcribe Haitian Creole speech from:
- Voice notes
- Radio shows
- Interviews
- Public speeches
- Educational content
- Synthetic voices
Enable Creole voice interfaces in:
- Voice assistants
- Transcription services
- Language-learning tools
- Chatbots and accessibility platforms
β οΈ Limitations
- May struggle with:
- Heavily code-switched speech (Creole + French/English mixed)
- Extremely poor audio quality (e.g., heavy background noise)
- Very fast or mumbled speech in some dialects
- Long duration audio file
- Not optimized for real-time transcription on low-resource devices
- Fine-tuned on a specific dataset β might generalize less to completely unseen voice types or rare accents
π Training and evaluation data
The model was trained on the creole-text-voice dataset, which includes:
- 5 hours of Haitian Creole Synthetic speech
- Annotated, time-aligned text transcripts following standard Creole orthography
Sources for next steps:
- Public domain radio and podcast archives
- Open-access interviews and spoken-word audio
- Community-submitted voice samples
Preprocessing steps:
- Voice Activity Detection (VAD)
- Noise filtering and audio normalization
- Manual transcript review and correction
Model usage script
# Load model directly
from transformers import AutoProcessor, AutoModelForSpeechSeq2Seq
import librosa
import numpy as np
import torch
processor = AutoProcessor.from_pretrained("jsbeaudry/whisper-medium-oswald")
model = AutoModelForSpeechSeq2Seq.from_pretrained("jsbeaudry/whisper-medium-oswald")
def transcript (audio_file_path):
# Load audio
speech_array, sampling_rate = librosa.load(audio_file_path, sr=16000)
# Convert the NumPy array to a PyTorch tensor
speech_array_pt = torch.from_numpy(speech_array).unsqueeze(0)
input_features = processor(speech_array, sampling_rate=sampling_rate, return_tensors="pt").input_features
# 2. Generate predictions
predicted_ids = model.generate(input_features)
# 3. Decode the predictions
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)
# print(transcription)
return transcription
text = transcript("/path_audio")
print(text)
Model usage with gradio (UI)
from transformers import pipeline
import gradio as gr
# Load Whisper model
print("Loading model...")
pipe = pipeline(model="jsbeaudry/whisper-medium-oswald")
print("Model loaded successfully.")
# Transcription function
def transcribe(audio_path):
if audio_path is None:
return "Please upload or record an audio file first."
result = pipe(audio_path)
return result["text"]
# Build Gradio interface
def create_interface():
with gr.Blocks(title="Whisper Medium - Haitian Creole") as demo:
gr.Markdown("# ποΈ Whisper Medium Creole ASR")
gr.Markdown(
"Upload an audio file or record your voice in Haitian Creole. "
"Then click **Transcribe** to see the result."
)
with gr.Row():
with gr.Column():
audio_input = gr.Audio(source="upload", type="filepath", label="π§ Upload Audio")
audio_input2 = gr.Audio(source="microphone", type="filepath", label="π€ Record Audio")
with gr.Column():
transcribe_button = gr.Button("π Transcribe")
output_text = gr.Textbox(label="π Transcribed Text", lines=4)
transcribe_button.click(fn=transcribe, inputs=audio_input, outputs=output_text)
transcribe_button.click(fn=transcribe, inputs=audio_input2, outputs=output_text)
return demo
if __name__ == "__main__":
interface = create_interface()
interface.launch()
Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 1e-05
- train_batch_size: 16
- eval_batch_size: 8
- seed: 42
- optimizer: Use OptimizerNames.ADAMW_TORCH with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
- lr_scheduler_type: linear
- lr_scheduler_warmup_steps: 500
- num_epochs: 5
- mixed_precision_training: Native AMP
Framework versions
- Transformers 4.46.1
- Pytorch 2.6.0+cu124
- Datasets 3.5.0
- Tokenizers 0.20.3
π Citation
If you use this model, please cite:
@misc{whispermediumcreoleoswald2025,
title={Whisper Medium Creole - Oswald},
author={Jean sauvenel beaudry},
year={2025},
howpublished={\url{https://huggingface.co./solvexalab/whisper-medium-creole-oswald}}
}
- Downloads last month
- 102
Model tree for jsbeaudry/whisper-medium-oswald
Base model
openai/whisper-medium