The Problem with Whisper Alone
OpenAI's Whisper is one of the best transcription models available. But it has one major limitation: it doesn't identify speakers. A meeting transcript looks like one continuous block of text with no indication of who said what.
// Whisper output (no speaker labels)
"Let's review the Q4 targets. We're currently at 85% of target. The new campaign should help close the gap. Great. Can you send the updated forecast by Friday?"
The Solution: Pre-process with Diarization
By separating speakers before running Whisper, you can transcribe each speaker's audio independently, then combine the results with proper attribution.
// Result: Labeled transcript
[Sarah]: "Let's review the Q4 targets."
[Mike]: "We're currently at 85% of target."
[Lisa]: "The new campaign should help close the gap."
[Sarah]: "Great. Can you send the updated forecast by Friday?"
Step-by-Step Workflow
Upload to SplitBySpeakers
Upload your audio file. Our AI identifies unique speakers and creates separate audio tracks for each.
meeting.mp3 → SplitBySpeakers
↓
speaker_1.mp3, speaker_2.mp3, speaker_3.mp3
Identify Each Speaker
Listen to a few seconds of each track to identify the speaker. Rename the files accordingly.
speaker_1.mp3 → sarah_ceo.mp3
speaker_2.mp3 → mike_sales.mp3
speaker_3.mp3 → lisa_marketing.mp3
Run Whisper on Each Track
Transcribe each speaker's audio separately using Whisper. This ensures each transcript is linked to the correct speaker.
# Using OpenAI's Whisper API
whisper sarah_ceo.mp3 --output-format json
whisper mike_sales.mp3 --output-format json
whisper lisa_marketing.mp3 --output-format json
Merge with Timestamps
Whisper provides timestamps. Use them to interleave the transcripts in chronological order, adding speaker labels.
Python Example
Here's a basic Python script to merge Whisper transcripts with speaker labels:
import json
from pathlib import Path
def merge_transcripts(speaker_files: dict) -> list:
"""
speaker_files: {"Sarah": "sarah.json", "Mike": "mike.json"}
Returns sorted list of segments with speaker labels
"""
all_segments = []
for speaker, filepath in speaker_files.items():
with open(filepath) as f:
data = json.load(f)
for segment in data["segments"]:
all_segments.append({
"speaker": speaker,
"start": segment["start"],
"end": segment["end"],
"text": segment["text"].strip()
})
# Sort by start time
return sorted(all_segments, key=lambda x: x["start"])
# Usage
speakers = {
"Sarah (CEO)": "sarah_ceo.json",
"Mike (Sales)": "mike_sales.json",
"Lisa (Marketing)": "lisa_marketing.json"
}
transcript = merge_transcripts(speakers)
for seg in transcript:
print(f"[{seg['speaker']}]: {seg['text']}")Pro Tips
Use Whisper's JSON Output
JSON format includes timestamps for each segment, making it easy to merge transcripts chronologically.
Higher Quality = Better Results
Both diarization and transcription work better with high-quality source audio. Use lossless formats when possible.
Handle Overlapping Speech
When speakers talk over each other, both get transcribed. Use timestamps to indicate overlapping segments in your output.
Batch Processing
For multiple recordings, script the entire pipeline: upload to SplitBySpeakers API → download tracks → run Whisper → merge.
Alternative Approaches
whisperX
Open-source project that adds diarization directly to Whisper. Good for technical users comfortable with Python.
pyannote + Whisper
Use pyannote.audio for diarization, then transcribe segments with Whisper. More setup but highly customizable.
Assembly AI / Deepgram
Commercial APIs that offer combined transcription + diarization. Easier to use but more expensive.
The SplitBySpeakers + Whisper approach gives you the best of both worlds: state-of-the-art transcription accuracy from Whisper, with clean speaker separation you can verify and label before transcription.