How to Use Speaker Diarization with OpenAI Whisper | Integration Guide

The Problem with Whisper Alone

OpenAI's Whisper is one of the best transcription models available. But it has one major limitation: it doesn't identify speakers. A meeting transcript looks like one continuous block of text with no indication of who said what.

// Whisper output (no speaker labels)

"Let's review the Q4 targets. We're currently at 85% of target. The new campaign should help close the gap. Great. Can you send the updated forecast by Friday?"

The Solution: Pre-process with Diarization

By separating speakers before running Whisper, you can transcribe each speaker's audio independently, then combine the results with proper attribution.

// Result: Labeled transcript

[Sarah]: "Let's review the Q4 targets."

[Mike]: "We're currently at 85% of target."

[Lisa]: "The new campaign should help close the gap."

[Sarah]: "Great. Can you send the updated forecast by Friday?"

Step-by-Step Workflow

Upload to SplitBySpeakers

Upload your audio file. Our AI identifies unique speakers and creates separate audio tracks for each.

meeting.mp3 → SplitBySpeakers

↓

speaker_1.mp3, speaker_2.mp3, speaker_3.mp3

Identify Each Speaker

Listen to a few seconds of each track to identify the speaker. Rename the files accordingly.

speaker_1.mp3 → sarah_ceo.mp3

speaker_2.mp3 → mike_sales.mp3

speaker_3.mp3 → lisa_marketing.mp3

Run Whisper on Each Track

Transcribe each speaker's audio separately using Whisper. This ensures each transcript is linked to the correct speaker.

# Using OpenAI's Whisper API

whisper sarah_ceo.mp3 --output-format json

whisper mike_sales.mp3 --output-format json

whisper lisa_marketing.mp3 --output-format json

Merge with Timestamps

Whisper provides timestamps. Use them to interleave the transcripts in chronological order, adding speaker labels.

Python Example

Here's a basic Python script to merge Whisper transcripts with speaker labels:

import json
from pathlib import Path

def merge_transcripts(speaker_files: dict) -> list:
    """
    speaker_files: {"Sarah": "sarah.json", "Mike": "mike.json"}
    Returns sorted list of segments with speaker labels
    """
    all_segments = []

    for speaker, filepath in speaker_files.items():
        with open(filepath) as f:
            data = json.load(f)

        for segment in data["segments"]:
            all_segments.append({
                "speaker": speaker,
                "start": segment["start"],
                "end": segment["end"],
                "text": segment["text"].strip()
            })

    # Sort by start time
    return sorted(all_segments, key=lambda x: x["start"])

# Usage
speakers = {
    "Sarah (CEO)": "sarah_ceo.json",
    "Mike (Sales)": "mike_sales.json",
    "Lisa (Marketing)": "lisa_marketing.json"
}

transcript = merge_transcripts(speakers)

for seg in transcript:
    print(f"[{seg['speaker']}]: {seg['text']}")

Pro Tips

Use Whisper's JSON Output

JSON format includes timestamps for each segment, making it easy to merge transcripts chronologically.

Higher Quality = Better Results

Both diarization and transcription work better with high-quality source audio. Use lossless formats when possible.

Handle Overlapping Speech

When speakers talk over each other, both get transcribed. Use timestamps to indicate overlapping segments in your output.

Batch Processing

For multiple recordings, script the entire pipeline: upload to SplitBySpeakers API → download tracks → run Whisper → merge.

Alternative Approaches

whisperX

Open-source project that adds diarization directly to Whisper. Good for technical users comfortable with Python.

pyannote + Whisper

Use pyannote.audio for diarization, then transcribe segments with Whisper. More setup but highly customizable.

Assembly AI / Deepgram

Commercial APIs that offer combined transcription + diarization. Easier to use but more expensive.

The SplitBySpeakers + Whisper approach gives you the best of both worlds: state-of-the-art transcription accuracy from Whisper, with clean speaker separation you can verify and label before transcription.

Speaker Diarization+ Whisper AI