Speaker separation playbook

How to separate speakers in one audio file

You recorded two, three, maybe four people, and it all landed on a single mixed track. Here is the honest, practical way to pull each voice back out into its own clean file, what actually works, and where the shortcuts quietly waste your afternoon.

18 min read · Updated July 2026

Why everyone ends up on one track

Almost nobody plans to end up with a single mixed recording of a multi-person conversation. It just happens. You sit down to record a remote interview, you hit the button in Zoom or Google Meet or a phone call recorder, and forty minutes later you have one audio file with two or three voices baked together. The moment you try to edit, you realize the problem: you cannot turn one person down without turning the other person down too. They share the same waveform.

The most common culprit is the single-device recording. One microphone in a room, or one phone on the table, captures everyone at once. Remote calls are worse in a subtle way: even though each person is on their own connection, most default recorders mix everyone down to a single mono or stereo export before you ever get the file. Zoom can record separate audio files per participant, but that setting is off by default and buried in preferences, so the vast majority of recordings arrive pre-mixed.

You might reasonably ask: why not just always record multitrack and skip this entire problem? Fair point, and if you can, you absolutely should. A local double-ender, where each participant records their own side, is the gold standard and no AI beats clean isolated source recordings. But real life gets in the way. The guest forgets to start their local recorder. The double-ender file corrupts. You are handed a phone memo from a client who will never install recording software. You are working with archive audio that was captured years ago. In every one of those cases the multitrack ship has sailed, and you are left with the single file you actually have. That is the situation this playbook is built for, and it is far more common than the tidy multitrack world people assume everyone lives in.

The good news is that the file you have is usually good enough to work with. If two people are mostly taking turns and the recording is reasonably clear, you can get each voice onto its own track. The rest of this guide is about how.

What ‘separating speakers’ actually means

This phrase gets used loosely, and the looseness causes people to buy the wrong tool. There are three genuinely different things hiding under “separate the speakers,” and they produce completely different outputs. Getting the mental model right up front saves you hours.

1. Speaker diarization = labelling who spoke when

Diarization answers the question “who is talking at each moment?” It draws a timeline that says Speaker A from 0:00 to 0:12, Speaker B from 0:12 to 0:30, and so on. Critically, it does not touch the audio itself. You still have one mixed file; you just have labels describing it. This is what you get from most transcription engines, and if you want the underlying theory, we wrote a plain-language explainer on what speaker diarization is and a more technical piece on speaker diarization with Whisper.

2. Source separation = isolating the actual audio

Source separation is the one people usually mean when they are frustrated in an editor. It takes the mixed recording and reconstructs a separate audio stream for each voice, so you end up with one real, playable, isolated file per speaker plus, often, the background music on its own. This is what SplitBySpeakers does. Not labels on a timeline, but genuine isolated audio you can drag into any editor and treat like a multitrack session that never existed. If you want the engineering underneath it, here is how AI audio separation works.

3. Transcript speaker labels = text only

The third thing is purely textual: a transcript that tags each paragraph with a speaker name. Useful for reading and searching, useless if what you needed was audio you can level and edit. The trap is that all three of these get marketed with the same words. If your goal is to edit each voice independently, you need source separation, and only source separation gives you files. Keep that distinction in your head for the rest of this guide, because the tool advice later depends entirely on it.

The manual methods (and why they mostly fail)

Before AI separation existed, editors developed a set of manual tricks to fake speaker separation out of a mono mix. They are worth understanding, partly because you will still see them recommended in old forum threads, and partly because knowing why they fail explains why the automatic approach is such a jump.

Clip-splitting by hand in Audacity

The most literal method is to open the file in Audacity, listen through, and cut the waveform every time the speaker changes, then move each chunk onto its own track. For a two-person interview with clean turn-taking this technically works, but it is brutally slow, roughly real-time plus overhead, and it collapses the instant both people talk at once, because a single overlapping clip cannot belong to two tracks. You end up arbitrating every crosstalk moment by hand.

EQ carving, noise gates, and panning

The more “clever” tricks try to use signal processing. EQ carving assumes each voice lives in a different frequency band, so you boost one range and cut another. The problem is that two human voices overlap enormously in frequency; carving one out guts the other and leaves both sounding thin. Noise gates are pitched as a way to silence the “other” speaker when the target is quiet, but a gate cannot tell voices apart, it only reacts to volume, so it chops words and does nothing during overlap. Panning tricks only help if the speakers were already recorded to different stereo positions, which on a mono file they were not.

Here is the honest version of the counter-argument: manual methods are not worthless. On a pristine recording of two people who never interrupt each other, patient clip-splitting produces a real result and costs you nothing but time. But that is a narrow case, and the effort scales linearly with length while the quality falls off a cliff the moment reality intrudes with overlap, laughter, or a third voice. For anything longer than a few minutes, the manual route is a false economy. This is the anti-pattern section for a reason: recognize these techniques so you can decide, deliberately, not to spend your evening on them.

The automatic approach: AI speaker separation

The automatic approach replaces all of that manual labor with a trained model. You upload the mixed file, the AI analyzes the whole recording, learns the sonic fingerprint of each distinct voice, and reconstructs a separate isolated track for every speaker it finds, plus the background music on its own if there is any. You download one clean file per person. There is no clip-cutting, no EQ guesswork, and no gate chopping words in half.

Mechanically, the flow is simple: upload, wait, download. A typical recording finishes in around two minutes. That speed comes from the fact that this is a batch process, not a live one. The model needs the whole file to do its best work, because it uses context from the entire recording to keep each speaker consistent from start to finish. That is the honest trade-off worth stating plainly: this is not real-time. You cannot point it at a live meeting and get separated streams as people talk. If live separation is what you need, this is the wrong tool, and we say so again in the limits section. For recorded files, though, batch is a feature, not a compromise, because it is what allows the quality to hold across a long conversation.

Practically, the tool accepts MP3, WAV, and M4A audio, as well as video files, up to 100MB, and it handles up to roughly five speakers in a single recording. If you want the deeper editorial walkthrough of the same idea, our companion article on how to separate speakers in audio covers the same ground with more examples.

The obvious counter-argument: AI is not magic, and the output is not always studio-perfect. On difficult source material you can hear faint bleed or slight artifacts. We are not going to pretend otherwise. But for the overwhelmingly common case, a couple of people having a mostly-orderly conversation, the result is clean enough to edit, level, and publish, and it takes two minutes instead of two hours.

Step by step: separating a two-person recording

Here is the concrete sequence for the most common job, a two-person interview or conversation. Each step names what “done” looks like, and an “if you’re behind” branch for when it does not go cleanly.

1

Confirm and prep the source file

Done looks like: a single file under 100MB in MP3, WAV, M4A, or video format, that plays start to finish without dropouts. You can hear both voices clearly, even if they share the track.

If you're behind: the file is over 100MB, export a lower-bitrate MP3 or trim dead air off the ends first. If a whole stretch is inaudible, note the timecode now, because no tool recovers a voice that was never captured.

2

Upload to SplitBySpeakers

Done looks like: the file is uploaded and processing has started. You do not need to configure anything, the model detects the number of speakers on its own.

If you're behind: an upload that stalls is almost always the file size or a flaky connection. Re-export smaller and retry on a stable network rather than fighting the upload.

3

Wait for processing (~2 minutes)

Done looks like: the job finishes and you see one downloadable track per detected speaker, plus a background-music track if the recording had music.

If you're behind: a long recording takes proportionally longer, so a two-minute estimate can stretch on a full episode. Let it run; it is faster than any manual method even at its slowest.

4

Spot-check each isolated track

Done looks like: you play 20-30 seconds of each track and hear one voice clearly dominant, with the other voice pushed far down or gone.

If you're behind: if you hear noticeable bleed, jump to the hard-cases section. Often the fix is accepting a slightly imperfect track and cleaning the worst moment by hand, rather than re-running.

5

Download and name your tracks

Done looks like: separate files saved with clear names like host.wav and guest.wav, ready to drop into your editor as independent tracks.

If you're behind: if you cannot tell which file is which, play the first line of each; the person who speaks first is your reference. Rename immediately so you never re-guess.

That entire loop, prep through download, is usually under ten minutes of your attention for a normal interview, and most of that is waiting. Compared with hand-cutting the same file in Audacity, it is not a close contest.

Handling crosstalk, overlap, music, and 3+ speakers

The clean two-person case is the easy one. Real conversations get messier, and it is worth being straight about how each kind of mess behaves. Crosstalk and overlap, where two people talk at literally the same time, is the single hardest thing for any separation system, human or machine. When two voices are perfectly simultaneous, the information is physically entangled in the waveform, and the model has to make a judgement call. Brief overlaps, the normal “mm-hm” and the occasional interruption, separate well. Long stretches of both people talking over each other are where you will hear the most bleed.

Background music is actually the friendlier case, because music sounds nothing like speech, so it lands cleanly on its own track and leaves the voices clearer. Three or more speakers works too, up to about five, but expect the difficulty to rise with each added voice, especially if two of them have similar timbre. Two deep male voices of similar pitch are harder to tell apart than a contrasting pair.

Vignette: the four-person roundtable

The scenario. A podcast producer inherited a lively four-person roundtable recorded on a single room mic. People interrupted constantly, laughed over each other, and two of the guests had noticeably similar voices. She needed each panelist isolated so she could ride levels, because one guest was twice as loud as the rest.

What surprised her. She expected the overlap to ruin everything. Instead, the separated tracks were clean during the 90 percent of the show where people took turns, and only got muddy in the genuinely simultaneous laughter. The two similar voices, which she was sure would merge, stayed mostly distinct.

The lesson. Judge separation by the parts you actually need to edit, not by the worst two seconds. She rode levels on the clean stretches and simply left the group laughter as-is, because nobody edits a laugh. The hard moments were the moments that did not matter.

What to do with the separated tracks

Getting isolated tracks is the means, not the end. The whole point is what you can now do that you could not do with the mixed file. First, you can edit each voice independently. If your guest mumbled and your host boomed, you can raise one and tame the other without touching the balance of the whole recording. Second, you can process them separately: de-noise the track that was recorded in a bad room, add compression to the quiet speaker, gate only the track that has hum. None of that was possible when both voices shared one waveform.

Leveling is usually the biggest win. Drop each track into your editor, set each speaker to a consistent loudness, and the conversation suddenly sounds professionally balanced. From there you hand off: the clean per-speaker audio makes transcription and captions far more accurate, since the transcriber is no longer guessing through crosstalk. This is the natural bridge into a full podcast editing pass or a targeted interview cleanup. If your next step is a polished episode end to end, the follow-on playbook on how to clean up your podcast audio picks up exactly here.

Vignette: the unbalanced client interview

The scenario. A freelance editor got a single-file Zoom recording of a founder interview. The founder was loud and close to the mic; the interviewer was faint and echoey. On the mixed file, every attempt to raise the interviewer also raised the founder into clipping.

What surprised him. Once separated, he could de-noise and compress the interviewer’s track alone without any of that processing touching the founder. The echo, which he assumed was permanent, dropped noticeably once it was on an isolated track he could treat directly.

The lesson. Separation is not just about volume, it unlocks per-voice processing. The moment each speaker is independent, every fix you know how to do becomes available again. He delivered a balanced interview from a recording he would previously have called unusable.

Tools: what to use, and what to avoid

The fastest way to waste an afternoon is to reach for a tool built for a different job. More people fail at this by picking wrong than by executing wrong, so here is the honest map, starting with what to avoid and why.

Avoid: vocal remover and karaoke tools

Tools like Lalal.ai and Moises are built to split music into stems, vocals versus instruments. They are excellent at that and useless for your job, because they think in “vocals” as one bucket. They will happily merge two speakers into a single “vocal” track, which is the exact opposite of what you want.

Avoid: real-time meeting bots

Live meeting assistants that join a call and transcribe on the fly cannot help with a file you already have. They also do not hand you isolated audio; they hand you text. If the conversation already happened and lives in a file, a real-time bot has nothing to offer.

Avoid: pure transcription tools and blanket noise plugins

Transcription services like Otter.ai alternatives only label who spoke in the text, they never give you separated audio. And blanket noise-reduction plugins reduce noise across the whole mix; they do not know one voice from another, so they cannot separate speakers at all.

So what should you use? For turning one mixed file into isolated per-speaker tracks, use a purpose-built source separation tool. That is what SplitBySpeakers is for. It is deliberately not a full editor like Descript or Audacity, and if you want a timeline and text-based editing, our Descript alternatives rundown is more useful. And if you are genuinely torn between an all-in-one editor and a transcription-first tool, the Descript vs Otter comparison lays out the trade-off. The honest positioning: use the right narrow tool for separation, then take the clean tracks into whatever editor you already like.

You can see plans and what is included on the pricing page.

A 20-minute workflow

Here is a repeatable operational checklist for turning a raw single-file recording into balanced, per-speaker audio ready for your editor. Each step has a “done” marker and an “if you’re behind” branch so you never stall.

Minutes 0-3: prep and upload

Done: source file is under 100MB, trimmed of obvious dead air, and uploaded.

Behind: export a smaller MP3 and skip trimming for now; you can tidy the ends after separation.

Minutes 3-5: let it process

Done: one isolated track per speaker (plus music) is ready to download.

Behind: on a long file, start reviewing your edit notes while it runs instead of watching the bar.

Minutes 5-10: spot-check and download

Done: each track auditioned for 20-30 seconds, confirmed one dominant voice, files named clearly.

Behind: if one track has bleed, download it anyway and flag the timecodes to fix by hand later.

Minutes 10-16: level in your editor

Done: each speaker set to a consistent loudness so no one is noticeably louder than the rest.

Behind: normalize each track to a target level as a fast first pass, then fine-tune only the outlier.

Minutes 16-20: process and hand off

Done: per-voice de-noise or compression applied where needed, tracks exported or sent to transcription.

Behind: skip optional processing and ship the leveled version; balanced beats perfect on a deadline.

Twenty minutes is a realistic target for a normal two-person interview once you have done it a few times. A messy four-person show takes longer in the leveling stage, but the separation itself is the same two-minute step regardless. The workflow scales with editing ambition, not with the separation.

When this won’t work (and what to do instead)

No tool is right for every situation, and pretending otherwise just wastes your time and ours. Here are the cases where speaker separation will disappoint you, and what to reach for instead.

  • Heavily overlapping simultaneous speech. If your recording is mostly people talking over each other at once, the underlying audio is too entangled to fully separate. You will get partial results at best. The real fix is upstream: record multitrack next time so overlap never shares a waveform.

  • Very low-quality sources. Heavily compressed, distorted, or barely-audible recordings do not contain enough clean signal to reconstruct. If a human can barely tell the voices apart, the model will struggle too.

  • More than about five speakers. A large panel or a crowded room pushes past what the tool reliably handles. For a big group, separate the segments that matter, or go back to the source and record participants individually.

  • Music production and karaoke. If you want to pull a lead vocal off a song or build an instrumental, that is music stem separation, a different job for a different tool. This is for conversations, not songs.

The honest bottom line: speaker separation is a sharp tool for a specific, common problem, one mixed recording of a handful of people mostly taking turns, that you need as isolated tracks. Inside that box it is genuinely excellent and saves hours. Outside it, be willing to use something else. Knowing the edge of the tool is part of using it well.

Related guides

Turn one mixed file into clean per-speaker tracks

Upload your recording and get one isolated track per speaker in about two minutes. Plans from $19/mo, money-back guarantee.

Get Started