What ‘clean’ actually means for a podcast
Before you touch a single fader, it helps to agree on what ‘clean’ even means, because most people who ask how to clean up podcast audio are actually chasing three different goals at once and confusing themselves in the process. Clean audio is first and foremost intelligible: every word from every speaker is easy to catch on cheap earbuds in a noisy train, not just on your studio monitors. Second, it is balanced: the quiet guest and the loud host sit at roughly the same perceived volume, so listeners never lunge for the volume knob. Third, it is free of distraction: no refrigerator hum, no chair squeaks that pull focus, no crosstalk where two voices smear into mush.
Notice what is not on that list. Clean does not mean sterile. It does not mean every breath is surgically removed, every room reflection erased, every voice flattened into the same processed sheen you hear on over-produced true-crime shows. That is the trap beginners fall into: they equate ‘more processing’ with ‘more professional,’ and they end up with audio that sounds like it was recorded inside a tin can and then run through a phone filter. The honest counter-argument here is that some genres genuinely want a hyper-polished, larger-than-life sound, and if that is your brand, fine. But for the overwhelming majority of conversational, interview, and solo shows, the goal is transparency. The listener should never think about the audio at all.
A useful mental test: play a 30-second clip to someone who is not a podcaster and ask what they noticed. If they say ‘the hum’ or ‘the guy on the left was too quiet,’ you have real problems worth fixing. If they say ‘it sounded a bit robotic,’ you have over-processed and need to back off. If they just react to what was said, you are done. That last case is the target, and holding it in your head stops you from grinding on inaudible flaws that no listener will ever notice. For a broader view of where cleanup fits into a release workflow, our podcast post-production tips walk through the full pipeline from raw file to published episode.
The order of operations that actually matters
Audio cleanup is one of those crafts where the sequence matters as much as the tools. Do the right steps in the wrong order and you will spend twice the time fighting problems you created yourself. The order that consistently works is: separate speakers, balance levels, de-noise, EQ, compress, then set final loudness. Read that as a mental model, not a rigid law, but understand why each step earns its place before the next.
Separation comes first because almost every later move is easier, safer, and more precise when each voice lives on its own track. Balancing comes second because there is no point de-noising a track you are about to turn up 9 dB, which would also turn its noise floor up 9 dB. De-noising comes before EQ because noise reduction changes the tonal balance, and you want to shape the clean result, not the noisy one. EQ comes before compression because a compressor reacts to whatever frequencies are loudest, so if you have a low-end rumble you have not removed yet, the compressor will duck the whole voice every time that rumble spikes. Compression comes before final loudness because loudness normalization is just a gain move applied to the finished, controlled signal.
The honest caveat: professionals break this order all the time. A seasoned engineer might EQ before de-noising because they know their room, or compress in two gentle stages around the noise reduction. That is fine when you understand the trade-offs. But if you are asking how to clean up podcast audio, you almost certainly benefit more from a reliable default than from clever exceptions. Run the steps in order for a dozen episodes, learn what each one actually does to your sound, and only then start bending the sequence. The rest of this guide walks each step in that exact order, and if you want to go deeper on the editorial side, our podcast editing best practices cover the cut-and-arrange decisions that sit alongside the technical cleanup.
Step 1: separate the speakers first
Here is the single decision that unlocks every other step: get each speaker onto their own isolated track. If you recorded a multitrack session where every mic already lives on a separate channel, you are ahead of the game. But the reality for most shows, remote interviews recorded on one call, in-person chats captured on a single field recorder, salvaged Zoom audio, is that everyone is glued together in one mixed file. That single file is where cleanup goes to die, because any adjustment you make hits every voice at once.
This is exactly where SplitBySpeakers does one specific job well: you upload the mixed audio or video file, and it uses AI to separate the individual speakers, and any background music, into clean isolated tracks, one file per speaker. It is automatic and upload-based rather than a live or real-time tool, a typical file comes back in around two minutes, it accepts MP3, WAV, M4A and video up to 100MB, and it handles up to roughly five speakers. Be clear about what it is not: it is not a full DAW, it will not level, EQ, or compress for you, and it is not a transcription service. It does the separation step, and then you take those clean stems into your own editor for everything that follows. If you want the deeper walkthrough of that specific task, the sibling playbook on how to separate speakers in one audio file covers the upload flow in detail.
Why does this unlock everything else? Because once each voice is alone, you can raise the timid guest without raising the host, de-noise a hissy laptop mic without touching the clean studio mic, and gate one track’s chair squeaks without silencing the other person mid-sentence. The fair counter-argument is that AI separation is not perfect: on heavy overlap or very similar voices you can get faint artifacts. In practice, though, an imperfect isolated track you can actually work on beats a pristine mixed track you cannot. Separation is leverage, and it is the reason this step sits at position one. Editors who do this every week can see how it fits a repeatable routine on our podcast editing use-case page.
Step 2: balance the levels between speakers
With each voice on its own track, the next job is to make them sit at the same perceived volume. Uneven levels are the number-one complaint listeners have about amateur podcasts, the host booms and the guest whispers, and it is also the easiest fix once you have separated tracks. The goal is not to make every track hit the same peak; peaks are misleading because a single loud laugh can spike a meter while the overall voice stays quiet. The goal is to match loudness, which is what your ears actually track.
Practically, aim each individual speaker track at somewhere around -16 to -20 LUFS integrated before your final loudness pass. That gives you clean headroom to glue everything together later without clipping. Use your editor’s loudness meter, most modern DAWs and even free tools have one, and nudge each track’s gain until the numbers land in that window. Do not eyeball the waveform; a fat waveform can be quiet and a thin one can be loud. The measurable target is what keeps this objective: you are not guessing, you are matching a number.
One nuance worth naming honestly: a single static gain move per track only works when a speaker’s level is consistent within their own track. If your guest leaned in and out of the mic, one track will swing from -14 to -26 LUFS across the episode, and no single gain setting fixes that. For those cases you need clip-gain automation, riding the level up in quiet passages and down in loud ones, before you ever reach for a compressor. It is tedious, and this is the one step where cleanup genuinely takes time. But matched levels are the foundation the entire mix rests on; skimp here and every later step inherits the imbalance. This kind of level-matching is the heart of our interview cleanup workflow, where a quiet remote guest is the norm rather than the exception.
Step 3: kill crosstalk and mic bleed
Crosstalk is the sound of one person’s voice leaking into another person’s microphone, and mic bleed is its close cousin. In a room with two open mics, the host’s voice arrives at the guest’s mic a few milliseconds later and slightly colored, so when you cut to the guest’s track you hear a hollow, phasey echo of the host. It makes editing miserable: every time you try to clean a pause on one track, the other person is faintly present, and hard edits click because the bleed doesn’t line up.
The blunt truth is that crosstalk is only truly fixable once speakers are on separate tracks, which is another reason separation is step one. On isolated tracks you have real options: gate each track so it only opens when that person is actually talking, manually silence the gaps between their lines, and use the isolation itself, because AI-separated stems already push the other voices down dramatically compared to a raw shared-room recording. On a single mixed file, none of this is possible, you cannot mute one voice’s bleed without muting the conversation. The counter-argument is that aggressive gating can chop the front off quiet words; set the gate’s threshold and release gently and check every transition by ear.
Vignette: the interview that echoed itself
The scenario. A two-person show recorded both hosts in the same small room on two cardioid mics, straight into one stereo file. On playback, every punchline had a faint delayed double of itself, and hard cuts between speakers clicked audibly.
What surprised them. They assumed the fix was a noise-reduction plugin. Nothing they tried touched it, because the ‘noise’ was actually intelligible speech bleeding across mics, and broadband NR has no idea a voice is unwanted. The problem was structural, not spectral.
The lesson. They ran the mixed file through separation to get two isolated stems, gated each gently, and the double-image vanished, cuts went silent because each track’s gaps were now genuinely empty. Crosstalk is a track problem, so it needs a track solution, not a plugin.
Step 4: tame room noise, echo, and background music
Now that levels are matched and voices are isolated, deal with the ambient junk: the steady hum of an air conditioner, computer fans, street noise, and the slap-back echo of an untreated room. The key word for this step is gentle. Broadband noise reduction works by learning a noise profile and subtracting it, and the more aggressively you subtract, the more you carve into the voice itself, producing the tell-tale underwater warble. Start with the lightest setting that meaningfully reduces the noise floor and stop the moment the voice starts to sound processed. A little residual room tone is almost always more natural than an over-scrubbed voice.
For steady tonal hums, a narrow notch or high-pass filter often removes the offender more transparently than broadband NR, since you are surgically removing a specific frequency rather than everything. For intermittent noise, chair creaks, keyboard clicks, a dog in another room, manual editing and gating on the isolated track usually beats any automatic tool, because you can simply silence the gaps where the noise lives without touching the speech. Echo and reverb are the hardest to fix in post; de-reverb tools exist but they degrade the voice quickly, so treat heavy room echo as a re-record signal rather than a post problem, which we return to at the end.
Background music baked into the recording is its own headache, a cafe interview with a song playing, a promo bed left running under a voiceover. Once that music is mixed into the same signal as the voice, no EQ fully removes it without gutting the speech. Separation helps here too, because the same tool that splits speakers can pull the background music out into its own stem, leaving a cleaner voice track. We cover that specific task in the guide on how to remove background music from audio. The honest limit: if the music is loud and dense relative to the voice, expect a usable improvement, not a miracle.
Step 5: EQ and compression per voice
This is where isolated tracks pay off one more time, because EQ and compression should be tuned per voice, not slapped across the whole mix as one global preset. A deep-voiced host and a bright-voiced guest need different treatment, and applying one setting to both means one of them ends up wrong. Start each voice with a high-pass filter, roll off everything below roughly 80 to 100 Hz for most voices, since there is nothing but rumble, plosive thumps, and desk vibration down there. That single move cleans up more mud than any other EQ decision.
From there, EQ to taste and to the specific voice: a gentle dip around 200 to 400 Hz if a voice sounds boxy, a small lift in the 3 to 6 kHz presence range if a voice sounds dull and hard to follow. Make small moves, a couple of dB at a time, and compare against the untouched track constantly so you do not drift into a harsh, hyped sound. The goal is to help intelligibility, not to remodel someone’s voice into something it is not.
Compression evens out the remaining dynamic swings so the loud and quiet parts of a single voice sit closer together. Reach for a gentle ratio, around 2:1 to 3:1, with a threshold set so the compressor is only pulling down a few dB on the loudest words, not clamping the whole performance. Medium attack lets the natural transients of consonants through so speech stays crisp; a moderate release keeps it from breathing audibly. The counter-argument you will hear is that broadcast voices use heavy compression, and they do, but they also have engineers riding it and a deliberate sound in mind. For a conversational podcast, light and per-voice beats heavy and global every time.
The mistakes that make audio worse
Most ruined podcast audio is not under-processed, it is over-processed, or processed in the wrong order, by someone trying hard to help. Knowing what not to do saves more episodes than any new plugin. Here are the anti-patterns that do the most damage, and why each one backfires.
Cranking noise reduction to kill every hiss. Aggressive broadband NR carves holes in the voice and produces that watery, underwater, metallic warble. A quiet, natural room tone is far less distracting than a scrubbed, artifact-ridden voice. Use the lightest setting that helps and stop.
Over-compressing until it pumps. A high ratio and a low threshold squash all the life out of a voice and make the noise floor audibly breathe up and down between words. Once you hear pumping, you have gone too far, back off the ratio and raise the threshold.
One preset for every voice. Slapping the same EQ and compressor across the whole mix guarantees it flatters one speaker and mangles another. Different voices need different treatment; that is the entire reason to work on separated tracks.
Chasing loudness too early. Pushing everything loud at the start, before balancing and cleaning, just amplifies noise and locks in problems. Loudness is the last step, applied to a finished, controlled signal, never the first.
Vignette: the episode that got worse in post
The scenario. A solo host inherited a decent raw recording and, wanting it to sound ‘radio ready,’ stacked a strong noise reducer, a fast heavy compressor, and a loudness maximizer on the master before doing anything else.
What surprised them. The exported file sounded noticeably worse than the raw one, thin, brittle, with an audible whoosh of noise rising in every pause. They had assumed more plugins meant more polish, and were shocked that doing less would have sounded better.
The lesson. They deleted the whole chain, high-passed, applied a gentle 2.5:1 compressor, and set loudness once at the end. The result was cleaner in ten minutes than the over-processed version had been in two hours. Restraint is a technique, not a shortcut.
A repeatable 30-minute per-episode checklist
The point of a checklist is that cleanup stops being a creative agony and becomes a routine you can run on autopilot. Here is a per-episode pass with a concrete ‘what done looks like’ for each step, plus an ‘if you’re behind’ branch for when the clock is against you.
Separate (≈3 min)
Upload the mixed file, get one isolated track per speaker plus any music stem. Done looks like: each voice on its own track, other voices clearly pushed down. If behind: still do this, it is the step that makes every later shortcut possible.
Balance levels (≈8 min)
Meter each track and gain it toward -16 to -20 LUFS integrated. Done looks like: every speaker within about 2 LU of each other on the loudness meter. If behind: set one static gain per track and skip clip-gain automation; note the worst swings for a later pass.
De-noise (≈5 min)
Apply the lightest broadband NR that lowers the noise floor, plus a notch for any steady hum. Done looks like: pauses are quiet with no watery artifacts on speech. If behind: a single mild NR preset across each voice track; do not chase perfection.
EQ + compress per voice (≈8 min)
High-pass ~80–100 Hz, small presence tweak, gentle 2–3:1 compression. Done looks like: each voice is clear and even with no pumping. If behind: high-pass and one gentle compressor per track, skip the surgical EQ.
Final loudness + spot check (≈6 min)
Sum to the mix, normalize to about -16 LUFS, scan the timeline for clicks and loud breaths. Done looks like: the export hits target loudness with no clipping and no obvious glitches on earbuds. If behind: normalize and listen to the first and last minute plus three random spots rather than the whole thing.
Thirty minutes is realistic once the routine is muscle memory and your source recordings are reasonable. The single biggest time sink is level automation on a wildly inconsistent track, which is why capturing clean levels at the source pays back every single episode. If you find yourself blowing past an hour every time, the problem is upstream in the recording, not in your post skills.
When to re-record instead of fixing in post
An honest cleanup guide has to admit where post- production runs out of road, because pretending every problem is fixable wastes your evening and still ships a bad episode. Post is good at what is additive and separable: uneven levels, steady hum, mild noise, tonal balance, and, with separation, splitting glued- together voices and pulling out background music. Those are the wins you can reliably bank.
Post is bad at what is destructive or already baked in. Heavy room echo and reverb cannot be meaningfully un-baked, de-reverb tools trade the echo for a hollowed, robotic voice. Severe clipping, where the recording was so hot the waveform got squared off, has permanently lost information that no plugin restores. A voice buried under louder noise or music, where the signal-to-noise ratio is upside down, will improve but never sound clean. And digital dropouts or glitchy connection artifacts from a bad call are gaps in the data, not something to filter. When you hit one of these, the fastest path to a good episode is often to pick up the phone and re-record the affected chunk, not to spend three hours making a bad clip slightly less bad.
The counter-argument is that re-recording is not always possible, a one-time guest, a live moment you cannot recreate. Fair. In those cases, do the best cleanup you can, be transparent with listeners if a segment is rough, and lean on the tools that genuinely help, separation to isolate the voice, gentle NR, careful leveling, while accepting the ceiling. If you are comparing where a dedicated separation step fits against all-in-one editors and transcription tools, our breakdown of Descript alternatives and Otter.ai alternatives lays out what each type of tool is actually for, and the head-to-head on Descript vs Otter helps if you are choosing an editing-plus-transcription stack to pair with your separation step. If your real goal was searchable text with speaker names rather than clean audio, the guide to transcribe with speaker labels points you at the right kind of tool instead.