I want to remove vocals from mp3 sound tracks. I searched google, and tried few softwares but none of them are convincing. I am planning to read the mp3 file, get a waveform and remove the waveform that is above a specified limit.
do you have any suggestions on how to proceed.
-- Update
I just want code that can read mp3 file format. Are there any softwares??
This isn't so much an "algorithm" as a "trick" but it could be automated in code. It works mostly for stereo tracks with where the vocals are centered. If the vocals are centered, they manifest equally in both tracks. If you invert one of the tracks and then merge them back together, the wave forms of the center vocals cancel out and are virtually removed. You can do this manually with most good audio editors like audacity. It doesn't give you perfect results and the rest of the audio suffers a bit too but it makes for great karaoke tracks :)
Source: http://www.cdf.utoronto.ca/~csc209h/summer/a2/a2.html, written by Daniel Zingaro.
Sounds are waves of air pressure. When
a sound is generated, a sound wave
consisting of compressions (increases
in pressure) and rarefactions
(decreases in pressure) moves through
the air. This is similar to what
happens if you throw a stone into a
pond: the water rises and falls in a
repeating wave.
When a microphone records sound, it
takes a measure of the air pressure
and returns it as a value. These
values are called samples and can be
positive or negative corresponding to
increases or decreases in air
pressure. Each time the air pressure
is recorded, we are sampling the
sound. Each sample records the sound
at an instant in time; the faster we
sample, the more accurate is our
representation of the sound. The
sampling rate refers to how many times
per second we sample the sound. For
example, CD-quality sound uses a
sampling rate of 44100 samples per
second; sampling someone's voice for
use in a VOIP conversation uses far
less than this. Sampling rates of
11025 (voice quality), 22050, and
44100 (CD quality) are common...
For mono sounds (those with one sound
channel), a sample is simply a
positive or negative integer that
represents the amount of compression
in the air at the point the sample was
taken. For stereo sounds (which we use
in this assignment), a sample is
actually made up of two integer
values: one for the left speaker and
one for the right...
Here's how the algorithm [to remove vocals] works.
Copy the first 44 bytes verbatim from the input file to the output
file. Those 44 bytes contain important
header information that should not be
modified.
Next, treat the rest of the input file as a sequence of shorts. Take
each pair of shorts left and right,
and compute combined = (left - right)
/ 2. Write two copies of combined to
the output file.
Why Does This Work?
For the curious, a brief explanation
of the vocal-removal algorithm is in
order. As you noticed from the
algorithm, we are simply subtracting
one channel from the other (and then
dividing by 2 to keep the volume from
getting too loud). So why does
subtracting the left channel from the
right channel magically remove vocals?
When music is recorded, it is
sometimes the case that vocals are
recorded by a single microphone, and
that single vocal track is used for
the vocals in both channels. The other
instruments in the song are recorded
by multiple microphones, so that they
sound different in both channels.
Subtracting one channel from the other
takes away everything that is ``in
common'' between those two channels
which, if we're lucky, means removing
the vocals.
Of course, things rarely work so well.
Try your vocal remover on this
badly-behaved wav file. Sure, the
vocals are gone, but so is the body of
the music! Apparently, some of the
instruments were also recorded
"centred", so that they are removed
along with the vocals when channels
are subtracted.
You can use the pydub Toolbox, see here for details, also see here for related question. It's dependent on FFmpeg and can read any fileformat
Then you can do the following:
from pydub import AudioSegment
from pydub.playback import play
# read in audio file and get the two mono tracks
sound_stereo = AudioSegment.from_file(myAudioFile, format="mp3")
sound_monoL = sound_stereo.split_to_mono()[0]
sound_monoR = sound_stereo.split_to_mono()[1]
# Invert phase of the Right audio file
sound_monoR_inv = sound_monoR.invert_phase()
# Merge two L and R_inv files, this cancels out the centers
sound_CentersOut = sound_monoL.overlay(sound_monoR_inv)
# Export merged audio file
fh = sound_CentersOut.export(myAudioFile_CentersOut, format="mp3")
Above a specified limit? sounds like a high pass filter...You could use phase cancellation if you had the acapella track along with the original. Otherwise, unless its an old 60s-era track that has vocals directly in the middle and everything else hard panned, i don't think there's a super clean way of removing vocals.