So here's the idea: you can generate a spectrogram from an audio file using shorttime Fourier transform (stft). Then some people have generated something called a "binary mask" to generate different audio (ie. with background noise removed etc.) from the inverse stft.
Here's what I understand:
stft is a simple equation that is applied to the audio file, which generates the information that can easily be displayed a spectrogram. By taking the inverse of the stft matrix, and multiplying it by a matrix of the same size (the binary matrix) you can create a new matrix with information to generate an audio file with the masked sound.
Once I do the matrix multiplication, how is the new audio file created?
It's not much but here's what I've got in terms of code:
from librosa import load
from librosa.core import stft, istft
y, sample_rate = load('1.wav')
spectrum = stft(y)
back_y = istft(spectrum)
Thank you, and here are some slides that got me this far. I'd appreciate it if you could give me an example/demo in python
Librosa's STFT is full-featured so unless you're very careful with how you manipulate the spectrum, you won't get a sensible output from its
istft
.Here's a pair of functions,
stft
andistft
, that I wrote from scratch that represent the forward and inverse STFT, along with a helper method that gives you the time and frequency locations of each pixel in the STFT array, plus a demo:This is in a gist if that's easier for you to read.
The entire module assumes real-only data and uses Numpy's
rfft
functions. You can definitely generalize this to complex data (or use librosa), but for your application (audio masking), using the real-only transforms makes it easier to ensure that everything works out and the output of the inverse STFT is real-only (it's easy to mess this up if you're doing the fully-general complex STFT, where you need to be careful in maintaining symmetries).The demo first generates some test data and confirms that the
istft
on thestft
of the data produces the data again. The test data is a chirp that starts at 5 Hz and goes down at 2 Hz per second, so over ~10 seconds of data, the chirp's frequency wraps around and ends up at around 15 Hz. The demo plots the STFT (by taking the absolute value of the STFT array):So
stft.py
file,import stft
,spectrum = stft.stft(y, 128)
,stft.
to functions defined instft.py
!),spectrum
array, beforeback_y = stft.istft(spectrum, 128)
.Masking/amplifying/attenuating frequency content means just scaling some bins of the
spectrum
array. If you have specific questions on how to do that, let us know. But this hopefully will give you a foolproof way of applying arbitrary effects.If you really want to use librosa's functions, let us know and we can show you how to do that too.