可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I am building a tool which is supposed to run on a server and analyze sound files. I want to do this in Ruby as all my other tools are written in Ruby as well. But I am having trouble finding a good way of accomplishing this.

A lot of the examples I've found has been doing visualizers and graphical stuff. I just need the FFT data, nothing more. I need to both get the audio data, and do a FFT on it. My end goal is to calculate some stuff like the mean/median/mode, 25th-percentile, and 75th-percentile over all frequencies (weighted amplitude), the BPM, and perhaps some other good characteristic to later be able to cluster similar sounds together.

First I tried to use ruby-audio and fftw3 but I never go the two to really work together. The documentation wasn't good either so I really didn't know what data was being shuffled around. Next I tried to use bplay / brec and limit my Ruby script to just use STDIN and perform an FFT on that (still using fftw3). But I couldn't get bplay/brec to work since the server doesn't have a sound card and I didn't manage to just get the audio directly to STDOUT without going to an audio device first.

Here's the closest I've gotten:

# extracting audio from wav with ruby-audio
buf = RubyAudio::Buffer.float(1024)
RubyAudio::Sound.open(fname) do |snd|
    while snd.read(buf) != 0
        # ???
    end
end

# performing FFT on audio
def get_fft(input, window_size)
    data = input.read(window_size).unpack("s*")
    na = NArray.to_na(data)
    fft = FFTW3.fft(na).to_a[0, window_size/2]
    return fft
end

So now I'm stuck and can't find any more good results on Google. So perhaps you SO guys can help me out?

Thanks!

回答1:

Here's the final solution to what I was trying to achieve, thanks a lot to Randall Cook's helpful advice. The code to extract sound wave and FFT of a wav file in Ruby:

require "ruby-audio"
require "fftw3"

fname = ARGV[0]
window_size = 1024
wave = Array.new
fft = Array.new(window_size/2,[])

begin
    buf = RubyAudio::Buffer.float(window_size)
    RubyAudio::Sound.open(fname) do |snd|
        while snd.read(buf) != 0
            wave.concat(buf.to_a)
            na = NArray.to_na(buf.to_a)
            fft_slice = FFTW3.fft(na).to_a[0, window_size/2]
            j=0
            fft_slice.each { |x| fft[j] << x; j+=1 }
        end
    end

rescue => err
    log.error "error reading audio file: " + err
    exit
end

# now I can work on analyzing the "fft" and "wave" arrays...

回答2:

I think there are two problems here. One is getting the samples, the other is performing the FFT.

To get the samples, there are two main steps: decoding and downmixing. To decode wav files, you just need to parse the header so you can know how to interpret the samples. For mp3 files, you'll need to do a full decode. Once the audio has been decoded, if you are not interested in processing the stereo channels separately, you may need to downmix it into mono, since the FFT expects a single channel as input. If you don't mind venturing outside of Ruby, the sox tool makes this easy. For example sox song.mp3 -b 16 song.raw channels 1 should convert an mp3 to a mono file of pure PCM samples (i.e. 16-bit integers). BTW, a quick search revealed the ruby/audio library (perhaps it is the one mentioned in your post). It looks pretty good, especially since it wraps libsndfile.

To perform the FFT, I see three options. One is to use this snippet of code that performs an FFT. I'm no Ruby expert, but it looks like it might be OK. The second option is to use NArray. It has a ton of mathematical methods, including FFTW, available in a separate module, a tarball for which is linked in the middle of the NArray page. The third option is to write your own FFT code. It's not an especially complicated algorithm, and could give you great experience with numerical processing in Ruby (if you need that).

You are probably aware of this, but the FFT expects complex input and generates complex output. Audio signals are real, of course, so the imaginary component of the input should always be zero (a + 0*i). Since your input is real, the output will be symmetrical about the midpoint of the output array. You can safely ignore the upper half. If you want the energy in a particular frequency bin (they are spaced linearly up to half the sample rate), you'll need to compute the magnitude of the complex value (sqrt(real*real + imag*imag)).

One more thing: Because frequency zero (the DC offset of the signal) and the Nyquist frequency (half the sample rate) have no phase components, some FFT implementations put them together into the same complex bin (one in the real component, one in the imaginary component, typically of the first bin). You can create some simple signals (all 1s for just a DC signal, and alternating +1, -1 for a Nyquist signal) and see what the FFT output looks like.