How to downsample audio recorded from mic realtime

2020-07-30 00:55发布

问题:

I am using following javascript to record audio and send it to a websocket server:

const recordAudio = () =>
    new Promise(async resolve => {

        const constraints = {
            audio: {
                sampleSize: 16,
                channelCount: 1,
                sampleRate: 8000
            },
            video: false
        };

        var mediaRecorder;
        const stream = await navigator.mediaDevices.getUserMedia(constraints);

        var options = {
            audioBitsPerSecond: 128000,
            mimeType: 'audio/webm;codecs=pcm'
        };
        mediaRecorder = new MediaRecorder(stream, options);
        var track = stream.getAudioTracks()[0];
        var constraints2 = track.getConstraints();
        var settings = track.getSettings();


        const audioChunks = [];

        mediaRecorder.addEventListener("dataavailable", event => {
            audioChunks.push(event.data);
            webSocket.send(event.data);
        });

        const start = () => mediaRecorder.start(30);

        const stop = () =>
            new Promise(resolve => {
                mediaRecorder.addEventListener("stop", () => {
                    const audioBlob = new Blob(audioChunks);
                    const audioUrl = URL.createObjectURL(audioBlob);


        const audio = new Audio(audioUrl);
                const play = () => audio.play();
                resolve({
                    audioBlob,
                    audioUrl,
                    play
                });
            });

            mediaRecorder.stop();
        });

    resolve({
        start,
        stop
    });
});

This is for realtime STT and the websocket server refused to send any response. I checked by debugging that the sampleRate is not changing to 8Khz.Upon researching, I found out that this is a known bug on both chrome and firefox. I found some other resources like stackoverflow1 and IBM_STT but I have no idea on how to adapt it to my code. The above helpful resources refers to buffer but all i have is mediaStream(stream) and event.data(blob) in my code. I am new to both javascript and Audio Api, so please pardon me if i did something wrong.

If this helps, I have an equivalent code of python to send data from mic to websocket server which works. Library used = Pyaudio. Code :

 p = pyaudio.PyAudio()
 stream = p.open(format="pyaudio.paInt16",
                        channels=1,
                        rate= 8000,
                        input=True,
                        frames_per_buffer=10)

 print("* recording, please speak")

 packet_size = int((30/1000)*8000)  # normally 240 packets or 480 bytes

 frames = []

        #while True:
 for i in range(0, 1000):
     packet = stream.read(packet_size)
     ws.send(packet, binary=True)

回答1:

To do realtime downsampling follow these steps:

  1. First get stream instance using this:

    const stream = await navigator.mediaDevices.getUserMedia(constraints);
    
  2. Create media stream source from this stream.

    var input = audioContext.createMediaStreamSource(stream);
    
  3. Create script Processor so that you can play with buffers. I am going to create a script processor which takes 4096 samples from the stream at a time, continuously, has 1 input channel and 1 output channel.

    var scriptNode = audioContext.createScriptProcessor(4096, 1, 1);
    
  4. Connect your input with scriptNode. You can connect script Node to the destination as per your requirement.

        input.connect(scriptNode);
        scriptNode.connect(audioContext.destination);
    
  5. Now there is a function onaudioprocess in scriptProcessor where you can do whatever you want with 4096 samples. var downsample will contain (1/sampling ratio) number of packets. floatTo16BitPCM will convert that to your required format since the original data is in 32 bit float format.

       var inputBuffer = audioProcessingEvent.inputBuffer;
        // The output buffer contains the samples that will be modified and played
        var outputBuffer = audioProcessingEvent.outputBuffer;
    
        // Loop through the output channels (in this case there is only one)
        for (var channel = 0; channel < outputBuffer.numberOfChannels; channel++) {
            var inputData = inputBuffer.getChannelData(channel);
            var outputData = outputBuffer.getChannelData(channel);
    
    
    
            var downsampled = downsample(inputData);
            var sixteenBitBuffer = floatTo16BitPCM(downsampled);
          }
    
  6. Your sixteenBitBuffer will contain the data you require.

    Functions for downsampling and floatTo16BitPCM are explained in this link of Watson API:IBM Watson Speech to Text Api

You won't need MediaRecorder instance. Watson API is opensource and you can look for a better streamline approach on how they implemented it for their use case. You should be able to salvage important functions from their code.