-->

How to hook real-time audio stream endpoint to Dir

2020-07-31 20:47发布

问题:

I am trying to hook-up my real time audio endpoint which produces continuous audio stream with Direct Line Speech (DLS) endpoint which eventually interacts with my Azure bot api.

I have a websocket API that continuously receives audio stream in binary format and this is what I intend to forward it to the DLS endpoint for continuous Speech2Text with my bot.

Based on the feedback and answer here, I have been able to hook up my Direct Line speech endpoint with a real-time stream.

I've tried a sample wav file which correctly gets transcribed by DLS and my bot is correctly able to retrieve the text to operate on it.

I have used the ListenOnce() API and am using a PushAudioInputStream method to push the audio stream to the DLS speech endpoint.

The below code is internals of ListenOnce() method

// Create a push stream
using (var pushStream = AudioInputStream.CreatePushStream())
{
    using (var audioInput = AudioConfig.FromStreamInput(pushStream))
    {
        // Create a new Dialog Service Connector
        this.connector = new DialogServiceConnector(dialogServiceConfig, audioInput);
        // ... also subscribe to events for this.connector

        // Open a connection to Direct Line Speech channel
        this.connector.ConnectAsync();
        Debug.WriteLine("Connecting to DLS");

        pushStream.Write(dataBuffer, dataBuffer.Length);

        try
        {
            this.connector.ListenOnceAsync();
            System.Diagnostics.Debug.WriteLine("Started ListenOnceAsync");
        }
    }
}

dataBuffer in above code is the 'chunk' of binary data I've received on my websocket.

const int maxMessageSize = 1024 * 4; // 4 bytes
var dataBuffer = new byte[maxMessageSize];

while (webSocket.State == WebSocketState.Open)
{
    var result = await webSocket.ReceiveAsync(new ArraySegment<byte>(dataBuffer), CancellationToken.None);
    if (result.MessageType == WebSocketMessageType.Close)
    {
        Trace.WriteLine($"Received websocket close message: {result.CloseStatus.Value}, {result.CloseStatusDescription}");
        await webSocket.CloseAsync(result.CloseStatus.Value, result.CloseStatusDescription, CancellationToken.None);
    }
    else if (result.MessageType == WebSocketMessageType.Text)
    {
        var message = Encoding.UTF8.GetString(dataBuffer);
        Trace.WriteLine($"Received websocket text message: {message}");
    }
    else // binary
    {
        Trace.WriteLine("Received websocket binary message");
        ListenOnce(dataBuffer); //calls the above 
    }
}

But the above code doesn't work. I believe I have couple of issues/questions with this approach -

  1. I believe I am not correctly chunking the data to Direct Line Speech to ensure that it receives full audio for correct S2T conversion.
  2. I know DLS API supports ListenOnceAsync() but not sure if this supports ASR (it knows when the speaker on other side stopped talking)
  3. Can I just get the websocket url for the Direct Line Speech endpoint and assume DLS correctly consumes the direct websocket stream?

回答1:

I believe I am not correctly chunking the data to Direct Line Speech to ensure that it receives full audio for correct S2T conversion.

DialogServiceConnector.ListenOnceAsync will listen until the stream is closed (or enough silence is detected). You are not closing your stream except for when you dispose of it at the end of your using block. You could await ListenOnceAsync but you'd have to make sure you close the stream first. If you don't await ListenOnceAsync then you can close the stream whenever you want, but you should probably do it as soon as you finish writing to the stream and you have to make sure you don't dispose of the stream (or the config) before ListenOnceAsync has had a chance to complete.

You also want to make sure ListenOnceAsync gets the full utterance. If you're only receiving 4 bytes at a time then that's certainly not a full utterance. If you want to keep your chunks to 4 bytes then it may be a good idea to keep ListenOnceAsync running during multiple iterations of that loop rather than calling it over and over for every 4 bytes you get.

I know DLS API supports ListenOnceAsync() but not sure if this supports ASR (it knows when the speaker on other side stopped talking)

I think you will have to determine when the speaker stops talking on the client side and then receive a message from your WebSocket indicating that you should close the audio stream for ListenOnceAsync.

It looks like ListenOnceAsync does support ASR.

Can I just get the websocket url for the Direct Line Speech endpoint and assume DLS correctly consumes the direct websocket stream?

You could try it, but I would not assume that myself. Direct Line Speech is still in preview and I don't expect compatibility to come easy.