Ultra Fast Text to Speech (WAV -> MP3) in ASP.NET

This question is essentially about the suitability of Microsoft's Speech API (SAPI) for server workloads and whether it can be used reliably inside of w3wp for speech synthesis. We have an asynchronous controller that uses uses the native System.Speech assembly in .NET 4 (not the Microsoft.Speech one that ships as part of Microsoft Speech Platform - Runtime Version 11) and lame.exe to generate mp3s as follows:

       [CacheFilter]
        public void ListenAsync(string url)
        {
                string fileName = string.Format(@"C:\test\{0}.wav", Guid.NewGuid());                       

                try
                {
                    var t = new System.Threading.Thread(() =>
                    {
                        using (SpeechSynthesizer ss = new SpeechSynthesizer())
                        {
                            ss.SetOutputToWaveFile(fileName, new SpeechAudioFormatInfo(22050, AudioBitsPerSample.Eight, AudioChannel.Mono));
                            ss.Speak("Here is a test sentence...");
                            ss.SetOutputToNull();
                            ss.Dispose();
                        }

                        var process = new Process() { EnableRaisingEvents = true };
                        process.StartInfo.FileName = Path.Combine(AppDomain.CurrentDomain.BaseDirectory, @"bin\lame.exe");
                        process.StartInfo.Arguments = string.Format("-V2 {0} {1}", fileName, fileName.Replace(".wav", ".mp3"));
                        process.StartInfo.UseShellExecute = false;
                        process.StartInfo.RedirectStandardOutput = false;
                        process.StartInfo.RedirectStandardError = false;
                        process.Exited += (sender, e) =>
                        {
                            System.IO.File.Delete(fileName);

                            AsyncManager.OutstandingOperations.Decrement();
                        };

                        AsyncManager.OutstandingOperations.Increment();
                        process.Start();
                    });

                    t.Start();
                    t.Join();
                }
                catch { }

            AsyncManager.Parameters["fileName"] = fileName;
        }

        public FileResult ListenCompleted(string fileName)
        {
            return base.File(fileName.Replace(".wav", ".mp3"), "audio/mp3");
        }

The question is why does SpeechSynthesizer need to run on a separate thread like that in order to return (this is reported elsewhere on SO here and here) and whether implementing a STAThreadRouteHandler for this request is more-efficient/scalable than the approach above?

Second, what are the options for running SpeakAsync in an ASP.NET (MVC or WebForms) context? None of the options I've tried seem to work (see update below).

Any other suggestions for how to improve this pattern (i.e. two dependencies that must execute serially to each other but each has async support) are welcome. I don't feel this scheme is sustainable under load, especially considering the known memory leaks in SpeechSynthesizer. Considering running this service on a different stack all together.

Update: Neither of the Speak or SpeakAsnc options appear to work under the STAThreadRouteHandler. The former produces:

System.InvalidOperationException: Asynchronous operations are not allowed in this context. Page starting an asynchronous operation has to have the Async attribute set to true and an asynchronous operation can only be started on a page prior to PreRenderComplete event. at System.Web.LegacyAspNetSynchronizationContext.OperationStarted() at System.ComponentModel.AsyncOperationManager.CreateOperation(Object userSuppliedState) at System.Speech.Internal.Synthesis.VoiceSynthesis..ctor(WeakReference speechSynthesizer) at System.Speech.Synthesis.SpeechSynthesizer.get_VoiceSynthesizer() at System.Speech.Synthesis.SpeechSynthesizer.SetOutputToWaveFile(String path, SpeechAudioFormatInfo formatInfo)

The latter results in:

System.InvalidOperationException: The asynchronous action method 'Listen' cannot be executed synchronously. at System.Web.Mvc.Async.AsyncActionDescriptor.Execute(ControllerContext controllerContext, IDictionary`2 parameters)

It seems like a custom STA thread pool (with ThreadStatic instances of the COM object) is a better approach: http://marcinbudny.blogspot.ca/2012/04/dealing-with-sta-coms-in-web.html

Update #2: It doesn't seem like System.Speech.SpeechSynthesizer needs STA treatment, seems to run fine on MTA threads so long as you follow that Start/Join pattern. Here's a new version that is able to correctly use SpeakAsync (issue there was disposing it prematurely!) and breaks up the WAV generation and the MP3 generation into two separate requests:

[CacheFilter]
[ActionName("listen-to-text")]
public void ListenToTextAsync(string text)
{
    AsyncManager.OutstandingOperations.Increment();   

    var t = new Thread(() =>
    {
        SpeechSynthesizer ss = new SpeechSynthesizer();
        string fileName = string.Format(@"C:\test\{0}.wav", Guid.NewGuid());

        ss.SetOutputToWaveFile(fileName, new SpeechAudioFormatInfo(22050,
                                                                   AudioBitsPerSample.Eight,
                                                                   AudioChannel.Mono));
        ss.SpeakCompleted += (sender, e) =>
        {
            ss.SetOutputToNull();
            ss.Dispose();

            AsyncManager.Parameters["fileName"] = fileName;
            AsyncManager.OutstandingOperations.Decrement();
        };

        CustomPromptBuilder pb = new CustomPromptBuilder(settings.DefaultVoiceName);
        pb.AppendParagraphText(text);
        ss.SpeakAsync(pb);               
    });

    t.Start();
    t.Join();                    
}

[CacheFilter]
public ActionResult ListenToTextCompleted(string fileName)
{
    return RedirectToAction("mp3", new { fileName = fileName });
}

[CacheFilter]
[ActionName("mp3")]
public void Mp3Async(string fileName) 
{
    var process = new Process()
    {
        EnableRaisingEvents = true,
        StartInfo = new ProcessStartInfo()
        {
            FileName = Path.Combine(AppDomain.CurrentDomain.BaseDirectory, @"bin\lame.exe"),
            Arguments = string.Format("-V2 {0} {1}", fileName, fileName.Replace(".wav", ".mp3")),
            UseShellExecute = false,
            RedirectStandardOutput = false,
            RedirectStandardError = false
        }
    };

    process.Exited += (sender, e) =>
    {
        System.IO.File.Delete(fileName);
        AsyncManager.Parameters["fileName"] = fileName;
        AsyncManager.OutstandingOperations.Decrement();
    };

    AsyncManager.OutstandingOperations.Increment();
    process.Start();
}

[CacheFilter]
public ActionResult Mp3Completed(string fileName) 
{
    return base.File(fileName.Replace(".wav", ".mp3"), "audio/mp3");
}

回答1:

I/O is very expensive on a server. how many multiple streams of wav writting do you think you can get on a server hard drive? Why not do it all in memory and only write the mp3 when it's fully processed? mp3's are much smaller and the I/O will be engaged for a small amount of time. You can even change the code to return the stream directly to the user instead of saving to an mp3 if you want.

How do can I use LAME to encode an wav to an mp3 c#

回答2:

This question is a bit old now, but this is what I'm doing and it's been working great so far:

    public Task<FileStreamResult> Speak(string text)
    {
        return Task.Factory.StartNew(() =>
        {
            using (var synthesizer = new SpeechSynthesizer())
            {
                var ms = new MemoryStream();
                synthesizer.SetOutputToWaveStream(ms);
                synthesizer.Speak(text);

                ms.Position = 0;
                return new FileStreamResult(ms, "audio/wav");
            }
        });
    }

might help someone...