I'm making an audio player using XAudio2. We are streaming data in packets of 640 bytes, at a sample rate of 8000Hz and sample depth of 16 bytes. We are using SlimDX to access XAudio2.
But when playing sound, we are noticing that the sound quality is bad. This, for example, is a 3KHz sine curve, captured with Audacity.
I have condensed the audio player to the bare basics, but the audio quality is still bad. Is this a bug in XAudio2, SlimDX, or my code, or is this simply an artifact that occurs when one go from 8KHz to 44.1KHz? The last one seems unreasonable, as we also generate PCM wav files which are played perfectly by Windows Media Player.
The following is the basic implementation, which generates the broken Sine.
public partial class MainWindow : Window
{
private XAudio2 device = new XAudio2();
private WaveFormatExtensible format = new WaveFormatExtensible();
private SourceVoice sourceVoice = null;
private MasteringVoice masteringVoice = null;
private Guid KSDATAFORMAT_SUBTYPE_PCM = new Guid("00000001-0000-0010-8000-00aa00389b71");
private AutoResetEvent BufferReady = new AutoResetEvent(false);
private PlayBufferPool PlayBuffers = new PlayBufferPool();
public MainWindow()
{
InitializeComponent();
Closing += OnClosing;
format.Channels = 1;
format.BitsPerSample = 16;
format.FormatTag = WaveFormatTag.Extensible;
format.BlockAlignment = (short)(format.Channels * (format.BitsPerSample / 8));
format.SamplesPerSecond = 8000;
format.AverageBytesPerSecond = format.SamplesPerSecond * format.BlockAlignment;
format.SubFormat = KSDATAFORMAT_SUBTYPE_PCM;
}
private void OnClosing(object sender, CancelEventArgs cancelEventArgs)
{
sourceVoice.Stop();
sourceVoice.Dispose();
masteringVoice.Dispose();
PlayBuffers.Dispose();
}
private void button_Click(object sender, RoutedEventArgs e)
{
masteringVoice = new MasteringVoice(device);
PlayBuffer buffer = PlayBuffers.NextBuffer();
GenerateSine(buffer.Buffer);
buffer.AudioBuffer.AudioBytes = 640;
sourceVoice = new SourceVoice(device, format, VoiceFlags.None, 8);
sourceVoice.BufferStart += new EventHandler<ContextEventArgs>(sourceVoice_BufferStart);
sourceVoice.BufferEnd += new EventHandler<ContextEventArgs>(sourceVoice_BufferEnd);
sourceVoice.SubmitSourceBuffer(buffer.AudioBuffer);
sourceVoice.Start();
}
private void sourceVoice_BufferEnd(object sender, ContextEventArgs e)
{
BufferReady.Set();
}
private void sourceVoice_BufferStart(object sender, ContextEventArgs e)
{
BufferReady.WaitOne(1000);
PlayBuffer nextBuffer = PlayBuffers.NextBuffer();
nextBuffer.DataStream.Position = 0;
nextBuffer.AudioBuffer.AudioBytes = 640;
GenerateSine(nextBuffer.Buffer);
Result r = sourceVoice.SubmitSourceBuffer(nextBuffer.AudioBuffer);
}
private void GenerateSine(byte[] buffer)
{
double sampleRate = 8000.0;
double amplitude = 0.25 * short.MaxValue;
double frequency = 3000.0;
for (int n = 0; n < buffer.Length / 2; n++)
{
short[] s = { (short)(amplitude * Math.Sin((2 * Math.PI * n * frequency) / sampleRate)) };
Buffer.BlockCopy(s, 0, buffer, n * 2, 2);
}
}
}
public class PlayBuffer : IDisposable
{
#region Private variables
private IntPtr BufferPtr;
private GCHandle BufferHandle;
#endregion
#region Constructors
public PlayBuffer()
{
Index = 0;
Buffer = new byte[640 * 4]; // 640 = 30ms
BufferHandle = GCHandle.Alloc(this.Buffer, GCHandleType.Pinned);
BufferPtr = new IntPtr(BufferHandle.AddrOfPinnedObject().ToInt32());
DataStream = new DataStream(BufferPtr, 640 * 4, true, false);
AudioBuffer = new AudioBuffer();
AudioBuffer.AudioData = DataStream;
}
public PlayBuffer(int index)
: this()
{
Index = index;
}
#endregion
#region Destructor
~PlayBuffer()
{
Dispose();
}
#endregion
#region Properties
protected int Index { get; private set; }
public byte[] Buffer { get; private set; }
public DataStream DataStream { get; private set; }
public AudioBuffer AudioBuffer { get; private set; }
#endregion
#region Public functions
public void Dispose()
{
if (AudioBuffer != null)
{
AudioBuffer.Dispose();
AudioBuffer = null;
}
if (DataStream != null)
{
DataStream.Dispose();
DataStream = null;
}
}
#endregion
}
public class PlayBufferPool : IDisposable
{
#region Private variables
private int _currentIndex = -1;
private PlayBuffer[] _buffers = new PlayBuffer[2];
#endregion
#region Constructors
public PlayBufferPool()
{
for (int i = 0; i < 2; i++)
Buffers[i] = new PlayBuffer(i);
}
#endregion
#region Desctructor
~PlayBufferPool()
{
Dispose();
}
#endregion
#region Properties
protected int CurrentIndex
{
get { return _currentIndex; }
set { _currentIndex = value; }
}
protected PlayBuffer[] Buffers
{
get { return _buffers; }
set { _buffers = value; }
}
#endregion
#region Public functions
public void Dispose()
{
for (int i = 0; i < Buffers.Length; i++)
{
if (Buffers[i] == null)
continue;
Buffers[i].Dispose();
Buffers[i] = null;
}
}
public PlayBuffer NextBuffer()
{
CurrentIndex = (CurrentIndex + 1) % Buffers.Length;
return Buffers[CurrentIndex];
}
#endregion
}
Some extra details:
This is used to replay recorded voice with various compression such as ALAW, µLAW or TrueSpeech. The data is sent in small packets, decoded and sent to this player. This is the reason for why we're using so low sampling rate, and so small buffers. There are no problems with our data, however, as generating a WAV file with the data results in perfect replay by WMP or VLC.
edit: We have now "solved" this by rewriting the player in NAudio. I'd still be interested in any input as to what is happening here. Is it our approach in the PlayBuffers, or is it simply a bug/limitation in DirectX, or the wrappers? I tried using SharpDX instead of SlimDX, but that did not change the result anything.
It looks as if the upsampling is done without a proper anti-aliasing (reconstruction) filter. The cutoff frequency is far too high (above the original Nyquist frequency) and therefore a lot of the aliases are being preserved, resulting in output resembling piecewise-linear interpolation between the samples taken at 8000 Hz.
Although all your different options are doing an upconversion from 8kHz to 44.1kHz, the way in which they do that is important, and the fact that one library does it well is no proof that the upconversion is not the source of error in the other.
It's been a while since I worked with sound and frequencies, but here is what I remember: You have a sample rate of 8000Hz and want a sine frequency of 3000Hz. So for 1 second you have 8000 samples and in that second you want your sine to oscillate 3000 times. That is below the Nyquist-frequency (half your sample rate) but barely (see Nyquist–Shannon sampling theorem). So I would not expect a good quality here.
In fact: step through the
GenerateSine
-method and you'll see thats[0]
will contain the values 0, 5792, -8191, 5792, 0, -5792, 8191, -5792, 0, 5792...None the less this doesn't explain the odd sine you recorded back and I'm not sure how much samples the human ear need to hear a "good" sine wave.