可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I am starting out with audio recording using my Android smartphone.

I successfully saved voice recordings to a PCM file. When I parse the data and print out the signed, 16-bit values, I can create a graph like the one below. However, I do not understand the amplitude values along the y-axis.

What exactly are the units for the amplitude values? The values are signed 16-bit, so they must range from -32K to +32K. But what do these values represent? Decibels?
If I use 8-bit values, then the values must range from -128 to +128. How would that get mapped to the volume/"loudness" of the 16-bit values? Would you just use a 16-to-1 quantisation mapping?
Why are there negative values? I would think that complete silence would result in values of 0.

If someone can point me to a website with information on what's being recorded, I would appreciate it. I found webpages on the PCM file format, but not what the data values are.

回答1:

Think of the surface of the microphone. When it's silent, the surface is motionless at position zero. When you talk, that causes the air around your mouth to vibrate. Vibrations are spring like, and have movement in both directions, as in back and forth, or up and down, or in and out. The vibrations in the air cause the microphone surface to vibrate as well, as in move up and down. When it moves down, that might be measured or sampled a positive value. When it moves up that might be sampled as a negative value. (Or it could be the opposite.) When you stop talking the surface settles back down to the zero position.

What numbers you get from your PCM recording data depend on the gain of the system. With common 16 bit samples, the range is from -32768 to 32767 for the largest possible excursion of a vibration that can be recorded without distortion, clipping or overflow. Usually the gain is set a bit lower so that the maximum values aren't right on the edge of distortion.

ADDED:

8-bit PCM audio is often an unsigned data type, with the range from 0..255, with a value of 128 indicating "silence". So you have to add/subtract this bias, as well as scale by about 256 to convert between 8-bit and 16-bit audio PCM waveforms.

回答2:

The raw numbers are an artefact of the quantization process used to convert an analog audio signal into digital. It makes more sense to think of an audio signal as a vibration around 0, extending as far as +1 and -1 for maximum excursion of the signal. Outside that, you get clipping, which distorts the harmonics and sounds terrible.

However, computers don't work all that well in terms of fractions, so discrete integers from 0 to 65536 are used to map that range. In most applications like this, a +32767 is considered maximum positive excursion of the microphone's or speaker's diaphragm. There is no correlation between a sample point and a sound pressure level, unless you start factoring in the characteristics of the recording (or playback) circuits.

(BTW, 16-bit audio is very standard and widely used. It is a good balance of signal-to-noise ratio and dynamic range. 8-bit is noisy unless you do some funky non-standard scaling.)

回答3:

Why are there negative values? I would think that complete silence
would result in values of 0

The diaphragm on a microphone vibrates in both directions and as a result creates positive and negative voltages. A value of 0 is silence as it indicates that the diaphragm is not moving. See how microphones work

Small clarification: The position of the diaphragm is being recorded. Silence occurs when there is no vibration, when there is no change in position. So the vibration you are seeing is what is pushing the air and creating changes in air pressure over time. The air is no longer being pushed at the top and bottom peaks of any vibration, so the peaks are when silence occurs. The loudest part of the signal is when the position changes the fastest which is somewhere in the middle of the peaks. The speed with which the diaphragm moves from one peak to another determines the amount of pressure that's generated by the diaphragm. When the top and bottom peaks are reduced to zero (or some other number they share) then there is no vibration and no sound at all. Also as the diaphragm slows down so that there's a greater space of time between peaks, there is less sound pressure being generated or recorded.

I recommend the Yamaha Sound Reinforcement Handbook for more in depth reading. Understanding the idea of calculus would help the understanding of audio and vibration as well.

回答4:

Lots of good answers here, but they don't directly address your questions in an easy to read way.

What exactly are the units for the amplitude values? The values are signed 16-bit, so they must range from -32K to +32K. But what do these values represent? Decibels?

The values have no unit. They simply represent a number that has come out of an analog-to-digital converter. The numbers from the A/D converter are a function of the microphone and pre-amplifier characteristics.

If I use 8-bit values, then the values must range from -128 to +128. How would that get mapped to the volume/"loudness" of the 16-bit values? Would you just use a 16-to-1 quantisation mapping?

I don't understand this question. If you are recording 8-bit audio, your values will be 8-bits. Are you converting 8-bit audio to 16-bit?

Why are there negative values? I would think that complete silence would result in values of 0

The diaphragm on a microphone vibrates in both directions and as a result creates positive and negative voltages. A value of 0 is silence as it indicates that the diaphragm is not moving. See how microphones work

For more details on how sound is represented digitally, see here.

回答5:

The 16bit numbers are the A/D convertor values from your microphone (you knew this). Know also that the amplifier between your microphone and the A/D convertor has an Automatic Gain Control (AGC). The AGC will actively change the amplification of the microphone signal to prevent too much voltage from hitting the A/D convertor (usually < 2Volts dc). Also, there is DC voltage de-coupling which sets the input signal in the middle of the A/D convertor's range (say 1Volt dc).

So, when there is no sound hitting the microphone, the AGC amplifier is sending a flat line 1.0 Volt dc signal to the A/D convertor. When sound waves hit the microphone, it creates a corresponding AC voltage wave. The AGC amp takes the AC voltage wave, centers it at 1.0 Vdc, and sends it to the A/D convertor. The A/D samples (measures the DC Voltage at say 44,000 / per second), and spits out the +/-16bit values of the voltage. So -65,536 = 0.0 Vdc and +65,536 = 2.0 Vdc. A value of +100 = 1.00001529 Vdc and -100 = 0.99998474 Vdc hitting the A/D convertor.

+Values are above 1.0 Vdc, -Values are below 1.0 Vdc.

Note, most audio systems use a log formula to curve the audio wave logarithmically, so a human ear can better hear it. In digital audio systems (with ADCs), Digital Signal Processing puts this curve on the signal. DSPs chips are big business, TI has made a fortune using them for all kinds of applications, not just audio processing. DSPs can work the very complicated math onto a real time stream of data that would choke an iPhone's ARM7 processor. Say you are sending 2MHz pulses to an array of 256 ultrasound sensor/receivers--you get the idea.