APE File

https://wiki.hydrogenaud.io/index.php?title=Main_Page

Digital audio

Sound is just a waveform, while digital audio is a digital representation of the waveform. Digital representation is achieved by "sampling" the size of the analog signal multiple times per second. Conceptually, this can be seen as the “height” of recording sound waves multiple times per second. Today’s audio CDs can store 44,100 samples per second. Since CDs are stereo, they store 44,100 left and right channels each second. These values are represented by 16-bit integers. Basically, a WAV file is a header followed by an array of R, L, R, L... Since each sample takes 32 bits (16 on the left and 16 on the right), and there are 44,100 samples per second, the audio for a second takes 1,411,200 bits, or 176,400 bytes.

Lossless compression

Lossless compression can be decomposed into several basic steps. They are described in detail below.

1. Convert to X, Y

The first step in lossless compression is to more efficiently model channels L and R as certain X and Y values. There are often a large number of correlations between L and R channels that can be exploited in a number of ways, one of which is popular is through the use of intermediate/lateral encoding. In this case, the intermediate (X) and side (Y) values are encoded instead of the L and R values. The midpoint (X) is the midpoint between the left and right channels, while the side (Y) is the difference between the channels. This can be achieved: X =（L + R）/ 2 Y =（L - R）

R = Y / 2 - X L = R + Y

? ? How to deal with the out-of-bounds and accuracy loss when using integers?

2. Predictor

Next, X and Y data will be passed through the predictor to try to remove any redundancy. Basically, the goal of this stage is to make the X and Y arrays contain as small as possible while still maintaining the decompressible state. This stage separates one compression scheme from another. There are almost countless ways to do this. Here is an example using simple linear algebra: PX and PY are the predicted X and Y; X[-1] is the previous X value; X[-2] is the second X value ahead Px = （2 * x [-1]）-x [-2] Py = （2 * and [-1]）-and [-2]

For example, if X = (2, 8, 24, ? ); PX = (2 * X[-1])-X[-2] = (2 * 24)-8 = 40

These predicted values are then compared with the actual values, and the difference (error) is sent to the next level for encoding.

Most good predictors are adaptive, so they can be adjusted to adapt to the "predictability" of the current data. For example, let's use a factor "m" with a range of 0 to 1024 (0 is unpredictable, while 1024 is fully predicted). After each prediction, adjust m higher or lower depending on whether the prediction is useful. Therefore, in the previous example, the remaining predictors are: X =（2，8，24，？） Px = （2 * x [-1]）-x [-2] = （2 * 24） -8 = 40

if ? = 45 and m = 512 [Final Value] =？ -（PX * m / 1024）= 45-（40 * m / 1024）= 45-（40 * 512/1024）= 45-20 = 25 ? ? If you adjust the predictor, do the predictor need to be saved? How to calculate the space occupied if saved? ? After that, m will be adjusted upward, because higher m would have been more effective.

Using different prediction equations and using multiple traversal predictors can make considerable differences at the compression level. Here is a quick list of some predictive equations as shown in the Shorten technical documentation (for different orders): P0 = 0 P1 = X[-1] P2 =（2 * X[-1]）-X[-2] P3 =（3 * X[-1]）-（3 * X[-2]）+ X[-3]

3. Data encoding

The purpose of audio compression is to make all numbers as small as possible by eliminating any correlation between numbers. Once implemented, the result number must be written to disk. Why are smaller numbers better? They are better because they require fewer bits to represent. For example, suppose we want to encode an array of numbers (32 bits long): Base 10：10、14、15、46 Or binary Base 2：1010、1110、1111、101110 Now, if you want to represent these numbers, it is obviously very inefficient to represent them as separate longs for each 32 bit in as few bits as possible. This would require 128 bits, and by looking at the same number represented by the base two it is obvious that there must be a better approach. Ideally, add four numbers with the least necessary digits, so 1010, 1110, 1111, 101110 without commas would be 1010111101110. The problem here is that we don't know where a number starts and where the next number starts. To store smaller numbers, so they take up fewer bits, but can still be decompressed, we use "entropy encoding". Here are detailed information about a common entropy encoding system called Rice encoding. Monkey's Audio uses a slightly more advanced entropy encoder, but the fundamentals of Rice encoding are still useful.

Entropy encoding Rice encoding https://en.wikipedia.org/wiki/Golomb_coding Rice encoding is a way to represent smaller numbers using fewer bits, while still maintaining the ability to distinguish one from the next. Basically it looks like this: You can make a number how much guesses you can make and call it k. Take the rightmost k digits of the number and remember what they are. Imagine that there are no binary numbers on the rightmost k bits and then look at its new value (which is not suitable for the overflow of k bits), encode the numbers with these values. The encoded value is represented as zero corresponding to step 3, followed by a terminator 1, indicating that you have completed the "overflow" sending, followed by the k bits in step 2.

Example Let's look at the example and try to encode the fourth number in the series 10, 14, 15, 46. You'd better guess how many bits a number will take and call it k: since the first three numbers take 4 bits, this seems to be a reasonable guess, so we set k = 4 to the rightmost k bit of that number. And remember what they are: 4 digits to the right of 46 (101110) are 1110 Imagine that there is no rightmost k bit binary number and look at its new value (this is not suitable for the overflow of k bits): at 1110 to the right from 101110 you have 10 or 2 left (base 10). Use these values to encode the numbers. So we put two 0s, then terminate 1, and then k bit 1110. We have a total of: 0011110 Now that this is not done properly, we only take 0011110 and k = 4 and then go backwards. We first see that the overflow is 2 (there are two zeros before terminating 1). We also see the last four digits = 1110. So we take the value 10 (overflow) and the value 1110 (k), and then do some shifts and fluctuations! (Overflow shift<< << K digits)

More technical versions of the same example Here are more technical and mathematical descriptions of the same process: Suppose that an integer n is the number to be encoded, and k is the number of bits to be encoded directly. Symbol (1 is positive, 0 is negative) n /（2K） 0 Terminate n's 1 K least significant bits

For example, if n = 578 and k = 8:100101000010 Symbol (1 is positive, 0 is negative) = [1] n /（2K） 0 ： n /2K = 578/256 = 2 = [00] Termination 1:[1] The least significant bits of k n: 578 = [01000010] Put 1-4 together: [1] [00] [1] [01000010] = 100101000010 During the encoding process, the best k is determined by looking at the past average, but the average is quite a lot (16-128 works well), and then select the best k for that average (basically guessing what the next value is, and then selecting the most efficient k based on that value). The best k can be calculated as [log(n)/log(2)].

References

https://monkeysaudio.com/theory.html

哥伦比编码

联合立体声

Digital audio​

Lossless compression​

1. Convert to X, Y​

2. Predictor​

3. Data encoding​

References​