This is the html version of the book:
Thomas Wiegand and Heiko Schwarz (2010): "Source Coding: Part I of Fundamentals of Source and Video Coding", Now Publishers, Foundations and TrendsŪ in Signal Processing: Vol. 4: No 12, pp 1222. (pdf version of the book)
1 Introduction
The advances in source coding technology along with the rapid developments and improvements of network infrastructures, storage capacity, and computing power are enabling an increasing number of multimedia applications. In this text, we will describe and analyze fundamental source coding techniques that are found in a variety of multimedia applications, with the emphasis on algorithms that are used in video coding applications. The present first part of the text concentrates on the description of fundamental source coding techniques, while the second part describes their application in modern video coding.
The application areas of digital video today range from multimedia messaging, video telephony, and video conferencing over mobile TV, wireless and wired Internet video streaming, standard and highdefinition TV broadcasting, subscription and payperview services to personal video recorders, digital camcorders, and optical storage media such as the digital versatile disc (DVD) and BluRay disc. Digital video transmission over satellite, cable, and terrestrial channels is typically based on H.222.0/MPEG2 systems [37], while wired and wireless realtime conversational services often use H.32x [32, 33, 34] or SIP [64], and mobile transmission systems using the Internet and mobile networks are usually based on RTP/IP [68]. In all these application areas, the same basic principles of video compression are employed.
The block structure for a typical video transmission scenario is illustrated in Fig. 1.1. The video capture generates a video signal s that is discrete in space and time. Usually, the video capture consists of a camera that projects the 3dimensional scene onto an image sensor. Cameras typically generate 25 to 60 frames per second. For the considerations in this text, we assume that the video signal s consists of progressivelyscanned pictures. The video encoder maps the video signal s into the bitstream b. The bitstream is transmitted over the error control channel and the received bitstream b′ is processed by the video decoder that reconstructs the decoded video signal s′ and presents it via the video display to the human observer. The visual quality of the decoded video signal s′ as shown on the video display affects the viewing experience of the human observer. This text focuses on the video encoder and decoder part, which is together called a video codec.
The error characteristic of the digital channel can be controlled by the channel encoder, which adds redundancy to the bits at the video encoder output b. The modulator maps the channel encoder output to an analog signal, which is suitable for transmission over a physical channel. The demodulator interprets the received analog signal as a digital signal, which is fed into the channel decoder. The channel decoder processes the digital signal and produces the received bitstream b′, which may be identical to b even in the presence of channel noise. The sequence of the five components, channel encoder, modulator, channel, demodulator, and channel decoder, are lumped into one box, which is called the error control channel. According to Shannon’s basic work [69, 70] that also laid the ground to the subject of this text, by introducing redundancy at the channel encoder and by introducing delay, the amount of transmission errors can be controlled.
The basic communication problem may be posed as conveying source data with the highest fidelity possible without exceeding an available bit rate, or it may be posed as conveying the source data using the lowest bit rate possible while maintaining a specified reproduction fidelity [69]. In either case, a fundamental tradeoff is made between bit rate and signal fidelity. The ability of a source coding system to suitable choose this tradeoff is referred to as its coding efficiency or rate distortion performance. Video codecs are thus primarily characterized in terms of:
However, in practical video transmission systems, the following additional issues must be considered:
The practical source coding design problem can be stated as follows: Given a maximum allowed delay and a maximum allowed complexity, achieve an optimal tradeoff between bit rate and distortion for the range of network environments envisioned in the scope of the applications.
1.2 Scope and Overview of the Text
This text provides a description of the fundamentals of source and video coding. It is aimed at aiding students and engineers to investigate the subject. When we felt that a result is of fundamental importance to the video codec design problem, we chose to deal with it in greater depth. However, we make no attempt to exhaustive coverage of the subject, since it is too broad and too deep to fit the compact presentation format that is chosen here (and our time limit to write this text). We will also not be able to cover all the possible applications of video coding. Instead our focus is on the source coding fundamentals of video coding. This means that we will leave out a number of areas including implementation aspects of video coding and the whole subject of video transmission and errorrobust coding.
The text is divided into two parts. In the first part, the fundamentals of source coding are introduced, while the second part explains their application to modern video coding.
In the present first part, we describe basic source coding techniques that are also found in video codecs. In order to keep the presentation simple, we focus on the description for 1d discretetime signals. The extension of source coding techniques to 2d signals, such as video pictures, will be highlighted in the second part of the text in the context of video coding. Chapter 2 gives a brief overview of the concepts of probability, random variables, and random processes, which build the basis for the descriptions in the following chapters. In Chapter 3, we explain the fundamentals of lossless source coding and present lossless techniques that are found in the video coding area in some detail. The following chapters deal with the topic of lossy compression. Chapter 4 summarizes important results of rate distortion theory, which builds the mathematical basis for analyzing the performance of lossy coding techniques. Chapter 5 treats the important subject of quantization, which can be considered as the basic tool for choosing a tradeoff between transmission bit rate and signal fidelity. Due to its importance in video coding, we will mainly concentrate on the description of scalar quantization. But we also briefly introduce vector quantization in order to show the structural limitations of scalar quantization and motivate the later discussed techniques of predictive coding and transform coding. Chapter 6 covers the subject of prediction and predictive coding. These concepts are found in several components of video codecs. Wellknown examples are the motioncompensated prediction using previously coded pictures, the intra prediction using already coded samples inside a picture, and the prediction of motion parameters. In Chapter 7, we explain the technique of transform coding, which is used in most video codecs for efficiently representing prediction error signals.
The second part of the text will describe the application of the fundamental source coding techniques to video coding. We will discuss the basic structure and the basic concepts that are used in video coding and highlight their application in modern video coding standards. Additionally, we will consider advanced encoder optimization techniques that are relevant for achieving a high coding efficiency. The effectiveness of various design aspects will be demonstrated based on experimental results.
1.3 The Source Coding Principle
The present first part of the text describes the fundamental concepts of source coding. We explain various known source coding principles and demonstrate their efficiency based on 1d model sources. For additional information on information theoretical aspects of source coding the reader is referred to the excellent monographs in [11, 22, 4]. For the overall subject of source coding including algorithmic design questions, we recommend the two fundamental texts by Gersho and Gray [16] and Jayant and Noll [44].
The primary task of a source codec is to represent a signal with the minimum number of (binary) symbols without exceeding an “acceptable level of distortion”, which is determined by the application. Two types of source coding techniques are typically named:
Chapter 2 briefly reviews the concepts of probability, random variables, and random processes. Lossless source coding will be described in Chapter 3. The Chapters 5, 6, and 7 give an introduction to the lossy coding techniques that are found in modern video coding applications. In Chapter 4, we provide some important results of rate distortion theory, which will be used for discussing the efficiency of the presented lossy coding techniques.
This is the html version of the book:
Thomas Wiegand and Heiko Schwarz (2010): "Source Coding: Part I of Fundamentals of Source and Video Coding", Now Publishers, Foundations and TrendsŪ in Signal Processing: Vol. 4: No 12, pp 1222.
2 Random Processes
The primary goal of video communication, and signal transmission in general, is the transmission of new information to a receiver. Since the receiver does not know the transmitted signal in advance, the source of information can be modeled as a random process. This permits the description of source coding and communication systems using the mathematical framework of the theory of probability and random processes. If reasonable assumptions are made with respect to the source of information, the performance of source coding algorithms can be characterized based on probabilistic averages. The modeling of information sources as random processes builds the basis for the mathematical theory of source coding and communication.
In this chapter, we give a brief overview of the concepts of probability, random variables, and random processes and introduce models for random processes, which will be used in the following chapters for evaluating the efficiency of the described source coding algorithms. For further information on the theory of probability, random variables, and random processes, the interested reader is referred to [45, 60, 25].
Probability theory is a branch of mathematics, which concerns the description and modeling of random events. The basis for modern probability theory is the axiomatic definition of probability that was introduced by Kolmogorov in [45] using concepts from set theory.
We consider an experiment with an uncertain outcome, which is called a random experiment. The union of all possible outcomes ζ of the random experiment is referred to as the certain event or sample space of the random experiment and is denoted by . A subset of the sample space is called an event. To each event a measure P() is assigned, which is referred to as the probability of the event . The measure of probability satisfies the following three axioms:
 (2.1) 
 (2.2) 
 (2.3) 
In addition to the axioms, the notion of the independence of two events and the conditional probability are introduced:
 (2.4) 
 (2.5) 
The definitions (2.4) and (2.5) imply that, if two events _{i} and _{j} are independent and P(_{j}) > 0, the conditional probability of the event _{i} given the event _{j} is equal to the marginal probability of _{i},
 (2.6) 
A direct consequence of the definition of conditional probability in (2.5) is Bayes’ theorem,
 (2.7) 
which described the interdependency of the conditional probabilities P(_{i}_{j}) and P(_{j}_{i}) for two events _{i} and _{j}.
A concept that we will use throughout this text are random variables, which will be denoted with uppercase letters. A random variable S is a function of the sample space that assigns a real value S(ζ) to each outcome ζ ∈ of a random experiment.
The cumulative distribution function (cdf) of a random variable S is denoted by F_{S}(s) and specifies the probability of the event {S ≤s},
 (2.8) 
The cdf is a nondecreasing function with F_{S}(∞) = 0 and F_{S}(∞) = 1. The concept of defining a cdf can be extended to sets of two or more random variables S = {S_{0},,S_{N1}}. The function
 (2.9) 
is referred to as Ndimensional cdf, joint cdf, or joint distribution. A set S of random variables is also referred to as a random vector and is also denoted using the vector notation S = (S_{0},,S_{N1})^{T }. For the joint cdf of two random variables X and Y we will use the notation F_{XY }(x,y) = P(X ≤ x,Y ≤ y). The joint cdf of two random vectors X and Y will be denoted by F_{XY }(x,y) = P(X ≤ x,Y ≤ y).
The conditional cdf or conditional distribution of a random variable S given an event , with P() > 0, is defined as the conditional probability of the event {S ≤ s} given the event ,
 (2.10) 
The conditional distribution of a random variable X given another random variable Y is denoted by F_{XY }(xy) and defined as
 (2.11) 
Similarly, the conditional cdf of a random vector X given another random vector Y is given by F_{XY }(xy) = F_{XY }(x,y)∕F_{Y }(y).
2.2.1 Continuous Random Variables
A random variable S is called a continuous random variable, if its cdf F_{S}(s) is a continuous function. The probability P(S = s) is equal to zero for all values of s. An important function of continuous random variables is the probability density function (pdf), which is defined as the derivative of the cdf,
 (2.12) 
Since the cdf F_{S}(s) is a monotonically nondecreasing function, the pdf f_{S}(s) is greater than or equal to zero for all values of s. Important examples for pdf’s, which we will use later in this text, are given below.
Uniform pdf:
 (2.13) 
Laplacian pdf:
 (2.14) 
Gaussian pdf:
 (2.15) 
The concept of defining a probability density function is also extended to random vectors S = (S_{0},,S_{N1})^{T }. The multivariate derivative of the joint cdf F_{S}(s),
 (2.16) 
is referred to as the Ndimensional pdf, joint pdf, or joint density. For two random variables X and Y , we will use the notation f_{XY }(x,y) for denoting the joint pdf of X and Y . The joint density of two random vectors X and Y will be denoted by f_{XY }(x,y).
The conditional pdf or conditional density f_{S}(s) of a random variable S given an event , with P() > 0, is defined as the derivative of the conditional distribution F_{S}(s), f_{S}(s) = dF_{S}(s)∕ds. The conditional density of a random variable X given another random variable Y is denoted by f_{XY }(xy) and defined as
 (2.17) 
Similarly, the conditional pdf of a random vector X given another random vector Y is given by f_{XY }(xy) = f_{XY }(x,y)∕f_{Y }(y).
2.2.2 Discrete Random Variables
A random variable S is said to be a discrete random variable if its cdf F_{S}(s) represents a staircase function. A discrete random variable S can only take values of a countable set = {a_{0},a_{1},…}, which is called the alphabet of the random variable. For a discrete random variable S with an alphabet , the function
 (2.18) 
which gives the probabilities that S is equal to a particular alphabet letter, is referred to as probability mass function (pmf). The cdf F_{S}(s) of a discrete random variable S is given by the sum of the probability masses p(a) with a ≤ s,
 (2.19) 
With the Dirac delta function δ it is also possible to use a pdf f_{S} for describing the statistical properties of a discrete random variables S with a pmf p_{S}(a),
 (2.20) 
Examples for pmf’s that will be used in this text are listed below. The pmf’s are specified in terms of parameters p and M, where p is a real number in the open interval (0,1) and M is an integer greater than 1. The binary and uniform pmf are specified for discrete random variables with a finite alphabet, while the geometric pmf is specified for random variables with a countably infinite alphabet.
Binary pmf:
 (2.21) 
Uniform pmf:
 (2.22) 
Geometric pmf:
 (2.23) 
The pmf for a random vector S = (S_{0},,S_{N1})^{T } is defined by
 (2.24) 
and is also referred to as Ndimensional pmf or joint pmf. The joint pmf for two random variables X and Y or two random vectors X and Y will be denoted by p_{XY }(a_{x},a_{y}) or p_{XY }(a_{x},a_{y}), respectively.
The conditional pmf p_{S}(a) of a random variable S given an event , with P() > 0, specifies the conditional probabilities of the events {S = a} given the event B, p_{S}(a) = P(S = a). The conditional pmf of a random variable X given another random variable Y is denoted by p_{XY }(a_{x}a_{y}) and defined as
 (2.25) 
Similarly, the conditional pmf of a random vector X given another random vector Y is given by p_{XY }(a_{x}a_{y}) = p_{XY }(a_{x},a_{y})∕p_{Y }(a_{y}).
Statistical properties of random variables are often expressed using probabilistic averages, which are referred to as expectation values or expected values. The expectation value of an arbitrary function g(S) of a continuous random variable S is defined by the integral
 (2.26) 
For discrete random variables S, it is defined as the sum
 (2.27) 
Two important expectation values are the mean μ_{S} and the variance σ_{S}^{2} of a random variable S, which are given by
 (2.28) 
For the following discussion of expectation values, we consider continuous random variables. For discrete random variables, the integrals have to be replaced by sums and the pdf’s have to be replaced by pmf’s.
The expectation value of a function g(S) of a set N random variables S = {S_{0},,S_{N1}} is given by
 (2.29) 
The conditional expectation value of a function g(S) of a random variable S given an event , with P() > 0, is defined by
 (2.30) 
The conditional expectation value of a function g(X) of random variable X given a particular value y for another random variable Y is specified by
 (2.31) 
and represents a deterministic function of the value y. If the value y is replaced by the random variable Y , the expression E specifies a new random variable that is a function of the random variable Y . The expectation value E of a random variable Z = E can be computed using the iterative expectation rule,
In analogy to (2.29), the concept of conditional expectation values is also extended to random vectors.We now consider a series of random experiments that are performed at time instants t_{n}, with n being an integer greater than or equal to 0. The outcome of each random experiment at a particular time instant t_{n} is characterized by a random variable S_{n} = S(t_{n}). The series of random variables S = {S_{n}} is called a discretetime^{1} random process. The statistical properties of a discretetime random process S can be characterized by the Nth order joint cdf
 (2.33) 
Random processes S that represent a series of continuous random variables S_{n} are called continuous random processes and random processes for which the random variables S_{n} are of discrete type are referred to as discrete random processes. For continuous random processes, the statistical properties can also be described by the Nth order joint pdf, which is given by the multivariate derivative
 (2.34) 
For discrete random processes, the Nth order joint cdf F_{Sk}(s) can also be specified using the Nth order joint pmf,
 (2.35) 
where ^{N} represent the product space of the alphabets _{n} for the random variables S_{n} with n = k,,k + N  1 and
 (2.36) 
represents the Nth order joint pmf.
The statistical properties of random processes S = {S_{n}} are often characterized by an Nth order autocovariance matrix C_{N}(t_{k}) or an Nth order autocorrelation matrix R_{N}(t_{k}). The Nth order autocovariance matrix is defined by
 (2.37) 
where S_{k}^{(N)} represents the vector (S_{k},,S_{k+N1})^{T } of N successive random variables and μ_{N}(t_{k}) = ES_{k}^{(N)} is the Nth order mean. The Nth order autocorrelation matrix is defined by
 (2.38) 
A random process is called stationary if its statistical properties are invariant to a shift in time. For stationary random processes, the Nth order joint cdf F_{Sk}(s), pdf f_{Sk}(s), and pmf p_{Sk}(a) are independent of the first time instant t_{k} and are denoted by F_{S}(s), f_{S}(s), and p_{S}(a), respectively. For the random variables S_{n} of stationary processes we will often omit the index n and use the notation S.
For stationary random processes, the Nth order mean, the Nth order autocovariance matrix, and the Nth order autocorrelation matrix are independent of the time instant t_{k} and are denoted by μ_{N}, C_{N}, and R_{N}, respectively. The Nth order mean μ_{N} is a vector with all N elements being equal to the mean μ_{S} of the random variables S. The Nth order autocovariance matrix C_{N} = E(S^{(N)} μ_{N})(S^{(N)} μ_{N})^{T } is a symmetric Toeplitz matrix,
 (2.39) 
A Toepliz matrix is a matrix with constant values along all descending diagonals from left to right. For information on the theory and application of Toeplitz matrices the reader is referred to the standard reference [29] and the tutorial [23]. The (k,l)th element of the autocovariance matrix C_{N} is given by the autocovariance function ϕ_{k,l} = E. For stationary processes, the autocovariance function depends only on the absolute values k  l and can be written as ϕ_{k,l} = ϕ_{kl} = σ_{S}^{2}ρ_{kl}. The Nth order autocorrelation matrix R_{N} is also is symmetric Toeplitz matrix. The (k,l)th element of R_{N} is given by r_{k,l} = ϕ_{k,l} + μ_{S}^{2}.
A random process S = {S_{n}} for which the random variables S_{n} are independent is referred to as memoryless random process. If a memoryless random process is additionally stationary it is also said to be independent and identical distributed (iid), since the random variables S_{n} are independent and their cdf’s F_{Sn}(s) = P(S_{n} ≤ s) do not depend on the time instant t_{n}. The Nth order cdf F_{S}(s), pdf f_{S}(s), and pmf p_{S}(a) for iid processes, with s = (s_{0},,s_{N1})^{T } and a = (a_{0},,a_{N1})^{T }, are given by the products
 (2.40) 
where F_{S}(s), f_{S}(s), and p_{S}(a) are the marginal cdf, pdf, and pmf, respectively, for the random variables S_{n}.
A Markov process is characterized by the property that future outcomes do not depend on past outcomes, but only on the present outcome,
 (2.41) 
This property can also be expressed in terms of the pdf,
 (2.42) 
for continuous random processes, or in terms of the pmf,
 (2.43) 
for discrete random processes,
Given a continuous zeromean iid process Z = {Z_{n}}, a stationary continuous Markov process S = {S_{n}} with mean μ_{S} can be constructed by the recursive rule
 (2.44) 
where ρ, with ρ < 1, represents the correlation coefficient between successive random variables S_{n1} and S_{n}. Since the random variables Z_{n} are independent, a random variable S_{n} only depends on the preceding random variable S_{n1}. The variance σ_{S}^{2} of the stationary Markov process S is given by
 (2.45) 
where σ_{Z}^{2} = E denotes the variance of the zeromean iid process Z. The autocovariance function of the process S is given by
 (2.46) 
Each element ϕ_{k,l} of the Nth order autocorrelation matrix C_{N} represents a nonnegative integer power of the correlation coefficient ρ.
In following chapters, we will often obtain expressions that depend on the determinant C_{N} of the Nth order autocovariance matrix C_{N}. For stationary continuous Markov processes given by (2.44), the determinant C_{N} can be expressed by a simple relationship. Using Laplace’s formula, we can expand the determinant of the Nth order autocovariance matrix along the first column,
 (2.47) 
where C_{N}^{(k,l)} represents the matrix that is obtained by removing the kth row and lth column from C_{N}. The first row of each matrix C_{N}^{(k,l)}, with k > 1, is equal to the second row of the same matrix multiplied by the correlation coefficient ρ. Hence, the first two rows of these matrices are linearly dependent and the determinants C_{N}^{(k,l)}, with k > 1, are equal to 0. Thus, we obtain
 (2.48) 
The matrix C_{N}^{(0,0)} represents the autocovariance matrix C_{N1} of the order (N  1). The matrix C_{N}^{(1,0)} is equal to C_{N1} except that the first row is multiplied by the correlation coefficient ρ. Hence, the determinant C_{N}^{(1,0)} is equal to ρC_{N1}, which yields the recursive rule
 (2.49) 
By using the expression C_{1} = σ_{S}^{2} for the determinant of the 1st order autocovariance matrix, we obtain the relationship
 (2.50) 
A continuous random process S = {S_{n}} is said to be a Gaussian process if all finite collections of random variables S_{n} represent Gaussian random vectors. The Nth order pdf of a stationary Gaussian process S with mean μ_{S} and variance σ_{S}^{2} is given by
 (2.51) 
where s is a vector of N consecutive samples, μ_{N} is the Nth order mean (a vector with all N elements being equal to the mean μ_{S}), and C_{N} is an Nth order nonsingular autocovariance matrix given by (2.39).
A continuous random process is called a GaussMarkov process if it satisfies the requirements for both Gaussian processes and Markov processes. The statistical properties of a stationary GaussMarkov are completely specified by its mean μ_{S}, its variance σ_{S}^{2}, and its correlation coefficient ρ. The stationary continuous process in (2.44) is a stationary GaussMarkov process if the random variables Z_{n} of the zeromean iid process Z have a Gaussian pdf f_{Z}(s).
The Nth order pdf of a stationary GaussMarkov process S with the mean μ_{S}, the variance σ_{S}^{2}, and the correlation coefficient ρ is given by (2.51), where the elements ϕ_{k,l} of the Nth order autocovariance matrix C_{N} depend on the variance σ_{S}^{2} and the correlation coefficient ρ and are given by (2.46). The determinant C_{N} of the Nth order autocovariance matrix of a stationary GaussMarkov process can be written according to (2.50).
2.4 Summary of Random Processes
In this chapter, we gave a brief review of the concepts of random variables and random processes. A random variable is a function of the sample space of a random experiment. It assigns a real value to each possible outcome of the random experiment. The statistical properties of random variables can be characterized by cumulative distribution functions (cdf’s), probability density functions (pdf’s), probability mass functions (pmf’s), or expectation values.
Finite collections of random variables are called random vectors. A countably infinite sequence of random variables is referred to as (discretetime) random process. Random processes for which the statistical properties are invariant to a shift in time are called stationary processes. If the random variables of a process are independent, the process is said to be memoryless. Random processes that are stationary and memoryless are also referred to as independent and identically distributed (iid) processes. Important models for random processes, which will also be used in this text, are Markov processes, Gaussian processes, and GaussMarkov processes.
Beside reviewing the basic concepts of random variables and random processes, we also introduced the notations that will be used throughout the text. For simplifying formulas in the following chapters, we will often omit the subscripts that characterize the random variable(s) or random vector(s) in the notations of cdf’s, pdf’s, and pmf’s.
This is the html version of the book:
Thomas Wiegand and Heiko Schwarz (2010): "Source Coding: Part I of Fundamentals of Source and Video Coding", Now Publishers, Foundations and TrendsŪ in Signal Processing: Vol. 4: No 12, pp 1222.
3 Lossless Source Coding
Lossless source coding describes a reversible mapping of sequences of discrete source symbols into sequences of codewords. In contrast to lossy coding techniques, the original sequence of source symbols can be exactly reconstructed from the sequence of codewords. Lossless coding is also referred to as noiseless coding or entropy coding. If the original signal contains statistical properties or dependencies that can be exploited for data compression, lossless coding techniques can provide a reduction in transmission rate. Basically all source codecs, and in particular all video codecs, include a lossless coding part by which the coding symbols are efficiently represented inside a bitstream.
In this chapter, we give an introduction to lossless source coding. We analyze the requirements for unique decodability, introduce a fundamental bound for the minimum average codeword length per source symbol that can be achieved with lossless coding techniques, and discuss various lossless source codes with respect to their efficiency, applicability, and complexity. For further information on lossless coding techniques, the reader is referred to the overview of lossless compression techniques in [67].
3.1 Classification of Lossless Source Codes
In this text, we restrict our considerations to the practically important case of binary codewords. A codeword is a sequence of binary symbols (bits) of the alphabet = {0,1}. Let S = {S_{n}} be a stochastic process that generates sequences of discrete source symbols. The source symbols s_{n} are realizations of the random variables S_{n}. By the process of lossless coding, a message s^{(L)} = {s_{0},,s_{L1}} consisting of L source symbols is converted into a sequence b^{(K)} = {b_{0},,b_{K1}} of K bits.
In practical coding algorithms, a message s^{(L)} is often split into blocks s^{(N)} = {s_{n},,s_{n+N1}} of N symbols, with 1 ≤ N ≤ L, and a codeword b^{(ℓ)}(s^{(N)}) = {b_{0},,b_{ℓ1}} of ℓ bits is assigned to each of these blocks s^{(N)}. The length ℓ of a codeword b^{ℓ}(s^{(N)}) can depend on the symbol block s^{(N)}. The codeword sequence b^{(K)} that represents the message s^{(L)} is obtained by concatenating the codewords b^{ℓ}(s^{(N)}) for the symbol blocks s^{(N)}. A lossless source code can be described by the encoder mapping
 (3.1) 
which specifies a mapping from the set of finite length symbol blocks to the set of finite length binary codewords. The decoder mapping
 (3.2) 
is the inverse of the encoder mapping γ.
Depending on whether the number N of symbols in the blocks s^{(N)} and the number ℓ of bits for the associated codewords are fixed or variable, the following categories can be distinguished:
3.2 VariableLength Coding for Scalars
In this section, we consider lossless source codes that assign a separate codeword to each symbol s_{n} of a message s^{(L)}. It is supposed that the symbols of the message s^{(L)} are generated by a stationary discrete random process S = {S_{n}}. The random variables S_{n} = S are characterized by a finite^{1}^{1} The fundamental concepts and results shown in this section are also valid for countably infinite symbol alphabets (M →∞). symbol alphabet = {a_{0},,a_{M1}} and a marginal pmf p(a) = P(S = a). The lossless source code associates each letter a_{i} of the alphabet with a binary codeword b_{i} = {b_{0}^{i},,b_{ℓ(ai)1}^{i}} of a length ℓ(a_{i}) ≥ 1. The goal of the lossless code design is to minimize the average codeword length
 (3.3) 
while ensuring that each message s^{(L)} is uniquely decodable given their coded representation b^{(K)}.
A code is said to be uniquely decodable if and only if each valid coded representation b^{(K)} of a finite number K of bits can be produced by only one possible sequence of source symbols s^{(L)}.
A necessary condition for unique decodability is that each letter a_{i} of the symbol alphabet is associated with a different codeword. Codes with this property are called nonsingular codes and ensure that a single source symbol is unambiguously represented. But if messages with more than one symbol are transmitted, nonsingularity is not sufficient to guarantee unique decodability, as will be illustrated in the following.
Table 3.1 shows five example codes for a source with a four letter alphabet and a given marginal pmf. Code A has the smallest average codeword length, but since the symbols a_{2} and a_{3} cannot be distinguished^{2}. Code A is a singular code and is not uniquely decodable. Although code B is a nonsingular code, it is not uniquely decodable either, since the concatenation of the letters a_{1} and a_{0} produces the same bit sequence as the letter a_{2}. The remaining three codes are uniquely decodable, but differ in other properties. While code D has an average codeword length of 2.125 bit per symbol, the codes C and E have an average codeword length of only 1.75 bit per symbol, which is, as we will show later, the minimum achievable average codeword length for the given source. Beside being uniquely decodable, the codes D and E are also instantaneously decodable, i.e., each alphabet letter can be decoded right after the bits of its codeword are received. The code C does not have this property. If a decoder for the code C receives a bit equal to 0, it has to wait for the next bit equal to 0 before a symbol can be decoded. Theoretically, the decoder might need to wait until the end of the message. The value of the next symbol depends on how many bits equal to 1 are received between the zero bits.
Binary codes can be represented using binary trees as illustrated in Fig. 3.1. A binary tree is a data structure that consists of nodes, with each node having zero, one, or two descendant nodes. A node and its descendants nodes are connected by branches. A binary tree starts with a root node, which is the only node that is not a descendant of any other node. Nodes that are not the root node but have descendants are referred to as interior nodes, whereas nodes that do not have descendants are called terminal nodes or leaf nodes.
In a binary code tree, all branches are labeled with ‘0’ or ‘1’. If two branches depart from the same node, they have different labels. Each node of the tree represents a codeword, which is given by the concatenation of the branch labels from the root node to the considered node. A code for a given alphabet can be constructed by associating all terminal nodes and zero or more interior nodes of a binary code tree with one or more alphabet letters. If each alphabet letter is associated with a distinct node, the resulting code is nonsingular. In the example of Fig. 3.1, the nodes that represent alphabet letters are filled.
A code is said to be a prefix code if no codeword for an alphabet letter represents the codeword or a prefix of the codeword for any other alphabet letter. If a prefix code is represented by a binary code tree, this implies that each alphabet letter is assigned to a distinct terminal node, but not to any interior node. It is obvious that every prefix code is uniquely decodable. Furthermore, we will prove later that for every uniquely decodable code there exists a prefix code with exactly the same codeword lengths. Examples for prefix codes are the codes D and E in Table 3.1.
Based on the binary code tree representation the parsing rule for prefix codes can be specified as follows:
The parsing rule reveals that prefix codes are not only uniquely decodable, but also instantaneously decodable. As soon as all bits of a codeword are received, the transmitted symbol is immediately known. Due to this property, it is also possible to switch between different independently designed prefix codes inside a bitstream (i.e., because symbols with different alphabets are interleaved according to a given bitstream syntax) without impacting the unique decodability.
A necessary condition for uniquely decodable codes is given by the Kraft inequality,
 (3.4) 
For proving this inequality, we consider the term
 (3.5) 
The term ℓ_{L} = ℓ(a_{i0}) + ℓ(a_{i1}) + + ℓ(a_{iL1}) represents the combined codeword length for coding L symbols. Let A(ℓ_{L}) denote the number of distinct symbol sequences that produce a bit sequence with the same length ℓ_{L}. A(ℓ_{L}) is equal to the number of terms 2^{ℓL} that are contained in the sum of the right side of (3.5). For a uniquely decodable code, A(ℓ_{L}) must be less than or equal to 2^{ℓL}, since there are only 2^{ℓL} distinct bit sequences of length ℓ_{L}. If the maximum length of a codeword is ℓ_{max}, the combined codeword length ℓ_{L} lies inside the interval [L,L ⋅ ℓ_{max}]. Hence, a uniquely decodable code must fulfill the inequality
 (3.6) 
The left side of this inequality grows exponentially with L, while the right side grows only linearly with L. If the Kraft inequality (3.4) is not fulfilled, we can always find a value of L for which the condition (3.6) is violated. And since the constraint (3.6) must be obeyed for all values of L ≥ 1, this proves that the Kraft inequality specifies a necessary condition for uniquely decodable codes.
The Kraft inequality does not only provide a necessary condition for uniquely decodable codes, it is also always possible to construct a uniquely decodable code for any given set of codeword lengths {ℓ_{0},ℓ_{1},,ℓ_{M1}} that satisfies the Kraft inequality. We prove this statement for prefix codes, which represent a subset of uniquely decodable codes. Without loss of generality, we assume that the given codeword lengths are ordered as ℓ_{0} ≤ ℓ_{1} ≤≤ ℓ_{M1}. Starting with an infinite binary code tree, we chose an arbitrary node of depth ℓ_{0} (i.e., a node that represents a codeword of length ℓ_{0}) for the first codeword and prune the code tree at this node. For the next codeword length ℓ_{1}, one of the remaining nodes with depth ℓ_{1} is selected. A continuation of this procedure yields a prefix code for the given set of codeword lengths, unless we cannot select a node for a codeword length ℓ_{i} because all nodes of depth ℓ_{i} have already been removed in previous steps. It should be noted that the selection of a codeword of length ℓ_{k} removes 2^{ℓiℓk} codewords with a length of ℓ_{i} ≥ ℓ_{k}. Consequently, for the assignment of a codeword length ℓ_{i}, the number of available codewords is given by
 (3.7) 
If the Kraft inequality (3.4) is fulfilled, we obtain
 (3.8) 
Hence, it is always possible to construct a prefix code, and thus a uniquely decodable code, for a given set of codeword lengths that satisfies the Kraft inequality.
The proof shows another important property of prefix codes. Since all uniquely decodable codes fulfill the Kraft inequality and it is always possible to construct a prefix code for any set of codeword lengths that satisfies the Kraft inequality, there do not exist uniquely decodable codes that have a smaller average codeword length than the best prefix code. Due to this property and since prefix codes additionally provide instantaneous decodability and are easy to construct, all variable length codes that are used in practice are prefix codes.
Based on the Kraft inequality, we now derive a lower bound for the average codeword length of uniquely decodable codes. The expression (3.3) for the average codeword length can be rewritten as
 (3.9) 
With the definition q(a_{i}) = 2^{ℓ(ai)}∕, we obtain
 (3.10) 
Since the Kraft inequality is fulfilled for all uniquely decodable codes, the first term on the right side of (3.10) is greater than or equal to 0. The second term is also greater than or equal to 0 as can be shown using the inequality lnx ≤ x  1 (with equality if and only if x = 1),
The inequality (3.11) is also referred to as divergence inequality for probability mass functions. The average codeword length for uniquely decodable codes is bounded by
 (3.12) 
with
 (3.13) 
The lower bound H(S) is called the entropy of the random variable S and does only depend on the associated pmf p. Often the entropy of a random variable with a pmf p is also denoted as H(p). The redundancy of a code is given by the difference
 (3.14) 
The entropy H(S) can also be considered as a measure for the uncertainty^{3} that is associated with the random variable S.
The inequality (3.12) is an equality if and only if the first and second term on the right side of (3.10) are equal to 0. This is only the case if the Kraft inequality is fulfilled with equality and q(a_{i}) = p(a_{i}), ∀a_{i} ∈. The resulting conditions ℓ(a_{i}) = log _{2}p(a_{i}), ∀a_{i} ∈, can only hold if all alphabet letters have probabilities that are integer powers of 1∕2.
For deriving an upper bound for the minimum average codeword length we choose ℓ(a_{i}) = ⌈log _{2}p(a_{i})⌉, ∀a_{i} ∈, where ⌈x⌉ represents the smallest integer greater than or equal to x. Since these codeword lengths satisfy the Kraft inequality, as can be shown using ⌈x⌉≥ x,
 (3.15) 
we can always construct a uniquely decodable code. For the average codeword length of such a code, we obtain, using ⌈x⌉ < x + 1,
 (3.16) 
The minimum average codeword length
_{min} that can be achieved with uniquely decodable codes that assign a separate codeword to each letter of an alphabet always satisfies the inequality
 (3.17) 
The upper limit is approached for a source with a twoletter alphabet and a pmf {p,1  p} if the letter probability p approaches 0 or 1 [15].
For deriving an upper bound for the minimum average codeword length we chose ℓ(a_{i}) = ⌈log _{2}p(a_{i})⌉, ∀a_{i} ∈. The resulting code has a redundancy ϱ =  H(S_{n}) that is always less than 1 bit per symbol, but it does not necessarily achieve the minimum average codeword length. For developing an optimal uniquely decodable code, i.e., a code that achieves the minimum average codeword length, it is sufficient to consider the class of prefix codes, since for every uniquely decodable code there exists a prefix code with the exactly same codeword length. An optimal prefix code has the following properties:
These conditions can be proved as follows. If the first condition is not fulfilled, an exchange of the codewords for the symbols a_{i} and a_{j} would decrease the average codeword length while preserving the prefix property. And if the second condition is not satisfied, i.e., if for a particular codeword with the maximum codeword length there does not exist a codeword that has the same length and differs only in the final bit, the removal of the last bit of the particular codeword would preserve the prefix property and decrease the average codeword length.
Both conditions for optimal prefix codes are obeyed if two codewords with the maximum length that differ only in the final bit are assigned to the two letters a_{i} and a_{j} with the smallest probabilities. In the corresponding binary code tree, a parent node for the two leaf nodes that represent these two letters is created. The two letters a_{i} and a_{j} can then be treated as a new letter with a probability of p(a_{i}) + p(a_{j}) and the procedure of creating a parent node for the nodes that represent the two letters with the smallest probabilities can be repeated for the new alphabet. The resulting iterative algorithm was developed and proved to be optimal by Huffman in [30]. Based on the construction of a binary code tree, the Huffman algorithm for a given alphabet with a marginal pmf p can be summarized as follows:
Adetailed example for the application of the Huffman algorithm is given in Fig. 3.2. Optimal prefix codes are often generally referred to as Huffman codes. It should be noted that there exist multiple optimal prefix codes for a given marginal pmf. A tighter bound than in (3.17) on the redundancy of Huffman codes is provided in [15].
3.2.4 Conditional Huffman Codes
Until now, we considered the design of variable length codes for the marginal pmf of stationary random processes. However, for random processes {S_{n}} with memory, it can be beneficial to design variable length codes for conditional pmfs and switch between multiple codeword tables depending on already coded symbols.
a  a_{0}  a_{1}  a_{2}  entropy

p(aa_{0})  0.90  0.05  0.05  H(S_{n}a_{0}) = 0.5690 
p(aa_{1})  0.15  0.80  0.05  H(S_{n}a_{1}) = 0.8842 
p(aa_{2})  0.25  0.15  0.60  H(S_{n}a_{2}) = 1.3527 
p(a)  0.64  0.24  0.1  H(S) = 1.2575 
As an example, we consider a stationary discrete Markov process with a three symbol alphabet = {a_{0},a_{1},a_{2}}. The statistical properties of this process are completely characterized by three conditional pmfs p(aa_{k}) = P(S_{n} = aS_{n1} = a_{k}) with k = 0,1,2, which are given in Table 3.2. An optimal prefix code for a given conditional pmf can be designed in exactly the same way as for a marginal pmf. A corresponding Huffman code design for the example Markov source is shown in Table 3.3. For comparison, Table 3.3 lists also a Huffman code for the marginal pmf. The codeword table that is chosen for coding a symbol s_{n} depends on the value of the preceding symbol s_{n1}. It is important to note that an independent code design for the conditional pmfs is only possible for instantaneously decodable codes, i.e., for prefix codes.
a_{i}  Huffman codes for conditional pmfs 


S_{n1} = a_{0}  S_{n1} = a_{2}  S_{n1} = a_{2}  
a_{ 0}  1  00  00  1  
a_{ 1}  00  1  01  00  
a_{ 2}  01  01  1  01  
1.1  1.2  1.4  1.3556  
The average codeword length S_{n1} = a_{k}) of an optimal prefix code for each of the conditional pmfs is guaranteed to lie in the halfopen interval , where
_{k} = (
 (3.18) 
denotes the conditional entropy of the random variable S_{n} given the event {S_{n1} = a_{k}}. The resulting average codeword length for the conditional code is
 (3.19) 
The resulting lower bound for the average codeword length conditional entropy H(S_{n}S_{n1}) of the random variable S_{n} assuming the random variable S_{n1} and is given by
is referred to as the where p(a_{i},a_{k}) = P(S_{n} = a_{i},S_{n1} = a_{k}) denotes the joint pmf of the random variables S_{n} and S_{n1}. The conditional entropy H(S_{n}S_{n1}) specifies a measure for the uncertainty about S_{n} given the value of S_{n1}. The minimum average codeword length _{min} that is achievable with the conditional code design is bounded by
 (3.21) 
As can be easily shown from the divergence inequality (3.11),
For our example, the average codeword length of the conditional code design is 1.1578 bit per symbol, which is about 14.6% smaller than the average codeword length of the Huffman code for the marginal pmf.
For sources with memory that do not satisfy the Markov property, it can be possible to further decrease the average codeword length if more than one preceding symbol is used in the condition. However, the number of codeword tables increases exponentially with the number of considered symbols. To reduce the number of tables, the number of outcomes for the condition can be partitioned into a small number of events, and for each of these events, a separate code can be designed. As an application example, the CAVLC design in the H.264/AVC video coding standard [36] includes conditional variable length codes.
In practice, the marginal and conditional pmfs of a source are usually not known and sources are often nonstationary. Conceptually, the pmf(s) can be simultaneously estimated in encoder and decoder and a Huffman code can be redesigned after coding a particular number of symbols. This would, however, tremendously increase the complexity of the coding process. A fast algorithm for adapting Huffman codes was proposed by Gallager [15]. But even this algorithm is considered as too complex for video coding application, so that adaptive Huffman codes are rarely used in this area.
3.3 VariableLength Coding for Vectors
Although scalar Huffman codes achieve the smallest average codeword length among all uniquely decodable codes that assign a separate codeword to each letter of an alphabet, they can be very inefficient if there are strong dependencies between the random variables of a process. For sources with memory, the average codeword length per symbol can be decreased if multiple symbols are coded jointly. Huffman codes that assign a codeword to a block of two or more successive symbols are referred to as block Huffman codes or vector Huffman codes and represent an alternative to conditional Huffman codes^{4}. The joint coding of multiple symbols is also advantageous for iid processes for which one of the probabilities masses is close to one.
3.3.1 Huffman Codes for FixedLength Vectors
We consider stationary discrete random sources S = {S_{n}} with an Mary alphabet = {a_{0},,a_{M1}}. If N symbols are coded jointly, the Huffman code has to be designed for the joint pmf
of a block of N successive symbols. The average codeword length _{min} per symbol for an optimum block Huffman code is bounded by
 (3.23) 
where
 (3.24) 
is referred to as the block entropy for a set of N successive random variables {S_{n},,S_{n+N1}}. The limit
 (3.25) 
is called the entropy rate of a source S. It can be shown that the limit in (3.25) always exists for stationary sources [14]. The entropy rate (S) represents the greatest lower bound for the average codeword length per symbol that can be achieved with lossless source coding techniques,
 (3.26) 
For iid processes, the entropy rate
is equal to the marginal entropy H(S). For stationary Markov processes, the entropy rate
a_{i}a_{k}  p(a_{i},a_{k})  codewords
 
a_{0}a_{0}  0.58  1  
a_{0}a_{1}  0.032  00001  
a_{0}a_{2}  0.032  00010  
a_{1}a_{0}  0.036  0010  
a_{1}a_{1}  0.195  01  
a_{1}a_{2}  0.012  000000  
a_{2}a_{0}  0.027  00011  
a_{2}a_{1}  0.017  000001  
(a)  a_{2}a_{2}  0.06  0011 
N  N_{}  
1  1.3556  3  
2  1.0094  9  
3  0.9150  27  
4  0.8690  81  
5  0.8462  243  
6  0.8299  729  
7  0.8153  2187  
8  0.8027  6561  
(b)  9  0.7940  19683 
As an example for the design of block Huffman codes, we consider the discrete Markov process specified in Table 3.2. The entropy rate (S) for this source is 0.7331 bit per symbol. Table 3.4(a) shows a Huffman code for the joint coding of 2 symbols. The average codeword length per symbol for this code is 1.0094 bit per symbol, which is smaller than the average codeword length obtained with the Huffman code for the marginal pmf and the conditional Huffman code that we developed in sec. 3.2. As shown in Table 3.4(b), the average codeword length can be further reduced by increasing the number N of jointly coded symbols. If N approaches infinity, the average codeword length per symbol for the block Huffman code approaches the entropy rate. However, the number N_{C} of codewords that must be stored in an encoder and decoder grows exponentially with the number N of jointly coded symbols. In practice, block Huffman codes are only used for a small number of symbols with small alphabets.
In general, the number of symbols in a message is not a multiple of the block size N. The last block of source symbols may contain less than N symbols, and, in that case, it cannot be represented with the block Huffman code. If the number of symbols in a message is known to the decoder (e.g., because it is determined by a given bitstream syntax), an encoder can send the codeword for any of the letter combinations that contain the last block of source symbols as a prefix. At the decoder side, the additionally decoded symbols are discarded. If the number of symbols that are contained in a message cannot be determined in the decoder, a special symbol for signaling the end of a message can be added to the alphabet.
3.3.2 Huffman Codes for VariableLength Vectors
An additional degree of freedom for designing Huffman codes, or generally variablelength codes, for symbol vectors is obtained if the restriction that all codewords are assigned to symbol blocks of the same size is removed. Instead, the codewords can be assigned to sequences of a variable number of successive symbols. Such a code is also referred to as V2V code in this text. In order to construct a V2V code, a set of letter sequences with a variable number of letters is selected and a codeword is associated with each of these letter sequences. The set of letter sequences has to be chosen in a way that each message can be represented by a concatenation of the selected letter sequences. An exception is the end of a message, for which the same concepts as for block Huffman codes (see above) can be used.
Fig. 3.3 Example for an Mary tree representing sequences of a variable number of letters, of the alphabet = {a_{0},a_{1},a_{2}}, with an associated variable length code.
Similarly as for binary codes, the set of letter sequences can be represented by an Mary tree as depicted in Fig. 3.3. In contrast to binary code trees, each node has up to M descendants and each branch is labeled with a letter of the Mary alphabet = {a_{0},a_{1},,a_{M1}}. All branches that depart from a particular node are labeled with different letters. The letter sequence that is represented by a particular node is given by a concatenation of the branch labels from the root node to the particular node. An Mary tree is said to be a full tree if each node is either a leaf node or has exactly M descendants.
We constrain our considerations to full Mary trees for which all leaf nodes and only the leaf nodes are associated with codewords. This restriction yields a V2V code that fulfills the necessary condition stated above and has additionally the following useful properties:
The first property implies that any message can only be represented by a single sequence of codewords. The only exception is that, if the last symbols of a message do not represent a letter sequence that is associated with a codeword, one of multiple codewords can be selected as discussed above.
Let N_{} denote the number of leaf nodes in a full Mary tree . Each leaf node _{k} represents a sequence a_{k} = {a_{0}^{k},a_{1}^{k},,a_{Nk1}^{k}} of N_{k} alphabet letters. The associated probability p(_{k}) for coding a symbol sequence {S_{n},,S_{n+Nk1}} is given by
 (3.29) 
where represents the event that the preceding symbols {S_{0},…,S_{n1}} were coded using a sequence of complete codewords of the V2V tree. The term p(a_{m}a_{0},,a_{m1},) denotes the conditional pmf for a random variable S_{n+m} given the random variables S_{n} to S_{n+m1} and the event . For iid sources, the probability p(_{k}) for a leaf node _{k} simplifies to
 (3.30) 
For stationary Markov sources, the probabilities p(_{k}) are given by
 (3.31) 
The conditional pmfs p(a_{m}a_{0},,a_{m1},) are given by the structure of the Mary tree and the conditional pmfs p(a_{m}a_{0},,a_{m1}) for the random variables S_{n+m} assuming the preceding random variables S_{n} to S_{n+m1}.
As an example, we show how the pmf p(a) = P(S_{n} = a) that is conditioned on the event can be determined for Markov sources. In this case, the probability p(a_{m}) = P(S_{n} = a_{m}) that a codeword is assigned to a letter sequence that starts with a particular letter a_{m} of the alphabet = {a_{0},a_{1},,a_{M1}} is given by
 (3.32) 
These M equations form a homogeneous linear equation system that has one set of nontrivial solutions p(a) = κ ⋅{x_{0},x_{1},,x_{M1}}. The scale factor κ and thus the pmf p(a) can be uniquely determined by using the constraint ∑ _{m=0}^{M1}p(a_{m}) = 1.
After the conditional pmfs p(a_{m}a_{0},,a_{m1},) have been determined, the pmf p() for the leaf nodes can be calculated. An optimal prefix code for the selected set of letter sequences, which is represented by the leaf nodes of a full Mary tree , can be designed using the Huffman algorithm for the pmf p(). Each leaf node _{k} is associated with a codeword of ℓ_{k} bits. The average codeword length per symbol is given by the ratio of the average codeword length per letter sequence and the average number of letters per letter sequence,
 (3.33) 
For selecting the set of letter sequences or the full Mary tree , we assume that the set of applicable V2V codes for an application is given by parameters such as the maximum number of codewords (number of leaf nodes). Given such a finite set of full Mary trees, we can select the full Mary tree , for which the Huffman code yields the smallest average codeword length per symbol .
a_{k}  p(_{k})  codewords
 
a_{0}a_{0}  0.5799  1  
a_{0}a_{1}  0.0322  00001  
a_{0}a_{2}  0.0322  00010  
a_{1}a_{0}  0.0277  00011  
a_{1}a_{1}a_{0}  0.0222  000001  
a_{1}a_{1}a_{1}  0.1183  001  
a_{1}a_{1}a_{2}  0.0074  0000000  
a_{1}a_{2}  0.0093  0000001  
(a)  a_{2}  0.1708  01 
N_{}  
5  1.1784  
7  1.0551  
9  1.0049  
11  0.9733  
13  0.9412  
15  0.9293  
17  0.9074  
19  0.8980  
(b)  21  0.8891 
As an example for the design of a V2V Huffman code, we again consider the stationary discrete Markov source specified in Table 3.2. Table 3.5(a) shows a V2V code that minimizes the average codeword length per symbol among all V2V codes with up to 9 codewords. The average codeword length is 1.0049 bit per symbol, which is about 0.4% smaller than the average codeword length for the block Huffman code with the same number of codewords. As indicated in Table 3.5(b), when increasing the number of codewords, the average codeword length for V2V codes usually decreases faster as for block Huffman codes. The V2V code with 17 codewords has already an average codeword length that is smaller than that of the block Huffman code with 27 codewords.
An application example of V2V codes is the runlevel coding of transform coefficients in MPEG2 Video [39]. An often used variation of V2V codes is called runlength coding. In runlength coding, the number of successive occurrences of a particular alphabet letter, referred to as run, is transmitted using a variablelength code. In some applications, only runs for the most probable alphabet letter (including runs equal to 0) are transmitted and are always followed by a codeword for one of the remaining alphabet letters. In other applications, the codeword for a run is followed by a codeword specifying the alphabet letter, or vice versa. V2V codes are particularly attractive for binary iid sources. As we will show in sec. 3.5, a universal lossless source coding concept can be designed using V2V codes for binary iid sources in connection with the concepts of binarization and probability interval partitioning.
3.4 Elias Coding and Arithmetic Coding
Huffman codes achieve the minimum average codeword length among all uniquely decodable codes that assign a separate codeword to each element of a given set of alphabet letters or letter sequences. However, if the pmf for a symbol alphabet contains a probability mass that is close to one, a Huffman code with an average codeword length close to the entropy rate can only be constructed if a large number of symbols is coded jointly. Such a block Huffman code does however require a huge codeword table and is thus impractical for real applications. Additionally, a Huffman code for fixed or variablelength vectors is not applicable or at least very inefficient for symbol sequences in which symbols with different alphabets and pmfs are irregularly interleaved, as it is often found in image and video coding applications, where the order of symbols is determined by a sophisticated syntax.
Furthermore, the adaptation of Huffman codes to sources with unknown or varying statistical properties is usually considered as too complex for realtime applications. It is desirable to develop a code construction method that is capable of achieving an average codeword length close to the entropy rate, but also provides a simple mechanism for dealing with nonstationary sources and is characterized by a complexity that increases linearly with the number of coded symbols.
The popular method of arithmetic coding provides these properties. The initial idea is attributed to P. Elias (as reported in [1]) and is also referred to as Elias coding. The first practical arithmetic coding schemes have been published by Pasco [61] and Rissanen [63]. In the following, we first present the basic concept of Elias coding and continue with highlighting some aspects of practical implementations. For further details, the interested reader is referred to [78], [58] and [65].
We consider the coding of symbol sequences s = {s_{0},s_{1},…,s_{N1}} that represent realizations of a sequence of discrete random variables S = {S_{0},S_{1},…,S_{N1}}. The number N of symbols is assumed to be known to both encoder and decoder. Each random variable S_{n} can be characterized by a distinct M_{n}ary alphabet _{n}. The statistical properties of the sequence of random variables S are completely described by the joint pmf
A symbol sequence s_{a} = {s_{0}^{a},s_{1}^{a},,s_{N1}^{a}} is considered to be less than another symbol sequence s_{b} = {s_{0}^{b},s_{1}^{b},,s_{N1}^{b}} if and only if there exists an integer n, with 0 ≤ n ≤ N  1, so that
 (3.34) 
Using this definition, the probability mass of a particular symbol sequence s can written as
 (3.35) 
This expression indicates that a symbol sequence s can be represented by an interval _{N} between two successive values of the cumulative probability mass function P(S ≤ s). The corresponding mapping of a symbol sequence s to a halfopen interval _{N} ⊂ [0,1) is given by
 (3.36) 
The interval width W_{N} is equal to the probability P(S = s) of the associated symbol sequence s. In addition, the intervals for different realizations of the random vector S are always disjoint. This can be shown by considering two symbol sequences s_{a} and s_{b}, with s_{a} < s_{b}. The lower interval boundary L_{N}^{b} of the interval _{N}(s_{b}),
 (3.38) 
In order to identify the symbol sequence s we only need to transmit the bit sequence b = {b_{0},b_{1},,b_{K1}}. The Elias code for the sequence of random variables S is given by the assignment of bit sequences b to the Nsymbol sequences s.
For obtaining codewords that are as short as possible, we should choose the real numbers v that can be represented with the minimum amount of bits. The distance between successive binary fractions with K bits after the binary point is 2^{K}. In order to guarantee that any binary fraction with K bits after the decimal point falls in an interval of size W_{N}, we need K ≥log _{2}W_{N} bits. Consequently, we choose
 (3.39) 
where ⌈x⌉ represents the smallest integer greater than or equal to x. The binary fraction v, and thus the bit sequence b, is determined by
 (3.40) 
An application of the inequalities ⌈x⌉≥ x and ⌈x⌉ < x + 1 to (3.40) and (3.39) yields
 (3.41) 
which proves that the selected binary fraction v always lies inside the interval _{N}. The Elias code obtained by choosing K = ⌈log _{2}W_{N}⌉ associates each Nsymbol sequence s with a distinct codeword b.
An important property of the Elias code is that the codewords can be iteratively constructed. For deriving the iteration rules, we consider subsequences s^{(n)} = {s_{0},s_{1},,s_{n1}} that consist of the first n symbols, with 1 ≤ n ≤ N, of the symbol sequence s. Each of these subsequences s^{(n)} can be treated in the same way as the symbol sequence s. Given the interval width W_{n} for the subsequence s^{(n)} = {s_{0},s_{1},…,s_{n1}}, the interval width W_{n+1} for the subsequence s^{(n+1)} = {s^{(n)},s_{n}} can be derived by
with p(s_{n}) being the conditional probability mass function P. Similarly, the iteration rule for the lower interval border L_{n} is given by where c(s_{n}) represents a cumulative probability mass function (cmf) and is given by
 (3.44) 
By setting W_{0} = 1 and L_{0} = 0, the iteration rules (3.42) and (3.43) can also be used for calculating the interval width and lower interval border of the first subsequence s^{(1)} = {s_{0}}. Equation (3.43) directly implies L_{n+1} ≥ L_{n}. By combining (3.43) and (3.42), we also obtain
The iteration rules have been derived for the general case of dependent and differently distributed random variables S_{n}. For iid processes and Markov processes, the general conditional pmf in (3.42) and (3.44) can be replaced with the marginal pmf p(s_{n}) = P(S_{n} = s_{n}) and the conditional pmf p(s_{n}s_{n1}) = P(S_{n} = s_{n}S_{n1} = s_{n1}), respectively.
As an example, we consider the iid process in Table 3.6. Beside the pmf p(a) and cmf c(a), the table also specifies a Huffman code. Suppose we intend to transmit the symbol sequence s =‘CABAC’. If we use the Huffman code, the transmitted bit sequence would be b =‘10001001’. The iterative code construction process for the Elias coding is illustrated in Table 3.7. The constructed codeword is identical to the codeword that is obtained with the Huffman code. Note that the codewords of an Elias code have only the same number of bits as the Huffman code if all probability masses are integer powers of 1∕2 as in our example.
s_{
0}=‘C’  s_{1}=‘A’  s_{2}=‘B’ 
W_{1} = W_{0} ⋅p(‘C’)  W_{2} = W_{1} ⋅p(‘A’)  W_{3} = W_{2} ⋅p(‘B’) 
= 1 ⋅ 2^{1} = 2^{1}  = 2^{1} ⋅ 2^{2} = 2^{3}  = 2^{3} ⋅ 2^{2} = 2^{5} 
= (0.1)_{b}  = (0.001)_{b}  = (0.00001)_{b} 
L_{1} = L_{0} + W_{0} ⋅c(‘C’)  L_{2} = L_{1} + W_{1} ⋅c(‘A’)  L_{3} = L_{2} + W_{2} ⋅c(‘B’) 
= L_{0} + 1 ⋅ 2^{1}  = L_{ 1} + 2^{1} ⋅ 0  = L_{ 2} + 2^{3} ⋅ 2^{2} 
= 2^{1}  = 2^{1}  = 2^{1} + 2^{5} 
= (0.1)_{b}  = (0.100)_{b}  = (0.10001)_{b} 
s_{
3}=‘A’  s_{4}=‘C’  termination 
W_{4} = W_{3} ⋅p(‘A’)  W_{5} = W_{4} ⋅p(‘C’)  K = ⌈ log _{2}W_{5}⌉ = 8 
= 2^{5} ⋅ 2^{2} = 2^{7}  = 2^{7} ⋅ 2^{1} = 2^{8}  
= (0.0000001)_{b}  = (0.00000001)_{b}  v = L_{5}2^{K}2^{K} 
L_{4} = L_{3} + W_{3} ⋅c(‘A’)  L_{5} = L_{4} + W_{4} ⋅c(‘C’)  = 2^{1} + 2^{5} + 2^{8} 
= L_{3} + 2^{5} ⋅ 0  = L_{ 4} + 2^{7} ⋅ 2^{1}  
= 2^{1} + 2^{5}  = 2^{1} + 2^{5} + 2^{8}  b = ‘10001001′ 
= (0.1000100)_{b}  = (0.10001001)_{b}  
Based on the derived iteration rules, we state an iterative encoding and decoding algorithm for Elias codes. The algorithms are specified for the general case using multiple symbol alphabets and conditional pmfs and cmfs. For stationary processes, all alphabets _{n} can be replaced by a single alphabet . For iid sources, Markov sources, and other simple source models, the conditional pmfs p(s_{n}s_{0},,s_{n1}) and cmfs c(s_{n}s_{0},,s_{n1}) can be simplified as discussed above.
Encoding algorithm:
Decoding algorithm:
Since the iterative interval refinement is the same at encoder and decoder side, Elias coding provides a simple mechanism for the adaptation to sources with unknown or nonstationary statistical properties. Conceptually, for each source symbol s_{n}, the pmf p(s_{n}s_{0},,s_{n1}) can be simultaneously estimated at encoder and decoder side based on the already coded symbols s_{0} to s_{n1}. For this purpose, a source can often be modeled as a process with independent random variables or as a Markov process. For the simple model of independent random variables, the pmf p(s_{n}) for a particular symbol s_{n} can be approximated by the relative frequencies of the alphabet letters inside the sequence of the preceding N_{W } coded symbols. The chosen interval size N_{W } adjusts the tradeoff between a fast adaptation and an accurate probability estimation. The same approach can also be applied for highorder probability models as the Markov model. In this case, the conditional pmf is approximated by the corresponding relative conditional frequencies.
The average codeword length per symbol for the Elias code is given by
 (3.46) 
By applying the inequalities ⌈x⌉≥ x and ⌈x⌉ < x + 1, we obtain
It should be noted that the Elias code is not guaranteed to be prefix free, i.e., a codeword for a particular symbol sequence may be a prefix of the codeword for any other symbol sequence. Hence, the Elias code as described above can only be used if the length of the codeword is known at the decoder side^{5}. A prefixfree Elias code can be constructed if the lengths of all codewords are increased by one, i.e., by choosing
 (3.48) 
The Elias code has several desirable properties, but it is still impractical, since the precision that is required for representing the interval widths and lower interval boundaries grows without bound for long symbol sequences. The widelyused approach of arithmetic coding is a variant of Elias coding that can be implemented with fixedprecision integer arithmetic.
For the following considerations, we assume that the probability masses p(s_{n}s_{0},,s_{n1}) are given with a fixed number V of binary digit after the binary point. We will omit the conditions “s_{0},,s_{n1}” and represent the pmfs p(a) and cmfs c(a) by
 (3.49) 
where p_{V }(a) and c_{V }(a) are V bit positive integers.
The key observation for designing arithmetic coding schemes is that the Elias code remains decodable if the interval width W_{n+1} satisfies
 (3.50) 
This guarantees that the interval _{n+1} is always nested inside the interval _{n}. Equation (3.43) implies L_{n+1} ≥ L_{n}, and by combining (3.43) with the inequality (3.50), we obtain
 (3.51) 
Hence, we can represent the interval width W_{n} with a fixed number of precision bits if we round it toward zero in each iteration step.
Let the interval width W_{n} be represented by a Ubit integer A_{n} and an integer z_{n} ≥ U according to
 (3.52) 
We restrict A_{n} to the range
 (3.53) 
so that the W_{n} is represented with a maximum precision of U bits. In order to suitably approximate W_{0} = 1, the values of A_{0} and z_{0} are set equal to 2^{U}  1 and U, respectively. The interval refinement can then be specified by
where y_{n} is a bit shift parameter with 0 ≤ y_{n} ≤ V . These iteration rules guarantee that (3.50) is fulfilled. It should also be noted that the operation ⌊x ⋅ 2^{y}⌋ specifies a simple right shift of the binary representation of x by y binary digits. To fulfill the constraint in (3.53), the bit shift parameter y_{n} has to be chosen according to
 (3.56) 
The value of y_{n} can be determined by a series of comparison operations.
Given the fixedprecision representation of the interval width W_{n}, we investigate the impact on the lower interval boundary L_{n}. The binary representation of the product
 (3.58) 
can be classified into four categories. The trailing bits that follow the (z_{n} + V )th bit after the binary point are equal to 0, but may be modified by following interval updates. The preceding U + V bits are directly modified by the update L_{n+1} = L_{n} + W_{n}c(s_{n}) and are referred to as active bits. The active bits are preceded by a sequence of zero or more 1bits and a leading 0bit (if present). These c_{n} bits are called outstanding bits and may be modified by a carry from the active bits. The z_{n} c_{n} U bits after the binary point, which are referred to as settled bits, are not modified in any following interval update. Furthermore, these bits cannot be modified by the rounding operation that generates the final codeword, since all intervals _{n+k}, with k > 0, are nested inside the interval _{n} and the binary representation of the interval width W_{n} = A_{n}2^{zn} also consists of z_{n}  U 0bits after the binary point. And since the number of bits in the final codeword,
 (3.59) 
is always greater than or equal to the number of settled bits, the settled bits can be transmitted as soon as they have become settled. Hence, in order to represent the lower interval boundary L_{n}, it is sufficient to store the U + V active bits and a counter for the number of 1bits that precede the active bits.
For the decoding of a particular symbol s_{n} it has to be determined whether the binary fraction v in (3.40) that is represented by the transmitted codeword falls inside the interval W_{n+1}(a_{i}) for an alphabet letter a_{i}. Given the described fixedprecision interval refinement, it is sufficient to compare the c_{n+1} outstanding bits and the U + V active bits of the lower interval boundary L_{n+1} with the corresponding bits of the transmitted codeword and the upper interval boundary L_{n+1} + W_{n+1}.
It should be noted that the number of outstanding bits can become arbitrarily large. In order to force an output of bits, the encoder can insert a 0bit if it detects a sequence of a particular number of 1bits. The decoder can identify the additionally inserted bit and interpret it as extra carry information. This technique is for example used in the MQcoder [72] of JPEG 2000 [41].
In comparison to Elias coding, the usage of the presented fixed precision approximation increases the codeword length for coding a symbol sequence s = {s_{0},s_{1},,s_{N1}}. Given W_{N} for n = N in (3.52), the excess rate of arithmetic coding over Elias coding is given by
 (3.60) 
where we used the inequalities ⌈x⌉ < x + 1 and ⌈x⌉≥ x to derive the upper bound on the right side. We shall further take into account that we may have to approximate the real pmfs p(a) in order to represent the probability masses as multiples of 2^{V }. Let q(a) represent an approximated pmf that is used for arithmetic coding and let p_{min} denote the minimum probability mass of the corresponding real pmf p(a). The pmf approximation can always be done in a way that the difference p(a)  q(a) is less than 2^{V }, which gives
 (3.61) 
An application of the inequality ⌊x⌋ > x  1 to the interval refinement (3.54) with the approximated pmf q(a) yields
By using the relationship W_{n+1} ≥ 2^{U1zn+1}, which is a direct consequence of (3.53), we obtain
 (3.63) 
Inserting the expressions (3.61) and (3.63) into (3.60) yields an upper bound for the increase in codeword length per symbol,
 (3.64) 
If we consider, for example, the coding of N = 1000 symbols with U = 12, V = 16, and p_{min} = 0.02, the increase in codeword length in relation to Elias coding is guaranteed to be less than 0.003 bit per symbol.
Arithmetic coding with binary symbol alphabets is referred to as binary arithmetic coding. It is the most popular type of arithmetic coding in image and video coding applications. The main reason for using binary arithmetic coding is its reduced complexity. It is particularly advantageous for adaptive coding, since the rather complex estimation of Mary pmfs can be replaced by the simpler estimation of binary pmfs. Wellknown examples of efficient binary arithmetic coding schemes that are used in image and video coding are the MQcoder [72] in the picture coding standard JPEG 2000 [41] and the Mcoder [55] in the video coding standard H.264/AVC [36].
In general, a symbol sequence s = {s_{0},s_{1},,s_{N1}} has to be first converted into a sequence c = {c_{0},c_{1},,c_{B1}} of binary symbols, before binary arithmetic coding can be applied. This conversion process is often referred to as binarization and the elements of the resulting binary sequences c are also called bins. The number B of bins in a sequence c can depend on the actual source symbol sequence s. Hence, the bin sequences c can be interpreted as realizations of a variablelength sequence of binary random variables C = {C_{0},C_{1},,C_{B1}}.
Conceptually, the binarization mapping S → C represents a lossless coding step and any lossless source code could be applied for this purpose. It is, however, only important that the used lossless source code is uniquely decodable. The average codeword length that is achieved by the binarization mapping does not have any impact on the efficiency of binary arithmetic coding, since the block entropy for the sequence of random variables S = {S_{0},S_{1},,S_{N1}},

is equal to entropy of the variablelength binary random vector C = {C_{0},C_{1},,C_{B1}}. The actual compression is achieved by the arithmetic coding. The above result also shows that binary arithmetic coding can provide the same coding efficiency as Mary arithmetic coding, if the influence of the finite precision arithmetic is negligible.
In practice, the binarization is usually done with very simple prefix codes for the random variables S_{n}. As we assume that the order of different random variables is known to both, encoder and decoder, different prefix codes can be used for each random variable without impacting unique decodability. A typical example for a binarization mapping, which is called truncated unary binarization, is illustrated in Table 3.8.
S_{ n}  number of bins B  C_{0}  C_{1}  C_{2}  C_{M2}  C_{M1}  
a_{0}  1  1  
a_{1}  2  0  1  
a_{2}  3  0  0  1  
a_{M3}  M  3  0  0  0  1  
a_{M2}  M  2  0  0  0  0  1  
a_{M1}  M  2  0  0  0  0  0  
The binary pmfs for the random variables C_{i} can be directly derived from the pmfs of the random variables S_{n}. For the example in Table 3.8, the binary pmf {P(C_{i} = 0),1  P(C_{i} = 0)} for a random variable C_{i} is given by
 (3.65) 
where we omitted the condition for the binary pmf. For coding nonstationary sources, it is usually preferable to directly estimate the marginal or conditional pmf’s for the binary random variables instead of the pmf’s for the source signal.
3.5 Probability Interval Partitioning Entropy Coding
For a some applications, arithmetic coding is still considered as too complex. As a lesscomplex alternative, a lossless coding scheme called probability interval partitioning entropy (PIPE) coding has been recently proposed [54]. It combines concepts from binary arithmetic coding and Huffman coding for variablelength vectors with a quantization of the binary probability interval.
A block diagram of the PIPE coding structure is shown in Fig. 3.4. It is assumed that the input symbol sequences s = {s_{0},s_{1},,s_{N1}} represent realizations of a sequence S = {S_{0},S_{1},,S_{N1}} of random variables. Each random variable can be characterized by a distinct alphabet _{n}. The number N of source symbols is assumed to be known to encoder and decoder. Similarly as for binary arithmetic coding, a symbol sequence s = {s_{0},s_{1},,s_{N1}} is first converted into a sequence c = {c_{0},c_{1},,c_{B1}} of B binary symbols (bins). Each bin c_{i} can be considered as a realization of a corresponding random variable C_{i} and is associated with a pmf. The binary pmf is given by the probability P(C_{i} = 0), which is known to encoder and decoder. Note that the conditional dependencies have been omitted in order to simplify the description.
The key observation for designing a lowcomplexity alternative to binary arithmetic coding is that an appropriate quantization of the binary probability interval has only a minor impact on the coding efficiency. This is employed by partitioning the binary probability interval into a small number U of halfopen intervals _{k} = (p_{k},p_{k+1}], with 0 ≤ k < U. Each bin c_{i} is assigned to the interval _{k} for which p_{k} < P(C_{i} = 0) ≤ p_{k+1}. As a result, the bin sequence c is decomposed into U bin sequences u_{k} = {u_{0}^{k},u_{1}^{k},…}, with 0 ≤ k < U. For the purpose of coding, each of the bin sequences u_{k} can be treated as a realization of a binary iid process with a pmf {p_{k},1  p_{k}}, where p_{k} denotes a representative probability for an interval _{k}, and can be efficiently coded with a V2V code as described in sec. 3.3. The resulting U codeword sequences b_{k} are finally multiplexed in order to produce a data packet for the symbol sequence s.
Given the U probability intervals _{k} = (p_{k},p_{k+1}] and corresponding V2V codes, the PIPE coding process can be summarized as follows:
The binarization process is the same as for binary arithmetic coding described in sec. 3.4. Typically, each symbol s_{n} of the input symbol sequence s = {s_{0},s_{1},,s_{N1}} is converted into a sequence c_{n} of a variable number of bins using a simple prefix code and these bin sequences c_{n} are concatenated to produce the bin sequence c that uniquely represents the input symbol sequence s. Here, a distinct prefix code can be used for each random variable S_{n}. Given the prefix codes, the conditional binary pmfs
can be directly derived based on the conditional pmfs for the random variables S_{n}. The binary pmfs can either be fixed or they can be simultaneously estimated at encoder and decoder side^{6}. In order to simplify the following description, we omit the conditional dependencies and specify the binary pmf for the ith bin by the probability P(C_{i} = 0).
For the purpose of binary coding, it is preferable to use bin sequences c for which all probabilities P(C_{i} = 0) are less than or equal to 0.5. This property can be ensured by inverting a bin value c_{i} if the associated probability P(C_{i} = 0) is greater than 0.5. The inverse operation can be done at the decoder side, so that the unique decodability of a symbol sequence s from the associated bin sequence c is not influenced. For PIPE coding, we assume that this additional operation is done during the binarization and that all bins c_{i} of a bin sequence c are associated with probabilities P(C_{i} = 0) ≤ 0.5.
C_{i}(S_{n})  C_{0}(S_{n})  C_{1}(S_{n}) 
P(C_{i}(S_{n}) = 0S_{n1} = a_{0})  0.10  0.50 
P(C_{i}(S_{n}) = 0S_{n1} = a_{1})  0.15  1/17 
P(C_{i}(S_{n}) = 0S_{n1} = a_{2})  0.25  0.20 
As an example, we consider the binarization for the stationary Markov source in specified in Table 3.2. If the truncated unary binarization given in Table 3.8 is used and all bins with probabilities P(C_{i} = 0) greater than 0.5 are inverted, we obtain the bin probabilities given in Table 3.9. C_{i}(S_{n}) denotes the random variable that corresponds to the ith bin inside the bin sequences for the random variable S_{n}.
The halfopen probability interval (0,0.5], which includes all possible bin probabilities P(C_{i} = 0), is partitioned into U intervals _{k} = (p_{k},p_{k+1}]. This set of intervals is characterized by U  1 interval borders p_{k} with k = 1,,U  1. Without loss of generality, we assume p_{k} < p_{k+1}. The outer interval borders are fixed and given by p_{0} = 0 and p_{U} = 0.5. Given the interval boundaries, the sequence of bins c is decomposed into U separate bin sequences u_{k} = (u_{0}^{k},u_{1}^{k},), where each bin sequence u_{k} contains the bins c_{i} with P(C_{i} = 0) ∈_{k}. Each bin sequence u_{k} is coded with a binary coder that is optimized for a representative probability p_{k} for the interval _{k}.
For analyzing the impact of the probability interval partitioning, we assume that we can design a lossless code for binary iid processes that achieves the entropy limit. The average codeword length ℓ_{b}(p,p_{k}) for coding a bin c_{i} with the probability p = P(C_{i} = 0) using an optimal code for the representative probability p_{k} is given by
 (3.66) 
When we further assume that the relative frequencies of the bin probabilities p inside a bin sequence c are given by the pdf f(p), the average codeword length per bin _{b} for a given set of U intervals _{k} with representative probabilities p_{k} can then be written as
 (3.67) 
Minimization with respect to the interval boundaries p_{k} and representative probabilities p_{k} yields the equation system,
 (3.68) 
 (3.69) 
Given the pdf f(p) and the number of intervals U, the interval partitioning can be derived by an iterative algorithm that alternately updates the interval borders p_{k} and interval representatives p_{k}. As an example, Fig. 3.5 shows the probability interval partitioning for a uniform distribution f(p) of the bin probabilities and U = 4 intervals. As can be seen, the probability interval partitioning leads to a piecewise linear approximation ℓ_{b}(p,p_{k})_{k} of the binary entropy function H(p).
Fig. 3.5 Example for the partitioning of the probability interval (0, 0.5] into 4 intervals assuming a uniform distribution of the bin probabilities p = P(C_{i} = 0).
The increase of the average codeword length per bin is given by
 (3.70) 
Table 3.10 lists the increases in average codeword length per bin for a uniform and a linear increasing (f(p) = 8p) distribution of the bin probabilities for selected numbers U of intervals.
U  1  2  4  8  12  16 
[%]  _{uni}12.47  3.67  1.01  0.27  0.12  0.07 
[%]  _{lin}5.68  1.77  0.50  0.14  0.06  0.04 
We now consider the probability interval partitioning for the Markov source specified in Table 3.2. As shown in Table 3.9, the binarization described above led to 6 different bin probabilities. For the truncated unary binarization of a Markov source, the relative frequency h(p_{ij}) that a bin with probability p_{ij} = P(C_{i}(S_{n})S_{n1} = a_{j}) occurs inside the bin sequence c is equal to
 (3.71) 
The distribution of the bin probabilities is given by
interval _{k} = (p_{k},p_{k+1}]  representative p_{k} 
_{0} = (0, 0.1326]  0.09 
_{1} = (0.1326, 0.3294]  0.1848 
_{2} = (0.3294, 0.5]  0.5000 
For the purpose of binary coding, a bin sequence u_{k} for the probability interval _{k} can be treated as a realization of a binary iid process with a pmf {p_{k},1  p_{k}}. The statistical dependencies between the bins have already been exploited by associating each bin c_{i} with a probability P(C_{i} = 0) that depends on previously coded bins or symbols according to the employed probability modeling. The V2V codes described in sec. 3.3 are simple but very efficient lossless source codes for binary iid processes U_{k} = {U_{n}^{k}}. Using these codes, a variable number of bins is mapped to a variablelength codeword. By considering a sufficiently large number of tables entries, these codes can achieve an average codeword length close to the entropy rate (U_{k}) = H(U_{n}^{k}).
p_{
0} = 0.09  p_{1} = 0.1848  p_{2} = 0.5
 
= 0.4394,ϱ_{0} = 0.69% _{0}  = 0.6934,ϱ_{1} = 0.42% _{1}  = 1,ϱ_{2} = 0%
_{2}  
bin sequence  codeword  bin sequence  codeword  bin sequence  codeword 
′1111111′  ′1′  ′111′  ′1′  ′1′  ′1′ 
′0′  ′011′  ′110′  ′001′  ′0′  ′0′ 
′10′  ′0000′  ′011′  ′010′  
′110′  ′0001′  ′1011′  ′011′  
′1110′  ′0010′  ′00′  ′00000′  
′11110′  ′0011′  ′100′  ′00001′  
′111110′  ′0100′  ′010′  ′00010′  
′1111110′  ′0101′  ′1010′  ′00011′  
As an example, Table 3.12 shows V2V codes for the interval representatives p_{k} of the probability interval partitioning given in Table 3.11. These codes achieve the minimum average codeword length per bin among all V2V codes with up to 8 codewords. The table additionally lists the average codeword lengths per bin _{k} and the corresponding redundancies ϱ_{k} = _{k}  (U_{k})∕ (U_{k}). The code redundancies could be further decreased if V2V codes with more than 8 codewords are considered. When we assume that the number of N of symbols approaches infinity, the average codeword length per symbol for the applied truncated unary binarization is given by
 (3.72) 
where the first term represents the average codeword length per bin for the bin sequence c and the second term is the bintosymbol ratio. For our simple example, the average codeword length for the PIPE coding is = 0.7432 bit per symbol. It is only 1.37% larger than the entropy rate and significantly smaller than the average codeword length for the scalar, conditional, and block Huffman codes that we have developed in sec. 3.2 and sec. 3.3.
In general, the average codeword length per symbol can be further decreased if the V2V codes and the probability interval partitioning are jointly optimized. This can be achieved by an iterative algorithm that alternately optimizes the interval representatives p_{k}, the V2V codes for the interval representatives, and the interval boundaries p_{k}. Each codeword entry m of a binary V2V code _{k} is characterized by the number x_{m} of 0bins, the number y_{m} of 1bins, and the length ℓ_{m} of the codeword. As can be concluded from the description of V2V codes in sec. 3.3, the average codeword length for coding a bin c_{i} with a probability p = P(C_{i} = 0) using a V2V code _{k} is given by
 (3.73) 
where V denotes the number of codeword entries. Hence, an optimal interval border p_{k} is given by the intersection point of the functions _{b}(p,_{k1}) and _{b}(p,_{k}) for the V2V codes of the neighboring intervals.
Fig. 3.6 Difference between the average codeword length and the binary entropy function H(p) for a probability interval partitioning into U = 12 intervals assuming optimal binary codes and a real design with V2V codes of up to 65 codeword entries. The distribution of bin probabilities is assumed to be uniform.
As an example, we jointly derived the partitioning into U = 12 probability intervals and corresponding V2V codes with up to 65 codeword entries for a uniform distribution of bin probabilities. Fig. 3.6 shows the difference between the average codeword length per bin and the binary entropy function H(p) for this design and a theoretically optimal probability interval partitioning assuming optimal binary codes with _{k} = H(p_{k}). The overall redundancy with respect to the entropy limit is 0.24% for the jointly optimized design and 0.12% for the probability interval partitioning assuming optimal binary codes.
The U codeword sequences b_{k} that are generated by the different binary encoders for a set of source symbols (e.g., a slice of a video picture) can be written to different partitions of a data packet. This enables a parallelization of the bin encoding and decoding process. At the encoder side, each subsequence u_{k} is written to a different buffer and the actual binary encoding can be done in parallel. At the decoder side, the U codeword sequences b_{k} can be decoded in parallel and the resulting bin sequences u_{k} can be stored in separate bin buffers. The remaining entropy decoding process can then be designed in a way that it simply reads bins from the corresponding U bin buffers.
The separate transmission of the codeword streams requires the signaling of partitioning information. Furthermore, parallelized entropy coding is often not required for small data packets. In such a case, the codewords of the U codeword sequences can be interleaved without any rate overhead. The decoder can simply read a new codeword from the bitstream if a new bin is requested by the decoding process and all bins of the previously read codeword for the corresponding interval _{k} have been used. At the encoder side, it has to be ensured that the codewords are written in the same order in which they are read at the decoder side. This can be efficiently realized by introducing a codeword buffer.
For PIPE coding, the concept of unique decodability has to be extended. Since the binarization is done using prefix codes, it is always invertible^{7}. However, the resulting sequence of bins c is partitioned into U subsequences u_{k}
 (3.74) 
and each of these subsequences u_{k} is separately coded. The bin sequence c is uniquely decodable, if each subsequence of bins u_{k} is uniquely decodable and the partitioning rule γ_{p} is known to the decoder. The partitioning rule γ_{p} is given by the probability interval partitioning {_{k}} and the probabilities P(C_{i} = 0) that are associated with the coding bins c_{i}. Hence, the probability interval partitioning {_{k}} has to be known at the decoder side and the probability P(C_{i} = 0) for each bin c_{i} has to be derived in the same way at encoder and decoder side.
3.6 Comparison of Lossless Coding Techniques
In the preceding sections, we presented different lossless coding techniques. We now compare these techniques with respect to their coding efficiency for the stationary Markov source specified in Table 3.2 and different message sizes L. In Fig. 3.7, the average codeword lengths per symbol for the different lossless source codes are plotted over the number L of coded symbols. For each number of coded symbols, the shown average codeword lengths were calculated as mean values over a set of one million different realizations of the example Markov source and can be considered as accurate approximations of the expected average codeword lengths per symbol. For comparison, Fig. 3.7 also shows the entropy rate and the instantaneous entropy rate, which is given by
 (3.75) 
and represents the greatest lower bound for the average codeword length per symbol when a message of L symbols is coded.
Fig. 3.7 Comparison of lossless coding techniques for the stationary Markov source specified in Table 3.2 and different numbers L of coded symbols.
For L = 1 and L = 5, the scalar Huffman code and the Huffman code for blocks of 5 symbols achieve the minimum average codeword length, respectively, which confirms that Huffman codes are optimal codes for a given set of letters or letter sequences with a fixed pmf. But if more than 10 symbols are coded, all investigated Huffman codes have a lower coding efficiency than arithmetic and PIPE coding. For large numbers of coded symbols, the average codeword length for arithmetic coding approaches the entropy rate. The average codeword length for PIPE coding is only a little bit larger; the difference to arithmetic coding could be further reduced by increasing the number of probability intervals and the number of codewords for the V2V tables.
The design of Huffman codes and the coding process for arithmetic codes and PIPE codes require that the statistical properties of a source, i.e., the marginal pmf or the joint or conditional pmf’s of up to a certain order, are known. Furthermore, the local statistical properties of real data such as image and video signals usually change with time. The average codeword length can be often decreased if a lossless code is flexible and can be adapted to the local statistical properties of a source. The approaches for adaptive coding are classified into approaches with forward adaptation and approaches with backward adaptation. The basic coding structure for these methods is illustrated in Fig. 3.8.
In adaptive coding methods with forward adaptation, the statistical properties of a block of successive samples are analyzed in the encoder and an adaptation signal is included in the bitstream. This adaptation signal can be for example a Huffman code table, one or more pmf’s, or an index into a predefined list of Huffman codes or pmf’s. The decoder adjusts the used code for the block of samples according to the transmitted information. Disadvantages of this approach are that the required side information increase the transmission rate and that forward adaptation introduces a delay.
Methods with backward adaptation estimate the local statistical properties based on already coded symbols simultaneously at encoder and decoder side. As mentioned in sec. 3.2, the adaptation of Huffman codes is a quite complex task, so that backward adaptive VLC coding is rarely used in practice. But for arithmetic coding, in particular binary arithmetic coding, and PIPE coding, the backward adaptive estimation of pmf’s can be easily integrated in the coding process. Backward adaptive coding methods do not introduce a delay and do not require the transmission of any side information. However, they are not robust against transmission errors. For this reason, backward adaptation is usually only used inside a transmission packet. It is also possible to combine backward and forward adaptation. As an example, the arithmetic coding design in H.264/AVC [36] supports the transmission of a parameter inside a data packet that specifies one of three initial sets of pmf’s, which are then adapted based on the actually coded symbols.
3.8 Summary of Lossless Source Coding
We have introduced the concept of uniquely decodable codes and investigated the design of prefix codes. Prefix codes provide the useful property of instantaneous decodability and it is possible to achieve an average codeword length that is not larger than the average codeword length for any other uniquely decodable code. The measures of entropy and block entropy have been derived as lower bounds for the average codeword length for coding a single symbol and a block of symbols, respectively. A lower bound for the average codeword length per symbol for any lossless source coding technique is the entropy rate.
Huffman codes have been introduced as optimal codes that assign a separate codeword to a given set of letters or letter sequences with a fixed pmf. However, for sources with memory, an average codeword length close to the entropy rate can only be achieved if a large number of symbols is coded jointly, which requires large codeword tables and is not feasible in practical coding systems. Furthermore, the adaptation of Huffman codes to timevarying statistical properties is typically considered as too complex for video coding applications, which often have realtime requirements.
Arithmetic coding represents a fixedprecision variant of Elias coding and can be considered as a universal lossless coding method. It does not require the storage of a codeword table. The arithmetic code for a symbol sequence is iteratively constructed by successively refining a cumulative probability interval, which requires a fixed number of arithmetic operations per coded symbol. Arithmetic coding can be elegantly combined with backward adaptation to the local statistical behavior of the input source. For the coding of long symbol sequences, the average codeword length per symbol approaches the entropy rate.
As an alternative to arithmetic coding, we presented the probability interval partitioning entropy (PIPE) coding. The input symbols are binarized using simple prefix codes and the resulting sequence of binary symbols is partitioned into a small number of bin sequences, which are then coded using simple binary V2V codes. PIPE coding provides the same simple mechanism for probability modeling and backward adaptation as arithmetic coding. However, the complexity is reduced in comparison to arithmetic coding and PIPE coding provides the possibility to parallelize the encoding and decoding process. For long symbol sequences, the average codeword length per symbol is similar to that of arithmetic coding.
It should be noted that there are various other approaches to lossless coding including LempelZiv coding [79], Tunstall coding [73, 66], or BurrowsWheeler coding [7]. These methods are not considered in this text, since they are not used in the video coding area.
This is the html version of the book:
Thomas Wiegand and Heiko Schwarz (2010): "Source Coding: Part I of Fundamentals of Source and Video Coding", Now Publishers, Foundations and TrendsŪ in Signal Processing: Vol. 4: No 12, pp 1222. (pdf version of the book)
4 Rate Distortion Theory
In lossy coding, the reconstructed signal is not identical to the source signal, but represents only an approximation of it. A measure of the deviation between the approximation and the original signal is referred to as distortion. Rate distortion theory addresses the problem of determining the minimum average number of bits per sample that is required for representing a given source without exceeding a given distortion. The greatest lower bound for the average number of bits is referred to as the rate distortion function and represents a fundamental bound on the performance of lossy source coding algorithms, similarly as the entropy rate represents a fundamental bound for lossless source coding. For deriving the results of rate distortion theory, no particular coding technique is assumed. The applicability of rate distortion theory includes discrete and continuous random processes.
In this chapter, we give an introduction to rate distortion theory and derive rate distortion bounds for some important model processes. We will use these results in the following chapters for evaluating the performance of different lossy coding techniques. For further details, the reader is referred to the comprehensive treatments of the subject in [22, 4] and the overview in [11].
4.1 The Operational Rate Distortion Function
A lossy source coding system as illustrated in Fig. 4.1 consists of an encoder and a decoder. Given a sequence of source symbols s, the encoder generates a sequence of codewords b. The decoder converts the sequence of codewords b into a sequence of reconstructed symbols s′.
The encoder operation can be decomposed into an irreversible encoder mapping α, which maps a sequence of input samples s onto a sequence of indexes i, and a lossless mapping γ, which converts the sequence of indexes i into a sequence of codewords b. The encoder mapping α can represent any deterministic mapping that produces a sequence of indexes i of a countable alphabet. This includes the methods of scalar quantization, vector quantization, predictive coding, and transform coding, which will be discussed in the following chapters. The lossless mapping γ can represent any lossless source coding technique, including the techniques that we discussed in chapter 3. The decoder operation consists of a lossless mapping γ^{1}, which represents the inverse of the lossless mapping γ and converts the sequence of codewords b into the sequence of indexes i, and a deterministic decoder mapping β, which maps the sequence of indexes i to a sequence of reconstructed symbols s′. A lossy source coding system Q is characterized by the mappings α, β, and γ. The triple Q = (α,β,γ) is also referred to as source code or simply as code throughout this text.
A simple example for a source code is an Ndimensional block code Q_{N} = {α_{N},β_{N},γ_{N}}, by which blocks of N consecutive input samples are independently coded. Each block of input samples s^{(N)} = {s_{0},,s_{N1}} is mapped to a vector of K quantization indexes i^{(K)} = α_{N}(s^{(N)}) using a deterministic mapping α_{N} and the resulting vector of indexes i is converted into a variablelength bit sequence b^{(ℓ)} = γ_{N}(i^{(K)}). At the decoder side, the recovered vector i^{(K)} = γ_{N}^{1}(b^{(ℓ)}) of indexes is mapped to a block s′^{(N)} = β_{N}(i^{(K)}) of N reconstructed samples using the deterministic decoder mapping β_{N}.
In the following, we will use the notations α_{N}, β_{N}, and γ_{N} also for representing the encoder, decoder, and lossless mappings for the first N samples of an input sequence, independently of whether the source code Q represents an Ndimensional block code.
For continuous random processes, the encoder mapping α cannot be invertible, since real numbers cannot be represented by indexes of a countable alphabet and they cannot be losslessly described by a finite number of bits. Consequently, the reproduced symbol sequence s′ is not the same as the original symbol sequence s. In general, if the decoder mapping β is not the inverse of the encoder mapping α, the reconstructed symbols are only an approximation of the original symbols. For measuring the goodness of such an approximation, distortion measures are defined that express the difference between a set of reconstructed samples and the corresponding original samples as a nonnegative real value. A smaller distortion corresponds to a higher approximation quality. A distortion of zero specifies that the reproduced samples are identical to the corresponding original samples.
In this text, we restrict our considerations to the important class of additive distortion measures. The distortion between a single reconstructed symbol s′ and the corresponding original symbol s is defined as a function d_{1}(s,s′), which satisfies
 (4.1) 
with equality if and only if s = s′. Given such a distortion measure d_{1}(s,s′), the distortion between a set of N reconstructed samples s′ = {s′_{0},s′_{1},…,s′_{N1}} and the corresponding original samples s = {s_{0},s_{1},…,s_{N1}} is defined by
 (4.2) 
The most commonly used additive distortion measure is the squared error, d_{1}(s,s′) = (s  s′)^{2}. The resulting distortion measure for sets of samples is the mean squared error (MSE),
 (4.3) 
The reasons for the popularity of squared error distortion measures are their simplicity and the mathematical tractability of the associated optimization problems. Throughout this text, we will explicitly use the squared error and mean squared error as distortion measures for single samples and sets of samples, respectively. It should, however, be noted that in most video coding applications the quality of the reconstruction signal is finally judged by human observers. But the MSE does not well correlate with the quality that is perceived by human observers. Nonetheless, MSEbased quality measures are widely used in the video coding community. The investigation of alternative distortion measures for video coding applications is still an active field of research.
In order to evaluate the approximation quality of a code Q, rather than measuring distortion for a given finite symbol sequence, we are interested in a measure for the expected distortion for very long symbol sequences. Given a random process S = {S_{n}}, the distortion δ(Q) associated with a code Q is defined as the limit of the expected distortion as the number of coded symbols approaches infinity,
 (4.4) 
if the limit exists. S^{(N)} = {S_{0},S_{1},…,S_{N1}} represents the sequence of the first N random variables of the random process S and β_{N}(α_{N}(⋅)) specifies the mapping of the first N input symbols to the corresponding reconstructed symbols as given by the code Q.
For stationary processes S with a multivariate pdf f(s) and a block code Q_{N} = (α_{N},β_{N},γ_{N}), the distortion δ(Q_{N}) is given by
 (4.5) 
Beside the distortion δ(Q), another important property required for evaluating the performance of a code Q is its rate. For coding of a finite symbol sequence s^{(N)}, we define the transmission rate as the average number of bits per input symbol,
 (4.6) 
where γ_{N}(α_{N}(⋅)) specifies the mapping of the N input symbols to the bit sequence b^{(ℓ)} of ℓ bits as given by the code Q and the operator ⋅ is defined to return the number of bits in the bit sequence that is specified as argument. Similarly as for the distortion, we are interested in a measure for the expected number of bits per symbol for long sequences. For a given random process S = {S_{n}}, the rate r(Q) associated with a code Q is defined as the limit of the expected number of bits per symbol as the number of transmitted symbols approaches infinity,
 (4.7) 
if the limit exists. For stationary random processes S and a block codes Q_{N} = (α_{N},β_{N},γ_{N}), the rate r(Q_{N}) is given by
 (4.8) 
where f(s) is the Nth order joint pdf of the random process S.
4.1.3 Operational Rate Distortion Function
For a given source S, each code Q is associated with a rate distortion point (R,D), which is given by R = r(Q) and D = δ(Q). In the diagram of Fig. 4.2, the rate distortion points for selected codes are illustrated as dots. The rate distortion plane can be partitioned into a region of achievable rate distortion points and a region of nonachievable rate distortion points. A rate distortion point (R,D) is called achievable if there is a code Q with r(Q) ≤ R and δ(Q) ≤ D. The boundary between the regions of achievable and nonachievable rate distortion points specifies the minimum rate R that is required for representing the source S with a distortion less than or equal to a given value D or, alternatively, the minimum distortion D that can be achieved if the source S is coded at a rate less than or equal to a given value R. The function R(D) that describes this fundamental bound for a given source S is called the operational rate distortion function and is defined as the infimum of rates r(Q) for all codes Q that achieve a distortion δ(Q) less than or equal to D,
 (4.9) 
Fig. 4.2 illustrates the relationship between the region of achievable rate distortion points and the operational rate distortion function. The inverse of the operational rate distortion function is referred to as operational distortion rate function D(R) and is defined by
 (4.10) 
Fig. 4.2 Operational rate distortion function as boundary of the region of achievable rate distortion points. The dots represent rate distortion points for selected codes.
The terms operational rate distortion function and operational distortion rate function are not only used for specifying the best possible performance over all codes Q without any constraints, but also for specifying the performance bound for sets of source codes that are characterized by particular structural or complexity constraints. As an example, such a set of source codes could be the class of scalar quantizers or the class of scalar quantizers with fixedlength codewords. With denoting the set of source codes Q with a particular constraint, the operational rate distortion function for a given source S and codes with the particular constraint is defined by
 (4.11) 
Similarly, the operational distortion rate function for a given source S and a set of codes with a particular constraint is defined by
 (4.12) 
It should be noted that in contrast to information rate distortion functions, which will be introduced in the next section, operational rate distortion functions are not convex. They are more likely to be step functions, i.e., piecewise constant functions.
4.2 The Information Rate Distortion Function
In the previous section, we have shown that the operational rate distortion function specifies a fundamental performance bound for lossy source coding techniques. But unless we suitably restrict the set of considered codes, it is virtually impossible to determine the operational rate distortion function according to the definition in (4.9). A more accessible expression for a performance bound of lossy codes is given by the information rate distortion function, which was originally introduced by Shannon in [69, 70].
In the following, we first introduce the concept of mutual information before we define the information rate distortion function and investigate its relationship to the operational rate distortion function.
Although this chapter deals with the lossy coding of random sources, we will introduce the quantity of mutual information for general random variables and vectors of random variables.
Let X and Y be two discrete random variables with alphabets _{X} = {x_{0},x_{1},,x_{MX1}} and _{Y } = {y_{0},y_{1},,y_{MY 1}}, respectively. As shown in sec. 3.2, the entropy H(X) represents a lower bound for the average codeword length of a lossless source code for the random variable X. It can also be considered as a measure for the uncertainty that is associated with the random variable X or as a measure for the average amount of information that is required to describe the random variable X. The conditional entropy H(XY ) can be interpreted as a measure for the uncertainty that we have about the random variable X if we observe the random variable Y or as the average amount of information that is required to describe the random variable X if the random variable Y is known. The mutual information between the discrete random variables X and Y is defined as the difference
 (4.13) 
The mutual information I(X;Y ) is a measure for the reduction of the uncertainty about the random variable X due to the observation of Y . It represents the average amount of information that the random variable Y contains about the random variable X. Inserting the formulas for the entropy (3.13) and conditional entropy (3.20) yields
 (4.14) 
where p_{X} and p_{Y } represent the marginal pmf’s of the random variables X and Y , respectively, and p_{XY } denotes the joint pmf.
For extending the concept of mutual information to general random variables we consider two random variables X and Y with marginal the pdf’s f_{X} and f_{Y }, respectively, and the joint pdf f_{XY }. Either or both of the random variables may be discrete or continuous or of mixed type. Since the entropy, as introduced in sec. 3.2, is only defined for discrete random variables, we investigate the mutual information for discrete approximations X_{Δ} and Y _{Δ} of the random variables X and Y .
With Δ being a step size, the alphabet of the discrete approximation X_{Δ} of a random variable X is defined by _{XΔ} = {…,x_{1},x_{0},x_{1},…} with x_{i} = i ⋅ Δ. The event {X_{Δ} = x_{i}} is defined to be equal to the event {x_{i} ≤ X < x_{i+1}}. Furthermore, we define an approximation f_{X}^{(Δ)} of the pdf f_{X} for the random variable X, which is constant inside each halfopen interval [x_{i},x_{i+1}), as illustrated in Fig. 4.3, and is given by
 (4.15) 
The pmf p_{XΔ} for the random variable X_{Δ} can then be expressed as
 (4.16) 
Similarly, we define a piecewise constant approximation f_{XY }^{(Δ)} for the joint pdf f_{XY } of two random variables X and Y , which is constant inside each 2dimensional interval [x_{i},x_{i+1}) × [y_{j},y_{j+1}). The joint pmf p_{XΔY Δ} of the two discrete approximations X_{Δ} and Y _{Δ} is then given by
 (4.17) 
Using the relationships (4.16) and (4.17), we obtain for the mutual information of the discrete random variables X_{Δ} and Y _{Δ}
 (4.18) 
If the step size Δ approaches zero, the discrete approximations X_{Δ} and Y _{Δ} approach the random variables X and Y . The mutual information I(X;Y ) for random variables X and Y can be defined as limit of the mutual information I(X_{Δ};Y _{Δ}) as Δ approaches zero,
 (4.19) 
If the step size Δ approaches zero, the piecewise constant pdf approximations f_{XY }^{(Δ)}, f_{X}^{(Δ)}, and f_{Y }^{(Δ)} approach the pdf’s f_{XY }, f_{X}, and f_{Y }, respectively, and the sum in (4.18) approaches the integral
 (4.20) 
which represents the definition of mutual information.
The formula (4.20) shows that the mutual information I(X;Y ) is symmetric with respect to the random variables X and Y . The average amount of information that a random variable X contains about another random variable Y is equal to the average amount of information that Y contains about X. Furthermore, the mutual information I(X;Y ) is greater than or equal to zero, with equality if and only if f_{XY }(x,y) = f_{X}(x)f_{Y }(x), ∀x,y ∈, i.e., if and only if the random variables X and Y are independent. This is a direct consequence of the divergence inequality for probability density functions f and g,
 (4.21) 
which is fulfilled with equality if and only if the pdfs f and g are the same. The divergence inequality can be proved using the inequality lnx ≥ x  1 (with equality if and only if x = 1),
For Ndimensional random vectors X = (X_{0},X_{1},…,X_{N1})^{T } and Y = (Y _{0},Y _{1},…,Y _{N1})^{T }, the definition of mutual information can be extended according to
 (4.23) 
where f_{X} and f_{Y } denote the marginal pdfs for the random vectors X and Y , respectively, and f_{XY } represents the joint pdf.
We now assume that the random vector Y is a discrete random vector and is associated with an alphabet _{Y }^{N}. Then, the pdf f_{Y } and the conditional pdf f_{Y X} can be written as
where p_{Y } denotes the pmf of the discrete random vector Y , and p_{Y X} denotes the conditional pmf of Y given the random vector X. Inserting f_{XY } = f_{Y X} ⋅f_{X} and the expressions (4.24) and (4.25) into the definition (4.23) of mutual information for vectors yields
 (4.26) 
This expression can be rewritten as
 (4.27) 
where H(Y ) is the entropy of the discrete random vector Y and
 (4.28) 
is the conditional entropy of Y given the event {X = x}. Since the conditional entropy H(Y X = x) is always nonnegative, we have
 (4.29) 
Equality is obtained if and only if H(Y X = x) is zero for all x and, hence, if and only if the random vector Y is given by a deterministic function of the random vector X.
If we consider two random processes X = {X_{n}} and Y = {Y _{n}} and represent the random variables for N consecutive time instants as random vectors X^{(N)} and Y ^{(N)}, the mutual information I(X^{(N)};Y ^{(N)}) between the random vectors X^{(N)} and Y ^{(N)} is also referred to as Nth order mutual information and denoted by I_{N}(X;Y ).
4.2.2 Information Rate Distortion Function
Suppose we have a source S = {S_{n}} that is coded using a lossy source coding system given by a code Q = (α,β,γ). The output of the lossy coding system can be described by the random process S′ = {S′_{n}}. Since coding is a deterministic process given by the mapping β(α(⋅)), the random process S′ describing the reconstructed samples is a deterministic function of the input process S. Nonetheless, the statistical properties of the deterministic mapping given by a code Q can be described by a conditional pdf g^{Q}(s′s) = g_{S′nSn}(s′s). If we consider, as an example, simple scalar quantization, the conditional pdf g^{Q}(s′s) represents, for each value of s, a shifted Dirac delta function. In general, g^{Q}(s′s) consists of a sum of scaled and shifted Dirac delta functions. Note that the random variables S′_{n} are always discrete and, hence, the conditional pdf g^{Q}(s′s) can also be represented by a conditional pmf. Instead of single samples, we can also consider the mapping of blocks of N successive input samples S to blocks of N successive output samples S′. For each value of N > 0, the statistical properties of a code Q can then be described by the conditional pdf g_{N}^{Q}(s′s) = g_{S′S}(s′s).
For the following considerations, we define the Nth order distortion
 (4.30) 
Given a source S, with an Nth order pdf f_{S}, and an additive distortion measure d_{N}, the Nth order distortion δ_{N}(g_{N}) is completely determined by the conditional pdf g_{N} = g_{S′S}. The distortion δ(Q) that is associated with a code Q and was defined in (4.4) can be written as
 (4.31) 
Similarly, the Nth order mutual information I_{N}(S;S′) between blocks of N successive input samples and the corresponding blocks of output samples can be written as
 (4.32) 
with
 (4.33) 
For a given source S, the Nth order mutual information only depends on the Nth order conditional pdf g_{N}.
We now consider any source code Q with a distortion δ(Q) that is less than or equal to a given value D. As mentioned above, the output process S′ of a source coding system is always discrete. We have shown in sec. 3.3.1 that the average codeword length for lossless coding of a discrete source cannot be smaller than the entropy rate of the source. Hence, the rate r(Q) of the code Q is greater than or equal to the entropy rate of S′,
 (4.34) 
By using the definition of the entropy rate S′) in (3.25) and the relationship (4.29), we obtain
(
 (4.35) 
where H_{N}(S′) denotes the block entropy for the random vectors S′ of N successive reconstructed samples and I_{N}(S;S′) is the mutual information between the Ndimensional random vectors S and the corresponding reconstructions S′. A deterministic mapping as given by a source code is a special case of a random mapping. Hence, the Nth order mutual information I_{N}(g_{N}^{Q}) for a particular code Q with δ_{N}(g_{N}^{Q}) ≤ D cannot be smaller than the smallest Nth order mutual information I_{N}(g_{N}) that can be achieved using any random mapping g_{N} = g_{S′S} with δ_{N}(g_{N}) ≤ D,
 (4.36) 
Consequently, the rate r(Q) is always greater than or equal to
 (4.37) 
This fundamental lower bound for all lossy source coding techniques is called the information rate distortion function. Every code Q that yields a distortion δ(Q) less than or equal to any given value D for a source S is associated with a rate r(Q) that is greater than or equal to the information rate distortion function R^{(I)}(D) for the source S,
 (4.38) 
This relationship is called the fundamental source coding theorem. The information rate distortion function was first derived by Shannon for iid sources [69, 70] and is for that reason also referred to as Shannon rate distortion function.
If we restrict our considerations to iid sources, the Nth order joint pdf f_{S}(s) can be represented as the product ∏ _{i=0}^{N1}f_{S}(s_{i}) of the marginal pdf f_{S}(s), with s = {s_{0},,s_{N1}}. Hence, for every N, the Nth order distortion δ_{N}(g_{N}^{Q}) and mutual information I_{N}(g_{N}^{Q}) for a code Q can be expressed using a scalar conditional pdf g^{Q} = g_{S′S},
 (4.39) 
Consequently, the information rate distortion function R^{(I)}(D) for iid sources is equal to the socalled first order information rate distortion function,
 (4.40) 
In general, the function
 (4.41) 
is referred to as the Nth order information rate distortion function. If N approaches infinity, the Nth order information rate distortion function approaches the information rate distortion function,
 (4.42) 
We have shown that the information rate distortion function represents a fundamental lower bound for all lossy coding algorithms. Using the concept of typical sequences, it can additionally be shown that the information rate distortion function is also asymptotically achievable [4, 22, 11], meaning that for any ε > 0 there exists a code Q with δ(Q) ≤ D and r(Q) ≤ R^{(I)}(D) + ε. Hence, subject to suitable technical assumptions the information rate distortion function is equal to the operational rate distortion function. In the following text, we use the notation R(D) and the term rate distortion function to denote both the operational and information rate distortion function. The term operational rate distortion function will mainly be used for denoting the operational rate distortion function for restricted classes of codes.
The inverse of the information rate distortion function is called the information distortion rate function or simply the distortion rate function and is given by
 (4.43) 
Using this definition, the fundamental source coding theorem (4.38) can also be written as
 (4.44) 
The information rate distortion function is defined as a mathematical function of a source. However, an analytical derivation of the information rate distortion function is still very difficult or even impossible, except for some special random processes. An iterative technique for numerically computing close approximations of the rate distortion function for iid sources was developed by Blahut and Arimoto in [6, 3] and is referred to as BlahutArimoto algorithm. An overview of the algorithm can be found in [22, 11].
4.2.3 Properties of the Rate Distortion Function
In the following, we state some important properties of the rate distortion function R(D) for the MSE distortion measure^{1}^{1} The properties hold more generally. In particular, all stated properties are valid for additive distortion measures for which the singleletter distortion d_{1}(s,s′) is equal to zero if s = s′ and is greater than zero if s≠s′. . For proofs of these properties, the reader is referred to [4, 22, 11].
 (4.45) 
For the MSE distortion measure, the value of D_{max} is equal to the variance σ^{2} of the source.
 (4.46) 
The last property shows that the fundamental bound for lossless coding is a special case of the fundamental bound for lossy coding.
For most random processes, an analytical expression for the rate distortion function cannot be given. In the following, we show how a useful lower bound for the rate distortion function of continuous random processes can be calculated. Before we derive this socalled Shannon lower bound, we introduce the quantity of differential entropy.
The mutual information I(X;Y ) of two continuous Ndimensional random vectors X and Y is defined in (4.23). Using the relationship f_{XY } = f_{XY } ⋅ f_{Y }, the integral in this definition can be decomposed into a part that only depends on one of the random vectors and a part that depends on both random vectors,
 (4.47) 
with
and In analogy to the discrete entropy introduced in Chapter 3, the quantity h(X) is called the differential entropy of the random vector X and the quantity h(XY ) is referred to as conditional differential entropy of the random vector X given the random vector Y .Since I(X;Y ) is always nonnegative, we can conclude that conditioning reduces the differential entropy,
 (4.50) 
similarly as conditioning reduces the discrete entropy.
For continuous random processes S = {S_{n}}, the random variables S_{n} for N consecutive time instants can be represented as a random vector S^{(N)} = (S_{0},,S_{N1})^{T }. The differential entropy h(S^{(N)}) for the vectors S^{(N)} is then also referred to as Nth order differential entropy and is denoted by
 (4.51) 
If, for a continuous random process S, the limit
 (4.52) 
exists, it is called the differential entropy rate of the process S.
The differential entropy has a different meaning than the discrete entropy. This can be illustrated by considering an iid process S = {S_{n}} with a uniform pdf f(s), with f(s) = 1∕A for s≤ A∕2 and f(s) = 0 for s > A∕2. The first order differential entropy for this process is
 (4.53) 
In Fig. 4.4, the differential entropy h(S) for the uniform iid process is shown as function of the parameter A. In contrast to the discrete entropy, the differential entropy can be either positive or negative. The discrete entropy is only finite for discrete alphabet sources, it is infinite for continuous alphabet sources. The differential entropy, however, is mainly useful for continuous random processes. For discrete random processes, it can be considered to be minus infinity.
As an example, we consider a stationary Gaussian random process with a mean μ and an Nth order autocovariance matrix C_{N}. The Nth order pdf f_{G}(s) is given in (2.51), where μ_{N} represents a vector with all N elements being equal to the mean μ. For the Nth order differential entropy h_{N}^{(G)} of the stationary Gaussian process, we obtain
By reformulating the matrix multiplication in the last integral as sum, it can be shown that for any random process with an Nth order pdf f(s) and an Nth order autocovariance matrix C_{N},
 (4.55) 
A stepbystep derivation of this result can be found in [11]. Inserting (4.55) into (4.54) and using log _{2}e = (ln2)^{1} yields
Now, we consider any stationary random process S with a mean μ and an Nth order autocovariance matrix C_{N}. The Nth order pdf of this process is denoted by f(s). Using the divergence inequality (4.21), we obtain for its Nth order differential entropy,
 (4.58) 
Hence, the Nth order differential entropy of any stationary nonGaussian process is less than the Nth order differential entropy of a stationary Gaussian process with the same Nth order autocovariance matrix C_{N}.
As shown in (4.56), the Nth order differential entropy of a stationary Gaussian process depends on the determinant of its Nth order autocovariance matrix C_{N}. The determinant C_{N} is given by the product of the eigenvalues ξ_{i} of the matrix C_{N}, C_{N} = ∏ _{i=0}^{N1}ξ_{i}. The trace of the Nth order autocovariance matrix tr(C_{N}) is given by the sum of its eigenvalues, tr(C_{N}) = ∑ _{i=0}^{N1}ξ_{i}, and, according to (2.39), also by tr(C_{N}) = N ⋅ σ^{2}, with σ^{2} being the variance of the Gaussian process. Hence, for a given variance σ^{2}, the sum of the eigenvalues is constant. With the inequality of arithmetic and geometric means,
 (4.59) 
which holds with equality if and only if x_{0} = x_{1} = … = x_{N1}, we obtain the inequality
 (4.60) 
Equality holds if and only if all eigenvalues of C_{N} are the same, i.e, if and only if the Gaussian process is iid. Consequently, the Nth order differential entropy of a stationary process S with a variance σ^{2} is bounded by
 (4.61) 
It is maximized if and only if the process is a Gaussian iid process.
Using the relationship (4.47) and the notation I_{N}(g_{N}) = I_{N}(S;S′), the rate distortion function R(D) defined in (4.37) can be written as
Since conditioning reduces the differential entropy, as has been shown in (4.50), the rate distortion function is bounded by
 (4.63) 
with
 (4.64) 
The lower bound R_{L}(D) is called the Shannon lower bound (SLB).
For stationary processes and the MSE distortion measure, the distortion δ_{N}(g_{N}) in (4.64) is equal to the variance σ_{Z}^{2} of the process Z = S S′. Furthermore, we have shown in (4.61) that the maximum Nth order differential entropy for a stationary process with a given variance σ_{Z}^{2} is equal to log _{2}(2πeσ_{Z}^{2}). Hence, the Shannon lower bound for stationary processes and MSE distortion is given by
 (4.65) 
Since we concentrate on the MSE distortion measure in this text, we call R_{L}(D) given in (4.65) the Shannon lower bound in the following without mentioning that it is only valid for the MSE distortion measure.
The Nth order differential entropy for iid sources S = {S_{n}} is equal to
 (4.66) 
where h(S) denotes the first order differential entropy. Hence, the Shannon lower bound for iid sources is given by
In the following, the differential entropy h(S) and the Shannon lower bound D_{L}(R) are given for three distributions. For the example of the Laplacian iid process with σ^{2} = 1, Fig. 4.5 compares the Shannon lower bound D_{L}(R) with the distortion rate function D(R), which was calculated using the BlahutArimoto algorithm [6, 3].
Uniform pdf:
 (4.69) 
Laplacian pdf:
 (4.70) 
Gaussian pdf:
 (4.71) 
Fig. 4.5 Comparison of the Shannon lower bound D_{L}(R) and the distortion rate function D(R) for a Laplacian iid source with unit variance (σ^{2} = 1).
The comparison of the Shannon lower bound D_{L}(R) and the distortion rate function D(R) for the Laplacian iid source in Fig. 4.5 indicates that the Shannon lower bound approaches the distortion rate function for small distortions or high rates. For various distortion measures, including the MSE distortion, it can in fact be shown that the Shannon lower bound approaches the rate distortion function as the distortion approaches zero,
 (4.72) 
Consequently, the Shannon lower bound represents a suitable reference for the evaluation of lossy coding techniques at high rates or small distortions. Proofs for the asymptotic tightness of the Shannon lower bound for various distortion measures can be found in [48, 5, 47].
For sources with memory, an exact analytic derivation of the Shannon lower bound is usually not possible. One of the few examples for which the Shannon lower bound can be expressed analytically is the stationary Gaussian process. The Nth order differential entropy for a stationary Gaussian process has been derived in (4.56). Inserting this result into the definition of the Shannon lower bound (4.65) yields
 (4.73) 
where C_{N} is the Nth order autocorrelation matrix. The determinant of a matrix is given by the product of its eigenvalues. With ξ_{i}^{(N)}, for i = 0,1,…,N  1, denoting the N eigenvalues of the Nth order autocorrelation matrix C_{N}, we obtain
 (4.74) 
In order to proceed, we restrict our considerations to Gaussian processes with zero mean, in which case the autocovariance matrix C_{N} is equal to the autocorrelation matrix R_{N}, and apply Grenander and Szegö’s theorem [29] for sequences of Toeplitz matrices. For a review of Toeplitz matrices, including the theorem for sequences of Toeplitz matrices, we recommend the tutorial [23]. Grenander and Szegö’s theorem can be stated as follows: If R_{N} is a sequence of Hermitian Toeplitz matrices with elements ϕ_{k} on the kth diagonal, the infimum Φ_{inf} = inf _{ω}Φ(ω) and supremum Φ_{sup} = sup_{ω}Φ(ω) of the Fourier series
 (4.75) 
are finite, and the function G is continuous in the interval [Φ_{inf},Φ_{sup}], then
 (4.76) 
where ξ_{i}^{(N)}, for i = 0,1,…,N  1, denote the eigenvalues of the Nth matrix R_{N}.
A matrix is called Hermitian if it is equal to its conjugate transpose. This property is always fulfilled for real symmetric matrices as the autocorrelation matrices of stationary processes. Furthermore, the Fourier series (4.75) for the elements of the autocorrelation matrix R_{N} is the power spectral density Φ_{SS}(ω). If we assume that the power spectral density is finite and greater than 0 for all frequencies ω, the limit in (4.74) can be replaced by an integral according to (4.76). The Shannon lower bound R_{L}(D) of a stationary Gaussian process with zeromean and a power spectral density Φ_{SS}(ω) is given by
 (4.77) 
A nonzero mean does not have any impact on the Shannon lower bound R_{L}(D), but on the power spectral density Φ_{SS}(ω).
For a stationary zeromean GaussMarkov process, the entries of the autocorrelation matrix are given by ϕ_{k} = σ^{2}ρ^{k}, where σ^{2} is the signal variance and ρ is the correlation coefficient between successive samples. Using the relationship ∑ _{k=1}^{∞}a^{k}e^{jkx} = a∕(e^{jx}  a), we obtain
Inserting this relationship into (4.77) yields where we used ∫ _{0}^{π} ln(a^{2}  2abcosx + b^{2})dx = 2π lna, for a ≥ b > 0. As discussed above, the mean of a stationary process does not have any impact on the Shannon rate distortion function or the Shannon lower bound. Hence, the distortion rate function D_{L}(R) for the Shannon lower bound of a stationary GaussMarkov process with a variance σ^{2} and a correlation coefficient ρ is given by
 (4.80) 
This result can also be obtained by directly inserting the formula (2.50) for the determinant C_{N} of the Nth order autocovariance matrix for GaussMarkov processes into the expression (4.73).
4.4 Rate Distortion Function for Gaussian Sources
Stationary Gaussian sources play a fundamental role in rate distortion theory. We have shown that the Gaussian source maximize the differential entropy, and thus also the Shannon lower bound, for a given variance or autocovariance function. Stationary Gaussian sources are also one of the few examples, for which the rate distortion function can be exactly derived.
Before stating another important property of Gaussian iid sources, we calculate their rate distortion function. Therefore, we first derive a lower bound and then show that this lower bound is achievable. To prove that the lower bound is achievable, it is sufficient to show that there is a conditional pdf g_{S′S}(s′s) for which the mutual information I_{1}(g_{S′S}) is equal to the lower bound for a given distortion D.
The Shannon lower bound for Gaussian iid sources as distortion rate function D_{L}(R) has been derived in sec. 4.3. The corresponding rate distortion function is given by
 (4.81) 
where σ^{2} is the signal variance. For proving that the rate distortion function is achievable, it is more convenient to look at the pdf of the reconstruction f_{S′}(s′) and the conditional pdf g_{SS′}(ss′) of the input given the reconstruction.
For distortions D < σ^{2}, we choose
where μ denotes the mean of the Gaussian iid process. It should be noted that the conditional pdf g_{SS′} represents a Gaussian pdf for the random variables Z_{n} = S_{n}  S′_{n}, which are given by the difference of the corresponding random variables S_{n} and S′_{n}. We now verify that the pdf f_{S}(s) that we obtain with the choices (4.82) and (4.83) represents the Gaussian pdf with a mean μ and a variance σ^{2}. Since the random variables S_{n} can be represented as the sum S′_{n} + Z_{n}, the pdf f_{S}(s) is given by the convolution of f_{S′}(s′) and g_{SS′}(ss′). And since means and variances add when normal densities are convolved, the pdf f_{S}(s) that is obtained is a Gaussian pdf with a mean μ = μ + 0 and a variance σ^{2} = (σ^{2}  D) + D. Hence, the choices (4.82) and (4.83) are valid, and the conditional pdf g_{S′S}(s′s) could be calculated using Bayes rule
 (4.84) 
The resulting distortion is given by the variance of the difference process Z_{n} = S_{n}  S′_{n},
 (4.85) 
For the mutual information, we obtain
The results show that, for any distortion D < σ^{2}, we can find a conditional pdf g_{S′S} that achieves the Shannon lower bound. For greater distortions, we choose g_{S′S}(s′s) = δ(0), which gives a distortion of σ^{2} and a rate of zero. Consequently, the rate distortion function for Gaussian iid sources is given by
 (4.87) 
The corresponding distortion rate function is given by
 (4.88) 
It is important to note that the rate distortion function for a Gaussian iid process is equal to the Shannon lower bound for the entire range of rates. Furthermore, it can be shown [4] that for every iid process with a given variance σ^{2}, the rate distortion function lies below that of the Gaussian iid process with the same variance.
4.4.2 Gaussian Sources with Memory
For deriving the rate distortion function R(D) for a stationary Gaussian process with memory, we decompose it into a number N of independent stationary Gaussian sources. The Nth order rate distortion function R_{N}(D) can then be expressed using the rate distortion function for Gaussian iid processes and the rate distortion function R(D) is obtained by considering the limit of R_{N}(D) as N approaches infinity.
As we stated in sec. 2.3, the Nth order pdf of a stationary Gaussian process is given by
 (4.89) 
where s is a vector of N consecutive samples, μ_{N} is a vector with all N elements being equal to the mean μ, and C_{N} is the Nth order autocovariance matrix. Since C_{N} is a symmetric and real matrix, it has N real eigenvalues ξ_{i}^{(N)}, for i = 0,1,…,N  1. The eigenvalues are solutions of the equation
 (4.90) 
where v_{i}^{(N)} represents a nonzero vector with unit norm, which is called a unitnorm eigenvector corresponding to the eigenvalue ξ_{i}^{(N)}. Let A_{N} be the matrix whose columns are build by the N unitnorm eigenvectors,
 (4.91) 
By combining the N equations (4.90) for i = 0,1,…,N  1, we obtain the matrix equation
 (4.92) 
where
 (4.93) 
is a diagonal matrix that contains the N eigenvalues of C_{N} on its main diagonal. The eigenvectors are orthogonal to each other and A_{N} is an orthogonal matrix.
Given the stationary Gaussian source {S_{n}}, we construct a source {U_{n}} by decomposing the source {S_{n}} into vectors S of N successive random variables and applying the transform
 (4.94) 
to each of these vectors. Since A_{N} is orthogonal, its inverse A^{1} exists and is equal to its transpose A^{T }. The resulting source {U_{n}} is given by the concatenation of the random vectors U. Similarly, the inverse transform for the reconstructions {U′_{n}} and {S′_{n}} is given by
 (4.95) 
with U′ and S′ denoting the corresponding vectors of N successive random variables. Since the coordinate mapping (4.95) is the inverse of the mapping (4.94), the Nth order mutual information I_{N}(U;U′) is equal to the Nth order mutual information I_{N}(S;S′). A proof of this statement can be found in [4]. Furthermore, since A_{N} is orthogonal, the transform
 (4.96) 
preserves the Euclidean norm^{2}^{2} We will show in sec. 7.2 that every orthogonal transform preserves the MSE distortion. . The MSE distortion between any realization s of the random vector S and its reconstruction s′
 (4.97) 
is equal to the distortion between the corresponding vector u and its reconstruction u′. Hence, the Nth order rate distortion function R_{N}(D) for the stationary Gaussian source {S_{n}} is equal to the Nth order rate distortion function for the random process {U_{n}}.
A linear transformation of a Gaussian random vector results in another Gaussian random vector. For the mean vector and the autocorrelation matrix of U, we obtain
 (4.98) 
and
 (4.100) 
of the pdf’s of the Gaussian components U_{i}. Consequently, the components U_{i} are independent of each other.
In sec. 4.2.2, we have shown how the Nth order mutual information and the Nth order distortion for a code Q can be described by a conditional pdf g_{N}^{Q} = g_{U′U} that characterizes the mapping of the random vectors U onto the corresponding reconstruction vectors U′. Due to the independence of the components U_{i} of the random vectors U, the Nth order mutual information I_{N}(g_{N}^{Q}) and the Nth order distortion δ_{N}(g_{N}^{Q}) for a code Q can be written as
 (4.101) 
where g_{i}^{Q} = g_{U′iUi} specifies the conditional pdf for the mapping of a vector component U_{i} onto its reconstruction U′_{i}. Consequently, the Nth order distortion rate function D_{N}(R) can be expressed by
 (4.102) 
where R_{i}(D_{i}) denotes the first order rate distortion function for a vector component U_{i}. The first order distortion rate function for Gaussian sources has been derived in sec. 4.4.1 and is given by
 (4.103) 
The variances σ_{i}^{2} of the vector component U_{i} are equal to the eigenvalues ξ_{i}^{(N)} of the Nth order autocovariance matrix C_{N}. Hence, the Nth order distortion rate function can be written as
 (4.104) 
With the inequality of arithmetic and geometric means, which holds with equality if and only if all elements have the same value, we obtain
 (4.105) 
where ^{(N)} denotes the geometric mean of the eigenvalues ξ_{i}^{(N)}. For a given Nth order mutual information R, the distortion is minimized if and only if ξ_{i}^{(N)}2^{2Ri} is equal to ^{(N)}2^{2R} for all i = 0,,N  1, which yields
 (4.106) 
In the above result, we have ignored the fact that the mutual information R_{i} for a component U_{i} cannot be less than zero. Since the distortion rate function given in (4.103) is steeper at low R_{i}, the mutual information R_{i} for components with ξ_{i}^{(N)} < ^{(N)}2^{2R} has to be set equal to zero and the mutual information R has to be distributed among the remaining components in order to minimize the distortion. This can be elegantly specified by introducing a parameter θ, with θ ≥ 0, and setting the component distortions according to
 (4.107) 
This concept is also know as inverse waterfilling for independent Gaussian sources [57], where the parameter θ can be interpreted as the water level. Using (4.103), we obtain for the mutual information R_{i},
 (4.108) 
The Nth order rate distortion function R_{N}(D) can be expressed by the following parametric formulation, with θ ≥ 0,
The rate distortion function R(D) for the stationary Gaussian random process {S_{n}} is given by the limit
 (4.111) 
which yields the parametric formulation, with θ > 0,
 (4.112) 
For Gaussian processes with zero mean (C_{N} = R_{N}), we can apply the theorem for sequences of Toeplitz matrices (4.76) to express the rate distortion function using the power spectral density Φ_{SS}(ω) of the source. A parametric formulation, with θ ≥ 0, for the rate distortion function R(D) for a stationary Gaussian source with zero mean and a power spectral density Φ_{SS}(ω) is given by
The minimization in the parametric formulation (4.113) and (4.114) of the rate distortion function is illustrated in Fig. 4.6. It can be interpreted that at each frequency, the variance of the corresponding frequency component as given by the power spectral density Φ_{SS}(ω) is compared to the parameter θ, which represents the mean squared error of the frequency component. If Φ_{SS}(ω) is found to be larger than θ, the mutual information is set equal to log _{2}, otherwise a mutual information of zero is assigned to that frequency component.
Fig. 4.6 Illustration of parametric equations for the rate distortion function of stationary Gaussian processes.
For stationary zeromean GaussMarkov sources with a variance σ^{2} and a correlation coefficient ρ, the power spectral density Φ_{SS}(ω) is given by (4.78). If we choose the parameter θ according to
 (4.115) 
we obtain the parametric equations
 (4.118) 
Conversely, for rates R ≥ log _{2}(1 + ρ), the distortion rate function of a stationary GaussMarkov process coincides with Shannon lower bound,
 (4.119) 
For Gaussian iid sources (ρ = 0), these results are identical to (4.87) and (4.88). Fig. 4.7 shows distortion rate functions for stationary GaussMarkov processes with different correlation factors ρ. The distortion is plotted as signaltonoise ratio SNR = 10log _{10}(σ^{2}∕D).
Fig. 4.7 Distortion rate functions for GaussMarkov processes with different correlation factors ρ. The distortion D is plotted as signaltonoise ratio SNR = 10 log _{10}(σ^{2}∕D).
We have noted above that the rate distortion function of the Gaussian iid process with a given variance specifies an upper bound for the rate distortion functions of all iid processes with the same variance. This statement can be generalized to stationary Gaussian processes with memory. The rate distortion function of the stationary zeromean Gaussian process as given parametrically by (4.113) and (4.114) specifies an upper limit for the rate distortion functions of all other stationary processes with the same power spectral density Φ_{SS}(ω). A proof of this statement can be found in [4].
4.5 Summary of Rate Distortion Theory
Rate distortion theory addresses the problem of finding the greatest lower bound for the average number of bits that is required for representing a signal without exceeding a given distortion. We introduced the operational rate distortion function that specifies this fundamental bound as infimum of over all source codes. A fundamental result of rate distortion theory is that the operational rate distortion function is equal to the information rate distortion function, which is defined as infimum over all conditional pdf’s for the reconstructed samples given the original samples. Due to this equality, both the operational and the information rate distortion function are usually referred to as the rate distortion function. It has further been noted that, for the MSE distortion measure, the lossless coding theorem specifying that the average codeword length per symbol cannot be less than the entropy rate represents a special case of rate distortion theory for discrete sources with zero distortion.
For most sources and distortion measures, it is not known how to analytically derive the rate distortion function. A useful lower bound for the rate distortion function is given by the socalled Shannon lower bound. The difference between the Shannon lower bound and the rate distortion function approaches zero as the distortion approaches zero or the rate approaches infinity. Due to this property, it represents a suitable reference for evaluating the performance of lossy coding schemes at high rates. For the MSE distortion measure, an analytical expression for the Shannon lower bound can be given for typical iid sources as well as for general stationary Gaussian sources.
An important class of processes is the class of stationary Gaussian processes. For Gaussian iid processes and MSE distortion, the rate distortion function coincides with the Shannon lower bound for all rates. The rate distortion function for general stationary Gaussian sources with zero mean and MSE distortion can be specified as a parametric expression using the power spectral density. It has also been noted that the rate distortion function of the stationary Gaussian process with zero mean and a particular power spectral density represents an upper bound for all stationary processes with the same power spectral density, which leads to the conclusion that Gaussian sources are the most difficult to code.
This is the html version of the book:
Thomas Wiegand and Heiko Schwarz (2010): "Source Coding: Part I of Fundamentals of Source and Video Coding", Now Publishers, Foundations and TrendsŪ in Signal Processing: Vol. 4: No 12, pp 1222. (pdf version of the book)
5 Quantization
Lossy source coding systems, which we have introduced in chapter 4, are characterized by the fact that the reconstructed signal is not identical to the source signal. The process that introduces the corresponding loss of information (or signal fidelity) is called quantization. An apparatus or algorithmic specification that performs the quantization process is referred to as quantizer. Each lossy source coding system includes a quantizer. The rate distortion point associated with a lossy source coding system is to a wide extent determined by the used quantization process. For this reason, the analysis of quantization techniques is of fundamental interest for the design of source coding systems.
In this chapter, we analyze the quantizer design and the performance of various quantization techniques with the emphasis on scalar quantization, since it is the most widely used quantization technique in video coding. To illustrated the inherent limitation of scalar quantization, we will also briefly introduce the concept of vector quantization and show its advantage with respect to the achievable rate distortion performance. For further details, the reader is referred to the comprehensive treatment of quantization in [16] and the overview of the history and theory of quantization in [28].
5.1 Structure and Performance of Quantizers
In the broadest sense, quantization is an irreversible deterministic mapping of an input quantity to an output quantity. For all cases of practical interest, the set of obtainable values for the output quantity is finite and includes fewer elements than the set of possible values for the input quantity. If the input quantity and the output quantity are scalars, the process of quantization is referred to as scalar quantization. A very simple variant of scalar quantization is the rounding of a real input value to its nearest integer value. Scalar quantization is by far the most popular form of quantization and is used in virtually all video coding applications. However, as we will see later, there is a gap between the operational rate distortion curve for optimal scalar quantizers and the fundamental rate distortion bound. This gap can only be reduced if a vector of more than one input sample is mapped to a corresponding vector of output samples. In this case, the input and output quantity are vectors and the quantization process is referred to as vector quantization. Vector quantization can asymptotically achieve the fundamental rate distortion bound if the number of samples in the input and output vector approaches infinity.
A quantizer Q of dimension N specifies a mapping of the Ndimensional Euclidean space ^{N} into a finite set of reconstruction vectors inside the Ndimensional Euclidean space ^{N}, (Although we restrict our considerations to finite sets of reconstruction vectors, some of the presented quantization methods and derivations are also valid for countably infinite sets of reconstruction vectors.)
 (5.1) 
If the dimension N of the quantizer Q is equal to 1, it is a scalar quantizer; otherwise, it is a vector quantizer. The number K of reconstruction vectors is also referred to as the size of the quantizer Q. The deterministic mapping Q associates a subset _{i} of the Ndimensional Euclidean space ^{N} with each of the reconstruction vectors s′_{i}. The subsets _{i}, with 0 ≤ i < K, are called quantization cells and are defined by
 (5.2) 
From this definition, it follows that the quantization cells _{i} form a partition of the Ndimensional Euclidean space ^{N},
 (5.3) 
Given the quantization cells _{i} and the associated reconstruction values s′_{i}, the quantization mapping Q can be specified by
 (5.4) 
A quantizer is completely specified by the set of reconstruction values and the associated quantization cells.
For analyzing the design and performance of quantizers, we consider the quantization of symbol sequences {s_{n}} that represent realizations of a random process {S_{n}}. For the case of vector quantization (N > 1), the samples of the input sequence {s_{n}} shall be arranged in vectors, resulting in a sequence of symbol vectors {s_{n}}. Usually, the input sequence {s_{n}} is decomposed into blocks of N samples and the components of an input vector s_{n} are build be the samples of such a block, but other arrangements are also possible. In any case, the sequence of input vectors {s_{n}} can be considered to represent a realization of a vector random process {S_{n}}. It should be noted that the domain of the input vectors s_{n} can be a subset of the Ndimensional space ^{N}, which is the case if the random process {S_{n}} is discrete or its marginal pdf f(s) is zero outside a finite interval. However, even in this case, we can generally consider quantization as a mapping of the Ndimensional Euclidean space ^{N} into a finite set of reconstruction vectors.
Fig. 5.1 shows a block diagram of a quantizer Q. Each input vector s_{n} is mapped onto one of the reconstruction vectors, given by Q(s_{n}). The average distortion D per sample between the input and output vectors depends only on the statistical properties of the input sequence {s_{n}} and the quantization mapping Q. If the random process {S_{n}} is stationary, it can be expressed by
 (5.5) 
where f_{S} denotes the joint pdf of the vector components of the random vectors S_{n}. For the MSE distortion measure, we obtain
 (5.6) 
Unlike the distortion D, the average transmission rate is not only determined by the quantizer Q and the input process. As illustrated in Fig. 5.1, we have to consider the lossless coding γ by which the sequence of reconstruction vectors {Q(s_{n})} is mapped onto a sequence of codewords. For calculating the performance of a quantizer or for designing a quantizer we have to make reasonable assumptions about the lossless coding γ. It is certainly not a good idea to assume a lossless coding with an average codeword length per symbol close to the entropy for the design, but to use the quantizer in combination with fixedlength codewords for the reconstruction vectors. Similarly, a quantizer that has been optimized under the assumption of fixedlength codewords is not optimal if it is used in combination with advanced lossless coding techniques such as Huffman coding or arithmetic coding.
The rate R of a coding system consisting of a quantizer Q and a lossless coding γ is defined as the average codeword length per input sample. For stationary input processes {S_{n}}, it can be expressed by
 (5.7) 
where γ(s′_{i}) denotes the average codeword length that is obtained for a reconstruction vector s′_{i} with the lossless coding γ and p(s′_{i}) denotes the pmf for the reconstruction vectors, which is given by
 (5.8) 
The probability of a reconstruction vector does not depend on the reconstruction vector itself, but only on the associated quantization cell _{i}.
Fig. 5.2 Lossy source coding system consisting of a quantizer, which is decomposed into an encoder mapping α and a decoder mapping β, and a lossless coder γ.
A quantizer Q can be decomposed into two parts, an encoder mapping α which maps the input vectors s_{n} to quantization indexes i, with 0 ≤ i < K, and a decoder mapping β which maps the quantization indexes i to the associated reconstruction vectors s′_{i}. The quantizer mapping can then be expressed by Q(s) = α(β(s)). The loss of signal fidelity is introduced as a result of the encoder mapping α, the decoder mapping β merely maps the quantization indexes i to the associated reconstruction vectors s′_{i}. The combination of the encoder mapping α and the lossless coding γ forms an encoder of a lossy source coding system as illustrated in Fig. 5.2. The corresponding decoder is given by the inverse lossless coding γ^{1} and the decoder mapping β.
In scalar quantization (N = 1), the input and output quantity are scalars. Hence, a scalar quantizer Q of size K specifies a mapping of the real line into a set of K reconstruction levels,
 (5.9) 
In the general case, a quantization cell _{i} corresponds to a set of intervals of the real line. We restrict our considerations to regular scalar quantizers for which each quantization cell _{i} represents a single interval of the real line and the reconstruction levels s′_{i} are located inside the associated quantization cells _{i}. Without loss of generality, we further assume that the quantization cells are ordered in increasing order of the values of their lower interval boundary. When we further assume that the quantization intervals include the lower, but not the higher interval boundary, each quantization cell can be represented by a halfopen^{2}^{2} In strict mathematical sense, the first quantization cell is an open interval _{0} = (∞,u_{1}). interval _{i} = [u_{i},u_{i+1}). The interval boundaries u_{i} are also referred to as decision thresholds. The interval sizes Δ_{i} = u_{i+1}  u_{i} are called quantization step sizes. Since the quantization cells must form a partition of the real line , the values u_{0} and u_{K} are fixed and given by u_{0} = ∞ and u_{K} = ∞. Consequently, K reconstruction levels and K 1 decision thresholds can be chosen in the quantizer design.
The quantizer mapping Q of a scalar quantizer as defined above can be represented by a piecewiseconstant inputoutput function as illustrated in Fig. 5.3. All input values s with u_{i} ≤ s < u_{i+1} are assigned to the corresponding reproduction level s′_{i}.
In the following treatment of scalar quantization, we generally assume that the input process is stationary. For continuous random processes, scalar quantization can then can be interpreted as a discretization of the marginal pdf f(s) as illustrated in Fig. 5.4.
For any stationary process {S} with a marginal pdf f(s), the quantizer output is a discrete random process {S′} with a marginal pmf
 (5.10) 
The average distortion D (for the MSE distortion measure) is given by
 (5.11) 
The average rate R depends on the lossless coding γ and is given by
 (5.12) 
5.2.1 Scalar Quantization with FixedLength Codes
We will first investigate scalar quantizers in connection with fixedlength codes. The lossless coding γ is assumed to assign a codeword of the same length to each reconstruction level. For a quantizer of size K, the codeword length must be greater than or equal to ⌈log _{2}K⌉. Under these assumptions, the quantizer size K should be a power of 2. If K is not a power of 2, the quantizer requires the same minimum codeword length as a quantizer of size K′ = 2^{⌈log 2K⌉}, but since K < K′, the quantizer of size K′ can achieve a smaller distortion. For simplifying the following discussion, we define the rate R according to
 (5.13) 
but inherently assume that K represents a power of 2.
A very simple form of quantization is the pulsecodemodulation (PCM) for random processes with a finite amplitude range. PCM is a quantization process for which all quantization intervals have the same size Δ and the reproduction values s′_{i} are placed in the middle between the decision thresholds u_{i} and u_{i+1}. For general input signals, this is not possible since it results in an infinite number of quantization intervals K and hence an infinite rate for our fixedlength code assumption. However, if the input random process has a finite amplitude range of [s_{min},s_{max}], the quantization process is actually a mapping of the finite interval [s_{min},s_{max}] to the set of reproduction levels. Hence, we can set u_{0} = s_{min} and u_{K} = s_{max}. The width A = s_{max}  s_{min} of the amplitude interval is then evenly split into K quantization intervals, resulting in a quantization step size
 (5.14) 
The quantization mapping for PCM can be specified by
 (5.15) 
As an example, we consider PCM quantization of a stationary random process with an uniform distribution, f(s) = for  ≤ s ≤. The distortion as defined in (5.11) becomes
 (5.16) 
By carrying out the integration and inserting (5.14), we obtain the operational distortion rate function,
 (5.17) 
For stationary random processes with an infinite amplitude range, we have to choose u_{0} = ∞ and u_{K} = ∞. The inner interval boundaries u_{i}, with 0 < i < K, and the reconstruction levels s′_{i} can be evenly distributed around the mean value μ of the random variables S. For symmetric distributions (μ = 0), this gives
Fig. 5.5 PCM quantization of stationary random processes with uniform (U), Laplacian (L), and Gaussian (G) distributions: (left) operational distortion rate functions in comparison to the corresponding Shannon lower bounds (for variances σ^{2} = 1); (right) optimal quantization step sizes.
For the application of PCM quantization to stationary random processes with an infinite amplitude interval, we have chosen the quantization step size for a given quantizer size K by minimizing the distortion. A natural extension of this concept is to minimize the distortion with respect to all parameters of a scalar quantizer of a given size K. The optimization variables are the K  1 decision thresholds u_{i}, with 0 < i < K, and the K reconstruction levels s′_{i}, with 0 ≤ i < K. The obtained quantizer is called a pdfoptimized scalar quantizer with fixedlength codes.
For deriving a condition for the reconstruction levels s′_{i}, we first assume that the decision thresholds u_{i} are given. The overall distortion (5.11) is the sum of the distortions D_{i} for the quantization intervals _{i} = [u_{i},u_{u+1}). For given decision thresholds, the interval distortions D_{i} are mutually independent and are determined by the corresponding reconstruction levels s′_{i},
 (5.20) 
By using the conditional distribution f(ss′_{i}) = f(s) ⋅ p(s′_{i}), we obtain
 (5.21) 
Since p(s′_{i}) does not depend on s′_{i}, the optimal reconstruction levels s′_{i}^{*} are given by
 (5.22) 
which is also called the generalized centroid condition. For the squared error distortion measure d_{1}(s,s′) = (s  s′)^{2}, the optimal reconstruction levels s′_{i}^{*} are the conditional means (centroids)
 (5.23) 
This can be easily proved by the inequality
If the reproduction levels s′_{i} are given, the overall distortion D is minimized if each input value s is mapped to the reproduction level s′_{i} that minimizes the corresponding sample distortion d_{1}(s,s′_{i}),
 (5.25) 
This condition is also referred to as the nearest neighbor condition. Since a decision threshold u_{i} influences only the distortions D_{i} of the neighboring intervals, the overall distortion is minimized if
 (5.26) 
holds for all decision thresholds u_{i}, with 0 < i < K. For the squared error distortion measure, the optimal decision thresholds u_{i}^{*}, with 0 < i < K, are given by
 (5.27) 
The expressions (5.23) and (5.27) can also be obtained by setting the partial derivatives of the distortion (5.11) with respect to the decision thresholds u_{i} and the reconstruction levels s′_{i} equal to zero [56].
The necessary conditions for the optimal reconstruction levels (5.22) and decision thresholds (5.25) depend on each other. A corresponding iterative algorithm for minimizing the distortion of a quantizer of given size K was suggested by Lloyd [49] and is commonly called the Lloyd algorithm. The obtained quantizer is referred to as Lloyd quantizer or LloydMax^{3}^{3} Lloyd and Max independently observed the two necessary conditions for optimality. quantizer. For a given pdf f(s), first an initial set of unique reconstruction levels {s′_{i}} is arbitrarily chosen, then the decision thresholds {u_{i}} and reconstruction levels {s′_{i}} are alternately determined according to (5.25) and (5.22), respectively, until the algorithm converges. It should be noted that the fulfillment of the conditions (5.22) and (5.25) is in general not sufficient to guarantee the optimality of the quantizer. The conditions are only sufficient if the pdf f(s) is logconcave. One of the examples, for which the Lloyd algorithm yields a unique solution independent of the initial set of reconstruction levels, is the Gaussian pdf.
Often, the marginal pdf f(s) of a random process is not known a priori. In such a case, the Lloyd algorithm can be applied using a training set. If the training set includes a sufficiently large number of samples, the obtained quantizer is an accurate approximation of the Lloyd quantizer. Using the encoder mapping α (see sec. 5.1), the Lloyd algorithm for a training set of samples {s_{n}} and a given quantizer size K can be stated as follows:
and update the decision thresholds {u_{i}} accordingly.
where the expectation value is taken over the training set.
As a first example, we applied the Lloyd algorithm with a training set of more than 10000 samples and the MSE distortion measure to a Gaussian pdf with unit variance. We used two different initializations for the reconstruction levels. Convergence was determined if the relative distortion reduction between two iterations steps was less than 1%, (D_{k}  D_{k+1})∕D_{k+1} < 0.01. The algorithm quickly converged after 6 iterations for both initializations to the same overall distortion D_{F }^{*}. The obtained reconstruction levels {s′_{i}} and decision thresholds {u_{i}} as well as the iteration processes for the two initializations are illustrated in Fig. 5.6.
Fig. 5.6 Lloyd algorithm for a Gaussian pdf with unit variance and two initializations: (top) final reconstruction levels and decision thresholds; (middle) reconstruction levels and decision thresholds as function of the iteration step; (bottom) overall SNR and SNR for the quantization intervals as function of the iteration step.
The same algorithm with the same two initializations was also applied to a Laplacian pdf with unit variance. Also for this distribution, the algorithm quickly converged after 6 iterations for both initializations to the same overall distortion D_{F }^{*}. The obtained quantizer and the iteration processes are illustrated in Fig. 5.7.
Fig. 5.7 Lloyd algorithm for a Laplacian pdf with unit variance and two initializations: (top) final reconstruction levels and decision thresholds; (middle) reconstruction levels and decision thresholds as function of the iteration step; (bottom) overall SNR and SNR for the quantization intervals as function of the iteration step.
5.2.2 Scalar Quantization with VariableLength Codes
We have investigated the design of quantizers that minimize the distortion for a given number K of reconstruction levels, which is equivalent to a quantizer optimization using the assumption that all reconstruction levels are signaled with codewords of the same length. Now we consider the quantizer design in combination with variablelength codes γ.
The average codeword length that is associated with a particular reconstruction level s′_{i} is denoted by (s′_{i}) = γ(s′_{i}). If we use a scalar Huffman code, (s′_{i}) is equal to the length of the codeword that is assigned to s′_{i}. According to (5.12), the average rate R is given by
 (5.28) 
The average distortion is the same as for scalar quantization with fixedlength codes and is given by (5.11).
Since distortion and rate influence each other, they cannot be minimized independently. The optimization problem can be stated as
 (5.29) 
or equivalently,
 (5.30) 
with R_{max} and D_{max} being a given maximum rate and a maximum distortion, respectively. The constraint minimization problem can be formulated as unconstrained minimization of the Lagrangian functional
 (5.31) 
The parameter λ, with 0 ≤ λ < ∞, is referred to as Lagrange parameter. The solution of the minimization of (5.31) is a solution of the constrained minimization problems (5.29) and (5.30) in the following sense: If there is a Lagrangian parameter λ that yields a particular rate R_{max} (or particular distortion D_{max}), the corresponding distortion D (or rate R) is a solution of the constraint optimization problem.
In order to derive necessary conditions similarly as for the quantizer design with fixedlength codes, we first assume that the decision thresholds u_{i} are given. Since the rate R is independent of the reconstruction levels s′_{i}, the optimal reconstruction levels are found by minimizing the distortion D. This is the same optimization problem as for the scalar quantizer with fixedlength codes. Hence, the optimal reconstruction levels s′_{i}^{*} are given by the generalized centroid condition (5.22).
The optimal average codeword lengths s′_{i}) also depend only on the decision thresholds u_{i}. Given the decision thresholds and thus the probabilities p(s′_{i}), the average codeword lengths (s′_{i}) can be determined. If we, for example, assume that the reconstruction levels are coded using a scalar Huffman code, the Huffman code could be constructed given the pmf p(s′_{i}), which directly yields the codeword length (s′_{i}). In general, it is however justified to approximate the average rate R by the entropy H(S) and set the average codeword length equal to
(
 (5.32) 
This underestimates the true rate by a small amount. For Huffman coding the difference is always less than 1 bit per symbol and for arithmetic coding it is usually much smaller. When using the entropy as approximation for the rate during the quantizer design, the obtained quantizer is also called an entropyconstrained scalar quantizer. At this point, we ignore that, for sources with memory, the lossless coding γ can employ dependencies between output samples, for example, by using block Huffman coding or arithmetic coding with conditional probabilities. This extension is discussed in sec. 5.2.6.
For deriving a necessary condition for the decision thresholds u_{i}, we now assume that the reconstruction levels s′_{i} and the average codeword length (s′_{i}) are given. Similarly as for the nearest neighbor condition in sec. 5.2.1, the quantization mapping Q(s) that minimizes the Lagrangian functional J is given by
 (5.33) 
A mapping Q(s) that that minimizes the term d(s,s′_{i}) + λ (s′_{i}) for each source symbol s minimizes also the expected value in (5.31). A rigorous proof of this statement can be found in [71]. The decision thresholds u_{i} have to be selected in a way that the term d(s,s′_{i}) + λ (s′_{i}) is the same for both neighboring intervals,
 (5.34) 
For the MSE distortion measure, we obtain
 (5.35) 
The consequence is a shift of the decision threshold u_{i} from the midpoint between the reconstruction levels toward the interval with the longer average codeword length, i.e., the less probable interval.
Fig. 5.8 Lagrangian minimization: (left) independent operational distortion rate curves for the 5 symbols, where each circle represents one of 6 available distortion rate points; (right) the small dots show the average distortion and rate for all possible combinations of the 5 different quantizers with their 6 rate distortion points, the circles show the solutions to the Lagrangian minimization problem.
Lagrangian minimization as in (5.33) is a very important concept in modern video coding. Hence, we have conducted a simple experiment to illustrate the minimization approach. For that, we simulated the encoding of a 5symbol sequence {s_{i}}. The symbols are assumed to be mutually independent and have different distributions. We have generated one operational distortion rate function D_{i}(R) = a_{i}^{2}2^{2R} for each symbol, with a_{i}^{2} being randomly chosen. For each operational distortion rate function we have selected 6 rate points R_{i,k}, which represent the available quantizers.
The Lagrangian optimization process is illustrated in Fig. 5.8. The diagram on the left shows the 5 operational distortion rate functions D_{i}(R) with the available rate points R_{i,k}. The right diagram shows the average distortion and rate for each combination of rate points for encoding the 5symbol sequence. The results of the minimization of D_{i}(R_{i,k}) + λR_{i,k} with respect to R_{i,k} for different values of the Lagrange parameter λ are marked by circles. This experiment illustrates that the Lagrangian minimization approach yields a result on the convex hull of the admissible distortion rate points.
Given the necessary conditions for an optimal quantizer with variablelength codes, we can construct an iterative design algorithm similar to the Lloyd algorithm. If we use the entropy as measure for the average rate, the algorithm is also referred to as entropyconstrained Lloyd algorithm. Using the encoder mapping α, the variant that uses a sufficiently large training set {s_{n}} can be stated as follows for a given value of λ:
and update the decision thresholds {u_{i}} accordingly.
where the expectation value is taken over the training set.
As mentioned above, the entropy constraint in the algorithm causes a shift of the cost function depending on the pmf p(s′_{i}). If two decoding symbols s′_{i} and s′_{i+1} are competing, the symbol with larger popularity has higher chance of being chosen. The probability of a reconstruction level that is rarely chosen is further reduced. As a consequence, symbols get “removed” and the quantizer size K of the final result can be smaller than the initial quantizer size N.
The number N of initial reconstruction levels is critical to quantizer performance after convergence. Fig. 5.9 illustrates the result of the entropyconstrained Lloyd algorithm after convergence for a Laplacian pdf and different numbers of initial reconstruction levels, where the rate is measured as the entropy of the reconstruction symbols. It can be seen that a larger number of initial reconstruction levels always leads to a smaller or equal distortion (higher or equal SNR) at the same rate than a smaller number of initial reconstruction levels.
Fig. 5.9 Operational distortion rate curves after convergence of the entropyconstrained Lloyd algorithm for different numbers of initialized reconstruction levels. The rate R is measured as the entropy of the reconstruction symbols.
As a first example, we applied the entropyconstrained Lloyd algorithm with the MSE distortion to a Gaussian pdf with unit variance. The resulting average distortion D_{F }^{*} is 10.45 dB for an average rate R, measured as entropy, of 2 bit per symbol. The obtained optimal reconstruction levels and decision thresholds are depicted in Fig. 5.10. This figure also illustrates the iteration process for two different initializations. For initialization A, the initial number of reconstruction levels is sufficiently large and during the iteration process the size of the quantizer is reduced. With initialization B, however, the desired quantizer performance is not achieved, because the number of initial reconstruction levels is too small for the chosen value of λ.
Fig. 5.10 Entropyconstrained Lloyd algorithm for a Gaussian pdf with unit variance and two initializations: (top) final reconstruction levels and decision thresholds; (middle) reconstruction levels and decision thresholds as function of the iteration step; (bottom) overall distortion D and rate R, measured as entropy, as function of the iteration step.
The same experiment was done for a Laplacian pdf with unit variance. Here, the resulting average distortion D_{F }^{*} is 11.46 dB for an average rate R, measured as entropy, of 2 bit per symbol. The obtained optimal reconstruction levels and decision thresholds as well as the iteration processes are illustrated in Fig. 5.11. Similarly as for the Gaussian pdf, the number of initial reconstruction levels for the initialization B is too small for the chosen value of λ, so that the desired quantization performance is not achieved. For initialization A, the initial quantizer size is large enough and the number of quantization intervals is reduced during the iteration process.
Fig. 5.11 Entropyconstrained Lloyd algorithm for a Laplacian pdf with unit variance and two initializations: (top) final reconstruction levels and decision thresholds; (middle) reconstruction levels and decision thresholds as function of the iteration step; (bottom) overall distortion D and rate R, measured as entropy, as function of the iteration step.
5.2.3 HighRate Operational Distortion Rate Functions
In general, it is impossible to analytically state the operational distortion rate function for optimized quantizer designs. One of the few exceptions is the uniform distribution, for which the operational distortion rate function for all discussed quantizer designs is given in (5.17). For stationary input processes with continuous random variables, we can, however, derive the asymptotic operational distortion rate functions for very high rates (R →∞) or equivalently for small distortions (D → 0). The resulting relationships are referred to as highrate approximations and approach the true operational distortion rate functions as the rate approaches infinity. We remember that as the rate approaches infinity, the (information) distortion rate function approaches the Shannon lower bound. Hence, for high rates, the performance of a quantizer design can be evaluated by comparing the high rate approximation of the operational distortion rate function with the Shannon lower bound.
The general assumption that we use for deriving highrate approximations is that the sizes Δ_{i} of the quantization intervals [u_{i},u_{i+1}) are so small that the marginal pdf f(s) of a continuous input process is nearly constant inside each interval,
 (5.36) 
The probabilities of the reconstruction levels can be approximated by
 (5.37) 
For the average distortion D, we obtain
 (5.38) 
An integration of the right side of (5.38) yields
 (5.39) 
For each quantization interval, the distortion is minimized if the term (u_{i+1}  s′_{i})^{3} is equal to the term (u_{i}  s′_{i})^{3}, which yields
 (5.40) 
By inserting (5.40) into (5.39), we obtain the following expression for the average distortion at high rates,
 (5.41) 
For deriving the asymptotic operational distortion rate functions, we will use the expression (5.41) with equality, but keep in mind that it is only asymptotically correct for Δ_{i} → 0.
For PCM quantization of random processes with a finite amplitude range of width A, we can directly insert the expression (5.14) into the distortion approximation (5.41). Since ∑ _{i=0}^{K1}p(s′_{i}) is equal to 1, this yields the asymptotic operational distortion rate function
 (5.42) 
In order to derive the asymptotic operational distortion rate function for optimal scalar quantizers in combination with fixedlength codes, we again start with the distortion approximation in (5.41). By using the relationship ∑ _{i=0}^{K1}K^{1} = 1, it can be reformulated as
 (5.43) 
Using Hölders inequality
 (5.44) 
with equality if and only if x_{i} is proportional to y_{i}, it follows
 (5.45) 
Equality is achieved if the terms f(s′_{i})Δ_{i}^{3} are proportional to 1∕K. Hence, the average distortion for high rates is minimized if all quantization intervals have the same contribution to the overall distortion D.
We have intentionally chosen α = 1∕3, in order to obtain an expression of the sum in which Δ_{i} has no exponent. Remembering that the used distortion approximation is asymptotically valid for small intervals Δ_{i}, the summation in (5.45) can be written as integral,
 (5.46) 
As discussed in sec. 5.2.1, the rate R for a scalar quantizer with fixedlength codes is given by R = log _{2}K. This yields the following asymptotic operational distortion rate function for optimal scalar quantizers with fixedlength codes,
 (5.47) 
where the factor ε_{F }^{2} only depends on the marginal pdf f(s) of the input process. The result (5.47) was reported by Panter and Dite in [59] and is also referred to as the Panter and Dite formula.
In sec. 5.2.2, we have discussed that the rate R for an optimized scalar quantizer with variablelength codes can be approximated by the entropy H(S′) of the output random variables S′. We ignore that, for the quantization of sources with memory, the output samples are not mutually independent and hence a lossless code that employs the dependencies between the output samples may achieve a rate below the scalar entropy H(S′).
By using the entropy H(S′) of the output random variables S′ as approximation for the rate R and applying the highrate approximation p(s′_{i}) = f(s′_{i})Δ_{i}, we obtain
Since we investigate the asymptotic behavior for small interval sizes Δ_{i}, the first term in (5.48) can be formulated as an integral, which actually represents the differential entropy h(S), yielding We continue with applying Jensen’s inequality for convex functions φ(x), such as φ(x) = log _{2}x, and positive weights a_{i},
 (5.50) 
By additionally using the distortion approximation (5.41), we obtain
 (5.52) 
Similarly as for the Panter and Dite formula, the factor ε_{F }^{2} only depends on the marginal pdf f(s) of the input process. This result (5.52) was established by Gish and Pierce in [17] using variational calculus and is also referred to as Gish and Pierce formula. The use of Jensen’s inequality to obtain the same result was first published in [27].
We now compare the asymptotic operational distortion rate functions for the discussed quantizer designs with the Shannon lower bound (SLB) for iid sources. All highrate approximations and also the Shannon lower bound can be written as
 (5.53) 
where the subscript X stands for optimal scalar quantizers with fixedlength codes (F), optimal scalar quantizers with variablelength codes (V ), or the Shannon lower bound (L). The factors ε_{X}^{2} depend only on the pdf f(s) of the source random variables. For the highrate approximations, ε_{F }^{2} and ε_{V }^{2} are given by (5.47) and (5.52), respectively. For the Shannon lower bound, ε_{L}^{2} is equal to 2^{2h(S)}∕(2πe) as can be easily derived from (4.68). Table 5.1 provides an overview of the various factors ε_{X}^{2} for three example distributions.
Shannon Lower  Panter & Dite  Gish & Pierce  
Bound (SLB)  (PdfOpt w. FLC)  (Uniform Q. w. VLC)  
Uniform pdf  ≈ 0.7  1  1 
(1.53 dB to SLB)  (1.53 dB to SLB)  
Laplacian pdf  ≈ 0.86  = 4.5  ≈ 1.23 
(7.1 dB to SLB)  (1.53 dB to SLB)  
Gaussian pdf  1  ≈ 2.72  ≈ 1.42 
(4.34 dB to SLB)  (1.53 dB to SLB)  
If we reformulate (5.53) as signaltonoise ratio (SNR), we obtain
 (5.54) 
For all highrate approximations including the Shannon lower bound, the SNR is a linear function of the rate with a slope of 20log _{10}2 ≈ 6. Hence, for high rates the MSE distortion decreases by approximately 6 dB per bit, independently of the source distribution.
A further remarkable fact is obtained by comparing the asymptotic operational distortion rate function for optimal scalar quantizers for variablelength codes with the Shannon lower bound. The ratio D_{V }(R)∕D_{L}(R) is constant and equal to πe∕6 ≈ 1.53 dB. The corresponding rate difference R_{V }(D)  R_{L}(D) is equal to log _{2}(πe∕6) ≈ 0.25. At high rates, the distortion of an optimal scalar quantizer with variablelength codes is only 1.53 dB larger than the Shannon lower bound. And for low distortions, the rate increase with respect to the Shannon lower bounds is only 0.25 bit per sample. Due to this fact, scalar quantization with variablelength coding is extensively used in modern video coding.
5.2.4 Approximation for Distortion Rate Functions
The asymptotic operational distortion rate functions for scalar quantizers that we have derived in sec. 5.2.3 can only be used as approximations for high rates. For several optimization problems, it is however desirable to have a simple and reasonably accurate approximation of the distortion rate function for the entire range of rates. In the following, we attempt to derive such an approximation for the important case of entropyconstrained scalar quantization (ECSQ).
If we assume that the optimal entropyconstrained scalar quantizer for a particular normalized distribution (zero mean and unit variance) and its operational distortion rate function g(R) are known, the optimal quantizer for the same distribution but with different mean and variance can be constructed by an appropriate shifting and scaling of the quantization intervals and reconstruction levels. The distortion rate function D(R) of the resulting scalar quantizer can then be written as
 (5.55) 
where σ^{2} denotes the variance of the input distribution. Hence, it is sufficient to derive an approximation for the normalized operational distortion rate function g(R).
For optimal ECSQ, the function g(R) and its derivative g′(R) should have the following properties:
 (5.56) 
 (5.57) 
 (5.58) 
Afunction that satisfies the above conditions is
 (5.59) 
The factor a is chosen in a way that g(0) is equal to 1. By numerical optimization, we obtained a = 0.9519 for the Gaussian pdf and a = 0.5 for the Laplacian pdf. For proving that condition (5.57) is fulfilled, we can substitute x = 2^{2R} and develop the Taylor series of the resulting function
 (5.60) 
around x_{0} = 0, which gives
 (5.61) 
Since the remaining terms of the Taylor series are negligible for small values of x (large rates R), (5.59) approaches the highrate approximation ε_{V }^{2}2^{2R} as the rate R approaches infinity. The first derivative of (5.59) is given by
 (5.62) 
It is a continuous and always less than zero.
Fig. 5.12 Operational distortion rate functions for a Gaussian (left) and Laplacian (right) pdf with unit variance. The diagrams show the (information) distortion rate function, the highrate approximation ε_{V }^{2}2^{2R}, and the approximation g(R) given in (5.59). Additionally, results of the ECLloyd algorithm with the rate being measured as entropy are shown.
The quality of the approximations for the operational distortion rate functions of an entropyconstrained quantizer for a Gaussian and Laplacian pdf is illustrated in Fig. 5.12. For the Gaussian pdf, the approximation (5.59) provides a sufficiently accurate match to the results of the entropyconstrained Lloyd algorithm and will be used later. For the Laplacian pdf, the approximation is less accurate for low bit rates.
5.2.5 Performance Comparison for Gaussian Sources
In the following, we compare the rate distortion performance of the discussed scalar quantizers designs with the rate distortion bound for unitvariance stationary GaussMarkov sources with ρ = 0 and ρ = 0.9. The distortion rate functions for both sources, the operational distortion rates function for PCM (uniform, fixedrate), the Lloyd design, and the entropyconstraint Lloyd design (ECLloyd), as well as the Panter & Dite and Gish & Pierce asymptotes are depicted in Fig. 5.13. The rate for quantizers with fixedlength codes is given by the binary logarithm of the quantizer size K. For quantizers with variablelength codes, it is measured as the entropy of the reconstruction levels.
The scalar quantizer designs behave identical for both sources as only the marginal pdf f(s) is relevant for the quantizer design algorithms. For high rates, the entropyconstrained Lloyd design and the Gish & Pierce approximation yield an SNR that is 1.53 dB smaller than the (information) distortion rate function for the GaussMarkov source with ρ = 0. The rate distortion performance of the quantizers with fixedlength codes is worse, particularly for rates above 1 bit per sample. It is, however, important to note that it cannot be concluded that the Lloyd algorithm yields a worse performance than the entropyconstrained Lloyd algorithm. Both quantizers are (locally) optimal with respect to their application area. The Lloyd algorithm results an optimized quantizer for fixedlength coding, while the entropyconstrained Lloyd algorithm yields on optimized quantizer for variablelength coding (with an average codeword length close to the entropy).
The distortion rate function for the GaussMarkov source with ρ = 0.9 is far away from the operational distortion rate functions of the investigated scalar quantizer designs. The reason is that we assumed a lossless coding γ that achieves a rate close to the entropy H(S′) of the output process. A combination of scalar quantization and advanced lossless coding techniques that exploit dependencies between the output samples is discussed in the next section.
5.2.6 Scalar Quantization for Sources with Memory
In the previous sections, we concentrated on combinations of scalar quantization with lossless coding techniques that do not exploit dependencies between the output samples. As a consequence, the rate distortion performance did only depend on the marginal pdf of the input process, and for stationary sources with memory the performance was identical to the performance for iid sources with the same marginal distribution. If we, however, apply scalar quantization to sources with memory, the output samples are not independent. The dependencies can be exploited by advanced lossless coding techniques such as conditional Huffman codes, block Huffman codes, or arithmetic codes that use conditional pmfs in the probability modeling stage.
The design goal of Lloyd quantizers was to minimize the distortion for a quantizer of a given size K. Hence, the Lloyd quantizer design does not change for source with memory. But the design of the entropyconstrained Lloyd quantizer can be extended by considering advanced entropy coding techniques. The conditions for the determination of the reconstruction levels and interval boundaries (given the decision thresholds and average codeword lengths) do not change, only the determination of the average codeword lengths in step 4 of the entropyconstrained Lloyd algorithm needs to be modified. We can design a lossless code such as a conditional or block Huffman code based on the joint pmf of the output samples (which is given by the joint pdf of the input source and the decision thresholds) and determine the resulting average codeword lengths. But, following the same arguments as in sec. 5.2.2, we can also approximate the average codeword lengths based on the corresponding conditional entropy or block entropy.
For the following consideration, we assume that the input source is stationary and that its joint pdf for N successive samples is given by f_{N}(s). If we employ a conditional lossless code (conditional Huffman code or arithmetic code) that exploits the conditional pmf of a current output sample S′ given the last N output samples S′, the average codeword lengths (s′_{i}) can be set equal to the ratio of the conditional entropy H(S′S′) and the symbol probability p(s′_{i}),
 (5.63) 
where k is an index that indicates any of the K^{N} combinations of the last N output samples, p is the marginal pmf of the output samples, and p_{N} and p_{N+1} are the joint pmfs for N and N + 1 successive output samples, respectively. It should be noted that the argument of the logarithm represents the conditional pmf for an output sample S′ given the N preceding output samples S′.
Each joint pmf for N successive output samples, including the marginal pmf p with N = 1, is determined by the joint pdf f_{N} of the input source and the decision thresholds,
 (5.64) 
where u_{k} and u_{k+1} represent the ordered sets of lower and upper interval boundaries for the vector s′_{k} of output samples. Hence, the average codeword length (s′_{i}) can be directly derived based on the joint pdf for the input process and the decision thresholds. In a similar way, the average codeword lengths for block codes of N samples can be approximated based on the block entropy for N successive output samples.
We now investigate the asymptotic operational distortion rate function for high rates. If we again assume that we employ a conditional lossless code that exploits the conditional pmf using the preceding N output samples, the rate R can be approximated by the corresponding conditional entropy H(S_{n}S_{n1},,S_{nN}),
 (5.65) 
For small quantization intervals Δ_{i} (high rates), we can assume that the joint pdfs f_{N} for the input sources are nearly constant inside each Ndimensional hypercube given by a combination of quantization intervals, which yields the approximations
 (5.66) 
where Δ_{k} represents the Cartesian product of quantization interval sizes that are associated with the vector of reconstruction levels s′_{k}. By inserting these approximations in (5.65), we obtain
 (5.69) 
which is similar to (5.49). In the same way as for (5.49) in sec. 5.2.3, we can now apply Jensen’s inequality and then insert the high rate approximation (5.41) for the MSE distortion measure. As a consequence of Jensen’s inequality, we note that also for conditional lossless codes, the optimal quantizer design for high rates has uniform quantization steps sizes. The asymptotic operational distortion rate function for an optimum quantizer with conditional lossless codes is given by
 (5.70) 
In comparison to the Gish & Pierce asymptote (5.52), the firstorder differential entropy h(S) is replaced by the conditional entropy given the N preceding input samples.
In a similar way, we can also derive the asymptotic distortion rate function for block entropy codes (as the block Huffman code) of size N. We obtain the result that also for block entropy codes, the optimal quantizer design for high rates has uniform quantization step sizes. The corresponding asymptotic operational distortion rate function is
 (5.71) 
where h(S_{n},…,S_{n+N1}) denotes the joint differential entropy for N successive input symbols.
The achievable distortion rate function depends on the complexity of the applied lossless coding technique (which is basically given by the parameter N). For investigating the asymptotically achievable operational distortion rate function for arbitrarily complex entropy coding techniques, we take the limit for N →∞, which yields
 (5.72) 
where S) denotes the differential entropy rate of the input source. A comparison with the Shannon lower bound (4.65) shows that the asymptotically achievable distortion for high rates and arbitrarily complex entropy coding is 1.53 dB larger than the fundamental performance bound. The corresponding rate increase is 0.25 bit per sample. It should be noted that this asymptotic bound can only be achieved for high rates. Furthermore, in general, the entropy coding would require the storage of a very large set of codewords or conditional probabilities, which is virtually impossible in real applications.
(The investigation of scalar quantization (SQ) showed that it is impossible to achieve the fundamental performance bound using a source coding system consisting of scalar quantization and lossless coding. For high rates the difference to the fundamental performance bound is 1.53 dB or 0.25 bit per sample. This gap can only be reduced if multiple samples are jointly quantized, i.e., by vector quantization (VQ). Although vector quantization is rarely used in video coding, we will give a brief overview in order to illustrate its design, performance, complexity, and the reason for the limitation of scalar quantization.
In Ndimensional vector quantization, an input vector s consisting of N samples is mapped to a set of K reconstruction vectors {s′_{i}}. We will generally assume that the input vectors are blocks of N successive samples of a realization of a stationary random process {S}. Similarly as for scalar quantization, we restrict our considerations to regular vector quantizers^{5}^{5} Regular quantizers are optimal with respect to the MSE distortion measure. for which the quantization cells are convex sets^{6}^{6} A set of points in ^{N} is convex, if for any two points of the set, all points on the straight line connecting the two points are also elements of the set. and each reconstruction vector is an element of the associated quantization cell. The average distortion and average rate of a vector quantizer are given by (5.5) and (5.7), respectively.
5.3.1 Vector Quantization with FixedLength Codes
We first investigate a vector quantizer design that minimizes the distortion D for a given quantizer size K, i.e., the counterpart of the Lloyd quantizer. The necessary conditions for the reconstruction vectors and quantization cells can be derived in the same way as for the Lloyd quantizer in sec. 5.2.1 and are given by
 (5.73) 
and
 (5.74) 
The extension of the Lloyd algorithm to vector quantization [46] is referred to as LindeBuzoGray algorithm (LBG). For a sufficiently large training set {s_{n}} and a given quantizer size K, the algorithm can be stated as follows:
where the expectation value is taken over the training set.
Fig. 5.14 Illustration of the LBG algorithm for a quantizer with N = 2 and K = 16 and a Gaussian iid process with unit variance. The lines mark the boundaries of the quantization cells, the crosses show the reconstruction vectors, and the lightcolored dots represent the samples of the training set.
An an example, we designed a 2d vector quantizer for a Gaussian iid process with unit variance. The selected quantizer size is K = 16 corresponding to a rate of 2 bit per (scalar) sample. The chosen initialization as well as the obtained quantization cells and reconstruction vectors after the 8th and 49th iteration of the LBG algorithm are illustrated in Fig. 5.14. In Fig. 5.15, the distortion is plotted as function of the iteration step.
After the 8th iteration, the 2dimensional vector quantizer shows a similar distortion (9.30 dB) as the scalar Lloyd quantizer at the same rate of R = 2 bit per (scalar) sample. This can be explained by the fact that the quantization cells are approximately rectangular shaped and that such rectangular cells would also be constructed by a corresponding scalar quantizer (if we illustrate the result for 2 consecutive samples). After the 49th iteration, the cells of the vector quantizer are shaped in a way that a scalar quantizer cannot create and the SNR is improved to 9.67 dB.
Fig. 5.16 Illustration of the LBG algorithm for a quantizer with N = 2 and K = 256 and a Gaussian iid process with unit variance: (left) resulting quantization cells and reconstruction vectors after 49 iterations; (right) distortion as function of the iteration step.
Fig. 5.16 shows the result of the LBG algorithm for a vector quantizer with N = 2 and K = 256, corresponding to a rate of R = 4 bit per sample, for the Gaussian iid source with unit variance. After the 49th iteration, the gain for twodimensional VQ is around 0.9 dB compared to SQ with fixedlength codes resulting in an SNR of 20.64 dB (of conjectured 21.05 dB [50]). The result indicates that at higher bit rates, the gain of VQ relative to SQ with fixedlength codes increases.
Fig. 5.17 Results of the LBG algorithm for a 2d VQ with a size of K = 16 (top) and K = 256 (bottom) for a Laplacian iid source with unit variance.
Fig. 5.17 illustrates the results for a 2d VQ design for a Laplacian iid source with unit variance and two different quantizer sizes K. For K = 16, which corresponds to a rate of R = 2 bit per sample, the SNR is 8.87 dB. Compared to SQ with fixedlength codes at the same rate, a gain of 1.32 dB has been achieved. For a rate of R = 4 bit per sample (K = 256), the SNR gain is increased to 1.84 dB resulting in an SNR of 19.4 dB (of conjectured 19.99 dB [50]).
5.3.2 Vector Quantization with VariableLength Codes
For designing a vector quantizer with variablelength codes, we have to minimize the distortion D subject to a rate constraint, which can be effectively done using Lagrangian optimization. Following the arguments in sec. 5.2.2, it is justified to approximate the rate by the entropy H(Q(S)) of the output vectors and to set the average codeword lengths equal to (s′_{i}) = log _{2}p(s′_{i}). Such a quantizer design is also referred to as entropyconstrained vector quantizer (ECVQ). The necessary conditions for the reconstruction vectors and quantization cells can be derived in the same way as for the entropyconstrained scalar quantizer (ECSQ) and are given by (5.73) and
 (5.75) 
The extension of the entropyconstrained Lloyd algorithm to vector quantization [9] is also referred to as ChouLookabaughGray algorithm (CLG). For a sufficiently large training set {s_{n}} and a given Lagrange parameter λ, the CLG algorithm can be stated as follows:
where the expectation value is taken over the training set.
As examples, we designed a 2d ECVQ for a Gaussian and Laplacian iid process with unit variance and an average rate, measured as entropy, of R = 2 bit per sample. The results of the CLG algorithm are illustrated in Fig. 5.18. The SNR gain compared to an ECSQ design with the same rate is 0.26 dB for the Gaussian and 0.37 dB for the Laplacian distribution.
Fig. 5.18 Results of the CLG algorithm for a Gaussian (top) and Laplacian (bottom) iid source with unit variance and a rate (entropy) of R = 2 bit per sample. The dashed line in the diagrams on the right shows the distortion for an ECSQ design with the same rate.
5.3.3 The Vector Quantization Advantage
The examples for the LBG and CLG algorithms showed that vector quantization increases the coding efficiency compared to scalar quantization. According to the intuitive analysis in [52], the performance gain can be attributed to three different effects: the space filling advantage, the shape advantage, and the memory advantage. In the following, we will briefly explain and discuss these advantages. We will see that the space filling advantage is the only effect that can be exclusively achieved with vector quantization. The associated performance gain is bounded to 1.53 dB or 0.25 bit per sample. This bound is asymptotically achieved for large quantizer dimensions and large rates, and corresponds exactly to the gap between the operational rate distortion function for scalar quantization with arbitrarily complex entropy coding and the rate distortion bound at high rates. For a deeper analysis of the vector quantization advantages, the reader is referred to the discussion in [52] and the quantitative analysis in [50].
When we analyze the results of scalar quantization in higher dimension, we see that the Ndimensional space is partitioned into Ndimensional hyperrectangles (Cartesian products of intervals). This does however not represent the densest packing in ^{N}. With vector quantization of dimension N, we have extra freedom in choosing the shapes of the quantization cells. The associated increase in coding efficiency is referred to as space filling advantage.
The space filling advantage can be observed in the example for the LBG algorithm with N = 2 and a Gaussian iid process in Fig. 5.14. After the 8th iteration, the distortion is approximately equal to the distortion of the scalar Lloyd quantizer with the same rate and the reconstruction cells are approximately rectangular shaped. However, the densest packing in two dimensions is achieved by hexagonal quantization cells. After the 49th iteration of the LBG algorithm, the quantization cells in the center of the distribution look approximately like hexagons. For higher rates, the convergence toward hexagonal cells is even better visible as can be seen in Figs. 5.16 and 5.17
Fig. 5.19 Convergence of LBG algorithm with N = 2 toward hexagonal quantization cells for a uniform iid process.
To further illustrate the space filling advantage, he have conducted another experiment for a uniform iid process with A = 10. The operational distortion rate function for scalar quantization is given by D(R) = 2^{2R}. For a scalar quantizer of size K = 10, we obtain a rate (entropy) of 3.32 bit per sample and a distortion of 19.98 dB. The LBG design with N = 2 and K = 100 is associated with about the same rate. The partitioning converges toward a hexagonal lattice as illustrated in Fig. 5.19 and the SNR is increased to 20.08 dB.
The gain due to choosing the densest packing is independent of the source distribution or any statistical dependencies between the random variables of the input process. The space filling gain is bounded to 1.53 dB, which can be asymptotically achieved for high rates if the dimensionality of the vector quantizer approaches infinity [50].
Fig. 5.20 Shape advantage for Gaussian and Laplacian iid sources as function of the vector quantizer dimension N.
The shape advantage describes the effect that the quantization cells of optimal VQ designs adapt to the shape of the source pdf. In the examples for the CLG algorithm, we have however seen that, even though ECVQ provides a better performance than VQ with fixedlength code, the gain due to VQ is reduced if we employ variablelength coding for both VQ and SQ. When comparing ECVQ with ECSQ for iid sources, the gain of VQ reduces to the space filling advantage, while the shape advantage is exploited by variablelength coding. However, VQ with fixedlength codes can also exploit the gain that ECSQ shows compared to SQ with fixedlength codes [50].
The shape advantage for high rates has been estimated in [50]. Fig. 5.20 shows this gain for Gaussian and Laplacian iid random processes. In practice, the shape advantage is exploited by using scalar quantization in combination with entropy coding techniques such as Huffman coding or arithmetic coding.
For sources with memory, there are linear or nonlinear dependencies between the samples. In optimal VQ designs, the partitioning of the Ndimensional space into quantization cells is chosen in a way that these dependencies are exploited. This is illustrated in Fig. 5.21, which shows the ECVQ result of the CLG algorithm for N = 2 and a GaussMarkov process with a correlation factor of ρ = 0.9 for two different values of the Lagrange parameter λ.
An quantitative estimation of the gain resulting from the memory advantage at high rates was done in [50]. Fig. 5.22 shows the memory gain for GaussMarkov sources with different correlation factors as function of the quantizer dimension N.
Fig. 5.22 Memory gain as function of the quantizer dimension N for GaussMarkov sources with different correlation factors ρ.
For sources with strong dependencies between the samples, such as video signals, the memory gain is much larger than the shape and space filling gain. In video coding, a suitable exploitation of the statistical dependencies between samples is one of the most relevant design aspects. The linear dependencies between samples can also be exploited by combining scalar quantization with linear prediction or linear transforms. These techniques are discussed in chapters 6 and 7. By combining scalar quantization with advanced entropy coding techniques, which we discussed in sec. 5.2.6, it is possible to partially exploit both linear as well as nonlinear dependencies.
5.3.4 Performance and Complexity
For further evaluating the performance of vector quantization, we compared the operational rate distortion functions for CLG designs with different quantizer dimensions N to the rate distortion bound and the operational distortion functions for scalar quantizers with fixedlength and variablelength^{7}^{7} In this comparison, it is assumed that the dependencies between the output samples or output vectors are not exploited by the applied lossless coding. codes. The corresponding rate distortion curves for a GaussMarkov process with a correlation factor of ρ = 0.9 are depicted in Fig. 5.23. For quantizers with fixedlength codes, the rate is given the binary logarithm of the quantizer size K; for quantizers with variablelength codes, the rate is measured as the entropy of the reconstruction levels or reconstruction vectors.
Fig. 5.23 Estimated vector quantization advantage at high rates [50] for a GaussMarkov source with a correlation factor of ρ = 0.9.
The operational distortion rate curves for vector quantizers of dimensions N = 2,5,10, and 100, labeled with “VQ, K = N(e)”, show the theoretical performance for high rates, which has been estimated in [50]. These theoretical results have been verified for N = 2 by designing entropyconstrained vector quantizers using the CLG algorithm. The theoretical vector quantizer performance for a quantizer dimension of N = 100 is very close to the distortion rate function of the investigated source. In fact, vector quantization can asymptotically achieve the rate distortion bound as the dimension N approaches infinity. Moreover, vector quantization can be interpreted as the most general lossy source coding system. Each source coding system that maps a vector of N samples to one of K codewords (or codeword sequences) can be designed as vector quantizer of dimension N and size K.
Despite the excellent coding efficiency vector quantization is rarely used in video coding. The main reason is the associated complexity. On one hand, a general vector quantizer requires the storage of a large codebook. This issue becomes even more problematic for systems that must be able to encode and decode sources at different bit rates, as it is required for video codecs. On the other hand, the computationally complexity for associating an input vector with the best reconstruction vector in rate distortion sense is very large in comparison to the encoding process for scalar quantization that is used in practice. One way to reduce the requirements on storage and computational complexity is to impose structural constraints on the vector quantizer. Examples for such structural constraints include:
In particular, predictive VQ can be seen as a generalization of a number of very popular techniques including motion compensation in video coding. For the actual quantization, video codecs mostly include a simple scalar quantizer with uniformly distributed reconstruction levels (sometimes with a deadzone around zero), which is combined with entropy coding and techniques such as linear prediction or linear transforms in order to exploit the shape of the source distribution and the statistical dependencies of the source. For video coding, the complexity of vector quantizers including those with structural constraints is considered as too large in relation to the achievable performance gains.
In this chapter, we have discussed quantization starting with scalar quantizers. The Lloyd quantizer that is constructed using an iterative procedure provides the minimum distortion for a given number of reconstruction levels. It is the optimal quantizer design if the reconstruction levels are transmitted using fixedlength codes. The extension of the quantizer design for variablelength codes is achieved by minimizing the distortion D subject to a rate constraint R < R_{max}, which can be formulated as a minimization of a Lagrangian functional D + λR. The corresponding iterative design algorithm includes a sufficiently accurate estimation of the codeword lengths that are associated with the reconstruction levels. Usually the codeword lengths are estimated based on the entropy of the output signal, in which case the quantizer design is also referred to as entropyconstrained Lloyd quantizer.
At high rates, the operational distortion rate functions for scalar quantization with fixed and variablelength codes as well as the Shannon lower bound can be described by
 (5.76) 
where X either indicates the Shannon lower bound or scalar quantization with fixed or variablelength codes. For a given X, the factors ε_{X}^{2} depend only of the statistical properties of the input source. If the output samples are coded with an arbitrarily complex entropy coding scheme, the difference between the operational distortion rate function for optimal scalar quantization and the Shannon lower bound is 1.53 dB or 0.25 bit per sample at high rates. Another remarkable result is that at high rates, optimal scalar quantization with variablelength codes is achieved if all quantization intervals have the same size.
In the second part of the chapter, we discussed the extension of scalar quantization to vector quantization, by which the rate distortion bound can be asymptotically achieved as the quantizer dimension approaches infinity. The coding efficiency improvements of vector quantization relative to scalar quantization can be attributed to three different effects: the space filling advantage, the shape advantage, and the memory advantage. While the space filling advantage can be only achieved by vector quantizers, the shape and memory advantage can also be exploited by combining scalar quantization with a suitable entropy coding and techniques such as linear prediction and linear transforms.
Despite its superior rate distortion performance, vector quantization is rarely used in video coding applications because of its complexity. Instead, modern video codecs combine scalar quantization with entropy coding, linear prediction, and linear transforms in order to achieve a high coding efficiency at a moderate complexity level.
This is the html version of the book:
Thomas Wiegand and Heiko Schwarz (2010): "Source Coding: Part I of Fundamentals of Source and Video Coding", Now Publishers, Foundations and TrendsŪ in Signal Processing: Vol. 4: No 12, pp 1222. (pdf version of the book)
6 Predictive Coding
In the previous chapter, we investigated the design and rate distortion performance of quantizers. We showed that the fundamental rate distortion bound can be virtually achieved by unconstrained vector quantization of a sufficiently large dimension. However, due to the very large amount of data in video sequences and the realtime requirements that are found in most video coding applications, only lowcomplex scalar quantizers are typically used in this area. For iid sources, the achievable operational rate distortion function for high rate scalar quantization lies at most 1.53 dB or 0.25 bit per sample above the fundamental rate distortion bound. This represents a suitable tradeoff between coding efficiency and complexity. But if there is a large amount of dependencies between the samples of an input signal, as it is the case in video sequences, the rate distortion performance for simple scalar quantizers becomes significantly worse than the rate distortion bound. A source coding system consisting of a scalar quantizer and an entropy coder can exploit the statistical dependencies in the input signal only if the entropy coder uses higherorder conditional or joint probability models. The complexity of such an entropy coder is however close to that of a vector quantizer, so that such a design is unsuitable in practice. Furthermore, video sequences are highly nonstationary and conditional or joint probabilities for nonstationary sources are typically very difficult to estimate accurately. It is desirable to combine scalar quantization with additional tools that can efficiently exploit the statistical dependencies in a source at a low complexity level. One of such coding concepts is predictive coding, which we will investigate in this chapter. The concepts of prediction and predictive coding are widely used in modern video coding. Wellknown examples are intra prediction, motioncompensated prediction, and motion vector prediction.
The basic structure of predictive coding is illustrated in Fig. 6.1 using the notation of random variables. The source samples {s_{n}} are not directly quantized. Instead, each sample s_{n} is predicted based on previous samples. The prediction value ŝ_{n} is subtracted from the value of the input sample s_{n} yielding a residual or prediction error sample u_{n} = s_{n} ŝ_{n}. The residual sample u_{n} is then quantized using scalar quantization. The output of the quantizer is a reconstructed value u′_{n} for the residual sample u_{n}. At the decoder side, the reconstruction u′_{n} of the residual sample is added to the predictor ŝ_{n} yielding the reconstructed output sample s′_{n} = ŝ_{n} + u′_{n}.
Intuitively, we can say that the better the future of a random process is predicted from its past and the more redundancy the random process contains, the less new information is contributed by each successive observation of the process. In the context of predictive coding, the predictors ŝ_{n} should be chosen in a way that they can be easily computed and result in a rate distortion efficiency of the predictive coding system that is as close as possible to the rate distortion bound.
In this chapter, we discuss the design of predictors with the emphasis on linear predictors and analyze predictive coding systems. For further details, the reader is referred to the classic tutorial [51], and the detailed treatments in [75] and [24].
Prediction is a statistical estimation procedure where the value of a particular random variable S_{n} of a random process {S_{n}} is estimated based on the values of other random variables of the process. Let _{n} be a set of observed random variables. As a typical example, the observation set can represent the N random variables _{n} = {S_{n1},S_{n2},,S_{nN}} that precede that random variable S_{n} to be predicted. The predictor for the random variable S_{n} is a deterministic function of the observation set _{n} and is denoted by A_{n}(_{n}). In the following, we will omit this functional notation and consider the prediction of a random variable S_{n} as another random variable denoted by Ŝ_{n},
 (6.1) 
The prediction error or residual is given by the difference of the random variable S_{n} to be predicted and its prediction Ŝ_{n}. It can also be interpreted as a random variable and is be denoted U_{n},
 (6.2) 
If we predict all random variables of a random process {S_{n}}, the sequence of predictions {Ŝ_{n}} and the sequence of residuals {U_{n}} are random processes. The prediction can then be interpreted as a mapping of an input random process {S_{n}} to an output random process {U_{n}} representing the sequence of residuals as illustrated in Fig. 6.2.
In order to derive optimum predictors, we have to discuss first how the goodness of a predictor can be evaluated. In the context of predictive coding, the ultimate goal is to achieve the minimum distortion between the original and reconstructed samples subject to a given maximum rate. For the MSE distortion measure (or in general for all additive difference distortion measures), the distortion between a vector of N input samples s and the associated vector of reconstructed samples s′ is equal to the distortion between the corresponding vector of residuals u and the associated vector of reconstructed residuals u′,
 (6.3) 
Hence, the operational distortion rate function of a predictive coding systems is equal to the operational distortion rate function for scalar quantization of the prediction residuals. As stated in sec. 5.2.4, the operational distortion rate curve for scalar quantization of the residuals can be stated as D(R) = σ_{U}^{2} ⋅ g(R), where σ_{U}^{2} is the variance of the residuals and the function g(R) depends only on the type of the distribution of the residuals. Hence, the rate distortion efficiency of a predictive coding system depends on the variance of the residuals and the type of their distribution. We will neglect the dependency on the distribution type and define that a predictor A_{n}(_{n}) given an observation set _{n} is optimal if it minimizes the variance σ_{U}^{2} of the prediction error. In the literature [51, 75, 24], the most commonly used criterion for the optimality of a predictor is the minimization of the MSE between the input signal and its prediction. This is equivalent to the minimization of the second moment ϵ_{U}^{2} = σ_{U}^{2} + μ_{U}^{2}, or the energy, of the prediction error signal. Since the minimization of the second moment ϵ_{U}^{2} implies^{1}^{1} We will later prove this statement for linear prediction. a minimization of the variance σ_{U}^{2} and the mean μ_{U}, we will also consider the minimization of the mean squared prediction error ϵ_{U}^{2}.
When considering the more general criterion of the mean squared prediction error, the selection of the optimal predictor A_{n}(_{n}) given an observation set _{n} is equivalent to the minimization of
 (6.4) 
The solution to this minimization problem is given by the conditional mean of the random variable S_{n} given the observation set _{n},
 (6.5) 
This can be proved by using the formulation
Since E and A_{n}(_{n}) are deterministic functions given the observation set _{n}, we can write
 (6.9) 
which proves that the conditional mean E minimizes the mean squared prediction error for a given observation set _{n}.
We will show later that in predictive coding the observation set _{n} must consist of reconstructed samples. If we for example use the last N reconstructed samples as observation set, _{n} = {S′_{n1},,S′_{nN}}, it is conceptually possible to construct a table in which the conditional expectations E are stored for all possible combinations of the values of s′_{n1} to s′_{nN}. This is in some way similar to scalar quantization with an entropy coder that employs the conditional probabilities p(s_{n}s′_{n1},,s′_{nN}) and does not significantly reduce the complexity. For obtaining a lowcomplexity alternative to this scenario, we have to introduce structural constraints for the predictor A_{n}(_{n}). Before we state a reasonable structural constraint, we derive the optimal predictors according to (6.5) for two examples.
As a first example, we consider a stationary Gaussian source and derive the optimal predictor for a random variable S_{n} given a vector S_{nk} = (S_{nk},,S_{nkN+1})^{T }, with k > 0, of N preceding samples. The conditional distribution f(S_{n}S_{nk}) of jointly Gaussian random variables is also Gaussian. The conditional mean E and thus the optimal predictor is given by (see for example [26])
 (6.10) 
where μ_{S} represents the mean of the Gaussian process, e_{N} is the Ndimensional vector with all elements equal to 1, and C_{N} is the Nth order autocovariance matrix, which is given by
 (6.11) 
The vector c_{k} is an autocovariance vector and is given by
 (6.12) 
Autoregressive processes are an important model for random sources. An autoregressive process of order m, also referred to as AR(m) process, is given by the recursive formula
where μ_{S} is the mean of the random process, a_{m} = (a_{1},,a_{m})^{T } is a constant parameter vector, and {Z_{n}} is a zeromean iid process. We consider the prediction of a random variable S_{n} given the vector S_{n1} of the N directly preceding samples, where N is greater than or equal to the order m. The optimal predictor is given by the conditional mean E. By defining an Ndimensional parameter vector a_{N} = (a_{1},,a_{m},0,,0)^{T }, we obtainFor both considered examples, the optimal predictor is given by a linear function of the observation vector. In a strict sense, it is an affine function if the mean μ of the considered processes is nonzero. If we only want to minimize the variance of the prediction residual, we do not need the constant offset and can use strictly linear predictors. For predictive coding systems, affine predictors have the advantage that the scalar quantizer can be designed for zeromean sources. Due to their simplicity and their effectiveness for a wide range of random processes, linear (and affine) predictors are the most important class of predictors for video coding applications. It should however be noted that nonlinear dependencies in the input process cannot be exploited using linear or affine predictors. In the following, we will concentrate on the investigation of linear prediction and linear predictive coding.
In the following, we consider linear and affine prediction of a random variable S_{n} given an observation vector S_{nk} = [S_{nk},,S_{nkN+1}]^{T }, with k > 0, of N preceding samples. We restrict our considerations to stationary processes. In this case, the prediction function A_{n}(S_{nk}) is independent of the time instant of the random variable to be predicted and is denoted by A(S_{nk}). For the more general affine form, the predictor is given by
 (6.15) 
where the constant vector h_{N} = (h_{1},,h_{N})^{T } and the constant offset h_{0} are the parameters that characterize the predictor. For linear predictors, the constant offset h_{0} is equal to zero.
The variance σ_{U}^{2} of the prediction residual depends on the predictor parameters and can be written as
The constant offset h_{0} has no influence on the variance of the residual. The variance σ_{U}^{2} depends only on the parameter vector h_{N}. By further reformulating the expression (6.16), we obtain where σ_{S}^{2} is the variance of the input process and C_{N} and c_{k} are the autocovariance matrix and the autocovariance vector of the input process given by (6.11) and (6.12), respectively.The mean squared prediction error is given by
 (6.19) 
This selection of h_{0} yields a mean of μ_{U} = 0 for the prediction error signal, and the MSE between the input signal and the prediction ϵ_{U}^{2} is equal to the variance of the prediction residual σ_{U}^{2}. Due to this simple relationship, we restrict the following considerations to linear predictors
 (6.20) 
and the minimization of the variance σ_{U}^{2}. But we keep in mind that the affine predictor that minimizes the mean squared prediction error can be obtained by additionally selecting an offset h_{0} according to (6.19). The structure of a linear predictor is illustrated in Fig. 6.3.
A linear predictor is called an optimal linear predictor if its parameter vector h_{N} minimizes the variance σ_{U}^{2}(h_{N}) given in (6.17). The solution to this minimization problem can be obtained by setting the partial derivatives of σ_{U}^{2} with respect to the parameters h_{i}, with 1 ≤ i ≤ N, equal to 0. This yields the linear equation system
 (6.21) 
We will prove later that this solution minimizes the variance σ_{U}^{2}. The N equations of the equation system (6.21) are also called the normal equations or the YuleWalker equations. If the autocorrelation matrix C_{N} is nonsingular, the optimal parameter vector is given by
 (6.22) 
The autocorrelation matrix C_{N} of a stationary process is singular if and only if N successive random variables S_{n},S_{n+1},,S_{n+N1} are linearly dependent (see [75]), i.e., if the input process is deterministic. We ignore this case and assume that C_{N} is always nonsingular.
By inserting (6.22) into (6.17), we obtain the minimum prediction error variance
Note that (h_{N}^{*})^{T } = c_{k}^{T }C_{N}^{1} follows from that fact that the autocorrelation matrix C_{N} and thus also its inverse C_{N}^{1} is symmetric.We now prove that the solution given by the normal equations (6.21) indeed minimizes the prediction error variance. Therefore, we investigate the prediction error variance for an arbitrary parameter vector h_{N}, which can be represented as h_{N} = h_{N}^{*} + δ_{N}. Inserting this relationship into (6.17) and using (6.21) yields
 (6.25) 
which proves that (6.21) specifies the parameter vector h_{N}^{*} that minimizes the prediction error variance.
In the following, we derive another important property for optimal linear predictors. We consider the more general affine predictor and investigate the correlation between the observation vector S_{nk} and the prediction residual U_{n},
 (6.27) 
Hence, optimal affine prediction yields a prediction residual U_{n} that is uncorrelated with the observation vector S_{nk}. For optimal linear predictors, equation (6.27) holds only for zeromean input signals. In general, only the covariance between the prediction residual and each observation is equal to zero,
 (6.28) 
The linear prediction for a single random variable S_{n} given an observation vector S_{nk} can also be extended to the prediction of a vector S_{n+K1} = (S_{n+K1},S_{n+K2},,S_{n})^{T } of K random variables. For each random variable of S_{n+K1}, the optimal linear or affine predictor can be derived as discussed above. If the parameter vectors h_{N} are arranged in a matrix and the offsets h_{0} are arranged in a vector, the prediction can be written as
 (6.29) 
where H_{K} is an K ×N matrix whose rows are given by the corresponding parameter vectors h_{N} and h_{K} is a Kdimensional vector whose elements are given by the corresponding offsets h_{0}.
The most often used prediction is the onestep prediction in which a random variable S_{n} is predicted using the N directly preceding random variables S_{n1} = (S_{n1},,S_{nN})^{T }. For this case, we now derive some useful expressions for the minimum prediction error variance σ_{U}^{2}(h_{N}^{*}), which will be used later for deriving an asymptotic bound.
For the onestep prediction, the normal equations (6.21) can be written in matrix notation as
 (6.30) 
where the factors h_{k}^{N} represent the elements of the optimal parameter vector h_{N}^{*} = (h_{1}^{N},,h_{N}^{N})^{T } for linear prediction using the N preceding samples. The covariances E are denoted by ϕ_{k}. By adding a matrix column to the left, multiplying the parameter vector h_{N}^{*} with 1, and adding an element equal to 1 at the top of the parameter vector, we obtain
