I Introduction
Consider the problem of estimating a random vector
from observations under the posterior density(1) 
where is a normalization constantsometimes called the partition function and are penalty functions. Although both and the penalties may depend on , our notation suppresses this dependence. We are interested in two problems:

MAP estimation: In maximum a posteriori (MAP) estimation, we wish to find the point estimate , equivalently stated as
(2) 
MMSE estimation and approximate inference: In minimum meansquared error (MMSE) estimation, we wish to compute the posterior mean and maybe also approximations of the posterior covariance or marginal posterior densities .
For the MAP estimation problem (2), the separable structure of the objective function can be exploited by one of several optimization methods, including variants of the iterative shrinkage and thresholding algorithm (ISTA) [1] ,[2, 3, 4, 5, 6] and the alternating direction method of multipliers (ADMM) [7] ,[8, 9, 10].
The MMSE and inference problems, however, are more difficult[11], even for the case of convex penalties [12, 13]. In recent years, there has been considerable interest in approximations of loopy belief propagation [14, 15] for both MMSE estimation and approximate inference. These methods include variants of expectation propagation (EP) [16, 17, 18] and, more recently, approximate message passing (AMP) [19, 20] ,[21, 22, 23]. For a posterior of the form (1), both EP and AMP reduce the inference problem to a sequence of problems involving only one penalty at a time. These “local” problems are computationally tractable under suitable penalties. Moreover, in certain large random instances, these methods are provably optimal [24],[20, 25]. Due to their generality, these methods have been successfully applied to a wide range of problems, e.g., [26, 27],[27, 28, 29, 30, 31, 32, 33].
Despite their computational simplicity, the convergence and accuracy of these methods are not fully understood. This work analyzes one promising EPtype method known as expectation consistent approximate inference (EC), originally proposed by Opper and Winther in [17]. As shown in [18], EC interpreted as a parallel form of the EP method from [16], while being closely related to the adaptive TAP method from[34] [35].
As we now describe, our work contributes to the extension and understanding of Opper and Winther’s EC method.

Generalization: We propose and analyze a generalization of the EC algorithm that we call Generalized EC (GEC). The proposed method can be applied to arbitrary penalties and , and can also be used for both MAP or MMSE inference by appropriate selection of estimation functions. Standard EC typically applies only to MMSE inference, often with one penalty being quadratic. Also, GEC supports a generalization of the covariance diagonalization step, which is one of the key computational bottlenecks in standard EC [12].

Fixed points: It is well known that, when the standard EC algorithm converges, its fixed points can be interpreted as saddle points of an energy function [17, 18] similar to the Bethe Free Energy (BFE) that arises in the analysis of loopy BP [15]. We provide a similar energyfunction interpretation of the MMSEGEC algorithm (Theorem 3
). Our analysis shows that the socalled first and secondorder terms output by MMSEGEC can be interpreted as estimates of the posterior mean and variance. Regarding the fixed points of MAPGEC, we show that the firstorder terms are critical points of the objective function (
2) and the secondorder terms can be interpreted as estimates of the local curvature of the objective function. 
Convergence: A critical concern for both EP and AMP is convergence [12, 36] [37, 38]. This situation is perhaps not surprising, given that they derive from loopy BP, which also may diverge. Most provably convergent alternate approaches are based on variants of doubleloop methods such as [17, 13]. Other modifications to improve the stability include damping and fractional updates [39, 36, 40] and sequential updating [41], which increase the likelihood of convergence at the cost of convergence speed. Our analysis of GEC convergence considers the first and secondorder terms separately—a decoupling technique also used in [18],[42]. We show that, for strictly convex, smooth penalties, the standard updates for the firstorder terms are provably convergent. For MAPGEC, the secondorder terms converge as well.

Relation to the replica prediction of optimality: In [43], Tulino et al. used a replica analysis from statistical physics to predict the the MMSE error when estimating a random vector
from noisy measurements of the linear transformation
under large, unitarily invariant, random . This work extended the replica analyses in [44, 45, 46], which applied to i.i.d. . (See also [47].) In [48, 49], Çakmak et al. proposed a variant of AMP (called SAMP) using closely related methods. In this work, we show that, when GEC is applied to linear regression, a prediction of the posterior MSE satisfies a fixed point equation that exactly matches the replica prediction from
[43]. 
Relation to ADMM: ADMM [7] is a popular approach to optimization problems of the form (2) with convex . Essentially, ADMM iterates individual optimizations of and together with a “dual update” that (asymptotically) enforces consistency between the individual optimizations. The dual update involves a fixed step size, whose choice affects the convergence speed of the algorithm. In this work, we show that GEC can be interpreted as a variant of ADMM with two dualupdates, each with a step size that is adapted according to the local curvature of the corresponding penalty .
Ii The Generalized EC Algorithm
Iia Estimation and Diagonalization
The proposed GEC algorithm involves two key operations: i) estimation, which computes an estimate of using one penalty at a time; and ii) diagonalization of a sensitivity term.
Estimation
The estimation function is constructed differently for the MAP and MMSE cases. In the MAP case, the estimation function is given by
(3) 
where and (componentwise), and where
for any and positive . The estimation function (3) is a scaled version of what is often called the proximal operator.
For the MMSE problem, the estimation function is
(4) 
where the expectation is with respect to the conditional density
(5) 
Diagonalization
In its more general form, the diagonalization operator is an affine linear map from to . Several instances of diagonalization are relevant to our work. For example, vectorvalued diagonalization,
(6) 
which simply returns a dimensional vector containing the diagonal elements of , and uniform diagonalization,
(7) 
which returns a constant vector containing the average diagonal element of . Here, denotes the dimensional vector with all elements equal to one.
For the separable GLM, it will be useful to consider a block uniform diagonalization. In this case, we partition
(8) 
with . Conformal to the partition, we define the block uniform diagonalization
(9) 
where is the th diagonal block of .
We note that any of these diagonalization operators can be used with either the MAP or MMSE estimation functions.
IiB Algorithm Description
The generalized EC (GEC) algorithm is specified in Algorithm 1. There, is the Jacobian matrix of evaluated at , is the diagonal matrix whose diagonal equals , “” is componentwise vector division, and “” is componentwise vector multiplication. Note that it is not necessary to compute the full matrix in line 5; it suffices to compute only the diagonalization .
It will sometimes be useful to rewrite Algorithm 1 in a scaled form. Define and . Then GEC can be rewritten as
(10a)  
(10b)  
(10c) 
Note that, in line 5 of Algorithm 1, we are required to compute the (scaled) Jacobian of the estimation function. For the MAP estimation function (3), this quantity becomes [20]
(11) 
where is the minimizer in (3) and is the Hessian of at that minimizer. For the MMSE estimation function, this scaled Jacobian becomes the covariance matrix
(12) 
where the covariance is taken with respect to the density (5).
IiC Examples
SLR with Separable Prior
Suppose that we aim to estimate given noisy linear measurements of the form
(13) 
where is a known matrix and is independent of . Statisticians often refer to this problem as standard linear regression (SLR). Suppose also that has independent elements with marginal densities :
(14) 
Then, the posterior takes the form of (1) when
(15) 
The separable nature of implies that, in both the MAP or MMSE cases, the output of the estimator (recall (3) and (4)) can be computed in a componentwise manner, as can the diagonal terms of their Jacobians. Likewise, the quadratic nature of implies that the output of can be computed by solving a linear system.
GLM with Separable Prior
Now suppose that, instead of (13), we have a more general likelihood with the form
(16) 
Statisticians often refer to (16) as the generalized linear model (GLM) [50, 51]. To pose the GLM in a format convenient for GEC, we define the new vector . Then, the posterior can be placed in the form of (1) using the penalties
where constrains to the nullspace of . Because the first penalty remains separable, the MAP and MMSE functions can be evaluated componentwise, as in separable SLR. For the second penalty , MAP or MMSE estimation simply becomes projection onto a linear space.
Iii Fixed Points of GEC
Iiia Consistency
We now characterize the fixed points of GEC for both MAP and MMSE estimation functions. For both scenarios, we will need the following simple consistency result.
Lemma 1.
Consider GEC (Algorithm 1) with arbitrary estimation functions and arbitrary diagonalization operator . For any fixed point with , we have
(18a)  
(18b) 
IiiB MAP Estimation
We first examine GEC’s fixed points for MAP estimation.
Theorem 1.
Proof.
See Appendix A.
IiiC MAP Estimation and Curvature
Note that Theorem 1 applies to an arbitrary diagonalization operator . This raises two questions: i) what is the role of the diagonalization operator , and ii) how can the fixed point be interpreted as a result of that diagonalization? We now show that, under certain additional conditions and certain choices of , can be related to the curvature of the optimization objective in (2).
Let be a stationary point of (2) and let be the Hessian of at . Then, the Hessian of the objective function in (2) is . Furthermore, let
(19) 
so that is the diagonal of the inverse Hessian. Geometrically speaking, this inverse Hessian measures the curvature of the objective function at the critical point .
We now identify two cases where : i) when are diagonal, and ii) when are free. To define “free,” consider the Stieltjes transform of any real symmetric matrix :
(20) 
where
are the eigenvalues of
. Also, let denote the socalled transform of , given by(21) 
where the inverse is in terms of composition of functions. The Stieltjes and transforms are discussed in detail in [52]. We will say that and are “free” if
(22) 
An important example of freeness is the following. Suppose that the penalty functions are given by for some matrices and functions . Then
It is shown in [52] that, if are fixed and are unitarily invariant random matrices, then are asymptotically free in certain limits as . Freeness will thus occur in the limits of large problem with unitarily invariant random matrices.
Theorem 2.
Proof.
See Appendix B.
IiiD MMSE Estimation
We now consider the fixed points of GEC under MMSE estimation functions. It is wellknown that the fixed points of the standard EC algorithm are critical points of a certain freeenergy optimization for approximate inference [17, 18]. We derive a similar characterization of the fixed points of GEC.
Let be the density (1) for some fixed . Then, given any density , it is straightforward to show that the KL divergence between and can be expressed as
(23) 
where is the differential entropy of and the constant term does not depend on . Thus, in principle, we could compute by minimizing (23) over all densities . Of course, this minimization is generally intractable since it involves a search over an dimensional density.
To approximate the minimization, define
(24) 
where , and are densities on the variable . Note that minimization of (23) over is equivalent to the optimization
(25) 
under the constraint
(26) 
The energy function (24) is known as the Bethe Free Energy (BFE). Under the constraint (26), the BFE matches the original energy function (23). However, BFE minimization under the constraint (26) is equally intractable.
As with EC, the GEC algorithm can be derived as a relaxation of the above BFE optimization, wherein (26) is replaced by the socalled moment matching constraints:
(27a)  
(27b) 
Thus, instead of requiring a perfect match in the densities as in (26), GEC requires only a match in their first moments and certain diagonal components of their second moments. Note that, for the vectorvalued diagonalization (6), (27b) is equivalent to
which requires only that the marginal 2nd moments match. Under the uniform diagonalization (7), (27b) is equivalent to
requiring only that the average 2nd marginal moments match.
Theorem 3.
Consider GEC (Algorithm 1) with the MMSE estimation functions (4) and either vectorvalued (6) or uniform (7) diagonalization. For any fixed point with , let and be the common values of and from Lemma 1. Also let
(28) 
for from (5) and let be the Gaussian density
(29) 
Then, are stationary points of the optimization (25) subject to the moment matching constraints (27). In addition, is the mean, and the marginal precision, of these densities:
(30)  
(31) 
Proof.
See Appendix C.
IiiE An Unbiased Estimate of
As described in Section IIC, a popular application of GEC is to approximate the marginals of the posterior density (1) in the case that the first penalty describes the prior and the second penalty describes the likelihood. That is,
Here, we have made the dependence of on explicit. The GEC algorithm produces three estimates for the posterior density: , , and . Consider the first of these estimates, . From (28) and (5), this belief estimate is given by
(32) 
where is a normalization constant.
If we model as a random vector, then (32) implies that
From Bayes rule, we know , which together with (32) implies
For to be an admissible prior density on , it must satisfy , , and . It is straightforward to show that one admissible choice is
Under this choice, we get
(33) 
in which case
can be interpreted as an unbiased estimate of
with covariance Gaussian estimation error.The situation above is reminiscent of AMP algorithms [19, 20]. Specifically, their state evolution analyses [24] show that, under large i.i.d. , they recursively produce a sequence of vectors that can be modeled as realizations of the true vector plus zeromean white Gaussian noise. They then compute a sequence of estimates of by “denoising” each .
Iv Convergence of the FirstOrder Terms for Strictly Convex Penalties
We first analyze the convergence of GEC with fixed “secondorder terms” and . To this end, fix at arbitrary values and assume that are fixed points of (10b). Then Lemma 1 implies that . With and fixed, the (scaled) GEC algorithm (10) updates only . In particular, (10c) implies that this update is
(34) 
We analyze the recursion (34) under the following assumption
Assumption 1.
For , fix and suppose that is differentiable in . Also define
(35) 
and assume that is symmetric and that there exists constants such that, for all ,
(36) 
This assumption is valid under strictly convex penalties:
Lemma 2.
Proof.
See [53].
We then have the following convergence result.
Theorem 4.
Proof.
See Appendix D.
V Convergence of the SecondOrder Terms for MAP Estimation
Va Convergence
We now examine the convergence of the secondorder terms and . The convergence results that we present here apply only to the case of MAP estimation (3) under strictly convex penalties that satisfy the conditions in Lemma 2. Furthermore, they assume that Algorithm 1 is initialized using a pair yielding , where is a local minimizer of (2).
Theorem 5.
Consider GEC (Algorithm 1) under the MAP estimation functions (3) with penalties that are strictly convex functions satisfying the assumptions in Lemma 2. Construct and as follows:
Choose arbitrary and run Algorithm 1 under fixed and fixed (for ) until convergence (as guaranteed by Theorem 4). Then record the final value of as .
Finally, run Algorithm 1 from the initialization without keeping and fixed.

For all subsequent iterations, we will have , where is the unique global minimizer of (2).

If is either the vectorvalued or uniform diagonalization operator, then the secondorder terms will converge to unique fixed points.
Proof.
See Appendix E.
Vi Relation to the Replica Prediction
Consider the separable SLR problem described in Section IIC for any matrix and noise precision . Consider GEC under the penalty functions (15), MMSE estimation (4), and uniform diagonalization (7). Thus, will have identical components of scalar value .
Suppose that is the belief estimate generated at a fixed point of GEC. Since in (14) is separable, (32) implies
In the sequel, let and denote the mean and variance with respect to the marginal density
(38) 
From (27a), the GEC estimate satisfies , which is the posterior mean under the estimated density (38). Also, from (27b) and the definition of the uniform diagonal operator (7), the components of are identical and satisfy
(39) 
which is the average of the marginal posterior variances. Equivalently, is the average estimation MSE,
We will show that the value for
can be characterized in terms of the singular values of
. Let , and let denote its Stieltjes Transform (20) and its transform (21). We then have the following.Theorem 6.
Proof.
See Appendix F.
It is interesting to compare this result with that in [43], which considers exactly this estimation problem in the limit of large with certain unitarily invariant random matrices and i.i.d. . That work uses a replica symmetric (RS) analysis to predict that the asymptotic MSE satisfies the fixed point equations
(41) 
where the expectation is over
. This Gaussian distribution is exactly the predicted likelihood
in (33). Hence, if is i.i.d., and follows the likelihood in (33), then the MSE predicted from the GEC estimated posterior must satisfy the same fixed point equation as the minimum MSE predicted from the replica method in the limit as . In particular, if this equation has a unique fixed point, then the GECpredicted MSE will match the minimum MSE as given by the replica method.Of course, these arguments are not a rigorous proof of optimality. The analysis relies on the GEC model with a particular choice of prior on . Also, the replica method itself is not rigorous. Nevertheless, the arguments do provide some hope that GEC is optimal in certain asymptotic and random regimes.
Vii Relation to ADMM
We conclude by relating GEC to the wellknown alternating direction method of multipliers (ADMM) [7, 8, 9, 10]. Consider the MAP minimization problem (2). To solve this via ADMM, we rewrite the minimization as a constrained optimization
(42) 
The division of the variable into two variables and is often called variable splitting. Corresponding to the constrained optimization (42), define the augmented Lagrangian,
(43) 
where is a dual vector and is an adjustable weight. The ADMM algorithm for this problem iterates the steps
(44a)  
(44b)  
(44c) 
where it becomes evident that can also be interpreted as a step size. The benefit of the ADMM method is that the minimizations involve only one penalty, or , at a time. A classic result [7] shows that if the penalties are convex (not necessarily smooth) and (2) has a unique minima, then the ADMM algorithm will converge to that minima. Our next result relates MAPGEC to ADMM.
Theorem 7.
Note that in the above description, we have been explicit about the iteration number to be precise about the timing of the updates. We see that a variant of ADMM can be interpreted as a special case of GEC with particular, fixed step sizes. This variant differs from the standard ADMM updates by having two updates of the dual parameters in each iteration. Alternatively, we can think of GEC as a particular variant of ADMM that uses an adaptive step size. From our discussion above, we know that the GEC algorithm can be interpreted as adapting the stepsize values to match the local “curvature” of the objective function.
Comments
There are no comments yet.