Optimization of the Map Coefficients

A crucial step in constructing transport maps is the optimization of the map coefficients, which determine how well the map represents the target distribution. This process can be approached in two distinct ways, depending on the available information about the target distribution [1].

Map-from-density

One way to construct a transport map is to directly optimize its parameters based on the (unnormalized) target density, as shown in Banana: Map from Density. This approach requires access to the target density function and uses quadrature schemes to approximate integrals, as introduced in Quadrature Methods.

Formally, we define the following optimization problem to determine the coefficients of the parameterized map :

As noted by [1], this optimization problem is generally non-convex. Specifically, it is only convex when the target density is log-concave. Especially in Bayesian inference, where the target density represents the posterior density, the function is not log-concave, resulting in a non-convex optimization problem.

In this package, map optimization is performed with the help of Optim.jl, and support a wide range of optimizers and options (such as convergence criteria and printing preferences). Specifically, we can pass our optimize! function the desired optimizer and options. For a full overview of available options, see the Optim.jl configuration documentation.

To perform the optimization of the map coefficients, we call:

julia

optimize!(M::PolynomialMap, target_density::Function, quadrature::AbstractQuadratureWeights;
  optimizer::Optim.AbstractOptimizer = LBFGS(), options::Optim.Options = Optim.Options())

We have to provide the polynomial map M, the target density function, and a quadrature scheme. Optionally, we can specify the optimizer (default is LBFGS()) and options.

Set initial coefficients

As the starting point of the optimization, the map coefficients can be set using setcoefficients!(M, coeffs), where coeffs is a vector of coefficients.

Usage

First we load the packages:

julia

using TransportMaps
using Optim
using Distributions
using Plots

Then, define the target density and quadrature scheme. Here, we use the same banana-shaped density as in Banana: Map from Density:

julia

banana_density(x) = logpdf(Normal(), x[1]) + logpdf(Normal(), x[2] - x[1]^2)
target = MapTargetDensity(banana_density)
quadrature = GaussHermiteWeights(3, 2)

Set optimization options to print the trace every 20 iterations:

julia

opts_trace = Optim.Options(iterations=200, show_trace=true, show_every=20, store_trace=true)

                x_abstol = 0.0
                x_reltol = 0.0
                f_abstol = 0.0
                f_reltol = 0.0
                g_abstol = 1.0e-8
          outer_x_abstol = 0.0
          outer_x_reltol = 0.0
          outer_f_abstol = 0.0
          outer_f_reltol = 0.0
          outer_g_abstol = 1.0e-8
           f_calls_limit = 0
           g_calls_limit = 0
           h_calls_limit = 0
       allow_f_increases = true
 allow_outer_f_increases = true
        successive_f_tol = 1
              iterations = 200
        outer_iterations = 1000
             store_trace = true
           trace_simplex = false
              show_trace = true
          extended_trace = false
           show_warnings = true
              show_every = 20
                callback = nothing
              time_limit = NaN

We will try the following optimizers from Optim.jl, ordered from simplest to most sophisticated:

Gradient Descent

The most basic optimization algorithm, Gradient Descent iteratively moves in the direction of the negative gradient. It is simple and robust, but can be slow to converge, especially for ill-conditioned problems.

julia

M_gd = PolynomialMap(2, 2)
res_gd = optimize!(M_gd, target, quadrature; optimizer=GradientDescent(), options=opts_trace)
println(res_gd)

Iter     Function value   Gradient norm
     0     3.397609e+00     4.804530e-01
 * time: 4.887580871582031e-5
    20     2.848534e+00     3.621462e-02
 * time: 0.5577878952026367
    40     2.842559e+00     1.621869e-02
 * time: 0.9312169551849365
    60     2.840523e+00     8.783921e-03
 * time: 1.3299839496612549
    80     2.840045e+00     7.258646e-03
 * time: 1.698009967803955
   100     2.839827e+00     4.187938e-03
 * time: 2.0947048664093018
   120     2.839672e+00     2.128402e-03
 * time: 2.5090909004211426
   140     2.839645e+00     2.133251e-03
 * time: 2.870330810546875
   160     2.839627e+00     8.449095e-04
 * time: 3.2306418418884277
   180     2.839621e+00     6.755565e-04
 * time: 3.6458280086517334
   200     2.839619e+00     3.546867e-04
 * time: 4.03125
┌ Warning: Optimization has not converged.
└ @ TransportMaps ~/work/TransportMaps.jl/TransportMaps.jl/src/optimization/mapfromdensity.jl:158
 * Status: failure (reached maximum number of iterations)

 * Candidate solution
    Final objective value:     2.839619e+00

 * Found with
    Algorithm:     Gradient Descent

 * Convergence measures
    |x - x'|               = 5.77e-05 ≰ 0.0e+00
    |x - x'|/|x'|          = 2.33e-05 ≰ 0.0e+00
    |f(x) - f(x')|         = 2.93e-08 ≰ 0.0e+00
    |f(x) - f(x')|/|f(x')| = 1.03e-08 ≰ 0.0e+00
    |g(x)|                 = 3.55e-04 ≰ 1.0e-08

 * Work counters
    Seconds run:   4  (vs limit Inf)
    Iterations:    200
    f(x) calls:    465
    ∇f(x) calls:   465
    ∇f(x)ᵀv calls: 0

Conjugate Gradient

Conjugate Gradient improves upon basic gradient descent by using conjugate directions, which can accelerate convergence for large-scale or quadratic problems. It requires gradient information but not the Hessian.

julia

M_cg = PolynomialMap(2, 2)
res_cg = optimize!(M_cg, target, quadrature; optimizer=ConjugateGradient(), options=opts_trace)
println(res_cg)

Iter     Function value   Gradient norm
     0     3.397609e+00     4.804530e-01
 * time: 4.506111145019531e-5
    20     2.839618e+00     7.770856e-05
 * time: 0.4335150718688965
    40     2.839618e+00     1.415130e-07
 * time: 0.6424970626831055
 * Status: success

 * Candidate solution
    Final objective value:     2.839618e+00

 * Found with
    Algorithm:     Conjugate Gradient

 * Convergence measures
    |x - x'|               = 6.87e-09 ≰ 0.0e+00
    |x - x'|/|x'|          = 2.77e-09 ≰ 0.0e+00
    |f(x) - f(x')|         = 0.00e+00 ≤ 0.0e+00
    |f(x) - f(x')|/|f(x')| = 0.00e+00 ≤ 0.0e+00
    |g(x)|                 = 2.63e-08 ≰ 1.0e-08

 * Work counters
    Seconds run:   1  (vs limit Inf)
    Iterations:    56
    f(x) calls:    118
    ∇f(x) calls:   63
    ∇f(x)ᵀv calls: 0

Nelder-Mead

Nelder-Mead is a derivative-free optimizer that uses a simplex of points to search for the minimum. It is useful when gradients are unavailable or unreliable, but may be less efficient for high-dimensional or smooth problems.

julia

M_nm = PolynomialMap(2, 2)
res_nm = optimize!(M_nm, target, quadrature; optimizer=NelderMead(), options=opts_trace)
println(res_nm)

Iter     Function value    √(Σ(yᵢ-ȳ)²)/n
------   --------------    --------------
     0     3.385910e+00     6.615819e-03
 * time: 0.0005309581756591797
    20     3.325358e+00     1.182567e-02
 * time: 0.053598880767822266
    40     3.217377e+00     1.768618e-02
 * time: 0.10069608688354492
    60     3.145705e+00     4.179719e-03
 * time: 0.16812396049499512
    80     3.117396e+00     5.441360e-03
 * time: 0.20492887496948242
   100     3.093641e+00     3.315425e-03
 * time: 0.23890209197998047
   120     3.070293e+00     3.700208e-03
 * time: 0.3277099132537842
   140     3.052176e+00     3.555031e-03
 * time: 0.36157798767089844
   160     3.040402e+00     2.725353e-03
 * time: 0.39615607261657715
   180     3.027753e+00     2.904953e-03
 * time: 0.7958660125732422
   200     3.012278e+00     2.578709e-03
 * time: 0.8264880180358887
┌ Warning: Optimization has not converged.
└ @ TransportMaps ~/work/TransportMaps.jl/TransportMaps.jl/src/optimization/mapfromdensity.jl:158
 * Status: failure (reached maximum number of iterations)

 * Candidate solution
    Final objective value:     3.012278e+00

 * Found with
    Algorithm:     Nelder-Mead

 * Convergence measures
    √(Σ(yᵢ-ȳ)²)/n ≰ 1.0e-08

 * Work counters
    Seconds run:   1  (vs limit Inf)
    Iterations:    200
    f(x) calls:    305

BFGS

BFGS is a quasi-Newton method that builds up an approximation to the Hessian matrix using gradient evaluations. It is generally faster and more robust than gradient descent and conjugate gradient for smooth problems.

julia

M_bfgs = PolynomialMap(2, 2)
res_bfgs = optimize!(M_bfgs, target, quadrature; optimizer=BFGS(), options=opts_trace)
println(res_bfgs)

Iter     Function value   Gradient norm
     0     3.397609e+00     4.804530e-01
 * time: 4.887580871582031e-5
 * Status: success

 * Candidate solution
    Final objective value:     2.839618e+00

 * Found with
    Algorithm:     BFGS

 * Convergence measures
    |x - x'|               = 7.15e-08 ≰ 0.0e+00
    |x - x'|/|x'|          = 2.88e-08 ≰ 0.0e+00
    |f(x) - f(x')|         = 5.33e-15 ≰ 0.0e+00
    |f(x) - f(x')|/|f(x')| = 1.88e-15 ≰ 0.0e+00
    |g(x)|                 = 3.02e-09 ≤ 1.0e-08

 * Work counters
    Seconds run:   0  (vs limit Inf)
    Iterations:    16
    f(x) calls:    19
    ∇f(x) calls:   19
    ∇f(x)ᵀv calls: 0

LBFGS

LBFGS is a limited-memory version of BFGS, making it suitable for large-scale problems where storing the full Hessian approximation is impractical. It is the default optimizer in many scientific computing packages due to its efficiency and reliability.

julia

M_lbfgs = PolynomialMap(2, 2)
res_lbfgs = optimize!(M_lbfgs, target, quadrature; optimizer=LBFGS(), options=opts_trace)
println(res_lbfgs)

Iter     Function value   Gradient norm
     0     3.397609e+00     4.804530e-01
 * time: 4.1961669921875e-5
 * Status: success

 * Candidate solution
    Final objective value:     2.839618e+00

 * Found with
    Algorithm:     L-BFGS

 * Convergence measures
    |x - x'|               = 6.67e-08 ≰ 0.0e+00
    |x - x'|/|x'|          = 2.68e-08 ≰ 0.0e+00
    |f(x) - f(x')|         = 1.78e-15 ≰ 0.0e+00
    |f(x) - f(x')|/|f(x')| = 6.26e-16 ≰ 0.0e+00
    |g(x)|                 = 1.85e-09 ≤ 1.0e-08

 * Work counters
    Seconds run:   0  (vs limit Inf)
    Iterations:    17
    f(x) calls:    22
    ∇f(x) calls:   22
    ∇f(x)ᵀv calls: 0

Finally, we can compare the results by means of variance diagnostic:

julia

samples_z = randn(1000, 2)
v_gd = variance_diagnostic(M_gd, target, samples_z)
v_cg = variance_diagnostic(M_cg, target, samples_z)
v_nm = variance_diagnostic(M_nm, target, samples_z)
v_bfgs = variance_diagnostic(M_bfgs, target, samples_z)
v_lbfgs = variance_diagnostic(M_lbfgs, target, samples_z)

println("Variance diagnostic GradientDescent:   ", v_gd)
println("Variance diagnostic ConjugateGradient: ", v_cg)
println("Variance diagnostic NelderMead:        ", v_nm)
println("Variance diagnostic BFGS:              ", v_bfgs)
println("Variance diagnostic LBFGS:             ", v_lbfgs)

Variance diagnostic GradientDescent:   0.0004248933777234051
Variance diagnostic ConjugateGradient: 0.0004262577686722196
Variance diagnostic NelderMead:        0.1050219660068103
Variance diagnostic BFGS:              0.0004262577395179497
Variance diagnostic LBFGS:             0.0004262577604438052

We can visualize the convergence of all optimizers:

julia

plot([res_gd.trace[i].iteration for i in 1:length(res_gd.trace)], lw=2,
    [res_gd.trace[i].g_norm for i in 1:length(res_gd.trace)], label="GradientDescent")
plot!([res_cg.trace[i].iteration for i in 1:length(res_cg.trace)], lw=2,
    [res_cg.trace[i].g_norm for i in 1:length(res_cg.trace)], label="ConjugateGradient")
plot!([res_nm.trace[i].iteration for i in 1:length(res_nm.trace)], lw=2,
    [res_nm.trace[i].g_norm for i in 1:length(res_nm.trace)], label="NelderMead")
plot!([res_bfgs.trace[i].iteration for i in 1:length(res_bfgs.trace)], lw=2,
    [res_bfgs.trace[i].g_norm for i in 1:length(res_bfgs.trace)], label="BFGS")
plot!([res_lbfgs.trace[i].iteration for i in 1:length(res_lbfgs.trace)], lw=2,
    [res_lbfgs.trace[i].g_norm for i in 1:length(res_lbfgs.trace)], label="LBFGS")
plot!(xaxis=:log, yaxis=:log, xlabel="Iteration", ylabel="Gradient norm",
    title="Convergence of different optimizers", xlims=(1, 200),
    legend=:bottomleft)

It becomes clear, that LBFGS and BFGS are the most efficient optimizers in this case, while Nelder-Mead struggles to keep up.

Map-from-samples

Another strategy of constructing a transport map is to use samples of the target density, as seen in Banana: Map from Samples. The formulation of transport map estimation in this way has the benefit to transform the problem into a convex optimization problem, when reference density is log-concave [1]. Since we can choose the reference density, we can leverage this property to simplify the optimization process.

When the map is constructed from samples, the optimization problem is formulated by minimizing the Kullback-Leibler divergence between the pushforward of the reference density and the empirical distribution of the samples. We denote the transport map by , which pushes forward the target distribution to the reference distribution. This leads to the following optimization problem:

where are samples from the target distribution, and is the density of the reference distribution.

To perform the optimization, we can use the same optimize! function as before, but now we pass samples instead of a target density and quadrature scheme. Similarly, we can specify the optimizer and options:

julia

optimize!(M::PolynomialMap, samples::AbstractArray{<:Real};
  optimizer::Optim.AbstractOptimizer = LBFGS(), options::Optim.Options = Optim.Options())

This page was generated using Literate.jl.

Optimization of the Map Coefficients ​

Map-from-density ​

Usage ​

Gradient Descent ​

Conjugate Gradient ​

Nelder-Mead ​

BFGS ​

LBFGS ​

Map-from-samples ​

Optimization of the Map Coefficients

Map-from-density

Usage

Gradient Descent

Conjugate Gradient

Nelder-Mead

BFGS

LBFGS

Map-from-samples