1 Introduction
There has been a flurry of recent work studying the convergence properties of policy gradient methods (agarwal2019optimality; bhandari2019global; shani2019adaptive; zhang2019global; wang2019neural; zhang2020variational; mei2020global). We revisit the finite time analysis of policy gradient methods in the simplest setting: finite state and action problems with a policy class consisting of all stochastic policies and with exact gradient evaluations. This setting was recently studied by agarwal2019optimality and shani2019adaptive, who view these problems as instances of smooth nonlinear optimization problems and suggest small stepsizes to control for the error due to local linearization. Despite nonconvexity of the objective, their proofs show policy gradient methods find an –optimal policy within either or iterations, depending on the precise algorithm used.
Instead of viewing the problem through the lens of nonlinear optimization, we take a policy iteration perspective. We highlight that many forms of policy gradient can work with extremely large stepsizes and attain a linear rate of convergence, meaning they require only iterations to reach an
–optimal policy. Our results cover many first order methods applied to the policy gradient loss function, including projected gradient descent, FrankWolfe, mirror descent, and natural gradient descent. In an idealized setting where stepsizes are set by line search, a one paragraph proof applies to all algorithms.
Many caveats apply to these results. The literature aims to effectively approximate natural policy gradient updates with complex deep neural networks
(schulman2015trust), but it is unclear whether understanding of other first order algorithms contributes to developing practical algorithms. Small stepsizes may be critical in practice for controlling certain approximation errors and for stabilizing algorithms. None of these issues are present in simple tabular RL problems, however, and given the flurry of interest in policy gradient convergence rates, we believe it is valuable for researchers to have a clear understanding in this idealized case.On concurrent work.
These authors developed the main results in this note many months ago^{1}^{1}1These results were mentioned in seminars at Cornell on November 5 2019, Georgia tech on March 11 2020, and the RL theory seminar on May 19 2020. They appeared on the second author’s website on May 19.. Concurrently, two other sets of authors have also concluded that policy gradient methods should converge linearly (cen2020fast; mei2020global). Those papers both contain sophisticated analysis of particular policy gradient procedures. Because this note contains short proofs that apply to a broad range of algorithms, it may help some readers gain clear understanding.
2 Problem Formulation
Consider a Markov decision process (MDP), which is a sixtuple
, consisting of a state space , action space , cost function , transition kernel , discount factor and initial distribution . We assume the state space to be finite and index the states as . For each state , we assume that there are is finite set of deterministic actions to choose from and take the action space,to be the set of all probability distributions over the
deterministic actions. That is, any actionis a probability vector where each component
denotes the probability of taking the action. A stationary policy specifies an action distribution as afunction of the state. We use the notation to denote the component of . Let denote the set of all stationary policies.The transition kernel specifies the probability of transitioning to a state upon choosing action in state . The cost function denotes the instantaneous expected cost incurred when selecting action in state . Cost and transition functions are linear in the action probabilities, meaning
(1) 
where is the th standard basis vector, representing one of the possible deterministic actions. With some abuse of notation, we sometime write and . We assume that perperiod costs are nonnegative and uniformly bounded, meaning for all and . The assumption that costs are nonnegative is without loss of generality, as one can always add the same large constant to the cost of each state and action without changing the decision problem.
Costtogo functions and Bellman operators.
Let and be the costtogo function for any policy defined as:
Define the Bellman operator under the policy as
The costtogo function under policy is the unique solution to the Bellman equation, . Similarly, the Bellman optimality operator, denoted by is defined as . The optimal costtogo function, which satisfies for all is the unique fixed point of and that there is at least one optimal policy, that attains this minimum for every .
The stateaction costtogo function corresponding to a policy ,
(2) 
measures the cumulative expected cost of taking action in state and applying thereafter. For any polices , we have the relations
(3) 
Note that for any and , the Qfunction is linear in , as we can write: . As before, we sometimes write , with some abuse of notation.
Loss function and initial distribution.
Policy gradient methods seek to minimize the scalar loss function
in which the states are weighted by their initial probabilities under and we have normalized costs by for convenience. We assume throughout that is supported on the entire state space, meaning that for all which implies that if and only if .
State distributions.
We define the discounted stateoccupancy measure under any policy and initial state distribution as:
(4) 
where and are both row vectors and denotes the Markov transition matrix under , i.e. . Recognize that is the distribution of the state at time , and so evaluates the discounted fraction of time the system spends at state under policy . Note that we have as we assumed for all .
3 Linear convergence of policy iteration
We briefly revisit the classic policy iteration algorithm as our analysis of policy gradient methods is intricately tied to it. At the same time, we review some basic properties related to Bellma operators. Starting with an initial policy , policy iteration first evaluates the corresponding costtogo function , and then finds the policy with
Expressed in terms of Bellman operators, is defined by the property .
Analysis of policy iteration follows from a few basic properties of Bellman operators. First, they are monotone, meaning the elementwise inequality implies and . Next, they are contraction operators contraction operators with respect to the maximum norm. That is, and hold for for any . The unique fixed points of and are and , respectively. Proofs can be found in bertsekas1995dynamic or puterman2014markov.
Let us now apply these properties to analyze policy iteration. Observe that
(5) 
Inductively applying to each side and using its monotonicity property yields the policy improvement property,
(6) 
Since we have,
From this, we conclude that policy iteration converges at least linearly. If is produced by policy iteration, then iterating the inequality above shows
In fact, sometimes policy iteration converges quadratically in the limit (puterman2014markov).
4 A sharp connection between policy gradient and policy iteration
We specialize the presentation of the policy gradient theorem in bhandari2020global to this setting to tabular problems. Define the weighed policy iteration or ”Bellman” objective:
where denotes the weighted inner product and denotes a weighting that places weight on any stateaction pair . For a probability distribution with for each , the policy iteration update to is a minimizer .
The policy gradient theorem connections the gradients of the infinite horizon cost function to gradients of the single period cost function underlying policy iteration. In particular, we have
Equivalently, we can write a first order Taylor expansion of as
To recap, we have interpreted the policy gradient as the gradient of a weighted policy iteration objective. In tabular problems, the policy iteration problem is trivial: the problem is to minimize a linear function of over the simplex and the solution is to put all weight on one action for each state. First order methods applied to can essentially solve such a problem in a single iteration and we’ll show that for this reason several first order methods applied to will converge as rapidly as policy iteration.
5 Policy gradient methods for finite MDPs
We write all algorithms in terms of their evolution in the space of policies . Several of them could instead be viewed as operating in the space of parameters for some parameterized policy class. We discuss this option in Remark 1, but keep our formulation and results focused on the space of policies.
Note that is the fold product of the probability simplex. This form of the policy class will cause certain policy gradient updates to decouple across states. Recall that is our notation for the vector of action probabilites selected at stae .
 FrankWolfe.

Starting with some policy , an iteration of the FrankWolfe algorithm computes
(7) and then updates the policy to . We use the notation in (7) as it is exactly a policy iteration update to so FrankWolfe mimics a softpolicy iteration step, akin to the update in kakade2002approximately. Note, the minimization problem in (7) decouples across states to optimize a linear objective over the probability simplex, so
is a pointmass that places all weight on .
 Projected Gradient Descent.

Starting with some policy , an iteration of the projected gradient descent algorithm with constant stepsize updates to the solution of a regularized problem
As (the regularization term tends to zero), converges to the solution of (7), which is exactly the policy iteration update as noted above. For intermediate values of , the projected gradient update decouples across states and takes the form:
which is a gradient step followed by a projection onto the probability simplex. Note that from an implementation perspective, projections onto the probability simplex involves a computationally efficient () softthresholding operation (duchi2008efficient).
 Mirrordescent.

The mirror descent method adapts to the geometry of the probability simplex by using a noneuclidean regularizer. We focus on using the Kullback Leibler (KL) divergence, a natural choice for the regularizer, under which an iteration of mirror descent updates policy to :
where KL divergence is defined as . It is well know that the solution to this optimization problem is the exponentiated gradient update (bubeck2015convex, Section 6.3),
Again, we can see that converges to a policy iteration update as .
 Natural policy gradient and TRPO.

We consider the natural policy gradient (NPG) algorithm of kakade2002natural which is closely related to the widely used TRPO algorithm of schulman2015trust. We focus on NPG applied to the softmax parameterization above for which it is actually an instance of mirror descent with a specific regularizer. In particular, beginning with some policy , an iteration of NPG updates to :
Here, we have used a natural regularizer that penalizes changes to the the action distribution at states in proportion to their occupancy measure . This yields a type of soft policy iteration update at each state. As discussed above, that this KL divergence regularized problem is solved by an exponential weights update is well known (bubeck2015convex, Section 6.3).
A potential source of confusion is that natural policy gradient is usually described as steepest descent in a variable metric defined by a certain fisher information matrix. But it is known to be equivalent to mirror descent under some conditions (raskutti2015information). In this case, readers can check that the exponentiated update above matches the explicit formula for the NPG update given in kakade2002natural and agarwal2019optimality for the case of softmax policies.
The choice of stepsizes is an important issue for most first order methods. Each of the algorithms above can be applied with a sequence of stepsizes to produce a sequence of policies . We define one stepsize selection rule below.
Exact line search.
At iteration , the update rules for each of the algorithms described above actually specify a new policy for a range of stepsizes, . We consider an idealized stepsize rule using exact line search, which directly optimizes over this choice of stepsize at each iteration, selecting where whenever this minimizer exists. More generally, we define
(8) 
where denotes the closed curve of policies traced out by varying . For FrankWolfe, is the line segment connecting the current policy and its policy iteration update . Under NPG, is a curve where and as . Since is not attainable under any fixed , this curve is not closed. By taking the closure, and defining line search via (8), certain formulas become cleaner. Of course, it is also possible to nearly solve (8) without taking the closure and obtain essentially the same results^{2}^{2}2For example, if we select a parameter that offers half the possible improvement, meaning , then our results follows with some extra factors of 2 in the bounds. One essentially needs to modify Equation (9) in the proof and the rest is the same..
Remark 1 (Policy parameterization and infima vs minima.).
Algorithms for policy gradients are usually presented in terms of parameterized policies. For example, a policy gradient algorithm might search over the parameter of a softmax policy , defined by . Many of our algorithms could equivalent be viewed as searching over . For example, NPG updates the policy by solving
That is, it solves the same minimization problem as described above, but over the parameterized policy class. The class of softmax policies can approximate any policy to arbitrary precision, so there is no practical distinction between defining FrankWolfe, Projected Gradient Descent, Mirror Descent or NPG iterations as solving these optimization problems over the policy parameter or over the policy . But mathematical analysis is much cleaner over the class of policies because it is closed. For example, it contains an optimal policy, whereas the class of softmax policies can only come infinitesimally close to an optimal policy. In practice, optimization problems are never solved beyond machine precision, so we don’t view the distinction between infimum and minimum to be relevant to the paper’s main insights.
6 Main result: geometric convergence
So far, we have described variants of policy gradients for tabular MDPs. All of these algorithms essentially make policy iteration updates when their stepsizes are large. Intuitively, it makes sense to expect that their convergence behavior closely resemble results for policy iteration rather than the analysis of gradient descent. We quantify this precisely in Theorem 1 below.
Our first result confirms that all of the algorithms we presented in the previous section converge geometrically if stepsizes are set by exact line search on . Again, the idea is that a policy gradient is a policy iteration update for an appropriate choice of stepsize. Our proof effectively shows that exact line search updates make at least as much progress in reducing as a policy iteration update. The mismatch between the policy gradient loss , which governs the stepsize choice, and the maximum norm, which governs policy iteration convergence, is the source of the term in the bound.
Our second results shows that dependence on the initial distribution can be avoided by forcing the algorithm to use appropriately large constant stepsizes. The simplest result applies to the FrankWolfe algorithm, which we already showed to be exactly equivalent to a soft policy iteration update. Similar results should apply to other algorithms as well, but more intricate calculations are needed to quantify how large stepsizes must be.
Theorem 1 (Geometric convergence).
Suppose one of the firstorder algorithms in subsection 5 is applied to minimize over with stepsize sequence . Let denote the initial policy and denote the sequence of iterates. The following bounds apply:

[label=()]

Exact line search. If either FrankWolfe, projected gradient descent, mirror descent, or NPG is applied with stepsizes chosen by exact line search as in (8), then

Constant stepsize FrankWolfe. Under FrankWolfe with constant stepsize ,
Proof.
Throughout the proof, we use some standard properties of the Bellman operator as described in Section 3. We denote to be the policy iteration update to any policy .
Part (a): Exact linesearch:
Under each algorithm and at each iteration , the policy iteration update is contained in the class introduced in (8). Therefore, for each algorithm,
(9) 
Recall policy improvement property in (6), which shows . Denote . We have,
Rearranging terms gives,
where the later inequalities follow by inductively applying the first one. We immediatley have the looser bound . The final result follows from observing .
Part (b): Constant stepsize FrankWolfe:
The proof makes simple modifications to the classic policy iteration analysis review in Section 3. Recall from Section 5 that the FrankWolfe update exactly equals a softpolicy iteration update:
where is the policy iteration update to . Thus, starting from a feasible policy , we always maintain feasibility for . By the linearity in (1), for any state ,
where the last step uses that by definition of the policy iteration update. Using as in (5), we get
Monotonicty of implies using that . Therefore,
Subtracting from both sides shows
Since the above inequality holds elementwise,
where we again use that and as is a contraction. Iterating over the above equation gives us our final result:
∎
plus 0.4ex
Comments
There are no comments yet.