• Skip to primary navigation
  • Skip to content
  • Skip to primary navigation
  • Skip to content

Early Career Research

Early Career Research at IDRE

UCLA Logo
Institute for Digital Research and Education
Image Link to I.D.R.E's TwitterImage Link to I.D.R.E's LinkedInImage Link to I.D.R.E's Facebook page
  • About
  • IDRE Fellows
  • ECR Projects
  • ECR Resources
  • ECR Events
You are here: Home / ECR Projects / Markov Decision Problem

Markov Decision Problem

Policy Iteration (PI) is a popular family of algorithms to compute an optimal policy for a given Markov Decision Problem (MDP). Starting from an arbitrary initial policy, PI repeatedly performs locally-improving switches until an optimal policy is found. The exact form of the switching rule gives rise to different variants of PI. Two decades ago, Mansour and Singh [1999] provided the first non-trivial upper bound on the number of iterations taken by “Howard’s PI” (HPI), a widely-used variant of PI. They also proposed a randomised variant (RPI) with an even tighter upper bound. Their bounds for HPI and RPI have not been improved subsequently, although curiously, these algorithms tend to perform much better in
practice than theoretically-superior variants.
We revisit the algorithms and analysis of Mansour and Singh [1999]. We prove a novel result on the structure of the policy space for k-action MDPs, k≥2, which generalizes a result known for k=2. Also proposing a new counting argument, we obtain a bound of (O((k logk)^0.5))^n iterations for an algorithm akin to RPI, improving significantly upon Mansour and Singh’s original bound of roughly O((k/2)^n). Similar analysis of a randomised variant of HPI also yields an upper bound of (O((k logk)^0.5))^n iterations, registering the first exponential improvement for HPI over the trivial bound of k^n . Our other contributions include a lower bound of Ω(n) iterations for RPI and an upper bound of 1.6001^n iterations for a randomised variant of “BatchSwitching PI” [Kalyanakrishnan et al., 2016a] on 2-action MDPs—the tightest upper bound shown yet for the PI family.

Keywords: Markov Decision Processes, Planning, Control, Randomised Algorithms

Author: Meet Taraviya

UCLA OIT

© 2025 UC REGENTS TERMS OF USE & PRIVACY POLICY

  1. HOME
  2. CONTACT
  3. PEOPLE
  4. LOGIN