1 Introduction
Reinforcement learning (RL) (Sutton and Barto, 2018)
is a field of machine learning that tackles the problem of learning how to act in a
unknown dynamic environment. An agent interacts with the environment, and receives feedback on its actions, in the form of a statedependent reward signal. Using this experience, the agent’s goal is then to find a policy that maximizes the longterm reward.There are two main approaches for learning such a policy: modelbased and modelfree. The modelbased approach estimates the system’s model and uses it to assess the longterm effects of actions via
fullplanning (e.g., Jaksch et al. 2010). Modelbased RL algorithms usually enjoy good performance guarantees, which are measured by their regret – the difference between the sum of rewards gained by playing an optimal policy and the sum of rewards that the agent accumulates (Jaksch et al., 2010; Bartlett and Tewari, 2009). Nevertheless, modelbased algorithms suffer from high space and computation complexity. The first is caused by the need for storing a model. The second is due to the frequent fullplanning, which requires a full solution of the estimated model. Alternatively, modelfree RL algorithms directly estimate quantities that take into account the longterm effect of an action, thus, avoiding model estimation and planning operations altogether (Jin et al., 2018). These algorithms usually enjoy better computational and space complexity, but seem to have worse performance guarantees.In many applications, the high computational complexity of modelbased RL is infeasible. Thus, practical modelbased approaches alleviate this computational burden by using shortterm planning e.g., Dyna (Sutton, 1991), instead of fullplanning. To the best of our knowledge, there are no regret guarantees for such algorithms, even in the tabular setting. This raises the following question: Can modelbased approach coupled with shortterm planning enjoy the favorable performance of modelbased RL?
In this work, we show that modelbased algorithms that use 1step planning can achieve the same performance as algorithms that perform fullplanning, and thus, answering affirmatively to the above question. As Table 1 shows, this decreases the computational complexity gap between modelbased and modelfree approaches by a factor of . To this end, we adapt the RealTime DynamicProgramming (RTDP) algorithm (Barto et al., 1995) that finds the optimal policy of a known model by acting greedily based on 1step planning. We demonstrate how RTDP can be incorporated into modelbased RL algorithms, and prove that the regret of the resulting algorithms remains unchanged, while their computational complexity drastically decreases.
The contributions of our paper are as follows: we first prove regret bounds for RTDP when the model is known. To do so, we establish some concentration results on Decreasing Bounded Processes, which are of independent interest. We then show that the regret bound translates into a Uniform Probably Approximately Correct (PAC)
(Dann et al., 2017) bound for RTDP that greatly improves existing PAC results (Strehl et al., 2006). Next, we move to the learning problem, where the model is unknown. Based on the results developed for RTDP, and focusing on finitehorizon MDPs, we adapt UCRL2 (Jaksch et al., 2010) and EULER (Zanette and Brunskill, 2019), both act by fullplanning, to UCRL2 with Greedy Policies (UCRL2GP) and EULER with Greedy Policies (EULERGP);modelbased algorithms that act by 1step planning. The adapted versions are shown to preserve the performance guarantees, while improve in terms of computational complexity.Algorithm  Regret  Time Complexity  Space Complexity 

UCRL2 (Jaksch et al., 2010)  
UCBVI (Azar et al., 2017)  
EULER (Zanette and Brunskill, 2019)  
UCRL2GP  
EULERGP  
Qv2 (Jin et al., 2018)  
Lower bounds 
–  – 
2 Notations and Definitions
We consider finitehorizon MDPs with timeindependent dynamics (Bertsekas and Tsitsiklis, 1996). A finitehorizon MDP is defined by the tuple , where and are the state and action spaces with cardinalities and , respectively. The immediate reward for taking an action at state
is a random variable
with expectation . The transition probability is , the probability of transitioning to state upon taking action at state . Furthermore, is the maximum number of nonzero transition probabilities across the entire stateaction pairs. If this number is unknown to the designer of the algorithm in advanced, then we set . The initial state in each episode is arbitrarily chosen and is the horizon, i.e., the number of timesteps in each episode. We define for all , and throughout the paper use and to denote timestep inside an episode and the index of an episode, respectively.A deterministic policy is a mapping from states and timestep indices to actions. We denote by , the action taken at time at state according to a policy . The quality of a policy from state at time is measured by its value function, which is defined as
where the expectation is over the environment’s randomness. An optimal policy maximizes this value for all states and timesteps , and the corresponding optimal value is denoted by for all . The optimal value satisfies the optimal Bellman equation, i.e.,
(1) 
We consider an agent that repeatedly interacts with an MDP in a sequence of episodes . The performance of the agent is measured by its regret, defined as . Throughout this work, the policy is computed by a 1step planning operation with respect to the value function estimated by the algorithm at the end of episode , denoted by . We also call such policy a greedy policy. Moreover, and stand, respectively, for the state and the action taken at the timestep of the episode.
Next, we define the filtration that includes all events (states, actions, and rewards) until the end of the episode, as well as the initial state of the episode . We denote by , the total number of timesteps (samples). Moreover, we denote by , the number of times that the agent has visited stateaction pair , and by , the empirical average of a random variable . Both quantities are based on experience gathered until the end of the episode and are measurable. We also define the probability to visit the stateaction pair at the episode at timestep by . We note that is measurable, and thus, . Also denote .
We use to refer to a quantity that depends on up to polylog expression of a quantity at most polynomial in , , , , , and . Similarly, represents up to numerical constants or polylog factors. We define , where
is a probability distribution over the domain of
, and use . Lastly, is the set of probability distributions over the state space .3 RealTime Dynamic Programming
RTDP (Barto et al., 1995) is a wellknown algorithm that solves an MDP when a model of the environment is given. Unlike, e.g., Value Iteration (VI) (Bertsekas and Tsitsiklis, 1996) that solves an MDP by offline calculations, RTDP solves an MDP in a realtime manner. As mentioned in Barto et al. (1995), RTDP can be interpreted as an asynchronous VI adjusted to a realtime algorithm.
Algorithm 1 contains the pseudocode of RTDP for finitehorizon MDPs. The value function is initialized with an optimistic value, i.e., an upper bound of the optimal value. At each timestep and episode , the agent acts from the current state greedily with respect to the current value at the next time step, . It then updates the value of according to the optimal Bellman operator. We denote by , the value function, and as we show in the following, it always upper bounds . Note that since the action at a fixed state is chosen according to , then is measurable.
Since RTDP is an online algorithm, i.e., it updates its value estimates through interactions with the environment, it is natural to measure its performance in terms of the regret. The rest of this section is devoted to supplying expected and highprobability bounds on the regret of RTDP, which will also lead to PAC bounds for this algorithm. In Section 4, based on the observations from this section, we will establish minimax regret bounds for 1step greedy modelbased RL.
We start by stating two basic properties of RTDP in the following lemma: the value is always optimistic and decreases in (see proof in Appendix B). Although the first property is known (Barto et al., 1995), to the best of our knowledge, the second one has not been proven in previous work.
Lemma 1.
For all , , and , it holds that (i) and (ii) .
The following lemma, that we believe is new, relates the difference between the optimistic value and the real value to the expected cumulative update of the value function at the end of the episode (see proof in Appendix B).
Lemma 2 (Value Update for Exact Model).
The expected cumulative value update of RTDP at the episode satisfies
The result relates the difference of the optimistic value and the value of the greedy policy to the expected update along the trajectory, created by following . Thus, for example, if the optimistic value is overestimated, then the value update throughout this episode is expected to be large.
3.1 Regret and PAC Analysis
Using Lemma 1, we observe that the sequence of values is decreasing and bounded from below. Thus, intuitively, the decrements of the values cannot be indefinitely large. Importantly, Lemma 2 states that when the expected decrements of the values are small, then is close to , and thus, to , since the former upper bounds the latter.
Building on this reasoning, we are led to establish a general result on a decreasing process. This result will allow us to formally justify the aforementioned reasoning and derive regret bounds for RTDP. The proof utilizes selfnormalized concentration bounds (de la Peña et al., 2007), applied on martingales, and can be found in Appendix A.
Definition 1 (Decreasing Bounded Process).
We call a random process , where is a filtration and is adapted to this filtration, a Decreasing Bounded Process, if it satisfies the following properties:

decreases, i.e., a.s. .

and for all a.s. .
Theorem 3 (Regret Bound of a Decreasing Bounded Process).
Let be a Decreasing Bounded Process and be its round regret. Then,
Specifically, it holds that
We are now ready to prove the central result of this section, the expected and highprobability regret bounds on RTDP (see full proof in Appendix B).
Theorem 4 (Regret Bounds for RTDP).
The following regret bounds hold for RTDP:


For any , with probability , for all ,
Proof Sketch.
Theorem 4 exhibits a regret bound that does not depend on . While it is expected that RTDP, that has access to the exact model, would achieve better performance than an RL algorithm with no such access, a regret bound independent of is a noteworthy result. Indeed, it leads to the following Uniform PAC (see Dann et al. 2017 for the definition) and PAC guarantees for RTDP (see proofs in Appendix B). To the best of our knowledge, both are the first PAC guarantees for RTDP.^{1}^{1}1Existing PAC results on RTDP analyze variations of RTDP in which is an input parameter of the algorithm.
Corollary 5 (RTDP is Uniform PAC).
Let and be the number of episodes in which RTDP outputs a policy with . Then,
Corollary 6 (RTDP is Pac).
Let and be the number of episodes in which RTDP outputs a non optimal policy. Define the (unknown) gap of the MDP, Then,
4 Exploration in Modelbased RL: Greedy Policy Achieves Minimax Regret
We start this section by formulating a general optimistic RL scheme that acts by 1step planning (see Algorithm 2). Then, we establish Lemma 7, which generalizes Lemma 2 to the case where a nonexact model is used for the value updates. Using this lemma, we offer a novel regret decomposition for algorithms which follow Algorithm 2. Based on the decomposition, we analyze generalizations of UCRL2 (Jaksch et al., 2010) (for finite horizon MDPs) and EULER (Zanette and Brunskill, 2019), that use greedy policies instead of solving an MDP (full planning) at the beginning of each episode. Surprisingly, we find that both generalized algorithms do not suffer from performance degradation, up to numerical constants and logarithmic factors. Thus, we conclude that there exists an RL algorithm that achieves the minimax regret bound, while acting according to greedy policies.
Consider the general RL scheme that explores by greedy policies as depicted in Algorithm 2. The value is initialized optimistically and the algorithm interacts with the unknown environment in an episodic manner. At each timestep , a greedy policy from the current state, , is calculated optimistically based on the empirical model and the current value at the next timestep . This is done in a subroutine called ‘ModelBaseOptimisticQ’. We further assume the optimistic function has the form and refer to as the optimistic model. The agent interacts with the environment based on the greedy policy with respect to and uses the gathered experience to update the empirical model at the end of the episode.
By construction of the update rule (see Line 7), the value is a decreasing function of , for all . Thus, property (ii) in Lemma 1 holds for Algorithm 2. Furthermore, the algorithms analyzed in this section will also be optimistic with high probability, i.e., property (i) in Lemma 1 also holds. Finally, since the value update uses the empirical quantities , , and from the previous episode, policy is still measurable.
The following lemma generalizes Lemma 2 to the case where, unlike in RTDP, the update rule does not use the exact model (see proof in Appendix C).
Lemma 7 (Value Update for Optimistic Model).
The expected cumulative value update of Algorithm 2 in the episode is bounded by
In the rest of the section, we consider two instantiations of the subroutine ‘ModelBaseOptimisticQ’ in Algorithm 2. We use the bonus terms of UCRL2 and of EULER to acquire an optimistic function, . These two options then lead to UCRL2 with Greedy Policies (UCRL2GP) and EULER with Greedy Policies (EULERGP) algorithms.
4.1 UCRL2 with Greedy Policies for FiniteHorizon MDPs
We form the optimistic local model based on the confidence set of UCRL2 (Jaksch et al., 2010). This amounts to use Algorithm 3 as the subroutine ‘ModelBaseOptimisticQ’ in Algorithm 2. The maximization problem on Line 3 of Algorithm 3 is common, when using bonus based on an optimistic model (Jaksch et al., 2010), and it can be solved efficiently in operations (e.g., Strehl and Littman 2008, Section 3.1.5). A full version of the algorithm can be found in Appendix D.
Thus, Algorithm 3 performs operations per episode. This saves the need to perform Extended Value Iteration (Jaksch et al., 2010), that costs operations per episode (an extra factor of ). Despite the significant improvement in terms of computational complexity, the regret of UCRL2GP is similar to the one of UCRL2 (Jaksch et al., 2010) as the following theorem formalizes (see proof in Appendix D).
Theorem 8 (Regret Bound of UCRL2GP).
For any time , with probability at least , the regret of UCRL2GP is bounded by .
Proof Sketch.
Using the optimism of the value function (see Section D.2) and by applying Lemma 7, we bound the regret as follows:
(4) 
Thus, the regret is upper bounded by two terms. As in Theorem 4, by applying Lemma 11 (Appendix A), the first term in (4) is a sum of Decreasing Bounded Processes, and can thus be bounded by . The presence of the second term in (4) is common in recent regret analyses (e.g., Dann et al. 2017). Using standard techniques (Jaksch et al., 2010; Dann et al., 2017; Zanette and Brunskill, 2019), this term can be bounded (up to additive constant factors) with high probability by . ∎
4.2 EULER with Greedy Policies
In this section, we use bonus terms as in EULER (Zanette and Brunskill, 2019). Similar to the previous section, this amounts to replacing the subroutine ‘ModelBaseOptimisticQ’ in Algorithm 2 with a subroutine based on the bonus terms from (Zanette and Brunskill, 2019). Algorithm 5 in Appendix E contains the pseudocode of the algorithm. The bonus terms in EULER are based on the empirical Bernstein inequality and tracking both an upper bound and a lowerbound on . Using these, EULER achieves both minimax optimal and problem dependent regret bounds.
EULER (Zanette and Brunskill, 2019) performs computations per episode (same as the VI algorithm), while EULERGP requires only . Despite this advantage in computational complexity, EULERGP exhibits similar minimax regret bounds to EULER (see proof in Appendix E), much like the equivalent performance of UCRL2 and UCRL2GP proved in Section 4.1.
Theorem 9 (Regret Bound of EULERGP).
Let be an upper bound on the total reward collected within an episode. Define and . With probability at least , for any time jointly on all episodes , the regret of EULERGP is bounded by Thus, it is also bounded by .
5 Related Work
RealTime Dynamic Programming: RTDP (Barto et al., 1995) has been extensively used and has many variants that exhibit superior empirical performance (e.g., (Bonet and Geffner, 2003; McMahan et al., 2005; Smith and Simmons, 2006)). For planning in discounted MDPs, Strehl et al. (2006) proved PAC bounds of , for a modified version of the algorithm that updates the value only if the update size is larger than . Bonet and Geffner (2003) proved convergence rate of for Labeled RTDP in the stochastic shortest path problem, which is equivalent to in finitehorizon MDP. Nevertheless, these algorithms explicitly use to mark states with accurate value estimate. We prove that RTDP converges in a rate of without knowing . Indeed, Strehl et al. (2006) posed whether the original RTDP is PAC as an open problem. Furthermore, no regret bound for RTDP has been reported in the literature.
Regret bounds for RL: The most renowned algorithms with regret guarantees for undiscounted infinitehorizon MDPs are UCRL2 (Jaksch et al., 2010) and REGAL (Bartlett and Tewari, 2009), which have been extended throughout the years (e.g., by Fruit et al. 2018; Talebi and Maillard 2018). Recently, there is an increasing interest in regret bounds for MDPs with finite horizon and stationary dynamics. In this scenario, UCRL2 enjoys a regret bound of order . Azar et al. (2017) proposed UCBVI, with improved regret bound of order , which is also asymptotically tight (Osband and Van Roy, 2016). Dann et al. (2018) presented ORLC that achieves tight regret bounds and (nearly) tight PAC guarantees for nonstationary MDPs. Finally, Zanette and Brunskill (2019) proposed EULER, an algorithm that enjoys tight minimax regret bounds and has additional problemdependent bounds that encapsulate the MDP’s complexity. All of these algorithms are modelbased and require fullplanning. Modelfree RL was analyzed by (Jin et al., 2018). There, the authors exhibit regret bounds that are worse by a factor of relatively to the lowerbound. To the best of our knowledge, there are no modelbased algorithms with regret guarantees that avoid fullplanning. It is worth noting that while all the above algorithms, and the ones in this work, rely on the Optimism in the Face of Uncertainty principle (Lai and Robbins, 1985)
, Thompson Sampling modelbased RL algorithms exist
(Osband et al., 2013; Gopalan and Mannor, 2015; Agrawal and Jia, 2017; Osband and Van Roy, 2017). There, a model is sampled from a distribution over models, on which fullplanning takes place.Greedy policies in modelbased RL: By adjusting RTDP to the case where the model is unknown, Strehl et al. (2012) formulated modelbased RL algorithms that act using a greedy policy. They proved a sample complexity bound for discounted MDPs. To the best of our knowledge, there are no regret bounds for modelbased RL algorithms that act by greedy policies.
Practical modelbased RL: Due to the high computational complexity of planning in modelbased RL, most of the practical algorithms are modelfree (e.g., Mnih et al. 2015). Algorithms that do use a model usually only take advantage of local information. For example, Dyna (Sutton, 1991; Peng et al., 2018) randomly samples stateaction pairs and updates them according to a local model. Other papers use the local model to plan for a short horizon from the current state (Tamar et al., 2016; Hafner et al., 2018). The performance of such algorithms depends heavily on the planning horizon, that in turn dramatically increases the computational complexity.
6 Conclusions and Future Work
In this work, we established that tabular modelbased RL algorithms can explore by 1step planning instead of fullplanning, without suffering from performance degradation. Specifically, exploring with modelbased greedy policies can be minimax optimal in terms of regret. Differently put, the variance caused by exploring with greedy policies is smaller than the variance caused by learning a sufficiently good model. Indeed, the extra term which appears due to the greedy exploration is
(e.g., the first term in (4)); a constant term, smaller than the existing constant terms of UCRL2 and EULER.This work raises and highlights some interesting research questions. The obvious ones are extensions to average and discounted MDPs, as well as to Thompson sampling based RL algorithms. Although these scenarios are harder or different in terms of analysis, we believe this work introduces the relevant approach to tackle this question. Another interesting question is the applicability of the results in largescale problems, when tabular representation is infeasible and approximation must be used. There, algorithms that act using lookahead policies, instead of 1step planning, are expected to yield better performance, as they are less sensitive to value approximation errors (e.g., Bertsekas and Tsitsiklis 1996; Jiang et al. 2018; Efroni et al. 2018b, a). Even then, fullplanning, as opposed to using a shorthorizon planning, might be unnecessary. Lastly, establishing whether the modelbased approach is or is not provably better than the modelfree approach, as the current state of the literature suggests, is yet an important and unsolved open problem.
Acknowledgments
We thank Oren Louidor for illuminating discussions relating the Decreasing Bounded Process, and Esther Derman for very helpful comments.
References
 Agrawal and Jia [2017] Shipra Agrawal and Randy Jia. Optimistic posterior sampling for reinforcement learning: worstcase regret bounds. In Advances in Neural Information Processing Systems, pages 1184–1194, 2017.
 Azar et al. [2017] Mohammad Gheshlaghi Azar, Ian Osband, and Rémi Munos. Minimax regret bounds for reinforcement learning. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pages 263–272. JMLR. org, 2017.

Bartlett and Tewari [2009]
Peter L Bartlett and Ambuj Tewari.
Regal: A regularization based algorithm for reinforcement learning in
weakly communicating mdps.
In
Proceedings of the TwentyFifth Conference on Uncertainty in Artificial Intelligence
, pages 35–42. AUAI Press, 2009.  Barto et al. [1995] Andrew G Barto, Steven J Bradtke, and Satinder P Singh. Learning to act using realtime dynamic programming. Artificial intelligence, 72(12):81–138, 1995.
 Bertsekas and Tsitsiklis [1996] Dimitri P Bertsekas and John N Tsitsiklis. Neurodynamic programming, volume 5. Athena Scientific Belmont, MA, 1996.
 Bonet and Geffner [2003] Blai Bonet and Hector Geffner. Labeled rtdp: Improving the convergence of realtime dynamic programming. In ICAPS, volume 3, pages 12–21, 2003.
 Dann et al. [2017] Christoph Dann, Tor Lattimore, and Emma Brunskill. Unifying pac and regret: Uniform pac bounds for episodic reinforcement learning. In Advances in Neural Information Processing Systems, pages 5713–5723, 2017.
 Dann et al. [2018] Christoph Dann, Lihong Li, Wei Wei, and Emma Brunskill. Policy certificates: Towards accountable reinforcement learning. arXiv preprint arXiv:1811.03056, 2018.
 de la Peña et al. [2007] Victor H de la Peña, Michael J Klass, Tze Leung Lai, et al. Pseudomaximization and selfnormalized processes. Probability Surveys, 4:172–192, 2007.
 de la Peña et al. [2008] Victor H de la Peña, Tze Leung Lai, and QiMan Shao. Selfnormalized processes: Limit theory and Statistical Applications. Springer Science & Business Media, 2008.
 Efroni et al. [2018a] Yonathan Efroni, Gal Dalal, Bruno Scherrer, and Shie Mannor. How to combine treesearch methods in reinforcement learning. arXiv preprint arXiv:1809.01843, 2018a.
 Efroni et al. [2018b] Yonathan Efroni, Gal Dalal, Bruno Scherrer, and Shie Mannor. Multiplestep greedy policies in approximate and online reinforcement learning. In Advances in Neural Information Processing Systems, pages 5238–5247, 2018b.
 Fruit et al. [2018] Ronan Fruit, Matteo Pirotta, Alessandro Lazaric, and Ronald Ortner. Efficient biasspanconstrained explorationexploitation in reinforcement learning. arXiv preprint arXiv:1802.04020, 2018.
 Gopalan and Mannor [2015] Aditya Gopalan and Shie Mannor. Thompson sampling for learning parameterized markov decision processes. In Conference on Learning Theory, pages 861–898, 2015.
 Hafner et al. [2018] Danijar Hafner, Timothy Lillicrap, Ian Fischer, Ruben Villegas, David Ha, Honglak Lee, and James Davidson. Learning latent dynamics for planning from pixels. arXiv preprint arXiv:1811.04551, 2018.
 Jaksch et al. [2010] Thomas Jaksch, Ronald Ortner, and Peter Auer. Nearoptimal regret bounds for reinforcement learning. Journal of Machine Learning Research, 11(Apr):1563–1600, 2010.
 Jiang et al. [2018] Daniel R Jiang, Emmanuel Ekwedike, and Han Liu. Feedbackbased tree search for reinforcement learning. arXiv preprint arXiv:1805.05935, 2018.
 Jin et al. [2018] Chi Jin, Zeyuan AllenZhu, Sebastien Bubeck, and Michael I Jordan. Is qlearning provably efficient? In Advances in Neural Information Processing Systems, pages 4863–4873, 2018.
 Lai and Robbins [1985] Tze Leung Lai and Herbert Robbins. Asymptotically efficient adaptive allocation rules. Advances in applied mathematics, 6(1):4–22, 1985.
 Maurer and Pontil [2009] Andreas Maurer and Massimiliano Pontil. Empirical bernstein bounds and sample variance penalization. arXiv preprint arXiv:0907.3740, 2009.
 McMahan et al. [2005] H Brendan McMahan, Maxim Likhachev, and Geoffrey J Gordon. Bounded realtime dynamic programming: Rtdp with monotone upper bounds and performance guarantees. In Proceedings of the 22nd international conference on Machine learning, pages 569–576. ACM, 2005.
 Mnih et al. [2015] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Humanlevel control through deep reinforcement learning. Nature, 518(7540):529, 2015.
 Osband and Van Roy [2016] Ian Osband and Benjamin Van Roy. On lower bounds for regret in reinforcement learning. arXiv preprint arXiv:1608.02732, 2016.
 Osband and Van Roy [2017] Ian Osband and Benjamin Van Roy. Why is posterior sampling better than optimism for reinforcement learning? In Proceedings of the 34th International Conference on Machine LearningVolume 70, pages 2701–2710. JMLR. org, 2017.
 Osband et al. [2013] Ian Osband, Daniel Russo, and Benjamin Van Roy. (more) efficient reinforcement learning via posterior sampling. In Advances in Neural Information Processing Systems, pages 3003–3011, 2013.
 Peng et al. [2018] Baolin Peng, Xiujun Li, Jianfeng Gao, Jingjing Liu, KamFai Wong, and ShangYu Su. Deep dynaq: Integrating planning for taskcompletion dialogue policy learning. arXiv preprint arXiv:1801.06176, 2018.

Smith and Simmons [2006]
Trey Smith and Reid Simmons.
Focused realtime dynamic programming for mdps: Squeezing more out of a heuristic.
In AAAI, pages 1227–1232, 2006.  Strehl and Littman [2008] Alexander L Strehl and Michael L Littman. An analysis of modelbased interval estimation for markov decision processes. Journal of Computer and System Sciences, 74(8):1309–1331, 2008.
 Strehl et al. [2006] Alexander L Strehl, Lihong Li, and Michael L Littman. Pac reinforcement learning bounds for rtdp and randrtdp. In Proceedings of AAAI workshop on learning for search, 2006.
 Strehl et al. [2012] Alexander L Strehl, Lihong Li, and Michael L Littman. Incremental modelbased learners with formal learningtime guarantees. arXiv preprint arXiv:1206.6870, 2012.
 Sutton [1991] Richard S Sutton. Dyna, an integrated architecture for learning, planning, and reacting. ACM SIGART Bulletin, 2(4):160–163, 1991.
 Sutton and Barto [2018] Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018.
 Talebi and Maillard [2018] Mohammad Sadegh Talebi and OdalricAmbrym Maillard. Varianceaware regret bounds for undiscounted reinforcement learning in mdps. arXiv preprint arXiv:1803.01626, 2018.
 Tamar et al. [2016] Aviv Tamar, Yi Wu, Garrett Thomas, Sergey Levine, and Pieter Abbeel. Value iteration networks. In Advances in Neural Information Processing Systems, pages 2154–2162, 2016.
 Weissman et al. [2003] Tsachy Weissman, Erik Ordentlich, Gadiel Seroussi, Sergio Verdu, and Marcelo J Weinberger. Inequalities for the l1 deviation of the empirical distribution. HewlettPackard Labs, Tech. Rep, 2003.
 Zanette and Brunskill [2019] Andrea Zanette and Emma Brunskill. Tighter problemdependent regret bounds in reinforcement learning without domain knowledge using value function bounds. arXiv preprint arXiv:1901.00210, 2019.
Appendix A Proofs on Decreasing Bounded Processes
In this section, we state and prove useful results on Decreasing Bounded Processes (see Definition 1). These results will be in use in proofs of the central theorems of this work.
See 3
Proof.
Without loss of generality, assume , since otherwise the results are trivial. We start by remarking that is almost surely monotonically increasing, since . Define the martingale difference process
and the martingale process . Since almost surely, can be bounded by . Also define the quadratic variations as and . Next, recall Theorem 2.7 of [de la Peña et al., 2007]:
Theorem 10.
Let and be two random variables, such that for all , we have
(5) 
Then, ,
(6) 
Condition (5) holds for and , due to Theorem 9.21 of [de la Peña et al., 2008]. can be easily bounded by . To bound , we first calculate and :
Thus,
In we used the fact that a.s., which allows us to conclude that the crossterm is nonpositive. In (**), we bounded . We can also bound , since each of the summands is a.s. nonnegative, and thus,
Combining all of the above bounds yields
Finally, we can bound by
Combining everything we obtain
Or, substituting in (6), we have
Next, notice that for , the function is monotonically increasing for any :
Moreover, for ,
where the inequality holds since . Thus, if , then , and we can bound the probability that by
and setting , we obtain
We remark that since is monotonically increasing a.s., this bound also implies that
To obtain a uniform bound, that is, bound that holds for all , note that the random sequence is monotonically increasing in and bounded. Thus, due to monotone convergence
Comments
There are no comments yet.