## Sequential Frameworks For Statistics-based Value Function Representation In Approximate Dynamic Programming

##### Abstract

Dynamic programming (DP) was derived by Bellman in 1957 as a mathematical programming method to solve multistage decision problems. The DP concept has great potential, but it only can solve small problems exactly under very limiting restrictions such as linear dynamics, quadratic cost, Gaussian random variable, etc. With advances in computational power, a new family of dynamic programming known as approximate dynamic programming (ADP) has emerged. A real-world DP problem is often high-dimensional and stochastic with continuous state variables. To handle this type of problem, ADP methods discretize/sample the continuous state space and approximate the future value (or cost-to-go) function by some statistical modeling techniques. The earliest strategies used full finite grid discretizations with multilinear or spline interpolation. Under a statistical perspective, more efficient design of experiments methods, such as orthogonal arrays (OAs) and number theoretic methods (NTMs), combined with flexible statistical modeling methods, such as multivariate adaptive regression splines (MARS) and neural networks (NNs), enabled approximate solutions to higher-dimensional problems. The above statistical perspective still maintains a traditional DP solution approach. By contrast, machine learning approaches evolved in the artificial intelligence community to approximately ``learn'' the DP solution via an iterative search. These learning based methods fall under various names, including reinforcement learning (RL), adaptive critic, and neuro-dynamic programming. These learning based ADP methods were initiated from the theories of psychology and animal learning, but now have evolved as an important branch-stream of machine learning methods. Compared with the previous ADP methods developed in statistical and operations research communities, this kind of methods can adaptively and gradually learn the DP solution with certain learning algorithms. However, the practical success of RL approaches is still limited due to extremely high computational cost. The RL approach to ADP is sequential in nature, and this dissertation seeks to improve upon the statistical perspective by developing sequential approaches in the spirit of RL. The existing ADP methods assume fixed model structures for approximating the future value (or cost-to-go) function. In practice, this model structure is difficult to identify, in many cases requiring a time-consuming trial-and-error process. In addition, the statistical perspective requires determination of the discretization sample size in advance. The iterative approach of RL, automatically determines sample size and uses system dynamics to explore the state space. In this dissertation, two types of sequential algorithms are developed. The first type uses a sequential concept based on consistency theory to both identify the approximating model structure and determine sample size. The second type uses system dynamics to sequentially identify the state space region. The first type of sequential algorithm builds an adaptive value function approximation while the size of the state space sample grows. In the statistical perspective to ADP, there are two components to value function approximation: (1)~design of experiments and (2)~statistical modeling. For design of experiments, NTM low-discrepancy sequence sampling techniques are employed because of the sequential nature in which they are generated. For statistical modeling, feed-forward NN models are used because of their consistency ability. The adaptive value function approximation (AVFA) is implemented in each stage of the backward-solving DP framework. Three different AVFA algorithms are derived based on the consistency concept and then tested on a nine-dimensional inventory forecasting problem. The first algorithm increments the size of the state space training data in each sequential step, and for each sample size a successive model search process is performed to find an optimal NN model. The second algorithm improves on the first by reducing the computation of the successive model search process. The third algorithm uses a more natural perspective of the consistency concept, where in each step, either the sample size is incremented or the complexity of the NN model is grown; an optimal NN model is not directly sought in this algorithm, but rather the consistency concept implies that convergence will yield the optimal model. The second type of sequential algorithm conducts state space exploration forwards through the stages. The objective is to identify the appropriate region in the state space to model the future value function. The specification of this region is needed in the design of experiments component of the statistical perspective; however, in practice, this region is typically unknown. Hence, this sequential state space exploration (SSSE) approach fills an important need. Since decisions are needed to move forward through the stages, both random and optimal decisions are explored. The SSSE algorithm is combined with the AVFA algorithm to yield a novel self-organized forward-backward ADP solution framework. This framework consists of two phases. The first phase has a single forward SSSE step using random decisions to identify an initial state space region for each stage and a single backward AVFA step to build initial future value function approximations over these regions. The second phase iterates between a forward SSSE step with optimal decisions and a backward AVFA step to update the state space regions and the future value function approximations until the state space regions stop changing. This new sequential SSSE-AVFA algorithm is also tested on a nine-dimensional stochastic inventory forecasting problem.