Outline
Definition of Markov Decision Processes, policy and performance criteria.
Dynamic programming with known transition dynamics: Value Iteration, Policy Iteration.
Dynamic Programming II
Dynamic programming with unknown transition dynamics: Q-Learning
Linear Programming
Algorithms based on Primal and Dual Linear Programming formulation of RL: constraint
sampling, REPS and DICE methods.
Policy Gradient I
Policy Parameterization, REINFORCE and techniques to compute unbiased estimator of
the policy gradient.
Policy Gradient II
Non concavity of the policy gradient objective, global convergence of projected gradient
descent, Global convergence of natural policy gradient, TRPO and PPO.
Deep and Robust Reinforcement Learning
Importance of robustness in RL, Robust RL as a Zero Sum Markov Game.
Imitation Learning
Motivations, Setting, maximum causal entropy IRL, GAIL and LP approaches.
Alignment and Reasoning with Reinforcement Learning
Small intro to Language Models, Alignment, RLHF, Reasoning, Reasoning in modern models
(GPT-o1, DeepSeek-R1).
Final lecture
Project Presentations