Reinforcement learning to achieve efficient equilibria in repeated games with partial observability ‒ SYCAMORE ‐ EPFL

Outline

This is a Master’s thesis project at the intersection of reinforcement learning and game theory. The goal is to design online learning algorithms that learn efficient equilibria of a game that is repeated over a long horizon. In particular, the focus is on leader follower games when the follower’s actions are partially observable. The project is theoretical in scope.

Motivation

Leader follower games are a class of two player games in game theory. The first player, also called the leader, picks an action which the follower observes. The follower then selects an action that maximizes his utility in response to the leader’s action. The objective of the leader is to select a policy that ensures the best payoff for her accounting for the follower’s response. There are several examples of such games, these include games like chess or Go, but also real world applications such as design of insurance contracts by insurance companies, design of CEO wage structures by company shareholders, as well as pricing design in electricity markets or transportation.

In many leader follower games in the real world, the leader cannot fully observe the actions of the follower. This can lead to inefficient outcomes for the leader [1], because she has to pick her first move under a noisy feedback from the follower. While one-shot versions of leader follower games with partial observability cannot achieve efficient outcomes in general, positive results emerge when the game is repeated many number of times [2]. This can be done by adapting the leader’s future actions over time following feedback from past (noisy) observations of the follower’s actions. The goal of the project is to investigate the use of online learning [3] and reinforcement learning algorithms [4] to learn these adaptive policies and achieve efficient outcomes in such games in the long run.

As an example, think of a game where company’s shareholders need to design the wage structure for a CEO. The company’s profit is a function of both the CEO’s efforts, as well as external factors such as output of other workers and market conditions. From observed company profits alone, the shareholders cannot infer whether it was caused by a high performing CEO, or because of external factors. The shareholders must thus design the CEO’s wage structure under partial observability of CEO’s actions, which may lead to misaligned incentives and sometimes inefficient outcomes for the company.

However, since the shareholders and the CEO typically interact multiple times, the shareholders can collect a history of realized company profits and use it to infer whether the CEO has put in the required effort on average. Accordingly, they can adapt the wages over time. Similarly, if the CEO knows that the shareholders can estimate his efforts from a large history of interactions, it is in his best interest to perform well to get a higher wage in the long run. Thus, repetition can help the company achieve long run efficiency.

Milestones

Weeks 1-4: Literature review and problem formulation.
Weeks 5-14: Design of algorithm, convergence analysis.
Weeks 14-16: Writing and evaluation of results.

Requirements

We look for self motivated students with a strong background in mathematics and theoretical computer science. Prior theory projects related to online learning/reinforcement learning are a bonus.

Contact

Interested students can reach out to [email protected] or [email protected].

References

[1] George Georgiadis. “Contracting with moral hazard”. In: Elgar Encyclopedia on the Economics of Competition, Regulation and Antitrust. Ed. by Michael D. Noel. Edward Elgar Publishing, Dec. 31, 2024, pp. 24–37. isbn: 978-1-80220-054-6 978-1-80220-053-9.

[2] Drew Fudenberg, David Levine, and Eric Maskin. “The Folk Theorem with Imperfect Public Information”. In: Econometrica 62.5 (Sept. 1994), p. 997. issn: 00129682.

[3] Ehud Kalai and Ehud Lehrer. “Rational Learning Leads to Nash Equilibrium”. In: Econometrica 61.5 (Sept. 1993), p. 1019. issn: 00129682.

[4] Galit Askenazi-Golan et al. The Bounds of Algorithmic Collusion; Q-learning, Gradient Learning, and the Folk Theorem. Mar. 3, 2026.