ALT Highlights – Equilibrium Computation and the Foundations of Deep Learning

Windows On Theory 2021-04-30

[Guest post by Kush Bhatia and Cyrus Rashtchian, foreword by Gautam Kamath]

Welcome to ALT Highlights, a series of blog posts spotlighting various happenings at the recent conference ALT 2021, including plenary talks, tutorials, trends in learning theory, and more! To reach a broad audience, the series will be disseminated as guest posts on different blogs in machine learning and theoretical computer science. Boaz has kindly agreed to host a post in this series. This initiative is organized by the Learning Theory Alliance, and overseen by Gautam Kamath. All posts in ALT Highlights are indexed on the official Learning Theory Alliance blog.

This is the third post in the series, an interview with Constantinos Daskalakis and coverage of his ALT 2021 keynote talk, written by Kush Bhatia and Cyrus Rashtchian.


To make a decision for ourselves, we need to think about the impact of our actions to our objectives. But, when our actions affect other people and their actions affect our objectives, we also need to consider their incentives, and choose our actions in anticipation of theirs. This increase in complexity also occurs in situations involving multiple decision-making machines (e.g., self-driving cars), automated systems (e.g., algorithmic stock trading), or living organisms (e.g., groups of cells).

Studying decision-making in so-called multi-agent environments has been a question of interest to mathematicians and economists for centuries. In the 1830s, Antoine Augustin Cournot developed a theory of competition to model oligopolies, inspired by observing competition in a spring water duopoly. Jumping forward to the 20th century, researchers converged that a fruitful approach to analyzing multi-agent systems is studying them at “equilibrium”, that is, in situations where the system is stable in the sense that all parties are satisfied with their actions. A fundamental concept of equilibrium, studied by John von Neumann, and later by von Neumann and Oskar Morgenstern is a collection of actions, one per agent, such that no agent has an incentive to deviate from their choice given the actions of the other agents. While a nice proposal, they could only show that such equilibrium is guaranteed to exist in situations conforming to what is called a “two-player zero-sum game”. In such games, two agents are in exact competition with each other; whatever one player wins the other loses.1 In 1950, John F. Nash showed that this notion of equilibrium, named Nash equilibrium in his honor, indeed exists for most naturally occurring multi-agent problems.2

While Nash established the existence of equilibrium, one question bothered economists and computer scientists alike: “Is it possible to efficiently find the Nash equilibrium of a game?”

Throughout the rest of the 20th century, many people proposed algorithms for computing Nash equilibrium, but none of them succeeded in showing that it can be done efficiently (i.e., in polynomial time in the size of the game). During his early graduate school days at UC Berkeley, Constantinos (a.k.a. Costis) Daskalakis obsessed over this question. Then, in 2006, Costis, along with co-authors Paul Goldberg and Christos Papadimitriou, who was also his PhD advisor, showed a surprising result: finding a Nash equilibrium is computationally intractable!

Figure 1. Utility functions encode the value an agent gets from a particular decision. (a) In classical Economics and Game Theory literature, these trade off benefits and costs of actions. For e.g., consider an agent deciding on the quantity x of basic material to procure while the other agent sets the price y per unit of this material. If we let g(x) denote the revenue from selling the product produced using x units of basic material, where g is a concave function of x (capturing diminishing returns), then the overall utility of the agent choosing x is f(x,y) = g(x) - x \cdot y, an overall concave function. (b) Modern multi-agent problems involve agents choosing parameters of deep neural networks. This quickly leads to non-concave agent utilities.

A central concept in analyzing multi-agent systems is the utility function of each interacting agent. This is a function that captures the value that the agent derives as a function of their own action as well as those of the other agents. A common assumption is that the utility function of each agent is a concave function of their own action for any collection of actions committed by the others. Concavity often arises when agents trade off benefits and costs from their actions taking into account properties like diminishing returns and risk-aversion (see Figure 1 for an illustration). It is also crucial in guaranteeing that equilibria exist.

From Game Theory and Economics to Deep Learning

Recently, Costis has shifted his attention to the more general setting where the underlying utilities can be arbitrary non-concave functions of the agents’ actions. “Earlier, I was interested in the problem of equilibrium computation for its fundamental applications in Game Theory and Economics and its intimate connections to duality theory, topology and complexity theory. As Machine Learning is now moving towards multi-agent learning, studying more general setups arising from nonconcave agent utilities becomes increasingly relevant,” says Costis. He elaborates that the recent success of deep learning methods has largely been in single-agent setups and that the next frontier is to replicate this success in multi-agent settings. It is at this intersection of deep learning and multi-agent learning that non-concave utility functions arise — when actions correspond to setting the parameters of deep neural networks, agent utilities quickly become non-concave in the space of these parameters(see Figure 1).

The focus of Costis’ keynote talk, on joint work with Stratis Skoulakis and Manolis Zampetakis [DSZ21] was on the simplest multi-agent problem: two-player zero-sum games. In these games, two players, the min and the max player, choose actions x and y respectively, which are constrained to lie in some compact and convex set S, i.e., (x,y) \in S. The agents share some objective function f(x,y) that min wants to minimize and max wants to maximize. Classical studies, going back to von Neumann’s celebrated work [vN28] focus on when this objective is a convex function of the min player’s action and a concave function of the max player’s action, the so-called “convex-concave setting”. For any function f which is convex-concave, there exists a Nash equilibrium, i.e., a point (x^*, y^*) satisfying

f(x^*, y) \leq f(x^*, y^*) \leq f(x, y^*) \quad \text{for all}\; x, y \text{ such that } (x, y^*), (x^*, y) \in S.\qquad (1)

Thus the min (max) player has no incentive to deviate from x^* (y^*) as long as the other player remains fixed. The existence of such (x^*, y^*) follows from von Neumann’s minimax theorem and Rosen’s generalization of this theorem to the case that agents actions are jointly constrained [Ros65]. However, when f is not convex-concave, the minimax theorem fails, and we lose the existence of Nash equilibrium. For a simple example, consider f(x,y) = (x-y)^2 over the space S=[0,1]^2. Given any decision choice (x, y) if x\neq y, the min player should move x towards y. Otherwise, if x = y, the max player wants to move away from x. Thus, no pair (x^*, y^*) satisfies equation (1).

Nonconvex-nonconcave utility functions naturally arise in adversarial training applications, such as Generative Adversarial Network (GAN) training, where the goal is to learn how to generate new data, such as images, from the same distribution that generated a collection of given data. Specifically, GANs are trained by trying to identify the equilibrium of a two-player zero-sum game between a generator model (the min player) and a discriminator model (the max player). Each of these models are viewed as agents, choosing parameters in deep neural networks, and the objective, capturing how close the generated distribution is to the target distribution, amounts to a nonconvex-nonconcave function of the underlying network parameters, which the min player aims to minimize and the max player aims to maximize.

Given that Nash equilibria may not exist when the objective is not convex-concave, what type of solutions should we target when studying two player zero-sum games with such objectives? “One property that we would like our target solutions to possess is that they are universal, i.e. they are guaranteed to exist for any objective function. We can take them to a practitioner and tell them that they are always plausible targets for their computations,” says Costis. With this in mind, Costis and his co-authors consider a relaxed equilibrium concept, called (\epsilon, \delta)-Nash equilibrium. This is a pair (x^*, y^*) satisfying

f(x^*, y^*) < f(x, y^*) + \epsilon \quad \text{for all}\; x \text{ such that } \|x - x^*\| \leq \delta \text{ and } (x,y^*)\in S; f(x^*, y^*) > f(x^*, y) - \epsilon \quad \text{for all}\; y \text{ such that } \|y - y^*\| \leq \delta \text{ and } (x^*,y)\in S.

This relaxes Nash equilibrium: given strategy y^* for the max player, the min player can improve by at most \epsilon by changing their action in a ball of radius \delta around x^*, and a symmetric condition holds for the max player, given strategy x^* for the min player. One of the main insights of their paper is that such local Nash equilibria are guaranteed to exist as long as \delta is a small enough function of \epsilon and f’s smoothness, namely whenever \delta \le \sqrt{2\epsilon \over L} where L is f’s smoothness. This non-trivial result is established via an application of Brouwer’s fixed point theorem.

Can algorithms find a local Nash equilibrium?

The next question, pertaining to computational complexity, is to determine whether finding an (\epsilon, \delta)-Nash equilibrium is algorithmically tractable in the regime of parameters (small enough \delta) where it is guaranteed to exist. As a first step, Costis and his co-authors focus on first-order algorithms, which have access to the gradient of the objective function. Examples include gradient descent and variants thereof, which have been the main computational engine behind the success of deep learning in single-agent problems. A classical result known for these methods in minimization settings is that they are efficient in computing (\epsilon,\delta)-minima of non-convex, smooth objectives f. These are points x^* such that f(x^*) < f(x) + \epsilon for all feasible x such that \|x - x^*\| \leq \delta. Namely, given query access to the gradient \nabla f of some L-smooth objective f with values normalized to [0,1], it is possible to compute (\epsilon,\delta)-minima in polynomially many, in L and 1/\epsilon, steps and queries to \nabla f, as long as \delta \le \sqrt{2\epsilon \over L}. In contrast to minimization, a main contribution of Costis’ work is to establish an intractability result for min-maximization, showing that the number of gradient queries for any first-order algorithm to compute (\epsilon,\delta)-Nash equilibria must be exponential in at least one of 1/\epsilon, the dimension d, or the smoothness L of the objective.

Theorem 1 (informal). First-order methods need a number of queries to \nabla f that is exponential in at least one of 1 /\epsilon, L, or the dimension to find (\epsilon,\delta)-Nash equilibria, even when \delta \le \sqrt{2\epsilon \over L}, i.e. in the regime in which they are guaranteed to exist.

This theorem tells us that there exist objective functions for which the min-maximization problem can be computationally intractable for any first-order algorithm. Requiring many queries is one way to say that the problem is hard in practice. Indeed, practitioners have found it notoriously hard to get the discriminator-generator neural networks to converge to good solutions for generative modelling problems using gradient-based methods.

From a technical perspective, this work represents a new approach for proving lower bounds in optimization. Classical lower bounds in the optimization literature, going back to Nemirovsky and Yudin [NY83] target the black-box setting: an algorithm is given access to an oracle which outputs some desired information about a function when presented with a query. For example, a first-order oracle outputs the gradient of a function f at a given input. Costis shared that he and his co-authors first tried to construct a black-box lower bound for local Nash equilibria directly. However, they were unsuccessful. Any direct construction they tried ended up introducing spurious local Nash equilibria, which first-order algorithms might find in polynomial time. Their direct attempts at a lower bound failed to capture the computational hardness of the problem. They quickly realized that they needed a deeper understanding of the problem at hand, better insight on what made it harder than minimization.

Figure 2. (a) Black-box models work with an oracle which an algorithm queries to obtain information. White-box models provide algorithms access to functions which can compute the desired information. (b) Architecture of black-box intractability results (focus of the optimization literature): first show computational hardness results in the white-box model (focus of computational complexity); then compose those with black-box lower bounds for any problem in the class for which hardness results were established.

That insight came when they switched to studying the complexity of the white-box version of the problem, wherein the optimization algorithm can look inside the oracle that computes \nabla f(x) and possibly f(x) as well. One might wonder why one would want to consider such white-box models over black-box ones if their goal is to prove intractability results for methods that have limited access to f. Indeed, proving an intractability result in the white-box model is much harder because the set of algorithms that use white-box access to the objective is strictly bigger than those using only black-box access. However, the key difference is that we are not looking for the same kind of hardness in the two models. In the black-box model, we are looking for unconditional computational hardness, that is, showing that any algorithm will require exponentially many queries to the gradient oracle. On the other hand, in the white-box model, we would like to show complexity-theoretic hardness, i.e., show that solving the problem at hand is at least as hard (or exactly as hard) as solving the hardest problems in some complexity class. Such a complexity-theoretic hardness result is conditional; it says that solving this problem will be computationally intractable as long as some computational complexity conjecture, such as P \neq NP, holds. Importantly, showing hardness (or completeness) of a problem in some complexity class typically entails a fine-grained understanding of the nature of the problem and how that enables it to encode other problems in the target complexity class.

In the white-box model, the authors show that the problem of computing local Nash equilibria is PPAD-complete. In other words, computing this local equilibrium concept in zero-sum games with nonconvex-nonconcave objectives is exactly as hard as computing Nash equilibria in general-sum games with concave agent utilities.3 This result is established by exhibiting a reduction from a variant of the Sperner coloring problem,4 which is a PPAD-complete problem, to a discrete nonconvex-nonconcave min-maximization problem, where the two agents choose points on a hypergrid.

Unexpected challenges in high dimensions

Having established this result, Costis and his coauthors presumed that the hardest part of the problem was behind them. However, another challenge awaited them. They still had to construct a continuous interpolation of their discrete function to satisfy the desired Lipschitz and smoothness properties in a computationally efficient manner. To understand the challenge with this, consider a simple two-dimensional example with two actions per agent. Suppose we are given prescribed values for f(x,y) on all four vertices of \{0,1\}^2 and our goal is to construct a continuous and smooth function on [0,1]^2, which matches the prescribed function values at the corners. A simple approach is to define f(x,y) at any point (x,y) using a smooth interpolation of all four corners of \{0,1\}^2. This works, but does not scale to high dimensions. An approach that would scale computationally in high dimensions is to first triangulate [0,1]^2 by chopping it along its main diagonal and then interpolate the function on each triangle separately. However, this simple approach fails since the gradient of the interpolated function can be discontinuous when crossing the diagonal. “This part turned out to be more technically challenging than we had thought in high dimensions,” says Costis. He and his coauthors overcame the issue by proposing a new smooth and computationally efficient interpolation scheme, which they expect will have more applications in transferring hardness results from discrete problems to continuous problems.

To obtain Theorem 1, the authors show that one can translate their complexity-theoretic hardness in the white-box model to an unconditional intractability result in the black-box model. This follows immediately from their reduction from Sperner to local Nash equilibrium. Indeed, finding local Nash equilibria in the min-max instance at the output of their reduction provides solutions to the Sperner instance at its input. Moreover, a single query (function value or gradient value) to the min-max objective requires O(d) queries to the Sperner coloring circuit in order to be computed. Finally, it is known that, with black-box queries to the Sperner coloring circuit, exponentially many queries are necessary to compute a solution [HPV89, Pap94]. An exponential black-box lower bound for local Nash equilibrium thus follows.

This proof architecture can be used more generally to prove intractability results for optimization problems involving smooth objectives. First, ignore the black-box model and focus on identifying the complexity class that captures the complexity of the problem in the white-box model. Then, focus on obtaining a hardness result for a discrete version of the problem. Once this is established, one can use the techniques presented in this work to lift this intractability from the discrete to the continuous problem. If there are black-box lower bounds for any problem residing in the complexity class for which the white-box version of the problem is hard, then these lower bounds can be composed with hardness reductions to establish lower bounds for the black-box version of the problem. As an aside, Costis mentions that it would be interesting if one could establish the lower bound of Theorem 1 in the black-box model directly, i.e., without going through the PPAD machinery.

Looking forward

Costis ends on an optimistic note: “While this might appear as a negative result, it really is a positive one.” Explaining further, he says that a philosophical consequence of his intractability results is that the multi-agent future of deep learning is going to have a lot of interesting “texture” — it will involve a large breadth of communities and motivate a plethora of problems at the interface of theoretical computer science, game theory, economics and machine learning. Costis envisions a change in balance in the multi-agent world: while recent successes of deep learning in single-agent problems capitalize on access to large data, unprecedented computational power, and effective inductive biases, multi-agent problems will demand much stronger inductive biases, invoking domain expertise in order to develop effective models and useful learning targets, as well as to discover algorithms that attain those targets.

Indeed, domain expertise has been crucial in some recent high-profile machine learning achievements in multi-agent settings: the AlphaGo agent for playing the game of Go and the Libratus agent for playing Texas Hold’em. For these, a game-theoretic understanding has been infused into the structure and training of the learning algorithm. In addition to using deep neural networks, AlphaGo uses a Monte Carlo tree search procedure to determine the best next move as well as to collect data for the training of the neural networks during self play, while Libratus uses counterfactual regret minimization to approximate the equilibrium of the game. The success of both algorithms required combining machine learning expertise with game-theoretic expertise about how to solve the games at hand.

More broadly, Costis urges young researchers to move beyond the classical statistical paradigm which assumes independent and identically distributed observations, and embrace learning challenges that are motivated from learning problems with state, incomplete or biased data, data with dependencies, and multi-agent learning applications. In particular, he would like to see more activity in obtaining new models and algorithms for reinforcement and multi-agent learning, better tools for high-dimensional learning problems with data bias and dependencies, as well as deeper connections to causal inference and econometrics. “There is a lot of beautiful mathematics to be done and new continents to explore motivated by these challenges.”

Acknowledgements

We are thankful to Margalit Glasgow, Gautam Kamath, Praneeth Netrapalli, Arun Sai Suggala, and Manolis Zampetakis for providing valuable feedback on the blog. We would like to especially thank Costis Daskalakis for helpful conversations related to the technical and philosophical aspects of this work, and valuable comments throughout the writing of this article.

Notes

As we discuss later, this existence only holds when the gains of each player are a concave function of their own actions.

2 This led to Nash winning the Nobel prize in Economics in 1994 with John Harsanyi and Reinhard Selten.

3 PPAD is the complexity class that exactly captures the complexity of computing Nash equilibria in general-sum games, computing fixed points of Lipschitz functions in convex and compact domains, and many other equilibrium and fixed point computation problems.

4 In Sperner, we are given white-box access to a circuit that computes colors for the vertices of some canonical simplicization of the d-dimensional simplex. Each vertex receives one of the colors in \{1,\ldots,d+1\} and each color i does not appear on any vertex of the triangulation lying on facet i of the simplex. The goal is to find a simplex of the triangulation with all vertices colored differently.

Bibliography

[DSZ21] Constantinos Daskalakis, Stratis Skoulakis, and Manolis Zampetakis. The complexity of constrained min-max optimization. Symposium on Theory of Computing, 2021

[HPV89] Michael D Hirsch, Christos H Papadimitriou, and Stephen A Vavasis. Exponential lower bounds for finding brouwer fixed points. Journal of Complexity, 1989.

[NY83] Arkadi S. Nemirovsky and David B. Yudin. Problem complexity and method efficiency in optimization. Wiley, 1983.

[Pap94] Christos H. Papadimitriou. On the complexity of the parity argument and other inefficient proofs of existence. Journal of Computer and System Sciences, 1994.

[Ros65] J. Ben Rosen. Existence and uniqueness of equilibrium points for concave n-person games. Econometrica, 1965.

[vN28] John von Neumann. Zur Theorie der Gesellschaftsspiele. In Mathematische annalen, 1928.