Modern machine learning is characterized by two key features: high-dimensional models and very large datasets. Each of these features presents its own unique challenges, from basic issues such as storing and accessing all of the data to more intricate mathematical quests such as finding good algorithms to search through the high-dimensional space of models. In our recent work (opens in new tab), which we’re happy to announce received a best paper award at this year’s Conference on Neural Information Processing Systems (NeurIPS) (opens in new tab), we find a surprising new connection between these two key aspects. We discovered that the well-known phenomenon of acceleration for smooth optimization can always be applied to distributed optimization—even in non-smooth settings—because of an inherent smoothness in the communication process!
Progress and the search for the unknown
The challenges of distributed computation and optimization have been studied separately for more than half a century. Take, for instance, the easier yet still prevalent challenge in many tasks of linear models such as logistic regression. There the mathematical theory of local search techniques—for example, gradient descent—in high-dimensional spaces is by now almost complete. This is largely due to work from the 1980s school of Russian researchers and engineers, including central figures Arkadi Nemirovski and Yurii Nesterov. They developed a theory of optimal local search methods, obtaining actual algorithms, such as momentum-based accelerated gradient descent, as well as proofs that, in some sense, no better algorithms exist. This theory was tremendously influential and put modern machine learning on solid footing. Another example is noisy evaluation, which produced the famous stochastic gradient descent technique. But today’s research in these areas remains very active, and modern machine learning comes with a host of other fundamental issues where optimal methods are not known yet. Of particular interest is the question of how to design optimal local search methods for distributed convex optimization while considering both high dimensional space and huge datasets.
Spotlight: Blog post
Microsoft Research has long been invested in finding answers to this question, including foundational work on the relationship between distributed learning and mini-batch gradient descent (opens in new tab). In a recent collaboration between Microsoft Research AI (opens in new tab) and the Microsoft Research-Inria Joint Centre (opens in new tab) in Paris, new advances were made on the question of optimal methods for distributed optimization. A paper on smooth optimization (opens in new tab) was published at the International Conference on Machine Learning (ICML) (opens in new tab) last year. Our NeurIPS 2018 paper, “Optimal Algorithms for Non-Smooth Distributed Optimization in Networks (opens in new tab),” explores the much more delicate question of non-smooth optimization.
An entirely different realm
It is well-known that for serial smooth convex optimization, momentum-based techniques yield optimal convergence rates. A slightly underwhelming finding of the ICML 2017 paper is that in the distributed setting, one can simply implement such optimal serial methods together with the master/slave protocol, and this in turn will yield an optimal distributed method. At this point, it is crucial to remember what mathematically optimal actually means. A method is “optimal” only within the set of assumptions under which optimality is proven. In particular, there is a tremendously important weakness in the master/slave protocol: It is not robust against machine failures, which are ubiquitous in distributed computing. Once one adds in the requirement of robustness—for example, by requiring that communication follows the so-called gossip protocol—one enters an entirely different realm. In this regime, a much deeper method turns out to be optimal, one based on a primal-dual representation of the distributed objective. Such representations have been known for a long time and are the basis of many practical algorithms, such as alternating direction method of multipliers (ADMM). The ICML paper proves that under a local smoothness condition, such primal-dual techniques are in fact optimal.
The discovery of hidden smoothness
For optimization experts, distributed optimization in the non-smooth scenario might seem like an unlikely research area, as it is well-known that for serial non-smooth optimization, vanilla gradient descent is already optimal. The surprising twist here, though—and indeed the key observation in our NeurIPS paper—is that the distributed nature always induces some smoothness. Slightly more precisely, while each machine might face a difficult non-smooth problem, the aggregation process resulting from communication is actually a smooth process.
This hidden smoothness reveals itself in the primal-dual methods, where now the primal problem is non-smooth while the dual problem is smooth. The idea then becomes, roughly, to use an optimal non-smooth method in the primal together with an optimal smooth method in the dual. In other words, non-smooth distributed optimization always allows for accelerated communication despite the non-smoothness of the original problem. The astute reader might have noticed that invoking primal-dual methods inevitably means that one is considering the local regularity (that is, on each machine) of the distributed function, leaving open the question of optimal methods under a global regularity assumption. Our NeurIPS paper also proposes an algorithm for that scenario based on deep ideas in randomized algorithms. However, proving optimality remains a difficult technical challenge. In particular, we conjecture that optimal methods under global regularity might inherently suffer from the high-dimensionality of the space. Proving such a statement is an exciting open problem!