MLlib*: Fast Training of GLMs using Spark MLlib

  • Zhipeng Zhang ,
  • Jiawei Jiang ,
  • ,
  • Ce Zhang ,
  • Lele Yu ,
  • Bin Cui

IEEE International Conference on Data Engineering (ICDE 2019) |

In Tencent Inc., more than 80% of the data are
extracted and transformed using Spark. However, the commonly
used machine learning systems are TensorFlow, XGBoost, and
Angel, whereas Spark MLlib, an official Spark package for
machine learning, is seldom used. One reason for this ignorance
is that it is generally believed that Spark is slow when it comes
to distributed machine learning. Users therefore have to undergo
the painful procedure of moving data in and out of Spark. The
question why Spark is slow, however, remains elusive.

In this paper, we study the performance of MLlib with a focus
on training generalized linear models using gradient descent.
Based on a detailed examination, we identify two bottlenecks in
MLlib, i.e., pattern of model update and pattern of communication.
To address these two bottlenecks, we tweak the implementation
of MLlib with two state-of-the-art and well-known techniques,
model averaging and AllReduce. We show that, the new system
that we call MLlib*, can significantly improve over MLlib and
achieve similar or even better performance than other specialized
distributed machine learning systems (such as Petuum and
Angel), on both public and Tencent’s workloads.