Flexible and Scalable Deep Learning with MMLSpark

  • ,
  • Sudarshan Raghunathan ,
  • Akshaya Annavajhala ,
  • Danil Kirsanov ,
  • Eduardo de Leon ,
  • Eli Barzilay ,
  • Ilya Matiach ,
  • Joe Davison ,
  • Maureen Busch ,
  • Miruna Oprescu ,
  • Ratan Sur ,
  • Roope Astala ,
  • Tong Wen ,
  • ChangYoung Park

4th International Conference on Predictive Applications and APIs |

In this work we detail a novel open source library, called MMLSpark, that combines the flexible deep learning library Cognitive Toolkit, with the distributed computing framework Apache Spark. To achieve this, we have contributed Java Language bindings to the Cognitive Toolkit, and added several new components to the Spark ecosystem. In addition, we also integrate the popular image processing library OpenCV with Spark, and present a tool for the automated generation of PySpark wrappers from any SparkML estimator and use this tool to expose all work to the PySpark ecosystem. Finally, we provide a large library of tools for working and developing within the Spark ecosystem. We apply this work to the automated classification of Snow Leopards from camera trap images, and provide an end to end solution for the non-profit conservation organization, the Snow Leopard Trust.

Publication Downloads

Synapse Machine Learning

February 12, 2020

SynapseML is an ecosystem of tools aimed towards expanding the distributed computing framework Apache Spark in several new directions. SynapseML adds many deep learning and data science tools to the Spark ecosystem, including seamless integration of Spark Machine Learning pipelines with Microsoft Cognitive Toolkit (CNTK), LightGBM and OpenCV. These tools enable powerful and highly-scalable predictive and analytical models for a variety of datasources.