Detection of Prevalent Malware Families with Deep Learning

Jack W. Stokes; Christian Seifert; Jerry Li; Nizar Hejazi

Detection of Prevalent Malware Families with Deep Learning

Jack W. Stokes ,
Christian Seifert ,
Jerry Li ,
Nizar Hejazi

2019 Military Communications Conference | October 2019

Published by IEEE | Organized by IEEE

PDF | Publication | Publication

Download BibTex

Attackers evolve their malware over time in order to evade detection, and the rate of change varies from family to family depending on the amount of resources these groups devote to their “product”. This rapid change forces anti-malware companies to also direct much human and automated effort towards combatting these threats. These companies track thousands of distinct malware families and their variants, but the most prevalent families are often particularly problematic. While some companies employ many analysts to investigate and create new signatures for these highly prevalent families, we take a different approach and propose a new deep learning system to learn a semantic feature embedding which better discriminates the files within each of these families. Identifying files which are close in a metric space is the key aspect of malware clustering systems. The DeepSim system employs a Siamese Neural Network (SNN), which has previously shown promising results in other domains, to learn this embedding for the cosine distance in the feature space. The error rate for K-Nearest Neighbor classification using DeepSim’s SNN with two hidden layers is 0.011% compared to 0.42% for a Jaccard Index-based baseline which has been used by several previously proposed systems to identify similar malware files.