SoReL-20M: Open source dataset of 20 million malware samples

On December 14, the network security company Sophos and ReversingLabs jointly released the largest malware research data set in history-SoReL-20M, which aims to build effective defense capabilities and enhance security detection and response capabilities.

is a data set containing 20 million PE file metadata, tags and features, which contains 10 million malware samples with malware removal functions. The goal is to provide a sufficient data set for designing machine learning methods to detect malware . At the same time, there are also PyTorch and LightGBM-based machine learning models pre-trained on these data as benchmarks.

There are many public data sets in the fields of natural language processing and image processing, such as MNIST, ImageNet, CIFAR-10, IMDB Reviews, Sentiment140 and WordNet. Different from the fields of natural language processing and image processing, standardized and labeled data sets are very challenging for network security because there are a lot of personally identifiable information, sensitive network infrastructure data, and personal intellectual property data. Wait, not to mention providing malware to unknown third parties.

EMBER (Endgame Malware BEnchmark for Research) , released in 2018, is an open source malware classifier with only 1.1 million malicious samples. Its function is just a single labeled data set (malware or non-malware), which means it will restrict The scope of the test.

The goal of is to solve this problem through 20 million PE malware samples, which contain 10 million malware samples with malware removal functions (not executable), and 10 million features and metadata extracted from non-malware software.



In addition, this method uses a labeling model based on machine learning to generate human-understandable semantic descriptions of important features of specified malware samples.

The release of is consistent with recent industry trends. In October 20th, Microsoft released the Anti-Machine Learning Threat Matrix to help security analysts detect, respond, and repair counterattacks against machine learning systems.

ReversingLabs researchers said that the idea of ‚Äč‚Äčthreat intelligence sharing in the security field is not new, but it is very critical. Artificial intelligence and machine learning have become the key to detecting new malware and targeted attack software, and their applications are becoming more widespread.

Github page: