Finding a golden nugget in a lake full of trash

For a while I was wondering, where am I going to find interesting malware? I have these huge sources of unorganized data: Malshare, VirusTotal, VirusShare, Malware Bazaar and AnyRun (and so much more!) but they hold so much data that unless you are looking for something very specific it’s highly doubtful that you would find something interesting out of the bat. It felt that all the big boy companies have access to so many resources while I, a single analyst have access to so many databases but no way to organize all the IOCs, Domains, Samples just a huge mess of data. It’s all about perspective and the way I saw the world, looked like this:

624x351

I wanted that golden nugget! The problem wasn’t the lack of data but the mess of the data. I needed a way to gain access to an endless amount of data but also a way to navigate through it and normalize it. I started looking at tools that might help me solve this problem without setting up 10 Virtual Machines with honeypots and whole instance of some threat intelligence platform. And then I was introduced to Splunk. Splunk allowed me to process huge data sets and create beautiful dashboards so I could manage my data sets and search for very specific things.

Required Tools and Knowledge:

I’ve created a special Python script that merges Malware Bazaar and Malshare to create a full database, the script is highly flexible and has special objects that would allow anyone to create any database and merge it with other databases to create huge collections of malware. In the context of this post I would be showing you how to use this script. Any developers that wish to improve my script or use it please don’t forget to credit me :3. in addition I’ve tried to make this script documented as possible so anyone could edit it. You can get a copy of my script here:

https://github.com/DanusMinimus/MalwareLake

I’ve tried to make this script generic and flexible as I possibly could, the database structure difference between Malshare and Malware Bazaar did not make this easy.

**Important details - main.py**

624x481

This code snippet is the most important part of the code the Database constructor function contains the following parameters:

The function createDatabase queries the database URLS and generates the database. It accepts one parameter:

How does it work?

624x449

First the function checks if a path to the database exists, if it doesn’t it creates one. If no database file exists, the function queries the database from the URL and writes it to current working directory. It unzips the database if needed and filters out double quotes and space characters from the database.

Finally the function generateFullDB accepts two Database objects and merges them together. The second parameter is optional, If one wants to create a main database a None value can be passed.

624x405

First the function creates a full data frame, what this does is sends out specific API queries to each database provider to extract extra data for each hash. This creates a normalized template for the main database. The functions who perform this action can be viewed in the module_api_parser.py.After the data frames were created, the function checks if a main database exists. If it doesn’t it creates a new one by concatenating both database objects. If a main database does exist, the function concatenates the database objects and appends them to the database ignoring the header names of the concatenated database objects.

The result of this should give you a CSV file containing the following template:

519x263

It’s worth noting the Malware Bazaar database is very rich and detailed while Malshare kind of lacks some values I wish it had. This database can be uploaded to any SIEM platform in my case I chose Splunk(you can learn about Splunk for free here https://www.splunk.com/en_us/training/courses/splunk-fundamentals-1.html!) Which resulted in the creation of the following:

624x269

I can use this dashboard to search for samples by Time, by their File Type, by their Signature, by their Tags, by their Virus Total Score and by their SHA256 hash. Thus, allowing to access a wide variety of unique malware samples for example, here are all the malware samples that have a virus total score lower than 5 - which is implies really low detection rate:

624x269

here are all the malware samples that arrived as files disguised as COVID-19 Information which also have a low virus total detection rate:

624x285

That’s about it guys, have fun!