Like anyone else that works IR, I found myself with a machine and I wasn't sure if I had malware on it. I figured I'd use the
RDS from NSRL to "subtract" out known good files. On a "modern" machine (defined as SSD storage with multiple cores), just "subtracting" out the known good is still time consuming. So, I did the following to make it much faster.
A simple "grep -v" would've taken >25 hours. With my process it was 15 seconds.
My system:
- Intel i7-4930K (6c, 12t) 3.4GHz
- ASUS X79 Deluxe Motherboard
- 64GB RAM
- Corsair 240GB ssd (nothing too fancy)
- Fedora 25
How did I accomplish this magic?
Get the NSRL
I won't go into details, but something like this:
- Download the "Modern" and "Legacy" RDS ISO's from the NSRL
- Mount them (mount -o loop RDS_257_legacy.iso /mnt/legacy/ ; mount -o loop RDS_257_modern.iso /mnt/modern/)
- Uncompress the big files
- cd ; mkdir nsrl ; cd nsrl
- mkdir modern ; cd modern
- unzip /mnt/modern/NSRLFile.txt.zip
- mv NSRLFile.txt NSRLFile-modern.txt
- cut -f2 -d, NSRLFile-modern.txt | cut -f2 -d\" | sort -u > nsrl-modern-su.md5
- cd ..
- mkdir legacy ; cd legacy
- unzip /mnt/legacy/NSRLFile.txt.zip
- mv NSRLFile.txt NSRLFile-legacy.txt
- cut -f2 -d, NSRLFile-legacy.txt | cut -f2 -d\" | sort -u > nsrl-legacy-su.md5
- cd ..
- Combine legacy & modern
- cd ~/nsrl
- cat {legacy,modern}/nsrl*-su.md5 | sort | uniq -c > nsrl-modern_legacy.md5
It's just that easy! ;)
Source Image
To "prepare" my source image, I had run the "file" command on everything in the image. I then grepped through that output file for "executable". That gave me executables formatted as:
- MS-DOS
- PE32
- DLL's
- All kinds of other stuff
I then ran md5sum on every file in that list. I saved those in "executables.md5s" There were 6800 unique executables listed in this file.
Obviously, I skipped some steps here. Hit me up in the comments if you want details.
Subtracting the NSRL from Source
My first attempt was:
grep -vi -f nsrl-modern_legacy.md5 executables.md5s
That ran out of memory and crashed.
I've done A BUNCH of work with "grep -v -f A B". I've learned that it can still be VERY fast if B is HUGE. What slows down grep is when A gets big. So, let's keep A small and get this done!
Normally, subtracting a bunch of things out of one input file must be done sequentially. This is slow. Boo!
Second Attempt
My second attempt taught me that I'm going to have to do this backwards. I want to find the intersection of these two files. Once I have that (very small) list, I can quickly subtract it from executables.md5s.
Third Attempt
This attempt took me down a fascinating path that was totally fruitless. I'll write a different blog on that later. :)
Fourth Attempt
The "executables.md5s" was the output of the "md5sum" program. So, I pulled out just the md5sums. Also, NSRL uses all uppercase, while the md5sum is all lowecase by default. This took care of both:
awk '{print toupper($1);}' executables.md5s > executables.md5sonly
I tried:
pv nsrl-modern_legacy.md5 | grep -f executables.md5sonly -i > tacos
If you don't know pv, go check it out! It's like "cat" with a "done-o-meter"!
This was going to take 23 hours!
Final Attempt
I have 6 cores, 12 threads of execution. I wanted to go with 4 x Cores for my number of runs. This command splits the already (comparatively) small executables.md5sonly into 24 files:
cd ~/nsrl ; mkdir exesplit ; cd exesplit
split -d -n l/24 ../executables.md5sonly exesplit.
time ls exesplit.* | parallel --jobs 24 grep -f {} ../nsrl-modern_legacy.md5 > all.out
This kicked off 24 parallel jobs of grep all searching through the NSRL for md5's from my suspect machine. There were 3700 md5sums in all.out
It took 7.8 seconds.
But... I'm not quite done yet. That just shows the files that are in my suspect machine AND in the NSRL. So, what I need is the files that are in the suspect machine that AREN'T in the NSRL. That's simple:
cd ~/nsrl
time grep -v -f exesplit/all.out executables.md5sonly > executables-nonsrl.md5
That took 2.2 seconds.
I'm now down to 3100 md5sums listed in executables-nonsrl.md5. I eliminated over half my files to check for malice.
Virus Total
My co-worker has a script to run these against the
Virus Total. Running all 6800 files would've taken >24 hours. Only running 3100 takes < 12 hours. And we can be pretty confident that we won't be querying for things that are known good.