Big Data’ing The Umbrella DNS Popularity List

Recently I started looking at the Umbrella DNS Popularity List and did a blog post about it here. The data seemed valuable and lacking at the same time so I spent my *limited* free time this week learning about R and RStudio.
Protip:  If you want to play along at home there is an RStudio docker container so all you need to do is:

docker run -d -p 8787:8787 -e USER=<username> -e PASSWORD=<password> rocker/rstudio

Getting today’s list loaded into R is as simple as:

# Get Todays List
if (file.exists(fn)) file.remove(fn)
temp <- tempfile()
download.file("http://s3-us-west-1.amazonaws.com/umbrella-static/top-1m.csv.zip",temp)
unzip(temp, "top-1m.csv")
today <- read_csv("top-1m.csv", col_names = FALSE)
unlink(temp)

Now you have the Top 1 million DNS requests from Umbrella ready to be “big data’ed”.
At the start of this project I wanted to do the following:
Search the DNS names for keywords. (Done).
Map all the DNS records on a map. (Done, Kinda).
Compare today’s and yesterday’s records for new DNS records.
Check all the DNS records against Censys and record open ports, and software.
Check all the DNS records against VirusTotal and see if any of them are known bad.
Check all the DNS records against SSLLabs and record SSL grade.
Take a nap.
My limited results so far follow with hopefully more to come.

Search The DNS Names

I wanted to do this to be able to search the list for a keyword and build a table and map of the data.  This was fairly easy and with help of leaflet and datatables here is the output of searching today’s data for cisco.
Here is the map:

Here is a link to the data. 
Here is the R code I wrote:
https://gist.github.com/jgamblin/7615b81cedd10e44d4f2220347b69cb0

Map All The DNS Records On A Map.

I got started on this and quickly realized that looking up the GEOIP information and mapping a million DNS records was going to take a week so I decided to do the Top 25,000 as a POC and come back and do all 1,000,000 later (maybe).
Here is the 25,000 Map:
Here is the R code I wrote:
https://gist.github.com/jgamblin/ccf3390bc5d2ce922cd5df38a40617b4
I also built a map with the Top 100K on it but it is huge (Load at your own risk).

…More to come.

I will be spending some more time on this over the next couple of weeks but cant think @EngelhardtCR and @hrbrmstr enough for all the help they have been over the last week as.   They are true data scientist and I am just a hacker with a blog.  : )
If you have any questions or suggestions please let me know on twitter at @jgamblin.
Here is a picture semi related to this blog post to make it look pretty when I share it on social media. 

Site Footer