AlienVault Longitudinal Study Part 4
chartsnthings 2014-06-09
In Part 1 we looked atacquiring raw data, and wrangling it into a time series dataset. InPart 2 we looked attypes of threats in the time series. In Part 3we looked at countries. Now we will examine countries and types incombination in the AlienVault reputation database.
Just as we shaped our dataset for better understanding in previous posts, we willuse the top 25 countries in the cctab table, and ignore address types with a compounddescription, i.e. with a semicolon in the type field. Here is the SQL to make ourcore dataset for this post:
-- Create table with countries and type summariesDROP TABLE IF EXISTS avcctyp;CREATE TABLE avcctyp ( d DATE, cc CHAR(2), type CHAR(50), avgrisk DECIMAL(4,2), avgrel DECIMAL(4,2), minrisk INTEGER UNSIGNED, minrel INTEGER UNSIGNED, maxrisk INTEGER UNSIGNED, maxrel INTEGER UNSIGNED, count INTEGER UNSIGNED);-- Insert countries, types, and average, min, max scores INSERT INTO avcctypSELECT d, cc, type, AVG(risk) AS avgrisk, AVG(rel) AS avgrel, MIN(risk) AS minrisk, MIN(rel) AS minrel, MAX(risk) AS maxrisk, MAX(rel) AS maxrel, COUNT(*) AS count FROM avtrack, avip WHERE avtrack.iind=avip.iindGROUP BY d, cc, type;-- Delete compound types (with semicolon) and non-top-25 countries DELETE FROM avcctyp WHERE type LIKE '%;%';DELETE FROM avcctyp WHERE cc NOT IN (SELECT * FROM cctab);
Let us now load a data frame in R with the country and type data:
library(RMySQL)avdb = dbConnect(MySQL(), user='xxxxxx', password='zzzzzzzz', dbname='avrep')rs <- dbSendQuery(avdb, statement = "SELECT * FROM avcctyp;")avcctyp <- fetch(rs, n = -1) # extract all rowsavcctyp$d = as.Date(avcctyp$d, "%Y-%m-%d")
We now have a data frame in R called avcctyp with the top 25 countries (from thefirst day of our series) and their types, averages and counts for each day of our time series.
Looking at one country, and one day, this SQL gives us a feel for what we can explore:
MariaDB [avrep]> SELECT * FROM avcctyp WHERE d='2013-10-01' AND cc='US';+------------+------+----------------------+---------+--------+---------+--------+---------+--------+-------+| d | cc | type | avgrisk | avgrel | minrisk | minrel | maxrisk | maxrel | count |+------------+------+----------------------+---------+--------+---------+--------+---------+--------+-------+| 2013-10-01 | US | APT | 4.00 | 2.00 | 4 | 2 | 4 | 2 | 1 || 2013-10-01 | US | C&C | 5.83 | 4.47 | 2 | 4 | 10 | 6 | 239 || 2013-10-01 | US | Malicious Host | 3.99 | 3.01 | 1 | 1 | 9 | 6 | 616 || 2013-10-01 | US | Malware distribution | 4.00 | 3.00 | 4 | 3 | 4 | 3 | 1 || 2013-10-01 | US | Malware Domain | 5.81 | 2.05 | 1 | 1 | 10 | 4 | 2061 || 2013-10-01 | US | Malware IP | 5.11 | 3.02 | 1 | 1 | 10 | 5 | 1483 || 2013-10-01 | US | Scanning Host | 2.07 | 2.00 | 1 | 1 | 6 | 4 | 40228 || 2013-10-01 | US | Spamming | 6.11 | 2.39 | 1 | 2 | 10 | 5 | 1161 |+------------+------+----------------------+---------+--------+---------+--------+---------+--------+-------+8 rows in set (0.02 sec)
Here are scanning hosts by country:
ggplot(avcctyp[avcctyp$type=="Scanning Host",], aes(d, count)) + geom_point() + theme_bw() + scale_x_date(breaks=NULL) + xlab("Day") + ylab("Count") + facet_grid(. ~ cc)
And what we see:
Here are malware domains by country:
ggplot(avcctyp[avcctyp$type=="Malware Domain",], aes(d, count)) + geom_point() + theme_bw() + scale_x_date(breaks=NULL) + xlab("Day") + ylab("Count") + facet_grid(. ~ cc)
And what we see:
You can see that each type of address could be investigated this way to determine country basedpatterns in the various address types. We can also see the various types for a particular country:
ggplot(avcctyp[avcctyp$cc=="US",], aes(d, count)) + geom_point() + theme_bw() + scale_y_log10() + scale_x_date(breaks=NULL) + xlab("Day") + ylab("Count") + facet_grid(. ~ type)
And what we see:
This graphic shows us the various types of addresses as facets for United States entries.
There are many other relationships that could be explored in the Alien Vault threat feed. Forexample, our dataframe has average, minimum and maximum risk and reliability ratings. Countriesand types could be more fully explored based on ratings,but these will have to be left as exercises for the reader. This four part series gaveyou a sense of how to start with a raw data feed and analyze it from a variety ofperspectives. We hope you’ll try the code we’ve published here, and take it farther withyour own exploration.
R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...