Mapping every IPv4 address
R-bloggers 2014-09-15
Summary:
During July I was working with a commercial data source that providesextra data around IP addresses and it dawned on me: rather than pingingbillions of IP addresses and creatingmap,I could create a map from all the geolocation data I had at my fingertips. At a high level I could answer “Where are all the IPv4 addressesworldwide?” But in reality what I created was a map communicating “Wheredoes the geo-location services think all the IPv4 address areworldwide?” And at the end of July I put together a plot in about anhour and tossed it onto twitter. It is still getting retweets over amonth later in spite of the redundancy in the title.
Bob and I have talked quite a bit before about the (questionable) valueof maps and how they can be eye-catching, but they often lack thesubstance to communicate a clear message. The problem may be compoundedwhen IP geolocation is the data source for maps. Hopefully I can pointout some of the issues in this post as we walk through how to gather andmap every IPv4 address in the world.
Step 2: Get the data
I already did step 1 by defining our goal and as a question it is,“Where does the geo-location service think all the ipv4 addresses areworldwide?” Step 2 then is getting data to support our research. When Icreated the original map I used data from a commercial geolocationservice. Since most readers won’t have a subscription, we can referenceMaxmind and their free geolocationdata. Start bydownloading the “GeoLite City” database inCSV/zipformat (28meg download) and unzip it to get the“GeoLiteCity-Location.csv” file. Since the first line of the CSV is acopyright statement, you have to read it in and skip 1 line. Becausethis is quite a bit file, you should leverage the data.table commandfread()
library(data.table)geo <- fread("data/GeoLiteCity-Location.csv", header=T, skip=1)# how many rows?geoRows <- nrow(geo)print(geoRows)## [1] 557986
Right away here, you can see some challenges with IP geolocation. Thereare around 4.2 billion total IP address, 3.7 billion are routable (halfa billion are reserved) and yet the data only has a total of 557,986unique rows. It’s probably a safe bet to say some of these may beaggregated together.
You can jump right to a map here and plot the latitude/longitude in thatfile, but to save processing time, you can remove duplicate points withthe unique function. Then load up a world map, and plot the points on it.
geomap1 <- unique(geo, by=c("latitude", "longitude"))library(maps)library(ggplot2)# load the worldworld_map<-map_data("world")# strip off antartica for aestheticsworld_map <- subset(world_map, region != "Antarctica") # sorry penguins# set up the plot with the map datagg <- ggplot(world_map)# now add a map layergg <- gg + geom_map(dat=world_map, map = world_map, aes(map_id=region), fill="white", color="gray70")# and the ip geolocation pointsgg <- gg + geom_point(data=geomap1, aes(longitude, latitude), colour="#AA3333", alpha=1/10, size=0.5)# basic themegg <- gg + theme_bw()# show the mapprint(gg)
That’s interesting, and if you notice the alpha on the points is set to1/10th, meaning it will take ten point on top of one another to make thecolor solid (red in this case). One thing we didn’t do though is accountfor the density of the <