Some of you might have seen Beyond "Soda, Pop, or Coke". I am facing a similar problem and would like to create a plot like that. In my case, I have a very large number of geo-coded observations (over 1 million) and a binary attribute x. I would like to show the distribution of x on a map with a color scale ranging from 0 to 1 for p(x=1).
I am open to other approaches but Katz's approach for Beyond "Soda, Pop, or Coke" is described here and uses these packages: fields, maps, mapproj, plyr, RANN, RColorBrewer, scales, and zipcode. His approach relies on k-nearest neighbor kernal smoothing with Gaussian kernel. He first defines a distance for each location t on the map to all observations and then uses a distance-weighted estimate for p(x=1|t) (probability that x is 1 conditional on the location). The formula is here.
When I understand this correctly, creating such a map in R involves these steps:
- Build grid that covers the entire region of the shapefile (let's call the points in the grid t). I tried this approach using
polygrid
but failed so far. Code is below. - For each t, calculate the distance to all the observations (or just find the k clostest points and calculate the distance for this subset)
- calculate p(x=1|t) according to the formula defined here
- plot all t with an appropriate colorscale that ranges from 0 to 1
Here is some example data and I two concrete questions. First, how do solve my problem with step 1? As the second map below shows, my current approach fails. That is a clear R implementation question and once that is solved, I should be able to complete the other steps. Second and more broadly, is that the right approach or would you suggest a different way to create heatmap with distribution of attribute values?
load libraries and open shapefile and packages
# set path
path = PATH # CHANGE THIS!!
# load libraries
library("stringr")
library("rgdal")
library("maptools")
library("maps")
library("RANN")
library("fields")
library("plyr")
library("geoR")
library("ggplot2")
# open shapefile
map.proj = CRS(" +proj=lcc +lat_1=40.66666666666666 +lat_2=41.03333333333333 +lat_0=40.16666666666666 +lon_0=-74 +x_0=300000 +y_0=0 +datum=NAD83 +units=us-ft +no_defs +ellps=GRS80 +towgs84=0,0,0")
proj4.longlat=CRS("+proj=longlat +ellps=GRS80")
shape = readShapeSpatial(str_c(path,"test-shape"),proj4string=map.proj)
shape = spTransform(shape, proj4.longlat)
# open data
df=readRDS(str_c(path,"df.rds"))
plot data
# plot shapefile with points
par (mfrow=c(1,1),mar=c(0,0,0,0), cex=0.8, cex.lab=0.8, cex.main=0.8, mgp=c(1.2,0.15,0), cex.axis=0.7, tck=-0.02,bg = "white")
plot(shape@bbox[1,],shape@bbox[2,],type='n',asp=1,axes=FALSE,xlab="",ylab="")
with(subset(df,attr==0),points(lon,lat,pch=20,col="#303030",bg="#303030",cex=0.4))
with(subset(df,attr==1),points(lon,lat,pch=20,col="#E16A3F",bg="#E16A3F",cex=0.4))
plot(shape,add=TRUE,border="black",lwd=0.2)
1) Build grid that covers the entire region of shapefile
# get the bounding box for ROI an convert to a list
bboxROI = apply(bbox(shape), 1, as.list)
# create a sequence from min(x) to max(x) in each dimension
seqs = lapply(bboxROI, function(x) seq(x$min, x$max, by= 0.001))
# rename to xgrid and ygrid
names(seqs) <- c('xgrid','ygrid')
# get borders of entire SpatialPolygonsDataFrame
borders = rbind.fill.matrix(llply(shape@polygons,function(p1) {
rbind.fill.matrix(llply(p1@Polygons,function(p2) p2@coords))
}))
# create grid
thegrid = do.call(polygrid,c(seqs, borders = list(borders)))
# add grid points to previous plot
points(thegrid[,1],thegrid[,2],pch=20,col="#33333333",bg="#33333333",cex=0.4)