MDI Data Workshop

Mapping Spatial Dependence:
An Introduction to Spatial Data Analysis

Le Bao

Massive Data Institute, Georgetown University

March 19, 2024

MDI Data Workshop

  • Interdependence in data
    • This time: spatial dependence
    • April 15-16: Next Generation Generative Models for Vision Related Tasks with Dr. Rupayan Mallick


Plan

  • The basics of spatial data
  • Exploring spatial dependence
    • Spatial autocorrelation
  • Modeling spatial dependence
    • Spatial interpolation
    • Spatial regression


Basics: Types of Spatial Data

  • Vector data
    • Spatial objects are often depicted using vector data, which details the object’s “geometry” or “shape” and includes additional variables, or “attributes”.
    • At its simplest level, vector data comprises of individual points stored as coordinate pairs that indicate a physical location in the world.
name longitude latitude aqi
River Terrace -76.95807 38.89557 42
Rockville -77.10688 39.11431 30
Springfield -77.18347 38.76835 34

Vector Data

plot(dmv_air$longitude, dmv_air$latitude, pch=20, 
     col=heat.colors(3)[rank(dmv_air$aqi)], 
     xlim=c(-77.25, -76.75), main='Air Quality Index', 
     ylab = "Latitude", xlab = "Longitude", bg="white")
text(dmv_air$longitude, dmv_air$latitude, labels=dmv_air$name, 
     pos=4, cex=0.7, offset = 0.5)
legend("topright", legend=unique(aqi), fill=heat.colors(3)[order(unique(aqi))], 
     title="AQI", bg="white")

Basics: Types of Spatial Data

  • Raster data
    • Raster data is typically used to depict continuous spatial phenomena (e.g. altitude or temperature gradients).
    • A raster dataset subdivides the spatial area into a grid of uniform rectangles, each holding one or more values for the variables of interest.
    • In raster data, geometry is inherently encoded rather than explicitly stored as coordinates.

Raster Data

library(raster)
dc_extent <- extent(-77.12, -76.91, 38.79, 39.05)
ras <- raster(dc_extent, nrow=25, ncol=25)
values(ras) <- runif(ncell(ras))
ras
class      : RasterLayer 
dimensions : 25, 25, 625  (nrow, ncol, ncell)
resolution : 0.0084, 0.0104  (x, y)
extent     : -77.12, -76.91, 38.79, 39.05  (xmin, xmax, ymin, ymax)
crs        : NA 
source     : memory
names      : layer 
values     : 0.0003433817, 0.9985121  (min, max)

Raster Data

plot(dc_extent, col=NA)
plot(ras, add=T, legend=F)

Vector Data

  • Vector data consists of points, lines, and polygons, etc.
    • For example, a point might represent a location of air quality sensor with attributes including the observation date, the observed air quality, and the pollutant’s information.
    • A line, a series of connected coordinate pairs, could represent a river system.
    • Polygons, which are closed polylines, are often used to represent areas such as geographical areas.

Line

library(sp)
line_coords <- data.frame(
  longitude = c(-77.3, -77.2086, -77.2367, -77.2555, -77.2865, -76.8),
  latitude = c(38.92, 38.9041, 38.8954, 38.8816, 38.8688, 38.85)
)

line <- SpatialLines(list(Lines(Line(line_coords), ID="segment")))
line
class       : SpatialLines 
features    : 1 
extent      : -77.3, -76.8, 38.85, 38.92  (xmin, xmax, ymin, ymax)
crs         : NA 

Line

plot(line, col="blue", lwd=2, main="Potomac River", 
     ylab="Latitude", xlab="Longitude")

Polygon

polygon_coords <- rbind(
    c(-77.11976, 38.93434),
    c(-77.04102, 38.99555),
    c(-77.00255, 38.96553),
    c(-76.90939, 38.89285),
    c(-76.97950, 38.83781),
    c(-77.03901, 38.79164),
    c(-77.03907, 38.84127),
    c(-77.03910, 38.86811),
    c(-77.09020, 38.90421),
    c(-77.11976, 38.93434)
)

dc_polygons <- SpatialPolygons(list(Polygons(list(Polygon(polygon_coords)), 
                                             ID="DC")))

Polygon

plot(dc_polygons, main="Washington D.C.", col="lightblue", border="blue", 
     xlim = c(-77.3865, -76.8094), ylim = c(38.76835, 39.11431))
points(dmv_air$longitude, dmv_air$latitude, pch=20, col=heat.colors(3)[rank(dmv_air$aqi)])
text(dmv_air$longitude, dmv_air$latitude, labels=dmv_air$name, pos=4, cex=0.7, offset = 0.5)
lines(line_coords, col="blue", lwd=2)

Shapefile and GeoJSON

  • Shapefile
library(sf)
dc_wards <- st_read("./data/Wards_from_2022/Wards_from_2022.shp")
Reading layer `Wards_from_2022' from data source 
  `/Users/baole/Desktop/spatial-workshop/data/Wards_from_2022/Wards_from_2022.shp' 
  using driver `ESRI Shapefile'
Simple feature collection with 8 features and 325 fields
Geometry type: POLYGON
Dimension:     XY
Bounding box:  xmin: -77.1198 ymin: 38.79164 xmax: -76.90915 ymax: 38.99597
Geodetic CRS:  WGS 84
  • GEOJSON
airbnb_nb <- st_read("./data/neighbourhoods.geojson")
Reading layer `neighbourhoods' from data source 
  `/Users/baole/Desktop/spatial-workshop/data/neighbourhoods.geojson' 
  using driver `GeoJSON'
Simple feature collection with 39 features and 2 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: -77.11934 ymin: 38.81801 xmax: -76.90915 ymax: 38.99597
Geodetic CRS:  WGS 84

Projection

  • Three-dimensional angular system to a two-dimensional planar system.
  • A coordinate reference system (CRS) defines, with the help of coordinates, how the two-dimensional, projected map is related to real locations on the earth.
    • A planar CRS comprises a projection, a datum, and a set of parameters that specify aspects like the map’s center.
  • The most commonly used datum globally is WGS84 (World Geodesic System 1984), closely related to NAD83 (North American Datum of 1983)
  • Most CRSs are assigned an “EPSG code” (European Petroleum Survey Group), which provides a unique identifier to a CRS.
    • EPSG:4326, also known as the WGS84 projection (because it’s based on WGS84’s ellipsoid), is a coordinate system used in Google Earth and GSP systems. EPSG:4269 is based on NAD83.

Projection and CRS

head(dc_wards)
Simple feature collection with 6 features and 325 fields
Geometry type: POLYGON
Dimension:     XY
Bounding box:  xmin: -77.08172 ymin: 38.79164 xmax: -76.90915 ymax: 38.9573
Geodetic CRS:  WGS 84
  WARD   NAME          REP_NAME
1    8 Ward 8 Trayon White, Sr.
2    6 Ward 6     Charles Allen
3    7 Ward 7      Vincent Gray
4    2 Ward 2      Brooke Pinto
5    1 Ward 1    Brianne Nadeau
6    5 Ward 5    Zachary Parker
                                                             WEB_URL
1     https://www.dccouncil.us/council/councilmember-trayon-white-sr
2               https://www.dccouncil.us/council/councilmember-allen
3                          https://dccouncil.us/council/vincent-gray
4 https://www.dccouncil.us/council/ward-2-councilmember-brooke-pinto
5                        https://dccouncil.us/council/brianne-nadeau
6                       https://dccouncil.us/council/kenyan-mcduffie
       REP_PHONE             REP_EMAIL
1 (202) 724-8045   twhite@dccouncil.us
2 (202) 724-8072   callen@dccouncil.us
3 (202) 724-8068    vgray@dccouncil.us
4 (202) 724-8058   bpinto@dccouncil.us
5 (202) 724-8181  bnadeau@dccouncil.us
6 (202) 724-8028 zparker@dccouncil.gov
                                  REP_OFFICE WARD_ID  LABEL STUSAB SUMLEV
1 1350 Pennsylvania Ave, Suite 400, NW 20004       8 Ward 8     DC    610
2 1350 Pennsylvania Ave, Suite 110, NW 20004       6 Ward 6     DC    610
3 1350 Pennsylvania Ave, Suite 406, NW 20004       7 Ward 7     DC    610
4 1350 Pennsylvania Ave, Suite 106, NW 20004       2 Ward 2     DC    610
5 1350 Pennsylvania Ave, Suite 108, NW 20004       1 Ward 1     DC    610
6 1350 Pennsylvania Ave, Suite 102, NW 20004       5 Ward 5     DC    610
           GEOID GEOCODE STATE POP100 HU100 P0010001 P0010002 P0010003 P0010004
1 610U600US11008   11008    11  85246 39164    85246    81405     8096    70959
2 610U600US11006   11006    11  84266 52768    84266    77187    50624    19562
3 610U600US11007   11007    11  85685 38968    85685    81893     7038    71628
4 610U600US11002   11002    11  89485 53217    89485    81571    57597    10196
5 610U600US11001   11001    11  85285 45694    85285    76084    42307    18741
6 610U600US11005   11005    11  89617 41794    89617    82833    22248    51307
  P0010005 P0010006 P0010007 P0010008 P0010009 P0010010 P0010011 P0010012
1      305      751       34     1260     3841     3341      988      148
2      216     5028       54     1703     7079     6449      851      388
3      343      591       40     2253     3792     3299      935      140
4      317     9853       64     3544     7914     7407      726      359
5      696     5273       88     8979     9201     8497      956      330
6      495     2731       87     5965     6784     6057     1189      241
  P0010013 P0010014 P0010015 P0010016 P0010017 P0010018 P0010019 P0010020
1      290       22      608      513      156       26      483        4
2     1528       47     2847      228      157       11      274       13
3      219       17      654      591      150       13      486        5
4     1666       34     4056      120       95       20      193       10
5     1294       38     4558      130      197        3      638       10
6      838       22     2146      396      260       12      736        7
  P0010021 P0010022 P0010023 P0010024 P0010025 P0010026 P0010027 P0010028
1        0       46       30       24        3      427      217       32
2        0       31       31       38        5      518      125       73
3        2       32       29       19        7      444      233       42
4        2       61       21       42        2      431       97       51
5        8      239       21       61       14      609      155       50
6        2      133       23       38       14      630      242       40
  P0010029 P0010030 P0010031 P0010032 P0010033 P0010034 P0010035 P0010036
1        5       66        7        1       33       18       20        1
2        4      106       11        0      106       20       33        0
3        1       53        5        0       24       10       21        9
4        2       81        9        0       96       23       34        4
5        5      114       24        0      145       16       49        0
6        1      123        8        1       81       30       42        1
  P0010037 P0010038 P0010039 P0010040 P0010041 P0010042 P0010043 P0010044
1        9        3        5        3        3        0        0        0
2        8        3       17        5        5        1        0        0
3       10        3       15        6        7        2        1        0
4        5        3        9        3        8        4        0        0
5        7        3       19        2        9        6        0        0
6        8        4       25       10        3        9        1        0
  P0010045 P0010046 P0010047 P0010048 P0010049 P0010050 P0010051 P0010052
1        2        2       66       15        3       34        2        6
2        0        1      101       18        0       57        7       14
3        1        1       43       12        0       23        0        3
4        0        2       74       25        0       39        1        2
5        1        4       87       10        1       52        2       10
6        0        1       92       17        0       55        1        7
  P0010053 P0010054 P0010055 P0010056 P0010057 P0010058 P0010059 P0010060
1        0        0        4        0        2        0        0        0
2        0        0        1        0        2        2        0        0
3        0        1        2        0        1        0        0        0
4        0        0        2        0        4        1        0        0
5        0        0        5        0        4        2        1        0
6        0        0        3        3        0        4        1        0
  P0010061 P0010062 P0010063 P0010064 P0010065 P0010066 P0010067 P0010068
1        0        0        7        4        1        0        1        0
2        0        0       10        4        5        0        0        0
3        1        0        6        0        4        0        1        1
4        0        0        1        1        0        0        0        0
5        0        0        8        1        3        0        1        2
6        1        0        5        4        0        0        0        0
  P0010069 P0010070 P0010071 P0020001 P0020002 P0020003 P0020004 P0020005
1        1        0        0    85246     3081    82165    79340     7670
2        1        1        1    84266     6122    78144    73945    49107
3        0        0        0    85685     4078    81607    78876     6624
4        0        1        1    89485     9744    79741    75843    55391
5        1        0        0    85285    17269    68016    64204    40017
6        1        0        0    89617    10419    79198    75285    21217
  P0020006 P0020007 P0020008 P0020009 P0020010 P0020011 P0020012 P0020013
1    70357      213      731       29      340     2825     2475      935
2    19243      121     4977       50      447     4199     3885      800
3    70969      239      568       32      444     2731     2384      907
4    10000      120     9803       54      475     3898     3669      695
5    18330      163     5202       70      422     3812     3506      862
6    50554      218     2687       76      533     3913     3516     1122
  P0020014 P0020015 P0020016 P0020017 P0020018 P0020019 P0020020 P0020021
1      132      272       19      104      482      152       23      297
2      338     1492       40      630      206      153       11      156
3       93      205       11      131      571      143       13      262
4      297     1633       27      662       99       87       20       85
5      238     1279       30      548      118      189        3      172
6      187      802       14      338      381      251       11      349
  P0020022 P0020023 P0020024 P0020025 P0020026 P0020027 P0020028 P0020029
1        4        0       12       29       14        0      315      194
2       12        0        2       22       23        0      268       97
3        3        2        6       29        8        0      326      202
4        9        2        2       15       36        0      200       72
5        8        7        2       19       29        2      275      103
6        7        1        9       22       21        1      368      197
  P0020030 P0020031 P0020032 P0020033 P0020034 P0020035 P0020036 P0020037
1       30        5       37        3        0        7       15        6
2       62        3       38        8        0        5       18       11
3       36        1       35        5        0        4       10        4
4       45        2       23        6        0        3       19        7
5       48        3       48       19        0        4       15        2
6       33        1       53        7        0        1       26       13
  P0020038 P0020039 P0020040 P0020041 P0020042 P0020043 P0020044 P0020045
1        0        8        2        2        3        2        0        0
2        0        7        3        5        5        5        1        0
3        0        8        2        7        6        5        0        1
4        3        4        2        6        3        4        1        0
5        0        6        3       11        2        8        1        0
6        1        6        4        8       10        0        7        1
  P0020046 P0020047 P0020048 P0020049 P0020050 P0020051 P0020052 P0020053
1        0        0        1       32       13        3       10        1
2        0        0        0       40       16        0        9        6
3        0        0        0       19       10        0        7        0
4        0        0        0       28       20        0        2        1
5        0        0        2       27        5        0       10        2
6        0        0        0       27       13        0        8        0
  P0020054 P0020055 P0020056 P0020057 P0020058 P0020059 P0020060 P0020061
1        5        0        0        0        0        0        0        0
2        8        0        0        0        0        0        1        0
3        1        0        1        0        0        0        0        0
4        2        0        0        0        0        2        1        0
5        7        0        0        0        0        1        2        0
6        5        0        0        0        0        0        0        1
  P0020062 P0020063 P0020064 P0020065 P0020066 P0020067 P0020068 P0020069
1        0        0        0        3        2        0        0        1
2        0        0        0        6        4        1        0        0
3        0        0        0        2        0        0        0        1
4        0        0        0        1        1        0        0        0
5        0        0        0        4        1        0        0        1
6        0        0        0        2        2        0        0        0
  P0020070 P0020071 P0020072 P0020073 P0030001 P0030002 P0030003 P0030004
1        0        0        0        0    62507    59754     6972    50933
2        0        1        0        0    74262    68775    45847    16404
3        1        0        0        0    66683    63965     5876    55656
4        0        0        0        0    83628    76770    54915     9062
5        2        0        0        0    75254    67609    39386    15702
6        0        0        0        0    74096    69224    19777    41961
  P0030005 P0030006 P0030007 P0030008 P0030009 P0030010 P0030011 P0030012
1      237      646       25      941     2753     2413      675      111
2      177     4765       50     1532     5487     4988      610      332
3      265      504       33     1631     2718     2375      667       90
4      272     9442       53     3026     6858     6476      630      322
5      543     4954       82     6942     7645     7102      757      291
6      414     2490       75     4507     4872     4360      798      193
  P0030013 P0030014 P0030015 P0030016 P0030017 P0030018 P0030019 P0030020
1      177       19      464      378      104       19      375        4
2      966       42     2422      181      116        7      221       12
3      114       12      498      469       92       11      352        3
4     1372       31     3677       97       61       16      159       10
5     1001       31     3905       94      167        3      554        9
6      480       14     1605      327      181       10      582        6
  P0030021 P0030022 P0030023 P0030024 P0030025 P0030026 P0030027 P0030028
1        0       41       25       19        2      293      161       14
2        0       24       24       30        1      404      101       48
3        1       21       24       14        7      306      159       33
4        1       48       20       30        2      317       73       32
5        6      203       19       49       13      469      129       35
6        2      100       22       28       12      449      181       23
  P0030029 P0030030 P0030031 P0030032 P0030033 P0030034 P0030035 P0030036
1        5       45        4        1       24       10        9        0
2        4       91        4        0       96       11       18        0
3        1       36        2        0       15        6       14        8
4        1       58        7        0       79       16       20        4
5        4       87       15        0      134        8       29        0
6        0       95        3        1       59       23       25        1
  P0030037 P0030038 P0030039 P0030040 P0030041 P0030042 P0030043 P0030044
1        8        3        3        2        2        0        0        0
2        5        3       12        4        5        1        0        0
3        7        2       12        2        7        2        0        0
4        5        3        7        2        5        3        0        0
5        4        2        8        1        4        6        0        0
6        5        2       20        7        0        2        1        0
  P0030045 P0030046 P0030047 P0030048 P0030049 P0030050 P0030051 P0030052
1        0        2       43       10        3       24        1        2
2        0        1       88       15        0       52        5       12
3        0        0       32        7        0       20        0        3
4        0        2       64       24        0       33        1        0
5        0        3       68        4        1       44        2        8
6        0        1       60       11        0       40        0        3
  P0030053 P0030054 P0030055 P0030056 P0030057 P0030058 P0030059 P0030060
1        0        0        1        0        2        0        0        0
2        0        0        1        0        2        1        0        0
3        0        0        1        0        1        0        0        0
4        0        0        2        0        3        1        0        0
5        0        0        3        0        4        2        0        0
6        0        0        2        3        0        1        0        0
  P0030061 P0030062 P0030063 P0030064 P0030065 P0030066 P0030067 P0030068
1        0        0        4        3        1        0        0        0
2        0        0        7        3        4        0        0        0
3        0        0        5        0        4        0        0        1
4        0        0        1        1        0        0        0        0
5        0        0        6        1        3        0        0        2
6        0        0        3        3        0        0        0        0
  P0030069 P0030070 P0030071 P0040001 P0040002 P0040003 P0040004 P0040005
1        0        0        0    62507     2195    60312    58274     6649
2        0        0        0    74262     5257    69005    65905    44499
3        0        0        0    66683     2938    63745    61796     5529
4        0        0        0    83628     8618    75010    71714    52844
5        0        0        0    75254    13901    61353    58271    37464
6        0        0        0    74096     7776    66320    63538    18976
  P0040006 P0040007 P0040008 P0040009 P0040010 P0040011 P0040012 P0040013
1    50546      178      628       23      250     2038     1779      651
2    16156      102     4722       48      378     3100     2854      581
3    55244      197      484       25      317     1949     1718      648
4     8908      111     9397       46      408     3296     3121      605
5    15389      129     4890       65      334     3082     2860      690
6    41437      180     2453       69      423     2782     2492      763
  P0040014 P0040015 P0040016 P0040017 P0040018 P0040019 P0040020 P0040021
1       99      168       17       83      358      100       16      241
2      303      947       36      514      169      114        7      131
3       65      106       10       98      454       86       11      201
4      271     1344       26      585       83       59       16       80
5      218      993       24      473       86      163        3      150
6      151      462        7      261      317      175        9      292
  P0040022 P0040023 P0040024 P0040025 P0040026 P0040027 P0040028 P0040029
1        4        0        7       25       10        0      233      152
2       11        0        2       18       21        0      207       82
3        1        1        5       24        8        0      218      136
4        9        1        1       15       26        0      150       56
5        7        5        2       18       26        2      203       88
6        6        1        9       21       17        1      273      157
  P0040030 P0040031 P0040032 P0040033 P0040034 P0040035 P0040036 P0040037
1       14        5       26        3        0        6        7        4
2       40        3       34        4        0        5       10        9
3       27        1       21        2        0        2        6        2
4       26        1       17        5        0        3       15        7
5       34        2       37       12        0        3        8        2
6       17        0       45        3        0        1       20        9
  P0040038 P0040039 P0040040 P0040041 P0040042 P0040043 P0040044 P0040045
1        0        7        2        2        2        2        0        0
2        0        4        3        3        4        5        1        0
3        0        7        1        6        2        5        0        0
4        3        4        2        5        2        4        0        0
5        0        3        2        4        1        4        1        0
6        1        3        2        7        7        0        0        1
  P0040046 P0040047 P0040048 P0040049 P0040050 P0040051 P0040052 P0040053
1        0        0        1       25        9        3       10        1
2        0        0        0       35       13        0        8        5
3        0        0        0       12        6        0        5        0
4        0        0        0       24       20        0        1        1
5        0        0        2       16        1        0        5        2
6        0        0        0       16        7        0        6        0
  P0040054 P0040055 P0040056 P0040057 P0040058 P0040059 P0040060 P0040061
1        2        0        0        0        0        0        0        0
2        8        0        0        0        0        0        1        0
3        1        0        0        0        0        0        0        0
4        0        0        0        0        0        1        1        0
5        5        0        0        0        0        1        2        0
6        3        0        0        0        0        0        0        0
  P0040062 P0040063 P0040064 P0040065 P0040066 P0040067 P0040068 P0040069
1        0        0        0        1        1        0        0        0
2        0        0        0        4        3        1        0        0
3        0        0        0        1        0        0        0        0
4        0        0        0        1        1        0        0        0
5        0        0        0        3        1        0        0        0
6        0        0        0        1        1        0        0        0
  P0040070 P0040071 P0040072 P0040073 H0010001 H0010002 H0010003 P0050001
1        0        0        0        0    39164    34608     4556     3362
2        0        0        0        0    52768    45285     7483     1991
3        1        0        0        0    38968    34958     4010     4528
4        0        0        0        0    53217    46543     6674    13218
5        2        0        0        0    45694    41035     4659     5066
6        0        0        0        0    41794    37211     4583     6759
  P0050002 P0050003 P0050004 P0050005 P0050006 P0050007 P0050008 P0050009
1     1054       23       32      750      249     2308        0      563
2      194       31       29      134        0     1797      655      255
3     2557     2197       34      289       37     1971        0        0
4      205       27        0      178        0    13013    11470        0
5      305        0      173      132        0     4761     4009        0
6     1007        0       40      967        0     5752     3053        0
  P0050010 OBJECTID                               GLOBALID CREATED_US
1     1745        1 {E31550AE-6FAE-4B74-909F-52B283BFAF68}       <NA>
2      887        2 {765C4F49-9292-4BDB-AA24-39F4EE43359F}       <NA>
3     1971        3 {73F07042-7D7F-452B-9BB3-0F87B0EC5418}       <NA>
4     1543        4 {7F8C2A51-427C-45FC-91EB-9693656AED9C}       <NA>
5      752        5 {C3C6E2E7-E68D-49B2-970C-D60675EA7B4B}       <NA>
6     2699        6 {6C10DD95-DE70-4F26-94BA-79F378FA74E0}       <NA>
  CREATED_DA LAST_EDITE LAST_EDI_1 SHAPEAREA SHAPELEN
1       <NA>       <NA>       <NA>         0        0
2       <NA>       JLAY 2023-12-07         0        0
3       <NA>       <NA>       <NA>         0        0
4       <NA>       <NA>       <NA>         0        0
5       <NA>       JLAY 2023-12-07         0        0
6       <NA>       JLAY 2023-12-07         0        0
                        geometry
1 POLYGON ((-76.99392 38.8777...
2 POLYGON ((-77.00908 38.8716...
3 POLYGON ((-76.94186 38.9185...
4 POLYGON ((-77.0327 38.88262...
5 POLYGON ((-77.03523 38.9374...
6 POLYGON ((-76.99144 38.9573...
st_crs(airbnb_nb)
Coordinate Reference System:
  User input: WGS 84 
  wkt:
GEOGCRS["WGS 84",
    DATUM["World Geodetic System 1984",
        ELLIPSOID["WGS 84",6378137,298.257223563,
            LENGTHUNIT["metre",1]]],
    PRIMEM["Greenwich",0,
        ANGLEUNIT["degree",0.0174532925199433]],
    CS[ellipsoidal,2],
        AXIS["geodetic latitude (Lat)",north,
            ORDER[1],
            ANGLEUNIT["degree",0.0174532925199433]],
        AXIS["geodetic longitude (Lon)",east,
            ORDER[2],
            ANGLEUNIT["degree",0.0174532925199433]],
    ID["EPSG",4326]]

Exploring Spatial Dependence

Tobler’s first law of geography

“Everything is related to everything else, but near things are more related than distant things.”

                                                                              - Tobler (1970)
  • Examples of spatial dependence: crime patterns, disease spread, housing market, political orientations, transportation, environmental equity, etc.

  • We oftentimes even use spatial dependence without realizing it:

Spatial Weights Matrix

  • A vast number of spatial analysis methods are based on a spatial matrix \mathbf{W}_{ij} of size n \times n with i an observation and j the “neighbours” of that observation. \mathbf{W}_{ij} represents the degree of spatial relationship between i and j.

  • Classically one can define :

    • Neighbouring matrix (\mathbf{W}_{ij} = 1 if j is a neighbour of i, 0 otherwise)
    • Distance matrix (\mathbf{W}_{ij} = the distance between i and j, modified by a function like \frac{1}{distance} or \frac{1}{distance^2})
    • Interaction matrix (\mathbf{W}_{ij} = the degree of interaction between i and j, the measure of the interaction depends on the subject of the analysis)

Spatial Weights Matrix: Neighborhood

library(spdep)
library(sf)
columbus <- st_read(system.file("shapes/columbus.shp",
               package = "spData"), quiet = TRUE)
nb <- poly2nb(columbus, queen = TRUE)
head(nb)
[[1]]
[1] 2 3

[[2]]
[1] 1 3 4

[[3]]
[1] 1 2 4 5

[[4]]
[1] 2 3 5 8

[[5]]
[1]  3  4  6  8  9 11 15 16

[[6]]
[1] 5 9

Neighborhood

plot(st_geometry(columbus), border = "lightgray")
plot(nb, st_geometry(columbus), add = TRUE)

Neighborhood

id <- 25 # select an area
columbus$neighbors <- "other"
columbus$neighbors[id] <- "area"
columbus$neighbors[nb[[id]]] <- "neighbors"
cols <- c("gray30", "gray", "white")
plot(st_geometry(columbus), col=cols[as.factor(columbus$neighbors)], main="Spatial Neighborhoods")
legend("topright", legend=c("Selected Area", "Neighbors", "Other"), fill=cols, bty="n")

Neighborhood

  • Neighborhood Matrix

k Nearest Neighbors

  • E.g.: neighbors based on 3 nearest neighbors

k Nearest Neighbors

cols_ctrs <- st_centroid(columbus)
knb <- knn2nb(knearneigh(cols_ctrs, k = 5)) # k=3 number nearest neighbors
plot(st_geometry(columbus), border = "lightgray")
plot(knb, st_geometry(columbus), add = TRUE)

Spatial Weight Matrix: Distance

  • Distance matrix
dist_mat <- spDists(as.matrix(st_coordinates(cols_ctrs)), longlat = FALSE)
dist_mat[1:6,1:6]
          [,1]      [,2]      [,3]      [,4]      [,5]      [,6]
[1,] 0.0000000 0.5987183 0.5796856 0.7480069 1.0878333 1.2779137
[2,] 0.5987183 0.0000000 0.7118774 0.3397537 0.9983315 1.5134864
[3,] 0.5796856 0.7118774 0.0000000 0.5609564 0.5233702 0.8053411
[4,] 0.7480069 0.3397537 0.5609564 0.0000000 0.6901508 1.3018772
[5,] 1.0878333 0.9983315 0.5233702 0.6901508 0.0000000 0.7533395
[6,] 1.2779137 1.5134864 0.8053411 1.3018772 0.7533395 0.0000000
  • Inverse distance weights (IDW) matrix
inv_dist_mat <- 1 / (dist_mat)
diag(inv_dist_mat) <- 0
inv_dist_mat[1:6,1:6]
          [,1]      [,2]     [,3]      [,4]      [,5]      [,6]
[1,] 0.0000000 1.6702347 1.725073 1.3368861 0.9192585 0.7825254
[2,] 1.6702347 0.0000000 1.404736 2.9433086 1.0016712 0.6607261
[3,] 1.7250730 1.4047363 0.000000 1.7826697 1.9106935 1.2417099
[4,] 1.3368861 2.9433086 1.782670 0.0000000 1.4489588 0.7681216
[5,] 0.9192585 1.0016712 1.910694 1.4489588 0.0000000 1.3274227
[6,] 0.7825254 0.6607261 1.241710 0.7681216 1.3274227 0.0000000

Inverse Distance Weights

id <- 25
columbus$weights <- inv_dist_mat[id, ]

# Plot the map, coloring regions based on the selected area's weights
plot(columbus["weights"], col = gray(1 - columbus$weights/max(columbus$weights)), main = "Inverse Distance Weights")

Exercise 1: Getting Started with Spatial Data

  1. Download and load the libraries
library(sf); library(sp); library(spdep)
  1. Download and load the data of DC wards ().
dc_wards <- st_read("./data/Wards_from_2022/Wards_from_2022.shp", quiet = TRUE)
  1. Plot a DC wards map.
st_geometry()
plot()
  1. Find the neighborhood (and k nearest neighbors) for DC wards and plot the neighbors of Ward 2.
poly2nb() # Polygon to neighborhood
st_centroid()
knearneigh(); knn2nb() #k nearest neighbor and knn to neighborhood
  1. Create a distance matrix between wards.
spDists() #distance

Exercise 1: Spatial Data and Spatial Matrix

dc_wards <- st_read("./data/Wards_from_2022/Wards_from_2022.shp", quiet = TRUE)
nb <- poly2nb(dc_wards, queen = TRUE)
plot(st_geometry(dc_wards), border = "lightgray")
plot(nb, st_geometry(dc_wards), add = TRUE)
dcw_ctrs <- st_centroid(dc_wards)
knb <- knn2nb(knearneigh(dcw_ctrs, k = 3)) # k=3 number 
plot(st_geometry(dc_wards), border = "lightgray")
plot(knb, st_geometry(dc_wards), add = TRUE)

Exercise 1: Spatial Data and Spatial Matrix

dc_wards$neighbors <- "other"
dc_wards$neighbors[dc_wards$WARD == 2] <- "selected area"
dc_wards$neighbors[knb[dc_wards$WARD == 2][[1]]] <- "neighbors"
cols <- c("gray30", "gray", "white")
plot(st_geometry(dc_wards), col=cols[as.factor(dc_wards$neighbors)], main="Spatial Neighborhoods")
legend("topright", legend=c("Selected Area", "Neighbors", "Other"), fill=cols, bty="n")
spDists(as.matrix(st_coordinates(dcw_ctrs)))

Autocorrelation

  • Spatial autocorrelation is used to describe the extent to which a variable is correlated with itself through space.

    • Positive spatial autocorrelation occurs when observations with similar values are closer together (i.e., clustered).
  • Global vs. local spatial autocorrelation

  • Statistical tests

    • Moran’s I (Moran 1950): correlation coefficient.
    • Geary’s C (Geary 1954): distance-based and more sensitive to local spatial autocorrelation.

Autocorrelation: Moran’s I

  • Global Moran’s I

I = \frac{n \sum\limits_i \sum\limits_j w_{ij}(Y_i - \bar{Y})(Y_j - \bar{Y})}{\left(\sum\limits_{i \neq j} w_{ij}\right) \sum\limits_i (Y_i - \bar{Y})^2}

  • Expected value of Moran’s I: E(I) = \frac{-1}{n-1}

  • Test statistic: z = \frac{I - E(I)}{\sqrt{Var(I)}}

Moran’s I

  • To what extent the Moran’s I value is different from random

Moran’s I

  • Local Moran’s I

I_i = \frac{n(Y_i - \bar{Y})}{\sum_j (Y_j - \bar{Y})^2} \sum_j w_{ij}(Y_j - \bar{Y})

  • Note that the global Moran’s I is proportional to the sum of the local Moran’s I obtained for all locations:

I = \frac{1}{\sum\limits_{i \neq j} w_{ij}} \sum\limits_{i} I_i

Autocorrelation: Boston House Price

boston <- st_read(system.file("shapes/boston_tracts.shp",
                           package = "spData"), quiet = TRUE)
head(boston[,1:7])
Simple feature collection with 6 features and 7 fields
Geometry type: POLYGON
Dimension:     XY
Bounding box:  xmin: -71.1753 ymin: 42.33 xmax: -71.1238 ymax: 42.3737
Geodetic CRS:  NAD27
  poltract                    TOWN TOWNNO TRACT      LON     LAT MEDV
1     0001 Boston Allston-Brighton     74     1 -71.0830 42.2172 17.8
2     0002 Boston Allston-Brighton     74     2 -71.0950 42.2120 21.7
3     0003 Boston Allston-Brighton     74     3 -71.1007 42.2100 22.7
4     0004 Boston Allston-Brighton     74     4 -71.0930 42.2070 22.6
5     0005 Boston Allston-Brighton     74     5 -71.0905 42.2033 25.0
6     0006 Boston Allston-Brighton     74     6 -71.0865 42.2100 19.9
                        geometry
1 POLYGON ((-71.1238 42.3689,...
2 POLYGON ((-71.1546 42.3573,...
3 POLYGON ((-71.1685 42.3601,...
4 POLYGON ((-71.15391 42.3461...
5 POLYGON ((-71.1479 42.337, ...
6 POLYGON ((-71.1382 42.3535,...

Boston House Price

plot(boston["MEDV"], main="Boston median prices of owner-occupied housing ($1000 USD)")

Boston House Price

mapview::mapview(boston, zcol = "MEDV")

Autocorrelation of House Price

nb <- poly2nb(boston) 
nbw <- nb2listw(nb, style = "W")

# Global Moran's I
gmoran <- moran.test(boston$MEDV, nbw,
                     alternative = "greater")
gmoran

    Moran I test under randomisation

data:  boston$MEDV  
weights: nbw    

Moran I statistic standard deviate = 23.35, p-value < 2.2e-16
alternative hypothesis: greater
sample estimates:
Moran I statistic       Expectation          Variance 
     0.6266753872     -0.0019801980      0.0007248686 

Monte Carlo Method of Moran’s I

  • A permutation bootstraping approach
gmoranMC <- moran.mc(boston$MEDV, nbw, nsim = 1000)
gmoranMC

    Monte-Carlo simulation of Moran I

data:  boston$MEDV 
weights: nbw  
number of simulations + 1: 1001 

statistic = 0.62668, observed rank = 1001, p-value = 0.000999
alternative hypothesis: greater

Monte Carlo Method of Moran’s I

hist(gmoranMC$res, breaks = round(seq(min(gmoranMC$res)-0.01,max(gmoranMC$res)+0.01,0.01), 3), 
     main = "Distribution of Results")
abline(v = gmoranMC$statistic, col = "red", lwd=1.5, lty=2)

Clusters Based on Autocorrelation

  • We can use local Moran’s I to identify cluster.
lmoran <- localmoran(boston$MEDV, nbw, alternative = "two.sided")
head(lmoran)
             Ii          E.Ii       Var.Ii        Z.Ii Pr(z != E(Ii))
1 -0.3457508492 -5.254157e-04 3.275376e-02 -1.90753363   5.645152e-02
2  0.0175875407 -1.626873e-05 2.045711e-03  0.38921049   6.971204e-01
3  0.0123379633 -6.557001e-07 4.089699e-05  1.92939381   5.368199e-02
4 -0.0001654033 -1.059064e-07 1.331742e-05 -0.04529559   9.638717e-01
5  0.3591628595 -1.427815e-04 7.898947e-03  4.04277384   5.282257e-05
6  0.0545610965 -1.625936e-04 1.357382e-02  0.46970410   6.385664e-01
boston$pvals <- lmoran[, 5]

Clusters

moran.plot(as.vector(boston$MEDV), nbw)
text(12.5, 45, "Low-High"); text(35, 45, "High-High");text(12.5, 7.5, "Low-Low"); text(35, 7.5, "High-Low")

Exercise 2: Autocorrelation

  • DC Airbnb price
  • Listing data
listings <- read.csv("./data/listings.csv")
head(listings)
    id                                                                 name
1 3686      Home in Washington · ★4.64 · 1 bedroom · 1 bed · 1 private bath
2 3943 Townhouse in Washington · ★4.83 · 1 bedroom · 1 bed · 1 private bath
3 4197    Home in Washington · ★4.85 · 1 bedroom · 1 bed · 1.5 shared baths
4 4529       Home in Washington · ★4.66 · 1 bedroom · 1 bed · 1 shared bath
5 4967             Home in Washington · ★4.74 · 1 bedroom · 1 bed · 3 baths
6 5589       Rental unit in Washington · ★4.50 · 1 bedroom · 1 bed · 1 bath
  host_id host_name neighbourhood_group
1    4645      Vita                  NA
2    5059      Vasa                  NA
3    5061    Sandra                  NA
4    5803   Bertina                  NA
5    7086    Edward                  NA
6    6527       Ami                  NA
                                      neighbourhood latitude longitude
1                                Historic Anacostia 38.86339 -76.98889
2 Edgewood, Bloomingdale, Truxton Circle, Eckington 38.91195 -77.00456
3                        Capitol Hill, Lincoln Park 38.88719 -76.99472
4                      Eastland Gardens, Kenilworth 38.90585 -76.94469
5    Ivy City, Arboretum, Trinidad, Carver Langston 38.91217 -76.99249
6    Kalorama Heights, Adams Morgan, Lanier Heights 38.91887 -77.04008
        room_type price minimum_nights number_of_reviews last_review
1    Private room    67             31                84  2023-08-30
2    Private room    66              1               495  2023-11-22
3    Private room   135              7                58  2023-11-11
4    Private room    56             30               102  2019-07-05
5    Private room  2500           1125                30  2016-09-22
6 Entire home/apt    60             31                96  2023-08-17
  reviews_per_month calculated_host_listings_count availability_365
1              0.53                              1              365
2              2.78                              5              252
3              0.33                              2              321
4              0.58                              2              179
5              0.19                              3              365
6              0.55                              1              121
  number_of_reviews_ltm                          license
1                     3                                 
2                    46 Hosted License: 5007242201001033
3                     6 Hosted License: 5007242201000749
4                     0                           Exempt
5                     0                                 
6                     1                                 

Exercise 2: Autocorrelation

  • Neighborhood
dc_neighborhood <- st_read("./data/neighbourhoods.geojson", quiet=T)
head(dc_neighborhood)
Simple feature collection with 6 features and 2 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: -77.03515 ymin: 38.81801 xmax: -76.93494 ymax: 38.90736
Geodetic CRS:  WGS 84
                                                                  neighbourhood
1                              Congress Heights, Bellevue, Washington Highlands
2                                                      Douglas, Shipley Terrace
3                            Woodland/Fort Stanton, Garfield Heights, Knox Hill
4                                                     Near Southeast, Navy Yard
5                                 River Terrace, Benning, Greenway, Dupont Park
6 Downtown, Chinatown, Penn Quarters, Mount Vernon Square, North Capitol Street
  neighbourhood_group                       geometry
1                <NA> MULTIPOLYGON (((-77.01389 3...
2                <NA> MULTIPOLYGON (((-76.99291 3...
3                <NA> MULTIPOLYGON (((-76.97714 3...
4                <NA> MULTIPOLYGON (((-76.97936 3...
5                <NA> MULTIPOLYGON (((-76.9376 38...
6                <NA> MULTIPOLYGON (((-77.0149 38...
#plot(st_geometry(dc_neighborhood), border = "lightgray")

Exercise 2: DC Airbnb Price

# Select only one bedroom listings
listings_1b <- listings[grep("1 bedroom", listings$name), ]
listings_avgs <- aggregate(price ~ neighbourhood, data = listings_1b, FUN = mean)
dc_nb_avgs <- merge(dc_neighborhood, listings_avgs, by.x = "neighbourhood", by.y = "neighbourhood", all.x = TRUE)
head(dc_nb_avgs)
Simple feature collection with 6 features and 3 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: -77.08773 ymin: 38.87285 xmax: -76.91322 ymax: 38.96315
Geodetic CRS:  WGS 84
                                                                             neighbourhood
1                                                     Brightwood Park, Crestwood, Petworth
2                                                            Brookland, Brentwood, Langdon
3                                                               Capitol Hill, Lincoln Park
4                                          Capitol View, Marshall Heights, Benning Heights
5                                           Cathedral Heights, McLean Gardens, Glover Park
6 Cleveland Park, Woodley Park, Massachusetts Avenue Heights, Woodland-Normanstone Terrace
  neighbourhood_group     price                       geometry
1                <NA>  82.11215 MULTIPOLYGON (((-77.01904 3...
2                <NA>  95.69841 MULTIPOLYGON (((-76.99647 3...
3                <NA> 117.31471 MULTIPOLYGON (((-77.00909 3...
4                <NA>  69.32000 MULTIPOLYGON (((-76.93291 3...
5                <NA> 148.59524 MULTIPOLYGON (((-77.07651 3...
6                <NA> 106.66667 MULTIPOLYGON (((-77.05395 3...

Exercise 2: DC Airbnb Price

mapview::mapview(dc_nb_avgs, zcol="price")

Exercise 2: DC Airbnb Price

  • Invesitigate the autocorrelation of DC’s Airbnb Price
    • Moran’s I test
    • Monte Carlo method
    • Moran scatter plot

       
  • Note:
sf_use_s2(FALSE) #traditional “GEOS” library 
#sf_use_s2(TRUE) #default of sf package; developed by Google using spherical projections.

Exercise 2: DC Airbnb Price

dc_nb_ctr <- st_centroid(dc_nb_avgs)
nb <- dnearneigh(dc_nb_ctr, 0, 5)
nbw <- nb2listw(nb, style="W")
gmoran <- moran.test(dc_nb_avgs$price, nbw,
                     alternative = "greater")
gmoran

    Moran I test under randomisation

data:  dc_nb_avgs$price  
weights: nbw    

Moran I statistic standard deviate = 5.1457, p-value = 1.332e-07
alternative hypothesis: greater
sample estimates:
Moran I statistic       Expectation          Variance 
      0.241062069      -0.026315789       0.002699959 

Exercise 2: DC Airbnb Price

gmoranMC <- moran.mc(dc_nb_avgs$price, nbw, nsim = 1000)
gmoranMC

    Monte-Carlo simulation of Moran I

data:  dc_nb_avgs$price 
weights: nbw  
number of simulations + 1: 1001 

statistic = 0.24106, observed rank = 999, p-value = 0.001998
alternative hypothesis: greater

Exercise 2: DC Airbnb Price

hist(gmoranMC$res, breaks = round(seq(min(gmoranMC$res)-0.01,max(gmoranMC$res)+0.01,0.01), 3), 
     main = "Distribution of Results")
abline(v = gmoranMC$statistic, col = "red", lwd=1.5, lty=2)

Exercise 2: DC Airbnb Price

moran.plot(as.vector(dc_nb_avgs$price), nbw)

Spatial Interpolation

  • Spatial dependence: bug or feature
  • Estimating or predicting values at unsampled locations using observations from sampled locations based on spatial dependence.
    • Creating continuous surfaces

Spatial Interpolation Methods

  • “Closest observation” (naive estimates)

  • Thiessen Polygons

Spatial Interpolation Methods

  • Inverse Distance Weighting (IDW)

\hat{y}(s_0) = \frac{\sum_{i=1}^{n} y(s_i) \times (1/d_i^{\beta})}{\sum_{i=1}^{n} (1/d_i^{\beta})} = \sum_{i=1}^{n} y(s_i) \times w_i

  • Nearest neighbors

\hat{y}(s_0) = \frac{\sum_{i=1}^{k} y(s_i)}{k}

  • Ensemble approach
    • New development: ensemble of multiple data sources (monitor+satellite+simulation); hyper local; etc.

Spatial Interpolation: DMV Air Quality

library(ggplot2)
us_county <- st_read("./data/cb_2018_us_county_500k/cb_2018_us_county_500k.shp", quiet=T)
dmv_county <- us_county[us_county$STATEFP %in% c("11", "24", "51"),]
dmv_county
Simple feature collection with 158 features and 9 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: -83.67539 ymin: 36.54074 xmax: -75.04894 ymax: 39.72304
Geodetic CRS:  NAD83
First 10 features:
    STATEFP COUNTYFP COUNTYNS       AFFGEOID GEOID                 NAME LSAD
58       11      001 01702382 0500000US11001 11001 District of Columbia   00
145      24      003 01710958 0500000US24003 24003         Anne Arundel   06
296      51      015 01480098 0500000US51015 51015              Augusta   06
297      51      037 01492442 0500000US51037 51037            Charlotte   06
298      51      051 01497376 0500000US51051 51051            Dickenson   06
299      51      053 01690739 0500000US51053 51053            Dinwiddie   06
300      51      069 01480124 0500000US51069 51069            Frederick   06
301      51      073 01480126 0500000US51073 51073           Gloucester   06
302      51      085 01480132 0500000US51085 51085              Hanover   06
303      51      099 01480137 0500000US51099 51099          King George   06
         ALAND    AWATER                       geometry
58   158340391  18687198 MULTIPOLYGON (((-77.11976 3...
145 1074287145 448099595 MULTIPOLYGON (((-76.84036 3...
296 2504708478   9537636 MULTIPOLYGON (((-79.53328 3...
297 1231010119   5809378 MULTIPOLYGON (((-78.90459 3...
298  855860629   8082181 MULTIPOLYGON (((-82.55383 3...
299 1305091846   8954562 MULTIPOLYGON (((-77.90034 3...
300 1070101640   5935153 MULTIPOLYGON (((-78.54022 3...
301  564117808 181754414 MULTIPOLYGON (((-76.7063 37...
302 1211818312  12652671 MULTIPOLYGON (((-77.78648 3...
303  465122891  21198264 MULTIPOLYGON (((-77.34744 3...

DMV Air Quality

dmv_air <- st_read("./data/dmv_air/dmv_air.shp", quiet=T)
dmv_air
Simple feature collection with 35 features and 4 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: -82.1641 ymin: 36.608 xmax: -75.79732 ymax: 39.70595
Geodetic CRS:  WGS 84
First 10 features:
                Stat_Nm Site_Nm                                    Lcl_S_N
1  District Of Columbia      41                              RIVER TERRACE
2  District Of Columbia      43                       MCMILLAN  NCore-PAMS
3  District Of Columbia      51                                  Near Road
4  District Of Columbia      53                  King Greenleaf Rec Center
5              Maryland       2                                 Millington
6              Maryland       2                                  Piney Run
7              Maryland       3 Fair Hill Natural Resource Management Area
8              Maryland       4                                 Horn Point
9              Maryland       6                    Howard County Near Road
10             Maryland       9                                 Hagerstown
    annl_mn                   geometry
1  9.661741 POINT (-76.95807 38.89557)
2  8.073839 POINT (-77.01318 38.92185)
3  8.695211 POINT (-76.95343 38.89477)
4  8.934325 POINT (-77.01282 38.87516)
5  6.922022 POINT (-75.79732 39.30502)
6  6.403240   POINT (-79.012 39.70595)
7  7.459565 POINT (-75.86477 39.70298)
8  6.918579 POINT (-76.14101 38.58752)
9  8.705889 POINT (-76.84611 39.14313)
10 7.068597 POINT (-77.72024 39.56418)
dmv_air <- st_transform(dmv_air, crs = crs(dmv_county))
dmv_air
Simple feature collection with 35 features and 4 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: -82.1641 ymin: 36.608 xmax: -75.79732 ymax: 39.70595
Geodetic CRS:  NAD83
First 10 features:
                Stat_Nm Site_Nm                                    Lcl_S_N
1  District Of Columbia      41                              RIVER TERRACE
2  District Of Columbia      43                       MCMILLAN  NCore-PAMS
3  District Of Columbia      51                                  Near Road
4  District Of Columbia      53                  King Greenleaf Rec Center
5              Maryland       2                                 Millington
6              Maryland       2                                  Piney Run
7              Maryland       3 Fair Hill Natural Resource Management Area
8              Maryland       4                                 Horn Point
9              Maryland       6                    Howard County Near Road
10             Maryland       9                                 Hagerstown
    annl_mn                   geometry
1  9.661741 POINT (-76.95807 38.89557)
2  8.073839 POINT (-77.01318 38.92185)
3  8.695211 POINT (-76.95343 38.89477)
4  8.934325 POINT (-77.01282 38.87516)
5  6.922022 POINT (-75.79732 39.30502)
6  6.403240   POINT (-79.012 39.70595)
7  7.459565 POINT (-75.86477 39.70298)
8  6.918579 POINT (-76.14101 38.58752)
9  8.705889 POINT (-76.84611 39.14313)
10 7.068597 POINT (-77.72024 39.56418)

DMV Air Quality

ggplot() +
  geom_sf(data = dmv_county, color = "grey25", fill = "grey75", alpha = 0.55) +
  geom_sf(data = dmv_air, aes(colour = annl_mn), size = 2) + 
  scale_color_gradient(low = "green", high = "red") +
  labs(color = "Average PM2.5", x = "", y = "") +
  theme_bw()

Spatial Interpolation: Closest Observation

georgetown <- data.frame(long = -77.0654, lat = 38.9097)
georgetown <- st_as_sf(georgetown, coords = c("long", "lat"), crs = crs(dmv_county))
(georgetown$pm25_nav <- dmv_air$annl_mn[which.min(st_distance(georgetown, dmv_air))])
[1] 8.073839

Spatial Interpolation: IDW

(georgetown$pm25_idw <- sum(dmv_air$annl_mn / st_distance(georgetown, dmv_air)^2) / 
                          sum(1 / st_distance(georgetown, dmv_air)^2))
8.845472 [1]


library(gstat)
gstat_idw <- gstat(id = "pm25_idw", formula = annl_mn ~ 1, locations = dmv_air)
predict(gstat_idw, newdata = georgetown)
[inverse distance weighted interpolation]
Simple feature collection with 1 feature and 2 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: -77.0654 ymin: 38.9097 xmax: -77.0654 ymax: 38.9097
Geodetic CRS:  NAD83
  pm25_idw.pred pm25_idw.var                 geometry
1      8.845469           NA POINT (-77.0654 38.9097)

Spatial Interpolation: KNN

gstat_knn <- gstat(id = "pm25_knn", formula = annl_mn ~ 1, locations = dmv_air, nmax = 3,
             set = list(idp = 0)) # idp: inverse distance power; idp=0 means all three neighbors are equally weighted.
(georgetown$pm25_knn <- predict(gstat_knn, newdata = georgetown)$pm25_knn.pred)
[inverse distance weighted interpolation]
[1] 8.886756

Spatial Interpolation

  • Create locations for prediction
dmv_bbox <- st_bbox(dmv_county)
x <- seq(dmv_bbox$xmin, dmv_bbox$xmax, length.out = 50)
y <- seq(dmv_bbox$ymin, dmv_bbox$ymax, length.out = 25)
dc_locs <- expand.grid(x = x, y = y)
dc_pred0 <- st_as_sf(dc_locs, coords = c("x", "y"), crs = crs(dmv_county))
dc_pred <- st_intersection(dmv_county, dc_pred0)

Spatial Interpolation

ggplot() +
  geom_sf(data = dmv_county, color = "grey25", fill = "grey75", alpha = 0.55) +
  geom_sf(data = dc_pred, color = "blue", size = 1, shape = 4) + 
  labs(x = "", y = "") +
  theme_bw()

Spatial Interpolation

  • Using IDW
dc_pred$pm25_idw <- predict(gstat_idw, newdata = dc_pred)$pm25_idw.pred
[inverse distance weighted interpolation]
  • Using KNN
dc_pred$pm25_knn <- predict(gstat_knn, newdata = dc_pred)$pm25_knn.pred
[inverse distance weighted interpolation]

Spatial Interpolation

ggplot() +
  geom_sf(data = dmv_county, color = "grey25", fill = "grey75", alpha = 0.55) +
  geom_sf(data = dc_pred, aes(color = pm25_idw), size = 1.5) + 
  geom_sf(data = dmv_air, aes(color = annl_mn), size = 3) +
  scale_color_gradient(low = "green", high = "red") +
  labs(color = "PM2.5", x = "", y = "") +
  theme_bw()

Spatial Interpolation with Raster Data

library(raster)
dmv_raster <- raster(ext = extent(dmv_bbox), res = c(0.1, 0.1))
dmv_raster_sf <- st_as_sf(rasterToPolygons(dmv_raster)) %>%
  st_transform(crs = crs(dmv_county)) %>%
  st_intersection(dmv_county)

Raster Data

ggplot() +
  geom_sf(data = dmv_county, color = "grey35", fill = "grey75", alpha = 0.55) +
  geom_sf(data = dmv_raster_sf, color = "blue", size = 0.5) + 
  labs(x = "", y = "") +
  theme_bw()

Spatial Interpolation with Raster Data

dmv_raster_sf$pm25_idw <- predict(gstat_idw, newdata = dmv_raster_sf)$pm25_idw.pred
[inverse distance weighted interpolation]
Code
ggplot() +
      geom_sf(data = dmv_county, color = "grey25", fill = "grey75", alpha = 0.55) +
      geom_sf(data = dmv_raster_sf, aes(fill = pm25_idw), size = 1) +
      geom_sf_label(data = dmv_air, aes(label = round(annl_mn, 1), fill = annl_mn), size = 5, alpha=0.75) +
      scale_color_gradient(low = "green", high = "red") +
      scale_fill_gradient(low = "green", high = "red") +
      labs(fill = "PM2.5", x = "", y = "") +
      guides(color = guide_legend(order = 1)) +
      theme_bw() +
      theme(legend.title = element_text(size = 12),
            legend.text = element_text(size = 10))

Exercise 3: Spatial Interpolation

  1. Use Airbnb listing data of DC to predict the pricing of 1B room for Georgetown area (-77.0654, 38.9097).
  2. Predict the pricing pattern for the whole DC. You will first need to create the locations for prediction.
  3. Use raster data to create a smooth surface of PM2.5 prediction.

Kriging

  • Kriging is an interpolation method naed after the South African mining engineer, Danie G. Krige.
  • It’s originated in the areas of mining and geo-statistics that involve spatially correlated data.
  • Kriging assumes that the spatial variation of a variable is composed of three components: trend (structural component), spatially correlated variation, noise (random error)
  • Ordinary kriging: relies purely on a spatial error process to make predictions.
  • Universial kriging: allows spatial trend terms and even location-specificc covariates to shape the prediction

Ordinary Kriging

  • Suppose we want to predict y(s_0) based observed data y(s_1), y(s_2), ..., y(s_n). The ordinary kriging estimator is defined as the linear unbiased estimator: \hat{y}(s_0) = \sum_{i=1}^n \lambda_i y(\mathbf{s}_i)

    that minizes the mean squared prediction error defined as E[(\hat{y}(s_0) - y(s_0))^2]

Universial Kriging

y(\mathbf{s}) = \mu(\mathbf{s}) + \omega(\mathbf{s}) + \epsilon(\mathbf{s})


\mu(\mathbf{s}) = \mathbf{X}(\mathbf{s})\beta

\omega(\mathbf{s}) \sim N(0, \sigma^2\mathbf{H}(\theta))

  • Partial sill: \sigma^2 is the variance driven by distance between two observations

  • Decay: \theta is the level of corelation among observations based on distance.

\sigma(\mathbf{s}) \sim N(0, \tau^2\mathbf{I})

  • Nugget: \tau^2 is the error variance that is independent from spatial separation

Variogram

  • In order to make spatial predictions, kriging goes through two steps:
    • Training: using the variograms and covariance functions to estimate the spatial autocorrelation
    • Prediction: predicting the unknown values

Variogram

  • Theoretical variogram:

Variogram

  • To obtain the theoretical variogram, we can use empirical semivariogram
Code
library(gstat)
boston_evgm <- variogram(MEDV ~ 1, boston)
plot(boston_evgm, type = 'o', main = "Variogram of Boston Median House Values")

Variogram

Code
v_init <- vgm(psill = 50, model = "Sph",
              range = 5, nugget = 50)
plot(boston_evgm, v_init, cex = 1.5)

Variogram

  • Then, we can choose a theoretical variogram model that best fits the empirical semivariogram.
    • Gaussian, Exponential, and Spherical models
    • Model parameters (nugget, sill, range) are estimated by fitting the chosen model to the empirical semivariogram.

Variogram

fittted_vgm <- fit.variogram(object = boston_evgm,
                    model = vgm(psill = 50, model = "Sph",
                                range = 5, nugget = 50))
fittted_vgm
  model    psill    range
1   Nug  0.00000 0.000000
2   Sph 93.13456 3.365812

Variogram

plot(boston_evgm, fittted_vgm, cex = 1.5)

Kriging

kmod <- gstat(formula = MEDV ~ 1, data = boston, model = fittted_vgm)
pred_loc <- st_as_sf(data.frame(long = -71.1456, lat = 42.3915), 
                       coords = c("long", "lat"), crs = crs(boston))

predict(kmod, pred_loc)
[using ordinary kriging]
Simple feature collection with 1 feature and 2 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: -71.1456 ymin: 42.3915 xmax: -71.1456 ymax: 42.3915
Geodetic CRS:  NAD27
  var1.pred var1.var                 geometry
1   21.4299 18.91049 POINT (-71.1456 42.3915)

Exercise 4: Using Kriging to Predict Airbnb Price

  1. Plot the empirical semivariogram of DC Airbnb price
  2. Fit a variogram model
  3. Using the fitted model to predict Georgetown Airbnb pricing

Spatial Regression Model

  • OLS
    • Gauss-Markov assumption: Cov(\epsilon_i, \epsilon_j) = 0, \forall i \neq j
  • Spatial autocorrelation
    • Spatially autocorrelated errors violates the OLS assumption, indicating that the model is misspecified.

OLS and Spatial Errors

lm_fit <- lm(log(MEDV) ~ RM + CRIM, data = boston)
summary(lm_fit)

Call:
lm(formula = log(MEDV) ~ RM + CRIM, data = boston)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.03618 -0.11069  0.01723  0.13889  1.31727 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.121864   0.112901   9.937   <2e-16 ***
RM           0.315514   0.017661  17.865   <2e-16 ***
CRIM        -0.019438   0.001443 -13.474   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.2721 on 503 degrees of freedom
Multiple R-squared:  0.5587,    Adjusted R-squared:  0.557 
F-statistic: 318.4 on 2 and 503 DF,  p-value: < 2.2e-16

Spatially Autocorrelated Errors

boston$residuals <- residuals(lm_fit)

ggplot(boston, aes(fill = residuals)) +
  geom_sf() +
  scale_fill_steps(low = "grey", high = "brown") + 
  theme_bw() 

Spatially Autocorrelated Errors

nb <- poly2nb(boston) 
nbw <- nb2listw(nb, style = "W")
moran_res <- moran.mc(boston$residuals, nbw, 1000)
moran_res

    Monte-Carlo simulation of Moran I

data:  boston$residuals 
weights: nbw  
number of simulations + 1: 1001 

statistic = 0.54595, observed rank = 1001, p-value = 0.000999
alternative hypothesis: greater

Spatial Lag Model

  • Also known as the spatial autoregressive model (SAR)

  • Incorporating spatial dependence in the outcome y = \rho \mathbf{W} y + \mathbf{X} \beta + \epsilon

    • \rho: Spatial autoregressive coefficient
    • \mathbf{W}: Spatial weights matrix

Spatial Lag Model

library(spatialreg)
lagsar_fit <- lagsarlm(log(MEDV) ~ RM + CRIM, data=boston, nbw)
summary(lagsar_fit)

Call:lagsarlm(formula = log(MEDV) ~ RM + CRIM, data = boston, listw = nbw)

Residuals:
       Min         1Q     Median         3Q        Max 
-0.6189521 -0.1004647 -0.0049353  0.0898597  1.0161974 

Type: lag 
Coefficients: (asymptotic standard errors) 
              Estimate Std. Error z value  Pr(>|z|)
(Intercept) -0.2810648  0.0838293 -3.3528 0.0007999
RM           0.1977215  0.0132364 14.9377 < 2.2e-16
CRIM        -0.0070405  0.0010320 -6.8219 8.986e-12

Rho: 0.68944, LR test value: 384.41, p-value: < 2.22e-16
Asymptotic standard error: 0.027224
    z-value: 25.325, p-value: < 2.22e-16
Wald statistic: 641.35, p-value: < 2.22e-16

Log likelihood: 134.3822 for lag model
ML residual variance (sigma squared): 0.030777, (sigma: 0.17543)
Number of observations: 506 
Number of parameters estimated: 5 
AIC: -258.76, (AIC for lm: 123.65)
LM test for residual autocorrelation
test value: 6.7847, p-value: 0.0091942

Spatial Lag Model

boston$residuals2 <- residuals(lagsar_fit)
moran.mc(boston$residuals2, nbw, 1000)

    Monte-Carlo simulation of Moran I

data:  boston$residuals2 
weights: nbw  
number of simulations + 1: 1001 

statistic = 0.054806, observed rank = 984, p-value = 0.01698
alternative hypothesis: greater

Spatial Error Model

  • Also called spatial moving average models

  • Incorporating spatial dependence in the error term y = \mathbf{X} \beta + u u = \lambda \mathbf{W} u + \epsilon

    • u: Spatially correlated error term
    • \lambda: Spatial autoregressive coefficient for the error term
    • \mathbf{W}: Spatial weights matrix

Spatial Error Model

ersar_fit <- errorsarlm(log(MEDV) ~ RM + CRIM, data=boston, nbw, tol.solve=1.0e-30)
summary(ersar_fit)

Call:errorsarlm(formula = log(MEDV) ~ RM + CRIM, data = boston, listw = nbw, 
    tol.solve = 1e-30)

Residuals:
       Min         1Q     Median         3Q        Max 
-0.9022086 -0.0739408 -0.0003695  0.0841594  0.9155054 

Type: error 
Coefficients: (asymptotic standard errors) 
              Estimate Std. Error z value  Pr(>|z|)
(Intercept)  1.4309386  0.1011931 14.1407 < 2.2e-16
RM           0.2544088  0.0141162 18.0224 < 2.2e-16
CRIM        -0.0071259  0.0012310 -5.7886 7.099e-09

Lambda: 0.85125, LR test value: 390.82, p-value: < 2.22e-16
Asymptotic standard error: 0.02544
    z-value: 33.462, p-value: < 2.22e-16
Wald statistic: 1119.7, p-value: < 2.22e-16

Log likelihood: 137.5884 for error model
ML residual variance (sigma squared): 0.027826, (sigma: 0.16681)
Number of observations: 506 
Number of parameters estimated: 5 
AIC: -265.18, (AIC for lm: 123.65)

Spatial Error Model

boston$residuals3 <- residuals(ersar_fit)
moran.mc(boston$residuals3, nbw, 1000)

    Monte-Carlo simulation of Moran I

data:  boston$residuals3 
weights: nbw  
number of simulations + 1: 1001 

statistic = -0.036993, observed rank = 91, p-value = 0.9091
alternative hypothesis: greater

Spatial Error Model

ggplot(boston, aes(fill = residuals3)) +
  geom_sf() +
  scale_fill_steps(low = "grey", high = "brown") + 
  theme_bw() 

Exercise

  • Fit spatial lag model and spatial error model on Airbnb price data.
  • Hint:
    • You need to create a spatial object using the listing data.
    • There are not many variables available: room_type, number_of_reviews, reviews_per_month, calculated_host_listings_count, availability_365.

Geographically Weighted Regression (GWR)

  • Allowing the regression coefficients to vary spatially y_i = \beta_0(u_i, v_i) + \sum_{k=1}^p \beta_k(u_i, v_i) x_{ik} + \epsilon_i

    • (u_i, v_i): Spatial coordinates of location i
    • \beta_k(u_i, v_i): Spatially varying regression coefficients
    • x_{ik}: Value of the k-th explanatory variable at location i
    • \epsilon_i: Error term at location i
  • Estimating local regression coefficients at each location

  • Capturing spatial non-stationarity and local variations in relationships

What We Didn’t Get To Cover

  • Point process
  • Raster data
  • (Bayesian) hierarchical models for spatial data
    • Banerjee, S., Carlin, B.P., & Gelfand, A.E. (2014). Hierarchical Modeling and Analysis for Spatial Data (2nd ed.). Chapman and Hall/CRC.
  • Machine learning for spatial data