Vector data

The bleeding edge ⚡: sf

Recently (this year), a new package has been introduced to manipulate vector data: sf (for Simple Features). It’s a pretty interesting improvement on sp/rgdal/rgeos, and we will focus on it in this tutorial.

In a few words, here’s what is interesting about sf:

  • It’s providing users with one unique class for all vector types,
  • It’s based on Simple Features, a formal standard (ISO 19125-1:2004) widely used in the GIS world that describes how objects in the real world can be represented,
  • The main class provided by sf is a data.frame – which means that a lot of data analysis methods are readily available,
  • It combines the capabilities of sp, rgdal, and rgeos under one unique package,
  • It is easier to install on some platforms than rgdal,
  • It is much faster, and scales better than sp and rgdal — the upcoming version will include spatial indexing!
library('sf')

Reading and writing spatial datasets

Loading spatial data from text files

It is very often that point data (such as profiles) is exchanged using a text delimited format such as CSV.

Reading text files

CSV and TXT files can be read using the classic read.csv, read.delim, etc., in base R. however, the readr package — from the tidyverse — does provide interesting alternatives that are much faster in read and write.

library(tidyverse)

df <- read_csv('./path/to/my/file.csv')

Converting data.frame to Simple Features

Let’s take the example dataset meuse from the sp package. meuse is a data.frame — similar to what could have been read from a CSV file using read_csv:

data('meuse', package = "sp")
head(meuse)
##        x      y cadmium copper lead zinc  elev       dist   om ffreq soil
## 1 181072 333611    11.7     85  299 1022 7.909 0.00135803 13.6     1    1
## 2 181025 333558     8.6     81  277 1141 6.983 0.01222430 14.0     1    1
## 3 181165 333537     6.5     68  199  640 7.800 0.10302900 13.0     1    1
## 4 181298 333484     2.6     81  116  257 7.655 0.19009400  8.0     1    2
## 5 181307 333330     2.8     48  117  269 7.480 0.27709000  8.7     1    2
## 6 181390 333260     3.0     61  137  281 7.791 0.36406700  7.8     1    2
##   lime landuse dist.m
## 1    1      Ah     50
## 2    1      Ah     30
## 3    1      Ah    150
## 4    0      Ga    270
## 5    0      Ah    380
## 6    0      Ga    470

The easiest way to modifiy this data.frame so that its coordinates (x and y) are used to generate a geometry column is using st_as_sf (“as simple feature”):

ms <- st_as_sf(
  meuse, 
  coords = c('x', 'y'),
  crs = "+init=epsg:28992"
)

ms
## Simple feature collection with 155 features and 12 fields
## geometry type:  POINT
## dimension:      XY
## bbox:           xmin: 178605 ymin: 329714 xmax: 181390 ymax: 333611
## epsg (SRID):    28992
## proj4string:    +proj=sterea +lat_0=52.15616055555555 +lon_0=5.38763888888889 +k=0.9999079 +x_0=155000 +y_0=463000 +ellps=bessel +towgs84=565.4171,50.3319,465.5524,-0.398957388243134,0.343987817378283,-1.87740163998045,4.0725 +units=m +no_defs
## First 20 features:
##    cadmium copper lead zinc  elev       dist   om ffreq soil lime landuse
## 1     11.7     85  299 1022 7.909 0.00135803 13.6     1    1    1      Ah
## 2      8.6     81  277 1141 6.983 0.01222430 14.0     1    1    1      Ah
## 3      6.5     68  199  640 7.800 0.10302900 13.0     1    1    1      Ah
## 4      2.6     81  116  257 7.655 0.19009400  8.0     1    2    0      Ga
## 5      2.8     48  117  269 7.480 0.27709000  8.7     1    2    0      Ah
## 6      3.0     61  137  281 7.791 0.36406700  7.8     1    2    0      Ga
## 7      3.2     31  132  346 8.217 0.19009400  9.2     1    2    0      Ah
## 8      2.8     29  150  406 8.490 0.09215160  9.5     1    1    0      Ab
## 9      2.4     37  133  347 8.668 0.18461400 10.6     1    1    0      Ab
## 10     1.6     24   80  183 9.049 0.30970200  6.3     1    2    0       W
## 11     1.4     25   86  189 9.015 0.31511600  6.4     1    2    0      Fh
## 12     1.8     25   97  251 9.073 0.22812300  9.0     1    1    0      Ag
## 13    11.2     93  285 1096 7.320 0.00000000 15.4     1    1    1       W
## 14     2.5     31  183  504 8.815 0.11393200  8.4     1    1    0      Ah
## 15     2.0     27  130  326 8.937 0.16833600  9.1     1    1    0      Ah
## 16     9.5     86  240 1032 7.702 0.00000000 16.2     1    1    1       W
## 17     7.0     74  133  606 7.160 0.01222430 16.0     1    1    1       W
## 18     7.1     69  148  711 7.100 0.01222430 16.0     1    1    1       W
## 19     8.7     69  207  735 7.020 0.00000000 13.7     1    1    1       W
## 20    12.9     95  284 1052 6.860 0.00000000 14.8     1    1    1    <NA>
##    dist.m              geometry
## 1      50 POINT (181072 333611)
## 2      30 POINT (181025 333558)
## 3     150 POINT (181165 333537)
## 4     270 POINT (181298 333484)
## 5     380 POINT (181307 333330)
## 6     470 POINT (181390 333260)
## 7     240 POINT (181165 333370)
## 8     120 POINT (181027 333363)
## 9     240 POINT (181060 333231)
## 10    420 POINT (181232 333168)
## 11    400 POINT (181191 333115)
## 12    300 POINT (181032 333031)
## 13     20 POINT (180874 333339)
## 14    130 POINT (180969 333252)
## 15    220 POINT (181011 333161)
## 16     10 POINT (180830 333246)
## 17     10 POINT (180763 333104)
## 18     10 POINT (180694 332972)
## 19     10 POINT (180625 332847)
## 20     10 POINT (180555 332707)

There is a simple plotting funtion including in sf. It is very similar to the old sp::spplot:

plot(ms)
## Warning: plotting the first 10 out of 12 attributes; use max.plot = 12 to
## plot all

Loading spatial data formats

A wide selecton of spatial data formats can be read using the st_read command (it’s using GDAL/OGR behind the scenes). Unlike readOGR from the rgdal package, generally the command can guess which driver it should use.

file_name <- system.file("shape/nc.shp", package="sf")
nc <- st_read(file_name)
## Reading layer `nc' from data source `/usr/local/lib/R/site-library/sf/shape/nc.shp' using driver `ESRI Shapefile'
## Simple feature collection with 100 features and 14 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965
## epsg (SRID):    4267
## proj4string:    +proj=longlat +datum=NAD27 +no_defs
print(nc)
## Simple feature collection with 100 features and 14 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965
## epsg (SRID):    4267
## proj4string:    +proj=longlat +datum=NAD27 +no_defs
## First 20 features:
##     AREA PERIMETER CNTY_ CNTY_ID        NAME  FIPS FIPSNO CRESS_ID BIR74
## 1  0.114     1.442  1825    1825        Ashe 37009  37009        5  1091
## 2  0.061     1.231  1827    1827   Alleghany 37005  37005        3   487
## 3  0.143     1.630  1828    1828       Surry 37171  37171       86  3188
## 4  0.070     2.968  1831    1831   Currituck 37053  37053       27   508
## 5  0.153     2.206  1832    1832 Northampton 37131  37131       66  1421
## 6  0.097     1.670  1833    1833    Hertford 37091  37091       46  1452
## 7  0.062     1.547  1834    1834      Camden 37029  37029       15   286
## 8  0.091     1.284  1835    1835       Gates 37073  37073       37   420
## 9  0.118     1.421  1836    1836      Warren 37185  37185       93   968
## 10 0.124     1.428  1837    1837      Stokes 37169  37169       85  1612
## 11 0.114     1.352  1838    1838     Caswell 37033  37033       17  1035
## 12 0.153     1.616  1839    1839  Rockingham 37157  37157       79  4449
## 13 0.143     1.663  1840    1840   Granville 37077  37077       39  1671
## 14 0.109     1.325  1841    1841      Person 37145  37145       73  1556
## 15 0.072     1.085  1842    1842       Vance 37181  37181       91  2180
## 16 0.190     2.204  1846    1846     Halifax 37083  37083       42  3608
## 17 0.053     1.171  1848    1848  Pasquotank 37139  37139       70  1638
## 18 0.199     1.984  1874    1874      Wilkes 37193  37193       97  3146
## 19 0.081     1.288  1880    1880     Watauga 37189  37189       95  1323
## 20 0.063     1.000  1881    1881  Perquimans 37143  37143       72   484
##    SID74 NWBIR74 BIR79 SID79 NWBIR79                       geometry
## 1      1      10  1364     0      19 MULTIPOLYGON (((-81.4727554...
## 2      0      10   542     3      12 MULTIPOLYGON (((-81.2398910...
## 3      5     208  3616     6     260 MULTIPOLYGON (((-80.4563446...
## 4      1     123   830     2     145 MULTIPOLYGON (((-76.0089721...
## 5      9    1066  1606     3    1197 MULTIPOLYGON (((-77.2176666...
## 6      7     954  1838     5    1237 MULTIPOLYGON (((-76.7450637...
## 7      0     115   350     2     139 MULTIPOLYGON (((-76.0089721...
## 8      0     254   594     2     371 MULTIPOLYGON (((-76.5625076...
## 9      4     748  1190     2     844 MULTIPOLYGON (((-78.3087615...
## 10     1     160  2038     5     176 MULTIPOLYGON (((-80.0256729...
## 11     2     550  1253     2     597 MULTIPOLYGON (((-79.5305099...
## 12    16    1243  5386     5    1369 MULTIPOLYGON (((-79.5305099...
## 13     4     930  2074     4    1058 MULTIPOLYGON (((-78.7491226...
## 14     4     613  1790     4     650 MULTIPOLYGON (((-78.8068008...
## 15     4    1179  2753     6    1492 MULTIPOLYGON (((-78.4925231...
## 16    18    2365  4463    17    2980 MULTIPOLYGON (((-77.3322067...
## 17     3     622  2275     4     933 MULTIPOLYGON (((-76.2989273...
## 18     4     200  3725     7     222 MULTIPOLYGON (((-81.0205688...
## 19     1      17  1775     1      33 MULTIPOLYGON (((-81.8062210...
## 20     1     230   676     0     310 MULTIPOLYGON (((-76.4805297...
plot(nc)
## Warning: plotting the first 10 out of 14 attributes; use max.plot = 14 to
## plot all

plot(nc['AREA'])

Its counterpart is st_write and is just as easy to use:

st_write(nc, "nc.shp")
st_write(nc, "nc.shp", delete_layer = TRUE)

# Note that write_st erases existing layers silently:
write_sf(nc, "nc.shp")

The drivers available on your machine can be printed using the st_drivers command:

my_drivers <- st_drivers()
head(my_drivers)
##                          name                                  long_name
## PCIDSK                 PCIDSK                       PCIDSK Database File
## netCDF                 netCDF                 Network Common Data Format
## JP2OpenJPEG       JP2OpenJPEG JPEG-2000 driver based on OpenJPEG library
## PDF                       PDF                             Geospatial PDF
## ESRI Shapefile ESRI Shapefile                             ESRI Shapefile
## MapInfo File     MapInfo File                               MapInfo File
##                write  copy is_raster is_vector
## PCIDSK          TRUE FALSE      TRUE      TRUE
## netCDF          TRUE  TRUE      TRUE      TRUE
## JP2OpenJPEG    FALSE  TRUE      TRUE      TRUE
## PDF             TRUE  TRUE      TRUE      TRUE
## ESRI Shapefile  TRUE FALSE     FALSE      TRUE
## MapInfo File    TRUE FALSE     FALSE      TRUE

Manipulating Simple Features

Very roughly, the sf package provides two types of functions:

  1. Spatial data operators (read, write, buffer, intersection, etc), prefixed by st_:
##  [1] "st_agr<-.sf"            "st_agr.sf"             
##  [3] "st_as_sf.sf"            "st_bbox.sf"            
##  [5] "st_boundary.sf"         "st_buffer.sf"          
##  [7] "st_cast.sf"             "st_centroid.sf"        
##  [9] "st_convex_hull.sf"      "st_coordinates.sf"     
## [11] "st_crs<-.sf"            "st_crs.sf"             
## [13] "st_difference.sf"       "st_geometry<-.sf"      
## [15] "st_geometry.sf"         "st_intersection.sf"    
## [17] "st_is.sf"               "st_line_merge.sf"      
## [19] "st_make_valid.sf"       "st_point_on_surface.sf"
## [21] "st_polygonize.sf"       "st_precision.sf"       
## [23] "st_segmentize.sf"       "st_set_precision.sf"   
## [25] "st_simplify.sf"         "st_split.sf"           
## [27] "st_sym_difference.sf"   "st_transform.sf"       
## [29] "st_triangulate.sf"      "st_union.sf"           
## [31] "st_voronoi.sf"          "st_zm.sf"
  1. A suite of dplyr verbs addapted for the sf data:
##  [1] "aggregate.sf"                "cbind.sf"                   
##  [3] "coerce,oldClass,S3-method"   "coerce,sf,Spatial-method"   
##  [5] "coerce,Spatial,sf-method"    "initialize,oldClass-method" 
##  [7] "merge.sf"                    "plot.sf"                    
##  [9] "print.sf"                    "rbind.sf"                   
## [11] "[.sf"                        "[[<-.sf"                    
## [13] "$<-.sf"                      "show,oldClass-method"       
## [15] "slotsFromS3,oldClass-method"

tidy verbs

sf is very well integrated with dplyr and the tidyverse: remember that Simple Features as implemented by sf are fundamentally just data.frames!

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
# Example: joining non-spatial attributes
# 
additional_data <- data.frame(
  CNTY_ID = nc$CNTY_ID,
  my_data <- runif(nrow(nc))
)

left_join(nc, additional_data, by = 'CNTY_ID')
## Simple feature collection with 100 features and 15 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965
## epsg (SRID):    4267
## proj4string:    +proj=longlat +datum=NAD27 +no_defs
## First 20 features:
##     AREA PERIMETER CNTY_ CNTY_ID        NAME  FIPS FIPSNO CRESS_ID BIR74
## 1  0.114     1.442  1825    1825        Ashe 37009  37009        5  1091
## 2  0.061     1.231  1827    1827   Alleghany 37005  37005        3   487
## 3  0.143     1.630  1828    1828       Surry 37171  37171       86  3188
## 4  0.070     2.968  1831    1831   Currituck 37053  37053       27   508
## 5  0.153     2.206  1832    1832 Northampton 37131  37131       66  1421
## 6  0.097     1.670  1833    1833    Hertford 37091  37091       46  1452
## 7  0.062     1.547  1834    1834      Camden 37029  37029       15   286
## 8  0.091     1.284  1835    1835       Gates 37073  37073       37   420
## 9  0.118     1.421  1836    1836      Warren 37185  37185       93   968
## 10 0.124     1.428  1837    1837      Stokes 37169  37169       85  1612
## 11 0.114     1.352  1838    1838     Caswell 37033  37033       17  1035
## 12 0.153     1.616  1839    1839  Rockingham 37157  37157       79  4449
## 13 0.143     1.663  1840    1840   Granville 37077  37077       39  1671
## 14 0.109     1.325  1841    1841      Person 37145  37145       73  1556
## 15 0.072     1.085  1842    1842       Vance 37181  37181       91  2180
## 16 0.190     2.204  1846    1846     Halifax 37083  37083       42  3608
## 17 0.053     1.171  1848    1848  Pasquotank 37139  37139       70  1638
## 18 0.199     1.984  1874    1874      Wilkes 37193  37193       97  3146
## 19 0.081     1.288  1880    1880     Watauga 37189  37189       95  1323
## 20 0.063     1.000  1881    1881  Perquimans 37143  37143       72   484
##    SID74 NWBIR74 BIR79 SID79 NWBIR79 my_data....runif.nrow.nc..
## 1      1      10  1364     0      19                0.008945796
## 2      0      10   542     3      12                0.293739612
## 3      5     208  3616     6     260                0.277374958
## 4      1     123   830     2     145                0.813574215
## 5      9    1066  1606     3    1197                0.260427771
## 6      7     954  1838     5    1237                0.724405893
## 7      0     115   350     2     139                0.906092151
## 8      0     254   594     2     371                0.949040221
## 9      4     748  1190     2     844                0.073144469
## 10     1     160  2038     5     176                0.754675027
## 11     2     550  1253     2     597                0.286000621
## 12    16    1243  5386     5    1369                0.100053522
## 13     4     930  2074     4    1058                0.954068775
## 14     4     613  1790     4     650                0.415607119
## 15     4    1179  2753     6    1492                0.455102418
## 16    18    2365  4463    17    2980                0.971055656
## 17     3     622  2275     4     933                0.583987980
## 18     4     200  3725     7     222                0.962204624
## 19     1      17  1775     1      33                0.761702403
## 20     1     230   676     0     310                0.714508535
##                          geometry
## 1  MULTIPOLYGON (((-81.4727554...
## 2  MULTIPOLYGON (((-81.2398910...
## 3  MULTIPOLYGON (((-80.4563446...
## 4  MULTIPOLYGON (((-76.0089721...
## 5  MULTIPOLYGON (((-77.2176666...
## 6  MULTIPOLYGON (((-76.7450637...
## 7  MULTIPOLYGON (((-76.0089721...
## 8  MULTIPOLYGON (((-76.5625076...
## 9  MULTIPOLYGON (((-78.3087615...
## 10 MULTIPOLYGON (((-80.0256729...
## 11 MULTIPOLYGON (((-79.5305099...
## 12 MULTIPOLYGON (((-79.5305099...
## 13 MULTIPOLYGON (((-78.7491226...
## 14 MULTIPOLYGON (((-78.8068008...
## 15 MULTIPOLYGON (((-78.4925231...
## 16 MULTIPOLYGON (((-77.3322067...
## 17 MULTIPOLYGON (((-76.2989273...
## 18 MULTIPOLYGON (((-81.0205688...
## 19 MULTIPOLYGON (((-81.8062210...
## 20 MULTIPOLYGON (((-76.4805297...

sf is also providing verbs that are adapted to working with geometries. For example summarise will also operate a spatial union:

nc %>% 
  mutate(
    group = sample(LETTERS[1:3], size = n(), replace = TRUE),
    area = st_area(.), # Note the computation of area on the fly
    sqrt_area = sqrt(area)
  ) %>% 
  group_by(group) %>% 
  summarise(mean_area = mean(sqrt_area)) %>% 
  plot
## Warning in Math.units(area): Operation sqrt not meaningful for units

Projection

The main functions are st_crs, which retrieves the Coordinate Reference System (CRS) of the sf object, and st_transform, which allows to re-project the dataset into a different CRS.

The CRS can be specified as an EPSG number (that is: an integer), or as a character string follwoing the Proj.4 standard.

# Retrieve projection information
st_crs(nc)
## $epsg
## [1] 4267
## 
## $proj4string
## [1] "+proj=longlat +datum=NAD27 +no_defs"
## 
## attr(,"class")
## [1] "crs"
# Test whether data is projected or not
st_is_longlat(nc)
## [1] TRUE
# Projection
# 

# Using a Proj.4 string
nc_p <- st_transform(nc, crs = "+proj=lcc +lat_1=36.16666666666666 +lat_2=34.33333333333334 +lat_0=33.75 +lon_0=-79 +x_0=609601.22 +y_0=0 +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs")
# using an EPSG code
nc_p <- st_transform(nc, crs = 32119)

st_crs(nc_p)
## $epsg
## [1] 32119
## 
## $proj4string
## [1] "+proj=lcc +lat_1=36.16666666666666 +lat_2=34.33333333333334 +lat_0=33.75 +lon_0=-79 +x_0=609601.22 +y_0=0 +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs"
## 
## attr(,"class")
## [1] "crs"

Geometry

The sf library wraps the capabilities of the GEOS (Geometry Engine Open Source) library — which is a very powerful piece of software used throughout the GIS industry, notably in PostGIS. The geometry functions in sf are prefixed using st_ to mimic the PostGIS counterpart functions.

# Area of features
st_area(nc)
## Units: m^2
##   [1] 1137388604  611077263 1423489919  694546292 1520740530  967727952
##   [7]  615942210  903650119 1179347051 1232769242 1136287383 1524655447
##  [13] 1427072339 1085967384  718190417 1894045426  524418608 1986966652
##  [19]  812293946  626337436  439717946  640703090  863300187 1276536764
##  [25] 1084141138 1697937846 1109973986 1800646340 1036414044  770546970
##  [31] 1423171283  585232978 1311628320 1225109692  800263351 1186420419
##  [37] 2194592880 1179139746 1550312884  690606074  665152821 1457875905
##  [43] 1340538971 1005743617  988324117 1163892506 2019778925 1810512144
##  [49]  944425951 1350102629 1685154840 1068192109 1691487743 2082117118
##  [55] 1447094476  943881740 2045540782 1420919020  707671880  653368537
##  [61] 1471077596 1436135571 1550981080 1186028306  788535203 1265443511
##  [67] 1829439189 1447873644  918216585 1312723474 1043728456  961860121
##  [73]  781910957 1046066043  986730350  917752685  601326678 1321916259
##  [79] 2437929994  833538942 1210326313 1738550168 1228689799 1648412173
##  [85] 1400582685  995107225 1678077407 2071844302 1228274216  519198288
##  [91] 1784932020  808600209 1978619669 2439553215 1264050755 2288682896
##  [97] 2181174167 2450241815  430695499 2165843695
# Distances between features
st_distance(nc_p[1:3,], nc_p[3:6,])
## Units: m
##          [,1]     [,2]     [,3]     [,4]
## [1,] 25652.53 440561.1 299772.2 361529.5
## [2,]     0.00 409428.5 268944.3 332589.8
## [3,]     0.00 367555.5 227017.5 290297.1
# Sanity check on the geometries
st_is_valid(nc[1:3,])
## [1] TRUE TRUE TRUE

The commands st_intersects, st_disjoint, st_touches, st_crosses, st_within, st_contains, st_overlaps, st_equals, st_covers, st_covered_by, st_equals_exact and st_is_within_distance are logical tests. They return a sparse matrix with matching (TRUE) indexes, or a full logical matrix:

# Intersects
st_intersects(nc_p[1:5,], nc_p[1:4,])
## [[1]]
## [1] 1 2
## 
## [[2]]
## [1] 1 2 3
## 
## [[3]]
## [1] 2 3
## 
## [[4]]
## [1] 4
## 
## [[5]]
## integer(0)
st_intersects(nc_p[1:5,], nc_p[1:4,], sparse = FALSE)
##       [,1]  [,2]  [,3]  [,4]
## [1,]  TRUE  TRUE FALSE FALSE
## [2,]  TRUE  TRUE  TRUE FALSE
## [3,] FALSE  TRUE  TRUE FALSE
## [4,] FALSE FALSE FALSE  TRUE
## [5,] FALSE FALSE FALSE FALSE

The commands st_buffer, st_boundary, st_convexhull, st_union_cascaded, st_simplify, st_triangulate, st_polygonize, st_centroid, st_segmentize, and st_union return new geometries, e.g.:

buf <- st_buffer(nc_p[c(1, 2, 14),], dist = 30000)
plot(buf, border = 'red')
## Warning: plotting the first 10 out of 14 attributes; use max.plot = 14 to
## plot all

The st_buffer function computes a buffer outwards or inwards from features. Bonus trick: if you have invalid fatures, try to make a buffer of size zero

fix <- st_buffer(corrupted, dist = 0)

Commands st_intersection, st_union, st_difference, st_sym_difference return new geometries that are a function of pairs of geometries:

# Intersection
i <- st_intersection(nc_p[1:5,], nc_p[1:4,])
## Warning: attribute variables are assumed to be spatially constant
## throughout all geometries
plot(i)
## Warning: plotting the first 9 out of 28 attributes; use max.plot = 28 to
## plot all

# Union
u <- st_union(nc)
plot(u)

ggplot2 compatibility

In the upcoming version of ggplot2 (note that I’m running the development version), sf object can be plotted very easily using geom_sf — so the incredible power of this plotting sytem can be takn advantage of:

library(ggplot2)
packageVersion('ggplot2')
## [1] '2.2.1.9000'
# Unprojected data
ggplot(nc) + 
  geom_sf(aes(fill = AREA))

# Facetting (on projected data this time!)

nc_grouped <- nc_p %>% 
  mutate(
    # Creating a (random) group attribute
    group = sample(LETTERS[1:3], size = n(), replace = TRUE)
  ) 

ggplot(nc_grouped) +
  geom_sf(aes(fill = sqrt(AREA))) +
  facet_wrap(~group, ncol = 2)

# Going full fancy
p <- nc_grouped %>% 
  select(-group) %>% 
  ggplot() + 
    geom_sf(colour = "grey70")

p +
  geom_sf(data = nc_grouped, aes(fill = sqrt(AREA))) +
  facet_wrap(~group, ncol = 2) +
  theme_bw()

The legacy: sp + rgdal

The “classic approach” is a collection of 3 packages:

  • sp: provides a set of dedicated classes for the different vector and raster datatypes , and some analysis tools (overlays, etc.),
  • rgdal: input/output library built over GDAL/OGR to read and write spatial data,
  • rgeos: geometry library built over GEOS for all geometry operations (union, intersections, buffer, etc).

Unlike the sf package, the sp library provides 2 classes per type of vector data:

  • SpatialPoints and SpatialPointsDataFrame
  • SpatialLines and SpatialLinesDataFrame
  • SpatialPolygons and SpatialPolygonsDataFrame

Reading vector data with sp

library(sp)
library(rgdal)
## rgdal: version: 1.2-11, (SVN revision 676)
##  Geospatial Data Abstraction Library extensions to R successfully loaded
##  Loaded GDAL runtime: GDAL 2.2.1, released 2017/06/23
##  Path to GDAL shared files: /usr/share/gdal/2.2
##  Loaded PROJ.4 runtime: Rel. 4.9.2, 08 September 2015, [PJ_VERSION: 492]
##  Path to PROJ.4 shared files: (autodetected)
##  Linking to sp version: 1.2-5
file_name <- system.file("shape/nc.shp", package="sf")

ogrListLayers(file_name)
## [1] "nc"
## attr(,"driver")
## [1] "ESRI Shapefile"
## attr(,"nlayers")
## [1] 1
nc <- readOGR(dsn = file_name, layer = 'nc')
## OGR data source with driver: ESRI Shapefile 
## Source: "/usr/local/lib/R/site-library/sf/shape/nc.shp", layer: "nc"
## with 100 features
## It has 14 fields
summary(nc)
## Object of class SpatialPolygonsDataFrame
## Coordinates:
##         min       max
## x -84.32385 -75.45698
## y  33.88199  36.58965
## Is projected: FALSE 
## proj4string :
## [+proj=longlat +datum=NAD27 +no_defs +ellps=clrk66
## +nadgrids=@conus,@alaska,@ntv2_0.gsb,@ntv1_can.dat]
## Data attributes:
##       AREA          PERIMETER         CNTY_         CNTY_ID    
##  Min.   :0.0420   Min.   :0.999   Min.   :1825   Min.   :1825  
##  1st Qu.:0.0910   1st Qu.:1.324   1st Qu.:1902   1st Qu.:1902  
##  Median :0.1205   Median :1.609   Median :1982   Median :1982  
##  Mean   :0.1263   Mean   :1.673   Mean   :1986   Mean   :1986  
##  3rd Qu.:0.1542   3rd Qu.:1.859   3rd Qu.:2067   3rd Qu.:2067  
##  Max.   :0.2410   Max.   :3.640   Max.   :2241   Max.   :2241  
##                                                                
##         NAME         FIPS        FIPSNO         CRESS_ID     
##  Alamance : 1   37001  : 1   Min.   :37001   Min.   :  1.00  
##  Alexander: 1   37003  : 1   1st Qu.:37050   1st Qu.: 25.75  
##  Alleghany: 1   37005  : 1   Median :37100   Median : 50.50  
##  Anson    : 1   37007  : 1   Mean   :37100   Mean   : 50.50  
##  Ashe     : 1   37009  : 1   3rd Qu.:37150   3rd Qu.: 75.25  
##  Avery    : 1   37011  : 1   Max.   :37199   Max.   :100.00  
##  (Other)  :94   (Other):94                                   
##      BIR74           SID74          NWBIR74           BIR79      
##  Min.   :  248   Min.   : 0.00   Min.   :   1.0   Min.   :  319  
##  1st Qu.: 1077   1st Qu.: 2.00   1st Qu.: 190.0   1st Qu.: 1336  
##  Median : 2180   Median : 4.00   Median : 697.5   Median : 2636  
##  Mean   : 3300   Mean   : 6.67   Mean   :1050.8   Mean   : 4224  
##  3rd Qu.: 3936   3rd Qu.: 8.25   3rd Qu.:1168.5   3rd Qu.: 4889  
##  Max.   :21588   Max.   :44.00   Max.   :8027.0   Max.   :30757  
##                                                                  
##      SID79          NWBIR79       
##  Min.   : 0.00   Min.   :    3.0  
##  1st Qu.: 2.00   1st Qu.:  250.5  
##  Median : 5.00   Median :  874.5  
##  Mean   : 8.36   Mean   : 1352.8  
##  3rd Qu.:10.25   3rd Qu.: 1406.8  
##  Max.   :57.00   Max.   :11631.0  
## 

Translating to and from sf

nc_sf <- st_as_sf(nc)
summary(nc_sf)
##       AREA          PERIMETER         CNTY_         CNTY_ID    
##  Min.   :0.0420   Min.   :0.999   Min.   :1825   Min.   :1825  
##  1st Qu.:0.0910   1st Qu.:1.324   1st Qu.:1902   1st Qu.:1902  
##  Median :0.1205   Median :1.609   Median :1982   Median :1982  
##  Mean   :0.1263   Mean   :1.673   Mean   :1986   Mean   :1986  
##  3rd Qu.:0.1542   3rd Qu.:1.859   3rd Qu.:2067   3rd Qu.:2067  
##  Max.   :0.2410   Max.   :3.640   Max.   :2241   Max.   :2241  
##                                                                
##         NAME         FIPS        FIPSNO         CRESS_ID     
##  Alamance : 1   37001  : 1   Min.   :37001   Min.   :  1.00  
##  Alexander: 1   37003  : 1   1st Qu.:37050   1st Qu.: 25.75  
##  Alleghany: 1   37005  : 1   Median :37100   Median : 50.50  
##  Anson    : 1   37007  : 1   Mean   :37100   Mean   : 50.50  
##  Ashe     : 1   37009  : 1   3rd Qu.:37150   3rd Qu.: 75.25  
##  Avery    : 1   37011  : 1   Max.   :37199   Max.   :100.00  
##  (Other)  :94   (Other):94                                   
##      BIR74           SID74          NWBIR74           BIR79      
##  Min.   :  248   Min.   : 0.00   Min.   :   1.0   Min.   :  319  
##  1st Qu.: 1077   1st Qu.: 2.00   1st Qu.: 190.0   1st Qu.: 1336  
##  Median : 2180   Median : 4.00   Median : 697.5   Median : 2636  
##  Mean   : 3300   Mean   : 6.67   Mean   :1050.8   Mean   : 4224  
##  3rd Qu.: 3936   3rd Qu.: 8.25   3rd Qu.:1168.5   3rd Qu.: 4889  
##  Max.   :21588   Max.   :44.00   Max.   :8027.0   Max.   :30757  
##                                                                  
##      SID79          NWBIR79                 geometry  
##  Min.   : 0.00   Min.   :    3.0   MULTIPOLYGON :100  
##  1st Qu.: 2.00   1st Qu.:  250.5   epsg:4267    :  0  
##  Median : 5.00   Median :  874.5   +proj=long...:  0  
##  Mean   : 8.36   Mean   : 1352.8                      
##  3rd Qu.:10.25   3rd Qu.: 1406.8                      
##  Max.   :57.00   Max.   :11631.0                      
## 
nc_sp <- as(nc_sf, "Spatial")
summary(nc_sp)
## Object of class SpatialPolygonsDataFrame
## Coordinates:
##         min       max
## x -84.32385 -75.45698
## y  33.88199  36.58965
## Is projected: FALSE 
## proj4string :
## [+proj=longlat +ellps=clrk66
## +nadgrids=@conus,@alaska,@ntv2_0.gsb,@ntv1_can.dat +no_defs]
## Data attributes:
##       AREA          PERIMETER         CNTY_         CNTY_ID    
##  Min.   :0.0420   Min.   :0.999   Min.   :1825   Min.   :1825  
##  1st Qu.:0.0910   1st Qu.:1.324   1st Qu.:1902   1st Qu.:1902  
##  Median :0.1205   Median :1.609   Median :1982   Median :1982  
##  Mean   :0.1263   Mean   :1.673   Mean   :1986   Mean   :1986  
##  3rd Qu.:0.1542   3rd Qu.:1.859   3rd Qu.:2067   3rd Qu.:2067  
##  Max.   :0.2410   Max.   :3.640   Max.   :2241   Max.   :2241  
##                                                                
##         NAME         FIPS        FIPSNO         CRESS_ID     
##  Alamance : 1   37001  : 1   Min.   :37001   Min.   :  1.00  
##  Alexander: 1   37003  : 1   1st Qu.:37050   1st Qu.: 25.75  
##  Alleghany: 1   37005  : 1   Median :37100   Median : 50.50  
##  Anson    : 1   37007  : 1   Mean   :37100   Mean   : 50.50  
##  Ashe     : 1   37009  : 1   3rd Qu.:37150   3rd Qu.: 75.25  
##  Avery    : 1   37011  : 1   Max.   :37199   Max.   :100.00  
##  (Other)  :94   (Other):94                                   
##      BIR74           SID74          NWBIR74           BIR79      
##  Min.   :  248   Min.   : 0.00   Min.   :   1.0   Min.   :  319  
##  1st Qu.: 1077   1st Qu.: 2.00   1st Qu.: 190.0   1st Qu.: 1336  
##  Median : 2180   Median : 4.00   Median : 697.5   Median : 2636  
##  Mean   : 3300   Mean   : 6.67   Mean   :1050.8   Mean   : 4224  
##  3rd Qu.: 3936   3rd Qu.: 8.25   3rd Qu.:1168.5   3rd Qu.: 4889  
##  Max.   :21588   Max.   :44.00   Max.   :8027.0   Max.   :30757  
##                                                                  
##      SID79          NWBIR79       
##  Min.   : 0.00   Min.   :    3.0  
##  1st Qu.: 2.00   1st Qu.:  250.5  
##  Median : 5.00   Median :  874.5  
##  Mean   : 8.36   Mean   : 1352.8  
##  3rd Qu.:10.25   3rd Qu.: 1406.8  
##  Max.   :57.00   Max.   :11631.0  
## 

Raster data

The sp package

Unlike sf, the sp package supports raster data. You can read a raster file in conjunction with the rgdal package:

library(sp)
library(rgdal)

# This is pointing to a "GeoTIFF" file on your computer
fn <- system.file("pictures/cea.tif", package = "rgdal")
fn
## [1] "/usr/local/lib/R/site-library/rgdal/pictures/cea.tif"

The raster file can be read using the readGDAL command:

r <- readGDAL(fn)
## /usr/local/lib/R/site-library/rgdal/pictures/cea.tif has GDAL driver GTiff 
## and has 515 rows and 514 columns

We can print a summary of the data, and check that it is regularly gridded:

summary(r)
## Object of class SpatialGridDataFrame
## Coordinates:
##          min         max
## x  -28493.17    2358.212
## y 4224973.14 4255884.544
## Is projected: TRUE 
## proj4string :
## [+proj=cea +lon_0=-117.333333333333 +lat_ts=33.75 +x_0=0 +y_0=0
## +datum=NAD27 +units=m +no_defs +ellps=clrk66
## +nadgrids=@conus,@alaska,@ntv2_0.gsb,@ntv1_can.dat]
## Grid attributes:
##   cellcentre.offset cellsize cells.dim
## x         -28463.16 60.02214       514
## y        4225003.15 60.02214       515
## Data attributes:
##      band1      
##  Min.   :  0.0  
##  1st Qu.: 58.0  
##  Median : 99.0  
##  Mean   :103.1  
##  3rd Qu.:140.0  
##  Max.   :255.0
gridded(r)
## [1] TRUE

The raster can also be plotted using the spplot command:

spplot(r)

A big drawback of the sp package is that it loads the entire dataset into memory.

For that reason, and also because of the lack of raster manipulation tools, using the raster package is usually preferred.

The raster package

The raster package has been developped after sp, with the aim to address some of the shortcomings of the way sp handles raster data. It makes it really easy to read, write, and play with raster data (whatever the number of bands). A big plus is also that it can process big datasets by reading it partially.

library(raster)

It provides essentially 3 types of objects (classes):

  • The RasterLayer, which represents a single layer of raster data,
  • The RasterStack, which is a (loose) collection of multiple RasterLayer objects sharing the same resolution and extents,
  • The RasterBrick, a multi-layered raster referring to a single file — typically a remote sensing image.

The RasterLayer

We can create a RasterLayer in many different ways:

  • from scratch
  • from a sp object
  • from a raster data file

Let’s focus on the latter one. We can simply feed the file name of our GeoTIFF dataset to the raster command:

r <- raster(fn)

r
## class       : RasterLayer 
## dimensions  : 515, 514, 264710  (nrow, ncol, ncell)
## resolution  : 60.02214, 60.02214  (x, y)
## extent      : -28493.17, 2358.212, 4224973, 4255885  (xmin, xmax, ymin, ymax)
## coord. ref. : +proj=cea +lon_0=-117.333333333333 +lat_ts=33.75 +x_0=0 +y_0=0 +datum=NAD27 +units=m +no_defs +ellps=clrk66 +nadgrids=@conus,@alaska,@ntv2_0.gsb,@ntv1_can.dat 
## data source : /usr/local/lib/R/site-library/rgdal/pictures/cea.tif 
## names       : cea 
## values      : 0, 255  (min, max)
summary(r)
## Warning in .local(object, ...): summary is an estimate based on a sample of 1e+05 cells (37.78% of all cells)
##         cea
## Min.      0
## 1st Qu.  58
## Median   99
## 3rd Qu. 140
## Max.    255
## NA's      0

The plotting function of raster is also faster than the plotting routine of sp:

plot(r)

We can also use the different commands to query our raster data:

# Dimensions
dim(r)
## [1] 515 514   1
# Number of cells
ncell(r)
## [1] 264710
# Resolution
res(r)
## [1] 60.02214 60.02214
# Coordinate reference system
projection(r)
## [1] "+proj=cea +lon_0=-117.333333333333 +lat_ts=33.75 +x_0=0 +y_0=0 +datum=NAD27 +units=m +no_defs +ellps=clrk66 +nadgrids=@conus,@alaska,@ntv2_0.gsb,@ntv1_can.dat"
# Spatial extent
extent(r)
## class       : Extent 
## xmin        : -28493.17 
## xmax        : 2358.212 
## ymin        : 4224973 
## ymax        : 4255885

The RasterStack

A RasterStack is a suite of raster layers that share:

  • the same extent
  • the same resolution

They can be from different files on your disk:

fn
## [1] "/usr/local/lib/R/site-library/rgdal/pictures/cea.tif"
s <- stack(fn, fn, fn)
s
## class       : RasterStack 
## dimensions  : 515, 514, 264710, 3  (nrow, ncol, ncell, nlayers)
## resolution  : 60.02214, 60.02214  (x, y)
## extent      : -28493.17, 2358.212, 4224973, 4255885  (xmin, xmax, ymin, ymax)
## coord. ref. : +proj=cea +lon_0=-117.333333333333 +lat_ts=33.75 +x_0=0 +y_0=0 +datum=NAD27 +units=m +no_defs +ellps=clrk66 +nadgrids=@conus,@alaska,@ntv2_0.gsb,@ntv1_can.dat 
## names       : cea.1, cea.2, cea.3 
## min values  :     0,     0,     0 
## max values  :   255,   255,   255
summary(s)
## Warning in .local(object, ...): summary is an estimate based on a sample of 1e+05 cells (37.78% of all cells)
##         cea.1 cea.2 cea.3
## Min.        0     0     0
## 1st Qu.    58    58    58
## Median     99    99    99
## 3rd Qu.   140   140   140
## Max.      255   255   255
## NA's        0     0     0
plot(s)

They can also be from different RasterLayer objects, or even different RasterStack objects:

r1 <- r
r2 <- r^2

s <- stack(r1, r2)

plot(s)

The number of layer in the stack can be checked using nlayers:

nlayers(s)
## [1] 2

Algebra

You can use the common algebra operators on raster objects:

r1 <- r + 10
r1 <- r1 * r + 5

You can apply functions:

r2 <- sqrt(r1)
r <- round(r)

Logical tests:

r <- r == 1

You can also use replacement functions:

r[] <- runif(ncell(r))
r1[r] <- -0.5
r1[r1 == 5] <- 15
r1[ !r ] <- 5

Conversions

From raster to sp and back

r <- raster(fn)
spdf <- as(r, "SpatialPixelsDataFrame")
summary(spdf)
## Object of class SpatialPixelsDataFrame
## Coordinates:
##          min         max
## x  -28493.17    2358.212
## y 4224973.14 4255884.544
## Is projected: TRUE 
## proj4string :
## [+proj=cea +lon_0=-117.333333333333 +lat_ts=33.75 +x_0=0 +y_0=0
## +datum=NAD27 +units=m +no_defs +ellps=clrk66
## +nadgrids=@conus,@alaska,@ntv2_0.gsb,@ntv1_can.dat]
## Number of points: 264710
## Grid attributes:
##    cellcentre.offset cellsize cells.dim
## s1         -28463.16 60.02214       514
## s2        4225003.15 60.02214       515
## Data attributes:
##       cea       
##  Min.   :  0.0  
##  1st Qu.: 58.0  
##  Median : 99.0  
##  Mean   :103.1  
##  3rd Qu.:140.0  
##  Max.   :255.0
spplot(spdf)

r <- raster(spdf)
r
## class       : RasterLayer 
## dimensions  : 515, 514, 264710  (nrow, ncol, ncell)
## resolution  : 60.02214, 60.02214  (x, y)
## extent      : -28493.17, 2358.212, 4224973, 4255885  (xmin, xmax, ymin, ymax)
## coord. ref. : +proj=cea +lon_0=-117.333333333333 +lat_ts=33.75 +x_0=0 +y_0=0 +datum=NAD27 +units=m +no_defs +ellps=clrk66 +nadgrids=@conus,@alaska,@ntv2_0.gsb,@ntv1_can.dat 
## data source : in memory
## names       : cea 
## values      : 0, 255  (min, max)
plot(r)

From raster to data.frame and back

df <- as.data.frame(s, xy = TRUE)
head(df)
##           x       y cea.1 cea.2
## 1 -28463.16 4255855     0     0
## 2 -28403.13 4255855     0     0
## 3 -28343.11 4255855     0     0
## 4 -28283.09 4255855     0     0
## 5 -28223.07 4255855     0     0
## 6 -28163.05 4255855     0     0
summary(df)
##        x                y               cea.1           cea.2      
##  Min.   :-28463   Min.   :4225003   Min.   :  0.0   Min.   :    0  
##  1st Qu.:-20780   1st Qu.:4232686   1st Qu.: 58.0   1st Qu.: 3364  
##  Median :-13067   Median :4240429   Median : 99.0   Median : 9801  
##  Mean   :-13067   Mean   :4240429   Mean   :103.1   Mean   :14109  
##  3rd Qu.: -5355   3rd Qu.:4248172   3rd Qu.:140.0   3rd Qu.:19600  
##  Max.   :  2328   Max.   :4255855   Max.   :255.0   Max.   :65025
df <- na.omit(df)
s <- rasterFromXYZ(df)
s
## class       : RasterBrick 
## dimensions  : 515, 514, 264710, 2  (nrow, ncol, ncell, nlayers)
## resolution  : 60.02214, 60.02214  (x, y)
## extent      : -28493.17, 2358.212, 4224973, 4255885  (xmin, xmax, ymin, ymax)
## coord. ref. : NA 
## data source : in memory
## names       : cea.1, cea.2 
## min values  :     0,     0 
## max values  :   255, 65025
s <- rasterFromXYZ(
  xyz = df,
  crs = projection(s)
)
s
## class       : RasterBrick 
## dimensions  : 515, 514, 264710, 2  (nrow, ncol, ncell, nlayers)
## resolution  : 60.02214, 60.02214  (x, y)
## extent      : -28493.17, 2358.212, 4224973, 4255885  (xmin, xmax, ymin, ymax)
## coord. ref. : NA 
## data source : in memory
## names       : cea.1, cea.2 
## min values  :     0,     0 
## max values  :   255, 65025

Modifying the Raster* objects

projectRaster(r, crs = "+init=epsg:4326")
## class       : RasterLayer 
## dimensions  : 527, 524, 276148  (nrow, ncol, ncell)
## resolution  : 0.000648, 0.000541  (x, y)
## extent      : -117.645, -117.3054, 33.66197, 33.94708  (xmin, xmax, ymin, ymax)
## coord. ref. : +init=epsg:4326 +proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0 
## data source : in memory
## names       : cea 
## values      : -36.65713, 255  (min, max)
pts <- sampleRandom(r, size = 10, sp = TRUE)
plot(r)
points(pts, pch = "+", col = 2, cex = 2)

extract(s, pts)
##       cea.1 cea.2
##  [1,]    16   256
##  [2,]    82  6724
##  [3,]    25   625
##  [4,]   181 32761
##  [5,]   165 27225
##  [6,]   132 17424
##  [7,]   107 11449
##  [8,]    66  4356
##  [9,]    66  4356
## [10,]   140 19600
extract(s, pts, df = TRUE)
##    ID cea.1 cea.2
## 1   1    16   256
## 2   2    82  6724
## 3   3    25   625
## 4   4   181 32761
## 5   5   165 27225
## 6   6   132 17424
## 7   7   107 11449
## 8   8    66  4356
## 9   9    66  4356
## 10 10   140 19600
extract(s, pts, sp = TRUE)
## class       : SpatialPointsDataFrame 
## features    : 10 
## extent      : -26362.38, -12.66279, 4225123, 4253274  (xmin, xmax, ymin, ymax)
## coord. ref. : +proj=cea +lon_0=-117.333333333333 +lat_ts=33.75 +x_0=0 +y_0=0 +datum=NAD27 +units=m +no_defs +ellps=clrk66 +nadgrids=@conus,@alaska,@ntv2_0.gsb,@ntv1_can.dat 
## variables   : 3
## names       : cea, cea.1, cea.2 
## min values  :  16,    16,   256 
## max values  : 181,   181, 32761

The resample command transfers raster values between two non-matching Raster* object (extent, resolution):

r2 <- aggregate(r, fact = 2)
res(r)
## [1] 60.02214 60.02214
res(r2)
## [1] 120.0443 120.0443
r3 <- resample(r2, r)
res(r3)
## [1] 60.02214 60.02214
s <- stack(r, r3)
plot(s)

Calc

my_function <- function(x) {
  
  res <- x * 10
  
  return(res)
}

r1 <- calc(r, fun = my_function)