2. Essentials: Binning Point Data

Introduction

The purpose of this tutorial is to demonstrate the basic usage of the icosa package. The primary targeted application of the package is in global biological sciences (e.g. in macroecological, biogeographical analyses), but other fields might find the structures and procedures relevant, given that they operate with point vector data and variables that depend on the density of such data.

Note that this is just a brief introduction to the package’s capabilities. The complete documentation of the package and the rest of the tutorials will be made available on the package website. Relevant how-to guides will be posted on the evolv-ED blog (https://www.evolv-ed.net/).

Rationale: Point Data

One of the many problems with ecological samples is that due to density and uniformity issues, the data points are to be aggregated to distinct units. As coordinate recording is very efficient on the 2d surface of a polar coordinate system (i.e. latiude and longitude data), this was primarly achieved by rectangular gridding of the surface (for instance 1° times 1° grid cells). Unfortunately, this method suffers from systematic biasing effects: as the poles are approached, the cells become smaller in area, and come closer to each other.

Random points on the sphere

We can illustrate this problem by generating random points and assign them to different spatial bins. The icosa package has a utility function to generate such points:

library(icosa)

The function rpsphere() generates randomly distributed points on the surface of a sphere.

# set seed for exact reproducibility
set.seed(0)
# 10000 random points
randPoints <- rpsphere(10000, output="polar")
# the first 6 points
head(randPoints)

##             long        lat
## [1,]  -43.998368 -16.764084
## [2,] -105.191289   6.095493
## [3,]  -20.456553 -10.867072
## [4,]    3.011043  26.776595
## [5,]   -7.820389  67.246328
## [6,]  170.576713 -35.169054

Land Polygons

You can visualize this pointset and plot them on a map, with some example data, such as the land polygons from Natural Earth - which are included in the package for the sake of this tutorial. You can load and plot such vector data with the sf package. You can install it from the CRAN with the regular install.packages()

library(sf)

## Linking to GEOS 3.10.2, GDAL 3.4.1, PROJ 8.2.1; sf_use_s2() is TRUE

Once the package is installed, you can load the example data with sf::st_read():

# reading in example data
ne <- sf::st_read(file.path(system.file(package="icosa"), "extdata/ne_110m_land.shx"))

## Reading layer `ne_110m_land' from data source 
##   `/usr/local/lib/R/site-library/icosa/extdata/ne_110m_land.shx' 
##   using driver `ESRI Shapefile'
## Simple feature collection with 127 features and 3 fields
## Geometry type: POLYGON
## Dimension:     XY
## Bounding box:  xmin: -180 ymin: -90 xmax: 180 ymax: 83.64513
## Geodetic CRS:  WGS 84

This vector data object can be plotted easilly and we can use it to contextualize the randPoints object - which can represent any kind of point data:

# plotting the world map
plot(ne$geometry, col="gray", border=NA)

# plotting the point cloud
points(randPoints, col="#FF000055", pch=3)

Note

The default margins of sf plots can be too large! The margins of some of these plots were set with:

par(mar=c(0.25,0.25,0.25,0.25))

Note that the points appear to have a non-random distribution: they seem to have higher density in lower-latitude areas. But this is in fact a distorting effect of not the points, but the equirectangular map projection itself! The points are are created with 3D normal distributions and then they are projected to the surface of the sphere around the center of this distribution.

Binning with a Gaussian (UV)-grid

Let’s count these data points in a simple longitude-latitude (Gaussian, polar, or UV) grid, which you can easilly implement with the terra package (i.e. raster)! Similar to sf this package can be installed from the CRAN with install.packages() function call.

library(terra)

Let’s make a 10x10° raster grid, and count the number of points in every cell:

# a 10-degree resolution grid
r <- terra::rast(res=10)

# count the number of points in raster cells
counts <- terra::rasterize(randPoints,r, fun=length)
counts

## class       : SpatRaster 
## dimensions  : 18, 36, 1  (nrow, ncol, nlyr)
## resolution  : 10, 10  (x, y)
## extent      : -180, 180, -90, 90  (xmin, xmax, ymin, ymax)
## coord. ref. : lon/lat WGS 84 
## source(s)   : memory
## name        : values 
## min value   :      1 
## max value   :     44

The counts of these points can be easilly visualized with a plot.

# the raster itself
plot(counts)

# plotting the map on plot
plot(ne$geometry, col=NA, border="black", add=TRUE)

The point counts have a similar latitudinal pattern as we saw with their distribution: as latitude increases, the cell sizes decrease and the expected number of points in them decreases.0

One solution to this problem can be the use of a different projection. For instance the Lambert’s Cylindrical Equal Area projection (["EPSG:54034"](https://epsg.io/54034)) can be used instead of the Equirectangular projection. This will, however distort the polar cells that will become more and more elongated with higher the latitude values - which might be a problem, or might not.

About icosahedral grids

The icosa package approaches this problem by not using a rectangular grid, i.e. a grid that has cells that are rectangular shape.

Instead, icosa relies on the tessellation of a regular icosahedron to a given resolution. This procedure ends up with a polyhedral object of triangular faces of higly isometric properties: very similar shapes of cells which are roughly equally distanced, and similar in cell area. Such bodies are often called icospheres.

Here is how such grid looks like in 3D (also plotted with icosa, see a later tutorial for such plotting):

Still, triangular grids are somewhat unintuitive, because distances between the cell can vary considerably: the corner of a cell is much farther away from the center than the side. For this reason, we usually use the inverted versions of these grids: the triangle cell-midpoints become the corners of the cells (i.e. vertices), which defines a grid where cells are hexagonal - with exactly 12 pentagons (where the vertices of the icosahedron used to be). This is the inverted version of the grid above:

Creating an icosahedral grid

Once icosa is loaded, you can create such a penta-hexagonal grid with a single line of code, using the hexagrid() function:

# create a trigrid class object
hexa <- hexagrid(deg=5, sf=TRUE)

## Selecting hexagrid with tessellation vector: c(2, 4).
## Mean edge length: 4.993 degrees.

The argument deg refers to the expected length of the edge length of the grid. This is given as angular distance between two cell vertices in degrees - which means the same as degrees longitudinally (or latitudinally on the equator), 1° =~111 km. This is used to select a tessellation vector, which is directly controlling the grid resolution, describing how the faces of icosahedron are split up. The higher the product of this vector, the higher the resolution - ie. the more cells there will be in the grid. The example c(2, 4) means that the original icosahedrons edges are split up 2 * 4 = 8 times. This triangular grid (trigrid) is then inverted (vertices become cell midpoints), which results in penta-hexagonal grid.
The argument sf=TRUE indicates that we want to create a representation of the grid that can be used for additional spatial operations using the sf package, for example 2D (projection-based) plotting. Since this is not always needed, and can significantly decrease performance, you have to indicate that you need this option.

If you type in the name of the object, you can immediately see some properties about the structure and resolution of the grid.

hexa

## A/An hexagrid object with 1280 vertices, 1920 edges and 642 faces.
## The mean grid edge length is 555.24 km or 4.99 degrees.
## Use plot3d() to see a 3d render.

The most important detail here is the number of faces of the 3d body (642) i.e. the number of cells in the grid.

Plotting the grid in 2d

If you have generated the sf-represenation as above, the plot() function can be invoked on the grid to see its 2D (longitude-latitude projection).

plot(hexa)

Putting the map and points generated above is just as easy as with any other kind of mapping object:

# plotting the world map
plot(ne$geometry, col="gray", border=NA)

# the grid
plot(hexa, add=TRUE)

# plotting the point cloud
points(randPoints, col="#FF000022", pch=3)

Spatial binning with the `hexagrid`

The grids themselves can only be useful if we can figure out how the points actually interact with them: for instance, if we want to repeat the same binning as earlier, but with the grid that we created earlier. To make this process as flexible and useful as possible, we will be doing this in three short steps:

Finding the cells that the points fall onto with locate()
Count the number of points in a cell. (using base R)
Plot the counts on the grid

The cells of the icoshedral grid are named with the convention "F<integer>" ("F" stands for face), such as as "F12". You can see how these are distributed using the gridlabs() function:

plot(hexa)
gridlabs(hexa, cex=0.5)

Cells names are assigned according to a spiral from the North pole of the grid.

Point lookup - `locate()`

The only thing that we need to able to do with the grid to make this process feasible is the ability to find the cells on which every points falls. This is a very basic feature, but with this we can do most computations that we might want to do with the grid.

Simplest way to illustrate this is with a couple of points. Let’s try it with the first 5 from the randomly generated set:

# the first five points
examples <- randPoints[1:5,]
examples

##             long        lat
## [1,]  -43.998368 -16.764084
## [2,] -105.191289   6.095493
## [3,]  -20.456553 -10.867072
## [4,]    3.011043  26.776595
## [5,]   -7.820389  67.246328

We can easily plot this object on the gird map we created above:

# visualize exact locations
points(examples, col="#FF0000", pch=3, cex=3)

# add an identifier 
text(label=1:nrow(examples), examples, col="#FF0000", pch=3, pos=2, cex=3)

The locate() function returns the name of the cell under a point: one cell name for every coordinate pairs. The function takes the grid (hexa), and the point coordinates (examples) as arguments and returns a vector of cell names:

exampleCells <- locate(hexa, examples)
exampleCells

## [1] "F405" "F277" "F367" "F208" "F25"

Note that the we plotted the number identifiers so it is clearly visible which point is which. The ordering of the face/cell names will match that of the points. The match is easier to see if we highlight the cells the points fall on:

# replot the map for clarity
plot(hexa)

# Subsetting the sf representation of the grid
plot(hexa, exampleCells, col="green", add=TRUE)

# labels
gridlabs(hexa, cex=0.5)

# points again
points(examples, col="#FF0000", pch=3, cex=3)

This localization of the points does a lot of heavy lifting in icosa, and it can be applied thousands/millions of points at once.

cells <- locate(hexa, randPoints)
str(cells)

##  chr [1:10000] "F405" "F277" "F367" "F208" "F25" "F510" "F79" "F539" "F481" ...

To keep things orgnanized this output of locate() is so designed that it can be used as a new column in the table of coordinates (optional):

# transform this to a data frame
rdf <- data.frame(randPoints)
# assign the face names as a column
rdf$cells <- cells
# the first 6 rows
head(rdf)

##          long        lat cells
## 1  -43.998368 -16.764084  F405
## 2 -105.191289   6.095493  F277
## 3  -20.456553 -10.867072  F367
## 4    3.011043  26.776595  F208
## 5   -7.820389  67.246328   F25
## 6  170.576713 -35.169054  F510

Counting the cells

To assess the density of points, we need to count the number of points in every cell - which is actually nothing else, but the tabulation of the number of times that a point falls on a cell. This we can easilly do with the table() function:

tCells <- table(cells)
str(tCells)

##  'table' int [1:642(1d)] 9 17 10 23 15 24 21 15 17 11 ...
##  - attr(*, "dimnames")=List of 1
##   ..$ cells: chr [1:642] "F1" "F10" "F100" "F101" ...

… which gives us the number of points that fall on one particular cell (cell name in the names() attribute), frequency as the value.

Plotting the frequency

There is a number of ways to visualize this number on the map, the simplest is to use a dedicated method of plot(), that takes the grid (with an sf representation, as above) and a named entity (vector, table) and plots this with the features made available via sf:

plot(hexa, tCells)

You can modify this plotting in any way you could with sf::plot(). For instance, if you want to use a different color scheme, a different binning

plot(hexa, tCells, 
    border="white",
    pal=c("#440154FF", "#2D708EFF", "#29AF7FFF", "#FDE725FF"), 
    breaks=c(0, 10, 15, 20, 40)
)

You can also set the plotted cooordinate reference system directly here in the plotting command. For instance, if you would like this to be plotted directly on a Mollweide projection ("ESRI:54009"):

# the base map
plot(hexa, tCells, 
    crs="ESRI:54009", 
    border="white",
    pal=c("#440154FF", "#2D708EFF", "#29AF7FFF", "#FDE725FF"), 
    breaks=c(0, 10, 15, 20, 40)
)

If you do not forget to transform the landmasses

neMoll <- sf::st_transform(ne, "ESRI:54009")

… you can also put them on top. Just make sure to set reset=FALSE (see sf::plot why):

# the base map
plot(hexa, tCells,
    crs="ESRI:54009", 
    border="white",
    pal=c("#440154FF", "#2D708EFF", "#29AF7FFF", "#FDE725FF"), 
    breaks=c(0, 10, 15, 20, 40), 
    reset=FALSE
)

# the landmasses tranparent gray with black contour
plot(neMoll$geometry, add=TRUE, col="#66666688")

All these options come from sf:plot(). If you familiarize yourself with how to use that,then only your imagination will be the limit!

Using the result in calculations

The by-product of this workflow is that now we get explicit access to not only the cells’ name, but also their attributes. For instance if you want to see whether the number of points in a cell has a latitudinal pattern, you just need to get the coordinates of the face centers, and assign them to the densities. You can get these with centers():

# translate the returned 2-column matrix to a data.frame
faceInfo <- as.data.frame(centers(hexa))
# the first 6 rows
head(faceInfo)

##    long      lat
## F1    0 90.00000
## F2 -162 82.07063
## F3  -90 82.07063
## F4  -18 82.07063
## F5   54 82.07063
## F6  126 82.07063

which returns the longitude and latitude coordinates of the face centers. Now we just need to assign the counts to these, which is easy with base R using the rownames as a character subscript and assigning the result as a new column of faceInfo:

faceInfo$count <- tCells[rownames(faceInfo)]
# the first 6 rows
head(faceInfo)

##    long      lat count
## F1    0 90.00000     9
## F2 -162 82.07063    13
## F3  -90 82.07063    16
## F4  -18 82.07063    20
## F5   54 82.07063    18
## F6  126 82.07063    18

Now it is easy to see whether there is an association with latitude:

plot(faceInfo$lat, faceInfo$count, 
    xlab="Latitude (deg)", ylab="Point count", 
    pch=16, col="#99000044")

There is no latitudinal trend in the points, but you can see that the spread decreases latitudinally (because there are fewer cells) at high latitudes.

Summary

If you have some data (e.g. randPoints) and you want to execute the binning and plotting, you can do it with this much code - now using a different, coarser resolution grid:

gr <- hexagrid(deg=10, sf=TRUE) # create grid

## Selecting hexagrid with tessellation vector: c(2, 2).
## Mean edge length: 9.994 degrees.

cells<- locate(gr, randPoints) # locate points
tab <- table(cells) # tabulate, calculate
plot(gr, tab) # plot named vector/table