876

# Classical Geostatistical Methods

Authored by: Dale L. Zimmerman , Michael Stein

# Handbook of Spatial Statistics

Print publication date:  March  2010
Online publication date:  March  2010

Print ISBN: 9781420072877
eBook ISBN: 9781420072884

10.1201/9781420072884-c3

#### Abstract

Suppose that a spatially distributed variable is of interest, which in theory is defined at every point over abounded study region of interest, D ⊂ R d, where d = 2 or 3. We suppose further that this variable has been observed (possibly with error) at each of n distinct points in D, and that from these observations we wish to make inferences about the process that governs how this variable is distributed spatially and about values of the variable at locations where it was not observed. The geostatistical approach for achieving these objectives is to assume that the observed data are a sample (at the n data locations) of one realization of a continuously indexed spatial stochastic process (random field) Υ(•) ≡ {Υ(s) : s ∈ D}. Chapter 2 reviewed some probabilistic theory for such processes. In this chapter, we are concerned with how to use the sampled realization to make statistical inferences about the process. In particular, we discuss a body of spatial statistical methodology that has come to be known as “classical geostatistics.” Classical geostatistical methods focus on estimating the first-order (large-scale or global trend) structure and especially the second-order (small- scale or local) structure of Υ(•), and on predicting or interpolating (kriging) values of Υ(•) at unsampled locations using linear combinations of the observations and evaluating the performance of these predictions by their (unconditional) mean squared errors. However, if the process Υ is sufficiently non-Gaussian, methods based on considering just the first two moments of Υ may be misleading. Furthermore, some common practices in classical geostatistics are problematic even for Gaussian processes, as we shall note herein.

#### 3.1  Overview

Suppose that a spatially distributed variable is of interest, which in theory is defined at every point over abounded study region of interest, D ⊂ R d, where d = 2 or 3. We suppose further that this variable has been observed (possibly with error) at each of n distinct points in D, and that from these observations we wish to make inferences about the process that governs how this variable is distributed spatially and about values of the variable at locations where it was not observed. The geostatistical approach for achieving these objectives is to assume that the observed data are a sample (at the n data locations) of one realization of a continuously indexed spatial stochastic process (random field) Υ(•) ≡ {Υ(s) : sD}. Chapter 2 reviewed some probabilistic theory for such processes. In this chapter, we are concerned with how to use the sampled realization to make statistical inferences about the process. In particular, we discuss a body of spatial statistical methodology that has come to be known as “classical geostatistics.” Classical geostatistical methods focus on estimating the first-order (large-scale or global trend) structure and especially the second-order (small- scale or local) structure of Υ(•), and on predicting or interpolating (kriging) values of Υ(•) at unsampled locations using linear combinations of the observations and evaluating the performance of these predictions by their (unconditional) mean squared errors. However, if the process Υ is sufficiently non-Gaussian, methods based on considering just the first two moments of Υ may be misleading. Furthermore, some common practices in classical geostatistics are problematic even for Gaussian processes, as we shall note herein.

Because good prediction of Υ(•) at unsampled locations requires that we have at our disposal estimates of the structure of the process, the estimation components of a geostatistical analysis necessarily precede the prediction component. It is not clear, however, which structure, first-order or second-order, should be estimated first. In fact, an inherent circularity exists—to properly estimate either structure, it appears we must know the other. We note that likelihood-based methods (see Chapter 4) quite neatly avoid this circularity problem, although they generally require a fully specified joint distribution and a parametric model for the covariance structure (however, see Im, Stein, and Zhu, 2007). The classical solution to this problem is to provisionally estimate the first-order structure by a method that ignores the second-order structure. Next, use the residuals from the provisional first-order fit to estimate the second-order structure, and then finally reestimate the first-order structure by a method that accounts for the second-order structure. This chapter considers each of these stages of a classical geostatistical analysis in turn, plus the kriging stage. We begin, however, with a description of the geostatistical model upon which all of these analyses are based.

#### 3.2  Geostatistical Model

Because only one realization of Υ(•) is available, and the observed data are merely an incomplete sample from that single realization, considerable structure must be imposed upon the process for inference to be possible. The classical geostatistical model imposes structure by specifying that

3.1 $ϒ ( s ) = μ ( s ) + e ( s ) ,$

where μ( s ) ≡ E [Υ(s)], the mean function, is assumed to be deterministic and continuous, and e(•) ≡ {e(s) : sD} is a zero-mean random “error” process satisfying a stationarity assumption. One common stationarity assumption is that of second-order stationarity, which specifies that

3.2 $Cov [ e ( s ) , e ( t ) ] = C ( s − t ) , for all s , t ∈ D .$

In other words, this asserts that the covariance between values of Υ(•) at any two locations depends on only their relative locations or, equivalently, on their spatial lag vector. The function C(•) defined in (3.2) is called the covariance function. Observe that nothing is assumed about higher-order moments of e(•) or about its joint distribution. Intrinsic stationarity, another popular stationary assumption, specifies that

3.3 $1 2 var [ e ( s ) − e ( t ) ] = γ ( s − t ) , for all s , t ϵ ⁢ D .$

The function Υ(•) defined by (3.3) is called the semivariogram (and the quantity 2Υ(•) is known as the variogram). A second-order stationary random process with covariance function C(•) is intrinsically stationary, with semivariogram given by

3.4 $γ ( h ) = C ( 0 ) − C ( h ) ,$

but the converse is not true in general. In fact, intrinsically stationary processes exist for which var[Υ(s)] is not even finite at any sD. An even weaker stationarity assumption is that satisfied by an intrinsic random field of order k (IRF-k), which postulates that certain linear combinations of the observations known as kth-order generalized increments have mean zero and a (generalized) covariance function that depends only on the spatial lag vector. IRF-ks were introduced in Chapter 2, to which we refer the reader for more details.

Model (3.1) purports to account for large-scale spatial variation (trend) through the mean function μ(•), and for small-scale spatial variation (spatial dependence) through the process e(•). In practice, however, it is usually not possible to unambiguously identify and separate these two components using the available data. Quoting from Cressie (1991, p. 114), “One person's deterministic mean structure may be another person's correlated error structure.” Consequently, the analyst will have to settle for a plausible, but admittedly nonunique, decomposition of spatial variation into large-scale and small-scale components.

In addition to capturing the small-scale spatial variation, the error process e(•) in (3.1) accounts for measurement error that may occur in the data collection process. This measurement error component typically has no spatial structure; hence, for some purposes it maybe desirable to explicitly separate it from the spatially dependent component. That is, we may write

3.5 $e ( s ) = η ( s ) + ϵ ( s ) ,$

where η(•) is the spatially dependent component and ϵ(•) is the measurement error. Such a decomposition is discussed in more detail in Section 3.5.

The stationarity assumptions introduced above specify that the covariance or semivariogram depends on locations s and t only through their lag vector h = st. A stronger property, not needed for making inference from a single sampled realization but important nonetheless, is that of isotropy. Here we describe just intrinsic isotropy (and anisotropy); second-order isotropy differs only by imposing an analogous condition on the covariance function rather than the semivariogram. An intrinsically stationary random process with semivariogram Υ(•) is said to be (intrinsically) isotropic if Υ(h) = Υ(h), where h = (h'h)1/2; that is, the semivariogram is a function of the locations only through the (Euclidean) distance between them. If the process is not isotropic, it is said to be anisotropic. Perhaps the most tractable form of anisotropy is geometric anisotropy, for which Υ(h) = Υ((h'Ah)1/2) where A is a positive definite matrix. Isotropy can be regarded as a special case of geometric anisotropy in which A is an identity matrix. Contours along which the semivariogram is constant (so-called isocorrelation contours when Υ(•) is second-order stationary) are d- dimensional spheres in the case of isotropy and d-dimensional ellipsoids in the more general case of geometric anisotropy.

The objectives of a geostatistical analysis, which were noted in general terms in Section 3.1, can now be expressed more specifically in terms of model (3.1). Characterization of the spatial structure is tantamount to the estimation of µ(.) and either C (•)or Υ (•). The prediction objective can be reexpressed as seeking to predict the value of Υ(s 0) = µ(s 0) + e(s 0) at an arbitrary site s 0.

#### 3.3  Provisional Estimation of the Mean Function

The first stage of a classical geostatistical analysis is to specify a parametric model, µ(s ; β), for the mean function of the spatial process, and then provisionally estimate this model by a method that requires no knowledge of the second-order dependence structure of Υ(•). The most commonly used parametric mean model is a linear function, given by

3.6 $μ ( s; β ) = X ( s ) T ⁢ β ,$

where X(s) is a vector of covariates (explanatory variables) observed at s, and β is an unrestricted parameter vector. Alternative choices include nonlinear mean functions, such as sines/cosines (with unknown phase, amplitude, and period) or even semiparametric or nonparametric (locally smooth) mean functions, but these appear to be used very rarely.

One possible approach to spatial interpolation is to place all of the continuous variation of the process into the mean function, i.e., assume that the observations equal a true but unknown continuous mean function plus independent and identically distributed errors, and use nonparametric regression methods, such as kernel smoothers, local polynomials, or splines. Although nonparametric regression methods provide a viable approach to spatial interpolation, we prefer for the following reasons the geostatistical approach when s refers to a location in physical space. First, the geostatistical approach allows us to take advantage of properties, such as stationarity and isotropy, that do not usually arise in nonparametric regression. Second, the geostatistical approach naturally generates uncertainty estimates for interpolated values even when the underlying process is continuous and is observed with little or no measurement error. Uncertainty estimation is problematic with nonpara- metric regression methods, especially if the standard deviation of the error term is not large compared to the changes in the underlying function between neighboring observations. It should be pointed out that smoothing splines, which can be used for nonparametric regression, yield spatial interpolants that can be interpreted as kriging predictors (Wahba, 1990). The main difference, then, between smoothing splines and kriging is in how one goes about estimating the degree of smoothing and in how one provides uncertainty estimates for the interpolants.

The covariates associated with a point s invariably include an overall intercept term, equal to one for all data locations. Note that if this is the only covariate and the error process e(•) in (3.1) is second-order (or intrinsically) stationary, then Y(•) itself is second-order (or intrinsically) stationary. The covariates may also include the geographic coordinates (e.g., latitude and longitude) of s, mathematical functions (such as polynomials) of those coordinates, and attribute variables. For example, in modeling the mean structure of April 1 snow water equivalent (a measure of how much water is contained in the snowpack) over the western United States in a given year, one might consider, in addition to an overall intercept, latitude and longitude, such covariates as elevation, slope, aspect, average wind speed, etc., to the extent that data on these attribute variables are available. If data on potentially useful attribute variables are not readily available, the mean function often is taken to be a polynomial function of the geographic coordinates only. Such models are called trend surface models. For example, the first-order (planar) and second-order (quadratic) polynomial trend surface models for the mean of a two-dimensional process are respectively as follows, where s = (s 1 , s 2):

$μ ( s; β ) = β 0 + β 1 s 1 + β 2 s 2 , μ ( s; β ) = β 0 + β 1 s 1 + β 2 s 2 + β 11 s 1 2 + β 12 s 1 s 2 + β 22 s 2 2 .$

Using a full" qth-order polynomial, i.e., a polynomial that includes all pure and mixed monomials of degree ≤ q, is recommended because this will ensure that the fitted surface is invariant to the choice of origin and orientation of the (Euclidean) coordinate system.

It is worth noting that realizations of a process with constant mean, but strong spatial correlation, frequently appear to have trends; therefore, it is generally recommended that one refrain from using trend surfaces that cannot be justified apart from examining the data.

The standard method for fitting a provisional linear mean function to geostatistical data is ordinary least squares (OLS). This method yields the OLS estimator $β ^ O L S$

of β, given by

$β ^ o l s = argmin Σ i = 1 n [ γ ( s i ) − X ( s i ) T β ] 2 .$

Equivalently, $β ^ O L S = ( X T X ) − 1 X T Y$

where X = [X(s 1), X(s 2),…, X(s n )] T and Y = [Υ(s 1), Υ(s 2),…, Υ(s n)] T , it being assumed without loss of generality that X has full column rank. Fitted values and fitted residuals at data locations are given by $Y ^ = X T β ^ O L S$ and $e ^ = Y − Y ^$ , respectively. The latter are passed to the second stage of the geostatisti- cal analysis, to be described in the next section. While still at this first stage, however, the results of the OLS fit should be evaluated and used to suggest possible alternative mean functions. For this purpose, the standard arsenal of multiple regression methodology, such as transformations of the response, model selection, and outlier identification, may be used, but in an exploratory rather than confirmatory fashion since the independent errors assumption upon which this OLS methodology is based is likely not satisfied by the data.

As a result of the wide availability of software for fitting linear regression models, OLS fitting of a linear mean function to geostatistical data is straightforward. However, there are some practical limitations worth noting, as well as some techniques/guidelines for overcoming these limitations. First, and in particular for polynomial trend surface models, the covariates can be highly multicollinear, which causes the OLS estimators to have large variances. This is mainly a numerical problem, not a statistical one, unless the actual value of the regression coefficients is of interest and it can be solved by centering the covariates (i.e., subtracting their mean values) or, if needed, by orthogonalizing the terms in some manner prior to fitting. Second, the fitted surface in portions of the spatial domain of interest where no observations are taken may be distorted so as to better fit the observed data. This problem is avoided, however, if the sampling design has good spatial coverage. Finally, as with least squares estimation in any context, the OLS estimators are sensitive to outliers and thus one may instead wish to fit the mean function using one of many available general procedures for robust and resistant regression. If the data locations form a (possibly partially incomplete) rectangular grid, one robust alternative to OLS estimation is median polish (Cressie, 1986), which iteratively sweeps out row and column medians from the observed data (and thus is implicitly based on an assumed row-column effects model for the first-order structure). However, the notion of what constitutes an outlier can be tricky with spatially dependent data, so robust methods should be used with care.

#### 3.4  Nonparametric Estimation of the Semivariogram

The second stage of a geostatistical analysis is to estimate the second-order dependence structure of the random process Υ(•) from the residuals of the fitted provisional mean function. To describe this in more detail, we assume that e(•) is intrinsically stationary, in which case the semivariogram is the appropriate mode of description of the second-order dependence. We also assume that d = 2, though extensions to d = 3 are straightforward.

Consider first a situation in which the data locations form a regular rectangular grid. Let $h 1 = ( h 11 h 12 ) , … , h k = ( h k 1 h k 2 )$

represent the distinct lags between data locations (in units of the grid spacings), with displacement angles ϕ u = tan−1(h u2/h u1) ∊ [0, π) (u = 1, … , k). Attention may be restricted to only those lags with displacement angles in [0, π) without any loss of information because Υ(h) is an even function. For u = 1,, k, let N( h u ) represent the number of times that lag h u occurs among the data locations. Then the empirical semivariogram is defined as follows:

$γ ^ ( h u ) = 1 2 N ( h u ) ∑ s i − s j = h u { e ^ ( s i ) − e ^ ( s j ) } 2 ( u = 1 , … , k ) ,$

where ê(s i •) is the residual from the fitted provisional mean function at the ith data location and is thus the ith element of the vector e defined in the previous section. We call $γ ^ ( h u )$

the uth ordinate of the empirical semivariogram. Observe that $γ ^ ( h u )$ is a method-of-moments type of estimator of γ (h u ). Under model (3.1) with constant mean, this estimator is unbiased; if the mean is not constant in model (3.1), the estimator is biased as a consequence of estimating the mean structure, but the bias is not large in practice (provided that the mean structure that is estimated is correctly specified). Figure 3.1   A polar partition of the lag space.

When data locations are irregularly spaced, there is generally little to no replication of lags among the data locations. To obtain quasireplication of lags, we first partition the lag space H = {st: s, t ∈D} into lag classes or “bins” H 1,…, H k , say, and assign each lag with displacement angle in [0, π) that occurs among the data locations to one of the bins. Then, we use a similar estimator:

3.7 $γ ^ ( h u ) = 1 2 N ( H u ) ⁢ Σ s i − s j ⁢ ϵ ⁢ H u ⁢ { e ^ ( s i ) − e ^ ( s j ) } 2 ⁢ ( u = 1 , … , k ) .$

Here h u is a representative lag for the entire bin H u , and N( H u ) is the number of lags that fall into H u . The bin representative, h u , is sometimes taken to be the centroid of H u , but a much better choice is the average of all the lags that fall into H u . The most common partition of the lag space is a “polar” partition, i.e., a partitioning into angle and distance classes, as depicted in Figure 3.1. A polar partition naturally allows for the construction and plotting of a directional empirical semivariogram, i.e., a set of empirical semivariogram ordinates corresponding to the same angle class, but different distance classes, in each of several directions. It also allows for lags to be combined over all angle classes to yield the ordinates of an omnidirectional empirical semivariogram. The polar partition of the lag space is not the only possible partition; however, some popular software for estimating semivariograms use a rectangular partition instead.

Each empirical semivariogram ordinate in the case of irregularly spaced data locations is approximately unbiased for its corresponding true semivariogram ordinate, as it is when the data locations form a regular grid, but there is an additional level of approximation or blurring in the irregularly spaced case due to the grouping of unequal lags into bins.

How many bins should be used to obtain the empirical semivariogram, and how large should they be? Clearly, there is a trade-off involved: The more bins that are used, the smaller they are and the better the lags in H u are approximated by h u , but the fewer the number of observed lags belonging to H u (with the consequence that the sampling variation of the empirical semivariogram ordinate corresponding to that lag is larger). One popular rule of thumb is to require N(h u ) to be at least 30 and to require the length of h u to be less than half the maximum lag length among data locations. But, there may be many partitions that meet these criteria, and so the empirical semivariogram is not actually uniquely defined when data locations are irregularly spaced. Furthermore, as we shall see in the simulation below, at lags that are a substantial fraction of the dimensions of the observation domain, ŷ(h u ) maybe highly variable even when N(h u ) is much larger than 30. The problem is that the various terms making up the sum in (3.7) are not independent and the dependence can be particularly strong at larger lags. Figure 3.2   Empirical semivariograms of simulated data obtained via three R programs.

One undesirable feature of the empirical semivariogram is its sensitivity to outliers, a consequence of each of its ordinates being a scaled sum of squares. An alternative and more robust estimator, due to Cressie and Hawkins (1980), is

$γ ¯ ( h u ) = { 1 N ( H u ) Σ s i − s j ⁢ ϵ ⁢ H u ⁢ | e ^ ( s i ) − e ^ ( s i ) | 1 / 2 ⁢ } 4 .914 + [ .988 / N ( H u ) ] ⁢ ( u ⁢ = 1 , … , k ) .$

As an example, let us consider empirical semivariograms obtained from three programs available in R with all arguments left at their default values. Specifically, we simulate an isotropic Gaussian process Υ with constant mean and exponential semivariogram with sill and range parameters equal to 1 on a 10 × 10 square grid with distance 1 between neighboring observations. (See Section 3.5 for definitions of the exponential semivariogram and its sill and range parameters.) Figure 3.2 shows the resulting empirical semivariograms using the command variog from geoR, the command est.variogram from sgeostat, and the command variogram from gstat. The first two programs do not automatically impose an upper bound on the distance lags and we can see that the estimates of Υ at the longer lags are very poor in this instance, even though, for example, for est.variogram from sgeostat, the estimate for the second longest lag (around 10.8) is based on 80 pairs of observations and the estimate for the third longest lag (aound 9.5) is based on 326 pairs. For variogram in gstat, the default gives a largest lag of around 4.08. Another important difference between the gstat program and the other two is that gstat, as we recommend, uses the mean distance within the bin rather than the center of the bin as the ordinate on the horizontal axis. For haphazardly sited data, the differences between the two may often be small, but here we find that for regular data, the differences can be dramatic. In particular, gstat and sgeostat give the same value for $γ ^$

at the shortest lag (0.6361), but gstat gives the corresponding distance as 1, whereas sgeostat gives this distance as 0.6364. In fact, with either program, every pair of points used in the estimator is exactly distance 1 apart, so the sgeostat result is quite misleading. It would appear that, in this particular setting, the default empirical variogram in gstat is superior to those in geoR and sgeostat. However, even with the best of programs, one should be very careful about using default parameter values for empirical semivariograms. Furthermore, even with well-chosen bins, it is important to recognize that empirical semivariograms do not necessarily contain all of the information in the data about the true semivariogram, especially, as noted by Stein (1999, Sec. 6.2), for differentiable processes.

#### 3.5  Modeling the Semivariogram

Next, it is standard practice to smooth the empirical semivariogram by fitting a parametric model to it. Why smooth the empirical semivariogram? There are several reasons. First, it is often quite bumpy; a smoothed version may be more reliable (have smaller variance) and therefore may increase our understanding of the nature of the spatial dependence. Second, the empirical semivariogram will often fail to be conditionally nonpositive definite, a property which must be satisfied to ensure that at the prediction stage to come, the prediction error variance is nonnegative at every point in D. Finally, prediction at arbitrary locations requires estimates of the semivariogram at lags not included among the bin representatives h 1,, h k nor existing among the lags between data locations, and smoothing can provide these needed estimates.

To smooth the empirical semivariogram, a valid parametric model for the semivariogram and a method for fitting that model must be chosen. The choice of model among the collection of valid semivariogram models is informed by an examination of the empirical semivariogram, of course, but other considerations (prior knowledge, computational simplicity, sufficient flexibility) may be involved as well. The following three conditions are necessary and sufficient for a semivariogram model to be valid (provided that they hold for all θΘ, where Θ is the parameter space for θ):

1. Vanishing at 0, i.e., Υ (0; θ) = 0
2. Evenness, i.e., Υ(–h; θ) = Υ(h; θ) for all h
3. Conditional negative definiteness, $∑ i = 1 n ∑ j = 1 n a i a j γ ( s i − s j ; θ ) ≤ 0$ for all n, all s 1,…, s n , and all a 1, …,a n such that $∑ i = 1 n a i = 0$

Often, the empirical semivariogram tends to increase roughly with distance in any given direction, up to some point at least, indicating that the spatial dependence decays with distance. In other words, values of Υ(•) at distant locations tend to be less alike than values at locations in close proximity. This leads us to consider primarily those semivariogram models that are monotone increasing functions of the intersite distance (in any given direction). Note that this is not a requirement for validity, however. Moreover, the modeling of the semivariogram is made easier if isotropy can be assumed. The degree to which this assumption is tenable has sometimes been assessed informally via “rose diagrams” (Isaaks and Srivastava, 1989) or by comparing directional empirical semivariograms. It is necessary to make comparisons in at least three, and preferably more, directions so that geometric anisotropy can be distinguished from isotropy. Moreover, without some effort to attach uncertainty estimates to semivariogram ordinates, we consider it dangerous to assess isotropy based on visual comparisons of directional empirical semivariograms. Specifically, directional empirical semivariograms for data simulated from an isotropic model can appear to show clear anisotropies (e.g., the semivariogram in one direction being consistently higher than in another direction) that are due merely to random variation and the strong correlations that occur between estimated semivariogram ordinates at different lags. More formal tests for isotropy have recently been developed; see Guan, Sherman, and Calvin (2004).

A large variety of models satisfy the three aforementioned validity requirements (in R 2 and R 3), plus monotonicity and isotropy, but the following five appear to be the most commonly used:

• Spherical
$γ ( h ; θ ) = { θ 1 ( 3 h 2 θ 2 − h 3 2 θ 2 3 ) for 0 ≤ h ≤ θ 2 θ 1 for h > θ 2$
• Exponential
$γ ( h ; θ ) = θ 1 { 1 − exp ( − h / θ 2 ) }$
• Gaussian
$γ ( h ; θ ) = θ 1 { 1 − exp ( − h 2 / θ 2 2 ) }$
• Matern
$γ ( h ; θ ) = θ 1 ( 1 − ( h / θ 2 ) v κ v ( h / θ 2 ) 2 v − 1 Γ ( v ) )$
where K v (•) is the modified Bessel function of the second kind of order ν
• Power
$γ ( h ; θ ) = θ 1 h θ 2$

These models are displayed in Figure 3.3. For each model, θ1 is positive; similarly, θ2 is positive in each model except the power model, for which it must satisfy 0 ≤ θ 2 < 2. In the Matern model, ν > 0. It can be shown that the Matern models with ν = 0•5 and ν → ∞ coincide with the exponential and Gaussian models, respectively.

Several attributes of an isotropic semivariogram model are sufficiently important to single out. The sill of Υ (h; θ ) is defined as limhΥ (h; θ) provided that the limit exists. If this limit exists, then the process is not only intrinsically stationary, but also second-order stationary, and C(0; θ) coincides with the sill. Note that the spherical, exponential, Gaussian, and Matern models have sills (equal to θ 1 in each of the parameterizations given above), but the power model does not. Furthermore, if the sill exists, then the range of Υ (h; θ) is the smallest value of h for which Υ(h; θ) equals its sill, if such a value exists. If the range does not exist, there is a related notion of an effective range, defined as the smallest value of h for which Υ (h; θ) is equal to 95% of its sill; in this case, the effective range is often a function of a single parameter called the range parameter. Of those models listed above that have a sill, only the spherical has a range (equal to θ2); however, the exponential and Gaussian models have effective ranges of approximately 3θ 2 and $3 θ 2$

, respectively, with θ 2 then being the range parameter. Range parameters can be difficult to estimate even with quite large datasets, in particular when, as is often the case, the range is not much smaller than the dimensions of the observation region (see Chapter 6). This difficulty is perhaps an argument for using the power class of variograms, which is essentially the Matern class for ν < 1 with the range set to infinity, thus, avoiding the need to estimate a range.

The Matern model has an additional parameter ν known as the smoothness parameter, as the process Υ(•) is m times mean square differentiable if and only if ν > m. The smoothness of the semivariogram near the origin (i.e., at small lags) is a key attribute for efficient spatial prediction (Stein, 1988; Stein and Handcock, 1989). Finally, the nugget effect of Υ(h; θ) is defined as limh→0 Υ(h; θ). The nugget effect is zero for all the models listed above, but a nonzero nugget effect can be added to any of them. For example, the exponential model with nugget effect θ3 is given by

3.8 $γ ( h ; θ ) = { 0 if h = 0 θ 3 + θ 1 { 1 − exp ( − h / θ 2 ) } if h > 0.$

One rationale for the nugget effect can be given in terms of the measurement error model (3.5). If η(•) in that model is intrinsically stationary and mean square continuous with a nuggetless exponential semivariogram, if ϵ(•) is an iid (white noise) measurement error process with variance θ 3, and if η(•) and ϵ(•) are independent, then the semivariogram of e(•) will coincide with (3.8). Figure 3.3   Semivariogram models.

Gaussian semivariograms correspond to processes that are extremely smooth—too much so to generally serve as good models for natural processes. For differentiable spatial processes, a Matern model with ν > 1, but not very large, is generally preferable. However, if one has an underlying smooth process with a sufficiently large nugget effect, it may sometimes not matter much whether one uses a Gaussian or Matern model. Spherical semivariograms are very popular in the geostatistical community, but less so among statisticians, in part because the semivariogram is only once differentiable in θ2 at θ2 = h, which leads to rather odd looking likelihood functions for the unknown parameters. There can be computational advantages to using semivariograms with a finite range if this range is substantially smaller than the dimensions of the observation domain, but even if one wants to use a semivariogram with finite range for computational reasons, there may be better alternatives than the spherical semivariogram (Furrer, Genton, and Nychka, 2006).

Any valid isotropic semivariogram model can be generalized to make it geometrically anisotropic, simply by replacing the argument h with (h'Ah)1/2, where A is a d × d positive definite matrix of parameters. For example, a geometrically anisotropic exponential semivariogram in R 2 is given by

$γ ( h ; θ ) = θ 1 { 1 − exp [ − ( h 1 2 + 2 θ 3 h 1 h 2 + θ 4 h 2 2 ) 1 / 2 / θ 2 2 ] } .$

Thus, for example, if θ 3 = 0 and θ 4 = 4, the effective range of the spatial correlation is twice as large in the E–W direction as in the N–S direction, and the effective range in all other directions is intermediate between these two. The isotropic exponential semivari- ogram corresponds to the special case in which θ3 = 0, θ4 = 1. Anisotropic models that are not geometrically anisotropic—so-called zonally anisotropic models—have sometimes been used, but they are problematic, both theoretically and practically (see Zimmerman (1993)).

Two main procedures for estimating the parameters of a chosen semivariogram model have emerged: weighted least squares (WLS) and maximum likelihood (ML) or its variant, restricted (or residual) maximum likelihood (REML). The WLS approach is very popular among practitioners due to its relative simplicity, but, because it is not based on an underlying probabilistic model for the spatial process, it is suboptimal and does not rest on as firm a theoretical footing as the likelihood-based approaches (though it is known to yield consistent and asymptotically normal estimators under certain regularity conditions and certain asymptotic frameworks) (see Lahiri, Lee, and Cressie (2002)). Nevertheless, at least for nondifferentiable processes, its performance is not greatly inferior to those that are likelihood-based (Zimmerman and Zimmerman, 1991; Lark, 2000). The remainder of this section describes the WLS approach only; likelihood-based approaches are the topic of the next chapter.

The WLS estimator of θ in the parametric model Υ(h; θ ) is given by

3.9 $θ ^ = argmin Σ u ∈ U N ( h u ) [ γ ( h u ; θ ) ] [ γ ^ ( h u ) − γ ( h u ; θ ) ] 2$

where all quantities are defined as in the previous section. Observe that the weights, N(h u )/[Υ(h u ; θ)]2, are small if either N(h u ) is small or Υ(h u ; θ ) is large. This has the effect, for the most commonly used semivariogram models (which are monotone increasing) and for typical spatial configurations of observations, of assigning relatively less weight to ordinates of the empirical semivariogram corresponding to large lags. For further details on the rationale for these weights, see Cressie (1985), although the argument is based on an assumption of independence between the terms in the sum (3.9), so it may tend to give too much weight to larger lags. Since the weights depend on θ , the WLS estimator must be obtained iteratively, updating the weights on each iteration until convergence is deemed to have occurred.

Comparisons of two or more fitted semivariogram models are usually made rather informally. If the models are non-nested and have the same number of parameters (e.g., the spherical and exponential models, with nuggets), the minimized weighted residual sum of squares (the quantity minimized in (3.9)) might be used to choose from among the competing models. However, we are unaware of any good statistical arguments for such a procedure and, indeed, Stein (1999) argues that an overly strong emphasis on making parametric estimates of semivariograms match the empirical semivariogram represents a serious flaw in classical geostatistics.

#### 3.6  Reestimation of the Mean Function

Having estimated the second-order dependence structure of the random process, there are two tacks the geostatistical analysis may take next. If the analyst has no particular interest in estimating the effects of covariates on Υ(•), then he/she may proceed directly to kriging, as described in the next section. If the analyst has such an interest, however, the next stage is to estimate the mean function again, but this time accounting for the second- order dependence structure. The estimation approach of choice in classical geostatistics is estimated generalized least squares (EGLS), which is essentially the same as generalized least squares (GLS) except that the variances and covariances of the elements of Υ, which are assumed known for GLS, are replaced by estimates. Note that second-order stationarity, not merely intrinsic stationarity, of e(•) must be assumed here to ensure that these variances and covariances exist and are functions of lag only.

A sensible method for estimating the variances and covariances, and one which yields a positive definite estimated covariance matrix, is as follows. First, estimate the common variance of the Υ(S i )s by the sill of the fitted semivariogram model, $γ ( h ; θ ^ )$

, obtained at the previous stage; denote this estimated variance by Ĉ(0). Then, motivated by (3.4), estimate the covariance between Y(s i ) and Y(s j ) for ij as $C ^ ( s i − s j ) = C ^ ( 0 ) − γ ( s i − s j ; θ ^ )$ . These estimated variances and covariances may then be arranged appropriately to form an estimated variance-covariance matrix

$Σ ^ = ( C ^ ( s i − s j ) ) .$

The EGLS estimator of β, $β ^ E G L S$

is then given by

$β ^ E G L S = ( X T Σ ^ − 1 X ) − 1 X T Σ ^ − 1 ϒ .$

The sampling distribution of $β ^ E G L S$

is much more complicated than that of the OLS or GLS estimator. It is known, however, that $β ^ E G L S$ is unbiased under very mild conditions, and that, if the process is Gaussian, the variance of $β ^ E G L S$ is larger than that of the GLS estimator were θ to be known, i.e., larger than (X T Σ−1 X)−1 (Harville, 1985). (Here, by “larger,”we mean that the difference, $var ⁡ ( β ^ E G L S ) − ( X T Σ − 1 X ) − 1$ , is nonnegative definite.) Nevertheless, for lack of a simple satisfactory alternative, the variance of $β ^ E G L S$ is usually estimated by the plug-in estimator, $( X T ∑ ^ − 1 X ) − 1$ .

If desired, the EGLS residuals, $Y − X β ^ E G L S$

, may be computed and the semivariogram reestimated from them. One may even iterate between mean estimation and semivariogram estimation several times, but, in practice, this procedure usually stops with the first EGLS fit. REML, described in Chapter 4, avoids this problem by estimating θ using only linear combinations of the observations whose distributions do not depend on β.

#### 3.7  Kriging

The final stage of a classical geostatistical analysis is to predict the values of Υ(•) at desired locations, perhaps even at all points, in D. Methods dedicated to this purpose are called kriging, after the South African mining engineer D. G. Krige, who was the first to develop and apply them. Krige's original method, now called ordinary kriging, was based on the special case of model (3.1) in which the mean is assumed to be constant. Here, we describe the more general method of universal kriging, which is identical to best linear unbiased prediction under model (3.1) with mean function assumed to be of the linear form (3.6).

Let s0 denote an arbitrary location in D; usually this will be an unsampled location, but it need not be. Consider the prediction of Υ(s 0) by a predictor, Y(s0), that minimizes the prediction error variance, var[ŷ(s 0) − ŷ(s 0)], among all predictors satisfying the following two properties:

1. Linearity, i.e., ŷ(s 0) = λ T Υ, where λ is a vector of fixed constants
2. Unbiasedness, i.e., E [ŷ(s 0)] = E [Υ(s0)], or equivalently λ τ X = X(s 0)

Suppose for the moment that the semivariogram of Υ(•) is known. Then the solution to this constrained minimization problem, known as the universal kriging predictor of Υ(s0), is given by

3.10 $γ ^ ( s 0 ) = [ γ + X ( X T Γ − 1 X ) − 1 ( X 0 − X T Γ − 1 γ ) ] T Γ − 1 ϒ ,$

where Υ = [Υ(s 1s 0),…, Υ(s n s 0)]T, Γ is the n × n symmetric matrix with ijth element Υ(s i − s j ) and x 0 = X(s 0). This result may be obtained using differential calculus and the method of Lagrange multipliers. However, a geometric proof is more instructive and following is an example.

Let us assume that the first component of x(s) is identically 1, which guarantees that the error of any linear predictor of Υ(s 0) that satisfies the unbiasedness constraint is a contrast, so that its variance can be obtained from the semivariogram of Υ(•). Let us also assume that there exists a linear predictor satisfying the unbiasedness constraint. Suppose λ τ Υ is such a predictor. Consider any other such predictor ν T Υ and set μ = ν λ. Since E ( λ τ Y) = E ( ν T Y) for all β, we must have X T μ = 0. And,

$var { v T ϒ − ϒ ( s 0 ) } = var [ μ T ϒ + { λ T ϒ − ϒ ( s 0 ) } ] = var ( μ T ϒ ) + var { λ T ϒ − ϒ ( s 0 ) } + 2 ⁢ Cov { μ T γ , λ T ϒ − ϒ ( s 0 ) } = ≥ var { λ T ϒ − ϒ ( s 0 ) } + 2 Cov { μ T ϒ , λ T ϒ − ϒ ( s 0 ) } = = var { λ T ϒ − ϒ ( s 0 ) } + 2 μ T ( − Γ λ + γ ) .$

If we can choose λ such that μ τ (Γ λ + Υ) = 0 for all μ satisfying X T μ = 0, then λ is the solution we seek, since we then have var{ v T Υ Υ (s 0)} ≥ var{λ T ΥΥ(s 0)} for any predictor ν T Υ satisfying the unbiasedness constraint. But, since the column space of X is the orthogonal complement of its left null space, this condition holds if and only if − Γ λ + Υ is in the column space of X, which is equivalent to the existence of a vector α satisfying − Γ λ + Χα = −Υ. Putting this condition together with the unbiasedness constraint yields the system of linear equations for λ and α

$( − Γ X X T ο ) ( λ α ) = ( − γ 0 ) ,$

where 0 and 0 indicate a vector and a matrix of zeroes, respectively. If Γ is invertible and X is of full rank, then simple row reductions yields λ as in (3.10).

The minimized value of the prediction error variance is called the (universal) kriging variance and is given by

3.11 $σ 2 ( s 0 ) = γ T Γ − 1 γ − ( X T Γ − 1 γ − x 0 ) T ( X T Γ − 1 X ) − 1 ( X T Γ − 1 γ − X 0 ) .$

The universal kriging predictor is an example of the best linear unbiased predictor, or BLUP, as it is generally abbreviated. If Υ(•) is Gaussian, the kriging variance can be used to construct a nominal 100(1 − α)% prediction interval for Υ(s 0), which is given by

$ϒ ^ ( s 0 ) ± Z α / 2 σ ( s 0 ) ,$

where 0 < α < 1 and z a / 2 is the upper α/2 percentage point of a standard normal distribution. If Υ(•) is Gaussian and Υ(•) is known, then Υ(so) − Υ(s0 ) is normally distributed and the coverage probability of this interval is exactly 1 − α.

If the covariance function for Υ exists and σ = [C(s 1s 0),…, C(s n s 0)] T , then the formula for the universal kriging predictor (3.10) holds with Υ replaced by σ and Γ by Σ . It is worthwhile to compare this formula to that for the best (minimum mean squared error) linear predictor when β is known: $x 0 T β + σ T Σ − T ( Y − X β )$

. A straightforward calculation shows that the universal kriging predictor is of this form with β replaced by $β ^ G L S$ . Furthermore, the expression (3.11) for the kriging variance is replaced by

$C ( 0 ) − σ T Σ − 1 σ + ( x 0 − X T Σ − 1 σ ) T ( X T Σ − 1 X ) − 1 ( x 0 − X T Σ − 1 σ ) .$

The first two terms, C (0) − σ τ Σ −1 σ, correspond to the mean squared error of the best linear predictor, so that the last term, which is always nonnegative, is the penalty for having to estimate β.

In practice, two modifications are usually made to the universal kriging procedure just described. First, to reduce the amount of computation required, the prediction of Υ(s 0) may be based not on the entire data vector Υ, but on only those observations that lie in a specified neighborhood around s 0. The range, the nugget-to-sill ratio, and the spatial configuration of data locations are important factors in choosing this neighborhood (for further details, see Cressie (1991, Sec. 3.2.1)). Generally speaking, larger nuggets require larger neighborhoods to obtain nearly optimal predictors. However, there is no simple relationship between the range and the neighborhood size. For example, Brownian motion is a process with no finite range for which the kriging predictor is based on just the two nearest neighbors. Conversely, there are processes with finite ranges for which observations beyond the range play a nontrivial role in the kriging predictor (Stein, 1999, p. 67). When a spatial neighborhood is used, the formulas for the universal kriging predictor and its associated kriging variance are of the same form as (3.10) and (3.11), but with Υ and Y replaced by the subvectors, and Γ and X replaced by the submatrices, corresponding to the neighborhood.

The second modification reckons with the fact that the semivariogram that appears in (3.10) and (3.11) is in reality unknown. It is common practice to substitute $γ ^ = γ ( θ ^ )$

and $Γ ^ = Γ ( θ ^ )$ for Υ and Γ in (3.10) and (3.11), where $θ ^$ is an estimate of θ obtained by, say, WLS. The resulting empirical universal kriging predictor is no longer a linear function of the data, but remarkably it remains unbiased under quite mild conditions (Kackar and Harville, 1981). The empirical kriging variance tends to underestimate the actual prediction error variance of the empirical universal kriging predictor because it does not account for the additional error incurred by estimating θ . Zimmerman and Cressie (1992) give a modified estimator of the prediction error variance of the empirical universal kriging predictor, which performs well when the spatial dependence is not too strong. However, Bayesian methods are arguably a more satisfactory approach for dealing with the uncertainty of spatial dependence parameters in prediction (see Handcock and Stein (1993)). Another possibility is to estimate the prediction error variance via a parametric bootstrap (Sjostedt-de Luna and Young, 2003).

Universal kriging yields a predictor that is a “location estimator” of the conditional distribution of Υ(s 0) given Υ ; indeed, if the error process e(•) is Gaussian, the universal kriging predictor coincides with the conditional mean, E(Υ(s 0)| Υ ) (assuming Υ(•) is known and putting a flat improper prior on any mean parameters). If the error process is non- Gaussian, then generally the optimal predictor, the conditional mean, is a nonlinear function of the observed data. Variants, such as disjunctive kriging and indicator kriging, have been developed for spatial prediction of conditional means or conditional probabilities for non- Gaussian processes (see Cressie, 1991, pp. 278–283), but we are not keen about them, as the first is based upon strong, difficult to verify assumptions and the second tends to yield unstable estimates of conditional probabilities. In our view, if the process appears to be badly non-Gaussian and a transformation doesn't make it sufficiently Gaussian, then the analyst should “bite the bullet” and develop a decent non-Gaussian model for the data.

The foregoing has considered point kriging, i.e., prediction at a single point. Sometimes a block kriging predictor, i.e., a predictor of the average value Y(B) ≡ ∫ B Y(s)d s/|B| over a region (block) BD of positive d-dimensional volume | B| is desired, rather than predictors of Υ(•) at individual points. Historically, for example, mining engineers were interested in this because the economics of mining required the extraction of material in relatively large blocks. Expressions for the universal block kriging predictor of Υ( B) and its associated kriging variance are identical to (3.10) and (3.11), respectively, but with Υ = [Υ(B, s 1),…, Υ(B, s n)]T, x 0 = [X1(B),, X p (B)]T (where p is the number of covariates in the linear mean function), γ (B, s i ) = |B|−1Bγ (us i ) d u and X j (B) = |B|−1 B X j (u) d u.

Throughout this chapter, it was assumed that a single spatially distributed variable, namely Υ(•), was of interest. In some situations, however, there may be two or more variables of interest, and the analyst may wish to study how these variables co-vary across the spatial domain and/or predict their values at unsampled locations. These problems can be handled by a multivariate generalization of the univariate geostatistical approach we have described. In this multivariate approach, { Υ (s) ≡ [Υ 1(s),…, Υ m(s)]T : s ϵ D} represents the m-variate spatial process of interest and a model Υ(s) = µ (S) + e(s) analogous to (3.1) is adopted in which the second-order variation is characterized by either a set of m semivari- ograms and m(m-1) /2 cross-semivariograms $γ i j ( h ) = 1 2 var ⁡ [ Y i ( s ) − Y j ( s + h ) ]$

, or a set of m covariance functions and m(m−1)/2 cross-covariance functions C ij (h) = Cov[Υ i (s), Υ ij (s+h)], depending on whether intrinsic or second-order stationarity is assumed. These functions can be estimated and fitted in a manner analogous to what we described for univariate geostatistics; likewise, the best (in a certain sense) linear unbiased predictor of Υ(s 0) at an arbitrary location s 0 ϵ D, based on observed values Υ (s 1),…, Υ(s n ), can be obtained by an extension of kriging known as cokriging. Good sources for further details are Ver Hoef and Cressie (1993) and Chapter 27 in this book.

While we are strong supporters of the general geostatistical framework to analyzing spatial data, we have, as we have indicated, a number of concerns about common geostatistical practices. For a presentation of geostatistics from the perspective of “geostatisticians” (that is, researchers who can trace their lineage to Georges Matheron and the French School of Geostatistics), we recommend the book by Chiles and Delfiner (1999).

#### References

Chilès, J.P. and Delfiner, P. (1999). Geostatistics: Modeling Spatial Uncertainty. New York: John Wiley & Sons.
Cressie, N. (1985). Fitting variogram models by weighted least squares. Journal of the International Association for Mathematical Geology, 17, 563–586.
Cressie, N. (1986). Kriging nonstationary data. Journal of the American Statistical Association, 81, 625–634.
Cressie, N. (1991). Statistics for Spatial Data. New York: John Wiley & Sons.
Cressie, N. and Hawkins, D.M. (1980). Robust estimation of the variogram, I. Journal of the International Association for Mathematical Geology, 12, 115–125.
Furrer, R. , Genton, M.G. , and Nychka, D. (2006). Covariance tapering for interpolation of large spatial datasets. Journal of Computational and Graphical Statistics, 15, 502–523.
Guan, Y. , Sherman, M. , and Calvin, J.A. (2004). A nonparametric test for spatial isotropy using subsampling. Journal of the American Statistical Association, 99, 810–821.
Handcock, M.S. and Stein, M.L. (1993). A Bayesian analysis of kriging. Technometrics, 35, 403–410.
Harville, D.A. (1985). Decomposition of prediction error. Journal of the American Statistical Association, 80, 132–138.
Im, H.K. , Stein, M.L. , and Zhu, Z. (2007). Semiparametric estimation of spectral density with irregular observations. Journal of the American Statistical Association, 102, 726–735.
Isaaks, E.H. and Srivastava, R.M. (1989). An Introduction to Applied Geostatistics. London: Academic Press.
Kackar, R.N. and Harville, D.A. (1981). Unbiasedness of two-stage estimation and prediction procedures for mixed linear models. Communications in Statistics-Theory and Methods, 10, 1249–1261.
Lahiri, S.N. , Lee, Y. , and Cressie, N. (2002). On asymptotic distribution and asymptotic efficiency of least squares estimators of spatial variogram parameters. Journal of Statistical Planning and Inference, 103, 65–85.
Lark, R.M. (2000). Estimating variograms of soil properties by the method-of-moments and maximum likelihood. European Journal of Soil Science, 51, 717–728.
Sjostedt-de Luna, S. and Young, A. (2003). The bootstrap and kriging prediction intervals. Scandinavian Journal of Statistics, 30, 175–192.
Stein, M.L. (1988). Asymptotically efficient prediction of a random field with a misspecified covariance function. Annals of Statistics, 16, 55–63.
Stein, M.L. (1999). Interpolation of Spatial Data: Some Theory for Kriging. New York: Springer.
Stein, M.L. and Handcock, M.S. (1989). Some asymptotic properties of kriging when the covariance function is misspecified. Mathematical Geology, 21, 171–190.
Ver Hoef, J.M. and Cressie, N. (1993). Multivariable spatial prediction. Mathematical Geology, 25, 219–240.
Wahba, G. (1990). Spline Models for Observational Data. Philadelphia: SIAM.
Zimmerman, D.L. (1993). Another look at anisotropy in geostatistics. Mathematical Geology, 25, 453–470.
Zimmerman, D.L. and Cressie, N. (1992). Mean squared prediction error in the spatial linear model with estimated covariance parameters. Annals of the Institute of Statistical Mathematics, 44, 27–43.
Zimmerman, D.L. and Zimmerman, M.B. (1991). A comparison of spatial semivariogram estimators and corresponding ordinary kriging predictors. Technometrics, 33, 77–91.

## Use of cookies on this website

We are using cookies to provide statistics that help us give you the best experience of our site. You can find out more in our Privacy Policy. By continuing to use the site you are agreeing to our use of cookies.