Title: | Robust Small Area Estimation |
---|---|
Description: | Empirical best linear unbiased prediction (EBLUP) and robust prediction of the area-level means under the basic unit-level model. The model can be fitted by maximum likelihood or a (robust) M-estimator. Mean square prediction error is computed by a parametric bootstrap. |
Authors: | Tobias Schoch [aut, cre] , Burkardt John [cph] (Fortran 90 subroutine zero_rc) |
Maintainer: | Tobias Schoch <[email protected]> |
License: | GPL-3 |
Version: | 0.3 |
Built: | 2024-11-07 05:42:54 UTC |
Source: | https://github.com/tobiasschoch/rsae |
The package implements methods to fit the basic unit-level model (also known as model type "B" in Rao, 2003, or nested-error regression model in Battese et al., 1988), to predict area-specific means by the empirical best linear unbiased predictor (EBLUP) or a robust prediction method (Copt and Victoria-Feser, 2009; Heritier et al., 2011), and to compute the mean square prediction error of the predicted area-level means by a parametric bootstrap (Sinha and Rao, 2009; see also Hall and Maiti, 2006; Lahiri, 2003).
The methods are discussed in Schoch (2012).
maximum likelihood estimator
Huber-type M-estimator (RML II of Richardson and Welsh, 1995, not the method proposed in Sinha and Rao, 2009); see Schoch (2012) for details
Data analysis involves the following steps:
prepare the data/ specify the model for estimation;
see saemodel()
fit the model by various (robust) methods; see
fitsaemodel()
(robustly) predict the random effects and the area
means; see robpredict()
Battese, G. E., Harter, R. M., and W.A. Fuller (1988). An error component model for prediction of county crop areas using. Journal of the American Statistical Association 83, 28–36. doi:10.2307/2288915
Copt, S. and M.-P. Victoria-Feser (2009). Robust Predictions in Mixed Linear Models, Research Report, University of Geneva.
Lahiri, P. (2003). On the impact of bootstrap in survey sampling and small area estimation. Statistical Science 18, 199–210. doi:10.1214/ss/1063994975
Hall, P. and T. Maiti (2006). On parametric bootstrap methods for small area prediction. Journal of the Royal Statistical Society. Series B 68, 221–238. doi:10.1111/j.1467-9868.2006.00541.x
Heritier, S., Cantoni, E., Copt, S., and M.-P. Victoria-Feser (2009). Robust methods in Biostatistics, New York: John Wiley and Sons.
Rao, J.N.K. (2003). Small Area Estimation, New York: John Wiley and Sons.
Richardson, A.M. and A.H. Welsh (1995). Robust restricted maximum likelihood in mixed linear model. Biometrics 51, 1429–1439. doi:10.2307/2533273
Schoch, T. (2012). Robust Unit-Level Small Area Estimation: A Fast Algorithm for Large Datasets. Austrian Journal of Statistics 41, 243–265. doi:10.17713/ajs.v41i4.1548
Sinha, S.K. and J.N.K. Rao (2009). Robust small area estimation. Canadian Journal of Statistics 37, 381–399. doi:10.1002/cjs.10029
fitsaemodel
fits SAE models that have been specified by
saemodel()
(or synthetic data generated by
makedata()
) for various (robust) estimation
methods.
fitsaemodel(method, model, ...) convergence(object) ## S3 method for class 'fit_model_b' print(x, digits = max(3L, getOption("digits") - 3L), ...) ## S3 method for class 'summary_fit_model_b' print(x, digits = max(3L, getOption("digits") - 3L), ...) ## S3 method for class 'fit_model_b' summary(object, ...) ## S3 method for class 'fit_model_b' coef(object, type = "both", ...)
fitsaemodel(method, model, ...) convergence(object) ## S3 method for class 'fit_model_b' print(x, digits = max(3L, getOption("digits") - 3L), ...) ## S3 method for class 'summary_fit_model_b' print(x, digits = max(3L, getOption("digits") - 3L), ...) ## S3 method for class 'fit_model_b' summary(object, ...) ## S3 method for class 'fit_model_b' coef(object, type = "both", ...)
method |
|
model |
an object of class |
digits |
|
object |
an object of class |
x |
an object of class |
type |
|
... |
additional arguments passed on to
|
Function fitsaemodel()
computes the following estimators:
maximum likelihood (ML): method = "ml"
,
Huber-type M-estimation: method = "huberm"
; this
method is called RML II by Richardson and Welsh (1995); see
Schoch (2012)
The ML is not robust against outliers.
The call for the Huber-type M-estimator (with Huber psi-function) is:
fitsaemodel(method = "huberm", model, k)
, where k
is
the robustness tuning constant of the Huber psi-function,
.
By default, the computation of the M-estimator is initialized by a robust estimate that derives from a fixed-effects model (centered by the median instead of the mean); see Schoch (2012) for the details.
If the data are supposed to be heavily contaminated (or if the
default algorithm did not converge), one may try to initialize
the algorithm underlying fitsaemodel()
by a high breakdown-point
estimate. The package offers two initialization methods:
NOTE: the robustbase package (Maechler et al., 2022)
must be installed to use this functionality.
init = "lts"
: least trimmed squares (LTS)
regression from robustbase; see
ltsReg() and Rousseeuw and Van
Driessen (2006),
init = "s"
: regression S-estimator from
robustbase; see lmrob() and
Maronna et al. (2019).
For small and medium size datasets, both methods are equivalent in terms of computation time. For large data, the S-estimator is considerably faster.
The methods are computed by (nested) iteratively re-weighted least
squares and a derivative of Richard Brent's zeroin
algorithm;
see Brent (2013, Chapter 4). The functions depend on the subroutines in
BLAS (Blackford et al., 2002) and LAPACK (Anderson et al., 2000); see
Schoch (2012).
An instance of the class "fitmodel"
Anderson, E., Bai, Z., Bischof, C., Blackford, L. S., Demmel, J., Dongarra, J., et al. (2000). LAPACK users' guide (3rd ed.). Philadelphia: Society for Industrial and Applied Mathematics (SIAM).
Blackford, L.S., Petitet, A., Pozo, R., Remington, K., Whaley, R.C., Demmel, J., et al. (2002). An updated set of basic linear algebra subprograms (BLAS). ACM Transactions on Mathematical Software 28, 135–151. doi:10.1145/567806.567807
Brent, R.P. (2013). Algorithms for minimization without derivatives. Mineola (NY): Dover Publications Inc. (This publication is an unabridged republication of the work originally published by Prentice-Hall Inc., Englewood Cliffs, NJ, in 1973).
Maechler, M., Rousseeuw, P., Croux, C., Todorov, V., Ruckstuhl, A., Salibian-Barrera, M., Verbeke, T., Koller, M., Conceicao, E.L.T. and M. Anna di Palma (2022). robustbase: Basic Robust Statistics R package version 0.95-0. https://CRAN.R-project.org/package=robustbase
Maronna, R.A., Martin, D., V.J. Yohai and M. Salibian-Barrera (2019): Robust statistics: Theory and methods. Chichester: John Wiley and Sons, 2nd ed.
Richardson, A.M. and A.H. Welsh (1995). Robust restricted maximum likelihood in mixed linear model. Biometrics 51, 1429–1439. doi:10.2307/2533273
Rousseeuw, P. J. and K. Van Driessen (2006). Computing LTS regression for large data sets. Data Mining and Knowledge Discovery 12, 29–45. doi:10.1007/s10618-005-0024-4
Schoch, T. (2012). Robust Unit-Level Small Area Estimation: A Fast Algorithm for Large Datasets. Austrian Journal of Statistics 41, 243–265. doi:10.17713/ajs.v41i4.1548
# use the landsat data head(landsat) # define the saemodel using the landsat data model <- saemodel(formula = HACorn ~ PixelsCorn + PixelsSoybeans, area = ~CountyName, data = subset(landsat, subset = (outlier == FALSE))) # summary of the model summary(model) # maximum likelihood estimates fitsaemodel("ml", model) # Huber M-estimate with robustness tuning constant k = 2 m <- fitsaemodel("huberm", model, k = 2) m # summary of the fitted model/ estimates summary(m) # obtain more information about convergence convergence(m) # extract the fixed effects coef(m, "fixef") # extract the random effects coef(m, "ranef") # extract both coef(m)
# use the landsat data head(landsat) # define the saemodel using the landsat data model <- saemodel(formula = HACorn ~ PixelsCorn + PixelsSoybeans, area = ~CountyName, data = subset(landsat, subset = (outlier == FALSE))) # summary of the model summary(model) # maximum likelihood estimates fitsaemodel("ml", model) # Huber M-estimate with robustness tuning constant k = 2 m <- fitsaemodel("huberm", model, k = 2) m # summary of the fitted model/ estimates summary(m) # obtain more information about convergence convergence(m) # extract the fixed effects coef(m, "fixef") # extract the random effects coef(m, "ranef") # extract both coef(m)
fitsaemodel
This function is used to define global settings and parameters that
are used by fitsaemodel()
.
fitsaemodel.control(niter = 40, iter = c(200, 200), acc = 1e-05, dec = 0, decorr = 0, init = "default", k_Inf = 20000, ...)
fitsaemodel.control(niter = 40, iter = c(200, 200), acc = 1e-05, dec = 0, decorr = 0, init = "default", k_Inf = 20000, ...)
niter |
|
iter |
|
acc |
|
dec |
|
decorr |
|
init |
|
k_Inf |
|
... |
additional arguments (not used). |
Changing the default values of the parameters may result in failure of convergence or loss of convergence speed.
A list with entries
niter
iter
acc
k_Inf
init
dec
decorr
add
# use the landsat data head(landsat) # define the saemodel using the landsat data model <- saemodel(formula = HACorn ~ PixelsCorn + PixelsSoybeans, area = ~CountyName, data = subset(landsat, subset = (outlier == FALSE))) # summary of the model summary(model) # obtain the maximum likelihood estimates with, for instance, 'niter = 50' # number of outer-loop iterations (by default: niter = 40). Here, we use # 'niter = 50' for the sake of demonstration, not because it is needed. fitsaemodel("ml", model, niter = 50)
# use the landsat data head(landsat) # define the saemodel using the landsat data model <- saemodel(formula = HACorn ~ PixelsCorn + PixelsSoybeans, area = ~CountyName, data = subset(landsat, subset = (outlier == FALSE))) # summary of the model summary(model) # obtain the maximum likelihood estimates with, for instance, 'niter = 50' # number of outer-loop iterations (by default: niter = 40). Here, we use # 'niter = 50' for the sake of demonstration, not because it is needed. fitsaemodel("ml", model, niter = 50)
The landsat
data is a compilation of survey and satellite
data from Battese et al. (1988). It consists of data on segments
(primary sampling unit; 1 segement approx. 250 hectares) under
corn and soybeans for 12 counties in north-central Iowa.
data(landsat)
data(landsat)
A data frame with 37 observations on the following 10 variables.
SegmentsInCounty
number of segments per county.
SegementID
sample segment identifier (per county).
HACorn
hectares of corn for each sample segment (as reported in the June 1978 Enumerative Survey).
HASoybeans
hectares of soybeans for each sample segment (as reported in the June 1978 Enumerative Survey).
PixelsCorn
number of pixels classified as corn for each sample segment (LANDSAT readings).
PixelsSoybeans
number of pixels classified as soybeans for each sample segment (LANDSAT readings).
MeanPixelsCorn
county-specific mean number of pixels classified as corn.
MeanPixelsSoybeans
county-specific mean number of pixels classified as soybeans.
outlier
outlier indicator; observation number 33 is flagged as outlier.
CountyName
county names (factor variable): Cerro Gordo, Hamilton, Worth, Humboldt, Franklin, Pocahontas, Winnebago, Wright, Webster, Hancock, Kossuth, Hardin.
The landsat
data in Battese et al. (1988) is a compilation
of the LANDSAT satellite data from the U.S. Department of Agriculture
(USDA) and the 1978 June Enumerative Survey.
The survey data on the areas under corn and soybeans (reported in hectares) in the 37 segments of the 12 counties (north-central Iowa) have been determined by USDA Statistical Reporting Service staff, who interviewed farm operators. A segment is about 250 hectares.
For the LANDSAT satellite data, information is recorded as "pixels". A pixel is about 0.45 hectares. The USDA has been engaged in research toward transforming satellite information into good estimates of crop areas at the individual pixel and segments level. The satellite (LANDSAT) readings were obtained during August and September 1978.
Data for more than one sample segment are available for several counties (i.e., unbalanced data).
Observations No. 33 has been flaged as outlier; see Battese et al. (1988, p. 28).
The landsat
data is from Table 1 of Battese et al. (1988, p. 29).
Battese, G. E., Harter, R. M., and W.A. Fuller (1988). An error component model for prediction of county crop areas using. Journal of the American Statistical Association 83, 28–36. doi:10.2307/2288915
head(landsat)
head(landsat)
The landsat
data is a compilation of survey and satellite
data from Battese et al. (1988). The county-specific population means of
pixels of the segments under corn and soybeans, respectively, are available
in the data.frame landsat_means
.
data(landsat_means)
data(landsat_means)
A data frame with 12 observations (counties) on the following 3 variables.
(intercept)
all ones.
MeanPixelsCorn
county-specific mean of number of pixels classified as corn (LANDSAT readings).
MeanPixelsSoybeans
county-specific number of pixels classified as soybeans (LANDSAT readings).
The data.frame landsat_means
is an aggregation of the data.frame
landsat
.
Battese, G. E., Harter, R. M., and W.A. Fuller (1988). An error component model for prediction of county crop areas using. Journal of the American Statistical Association 83, 28–36. doi:10.2307/2288915
head(landsat_means)
head(landsat_means)
This function generates synthetic data (possibly contaminated by outliers) for the basic unit-level SAE model.
makedata(seed = 1024, intercept = 1, beta = 1, n = 4, g = 20, areaID = NULL, ve = 1, ve.contam = 41, ve.epsilon = 0, vu = 1, vu.contam = 41, vu.epsilon = 0)
makedata(seed = 1024, intercept = 1, beta = 1, n = 4, g = 20, areaID = NULL, ve = 1, ve.contam = 41, ve.epsilon = 0, vu = 1, vu.contam = 41, vu.epsilon = 0)
seed |
|
intercept |
|
beta |
|
n |
|
g |
|
areaID |
|
ve |
|
ve.contam |
|
ve.epsilon |
|
vu |
|
vu.contam |
|
vu.epsilon |
|
Let denote an area-specific
-vector of
the response variable for the areas
. Define a
-matrix
of realizations
from the std. normal distribution,
, and let
denote a
-vector of regression coefficients. Now, the
are drawn using the law
with
and
the variances of the model error and random-effect
variance, respectively, and
and
denoting
the identity matrix and matrix of ones, respectively.
In addition, we allow the distribution of the model/residual and
area-level random effect to be contaminated (cf. Stahel and Welsh, 1997).
Notably, the laws of and
are replaced
by the Tukey-Huber contamination mixture:
where and
regulate the degree of contamination;
and
define the variance of the contamination part of the mixture distribution.
Four different contamination setups are possible:
no contamination (i.e., ve.epsilon = vu.epsilon = 0
),
contaminated model error (i.e., ve.epsilon != 0
and
vu.epsilon = 0
),
contaminated random effect (i.e., ve.epsilon = 0
and
vu.epsilon != 0
),
both are conaminated (i.e., ve.epsilon != 0
and
vu.epsilon != 0
).
An instance of the class saemodel
.
Schoch, T. (2012). Robust Unit-Level Small Area Estimation: A Fast Algorithm for Large Datasets. Austrian Journal of Statistics 41, 243–265. doi:10.17713/ajs.v41i4.1548
Stahel, W. A. and A. Welsh (1997). Approaches to robust estimation in the simplest variance components model. Journal of Statistical Planning and Inference 57, 295–319. doi:10.1016/S0378-3758(96)00050-X
# generate a model with synthetic data model <- makedata() model # summary of the model summary(model)
# generate a model with synthetic data model <- makedata() model # summary of the model summary(model)
Function robpredict()
predicts the area-level means by (1) the
empirical best linear unbiased predictor (EBLUP) or (2) a robust
prediction method which is due to Copt and Victoria-Feser (2009).
In addition, the function computes the mean square prediction
error (MSPE) of the predicted area-level means by a parametric
bootstrap method.
robpredict(fit, areameans = NULL, k = NULL, reps = NULL, seed = 1024, progress_bar = TRUE) ## S3 method for class 'pred_model_b' print(x, digits = max(3L, getOption("digits") - 3L), ...) ## S3 method for class 'pred_model_b' plot(x, type = "e", sort = NULL, ...) ## S3 method for class 'pred_model_b' residuals(object, ...) ## S3 method for class 'pred_model_b' as.matrix(x, ...) ## S3 method for class 'pred_model_b' head(x, n = 6L, ...) ## S3 method for class 'pred_model_b' tail(x, n = 6L, keepnums = TRUE, addrownums, ...)
robpredict(fit, areameans = NULL, k = NULL, reps = NULL, seed = 1024, progress_bar = TRUE) ## S3 method for class 'pred_model_b' print(x, digits = max(3L, getOption("digits") - 3L), ...) ## S3 method for class 'pred_model_b' plot(x, type = "e", sort = NULL, ...) ## S3 method for class 'pred_model_b' residuals(object, ...) ## S3 method for class 'pred_model_b' as.matrix(x, ...) ## S3 method for class 'pred_model_b' head(x, n = 6L, ...) ## S3 method for class 'pred_model_b' tail(x, n = 6L, keepnums = TRUE, addrownums, ...)
fit |
an object of class |
areameans |
|
k |
|
reps |
|
seed |
|
progress_bar |
|
x |
an object of class |
digits |
|
type |
|
sort |
|
object |
an object of class |
n |
|
keepnums |
in each dimension, if no names in that dimension are
present, create them using the indices included in that dimension.
Ignored if |
addrownums |
deprecated - |
... |
additional arguments (not used). |
Function robpredict()
computes predictions of the area-level means
and—if required—an estimate of the area-specific mean square
prediction error (MSPE).
Case 1: If areameans
is a matrix
with
area-level means of the explanatory variables, then
the computation of the fixed effects effects are
based on areameans
.
Case 2: If areameans = NULL
, then the predictions
are based on the sample data that have been used to
fit the model.
If reps = NULL
, the number of bootstrap
replicates is not specified; hence, MSPE is not computed.
If reps
is a positive integer and
areameans
is not NULL
(see Case 1 above), then
a (robust) parametric bootstrap estimate of MSPE is computed
as proposed by Sinha and Rao (2009); see also Lahiri (2003)
and Hall.
The EBLUP obtains if k = NULL
, i.e., if the
robustness tuning constant k
is unspecified.
Robust predictions of the area-level means are
computed if k
is a nonnegative real number.
Small values of k
imply that outliers are
heavily downweighted; formally, the EBLUP corresponds
to choosing the tuning constant k
equal to
infinity. The value of the tuning constant k
specified in robpredict()
can be different
from the tuning constant k
used in fitting
the model. The robust prediction method is due to Copt
and Victoria-Feser (2009); see also Heritier et al.
(2009, 113-114) and differs from the method in Sinha
and Rao (2009).
An instance of the S3 class pred_model_b
Users of Rgui.exe
on Windows are recommended to call
robpredict()
with argument progress_bar = FALSE
because Rgui.exe
does not handle calls to
txtProgressBar()
well (the
execution time of the same job increases and it tends to stall the
execution of R). Users of R-Studio
and Rterm.exe
are not affected.
Copt, S. and M.-P. Victoria-Feser (2009). Robust Predictions in Mixed Linear Models, Research Report, University of Geneva.
Lahiri, P. (2003). On the impact of bootstrap in survey sampling and small area estimation. Statistical Science 18, 199–210. doi:10.1214/ss/1063994975
Hall, P. and T. Maiti (2006). On parametric bootstrap methods for small area prediction. Journal of the Royal Statistical Society. Series B 68, 221–238. doi:10.1111/j.1467-9868.2006.00541.x
Heritier, S., Cantoni, E., Copt, S., and M.-P. Victoria-Feser (2009). Robust methods in biostatistics. New York: John Wiley and Sons.
Schoch, T. (2012). Robust Unit-Level Small Area Estimation: A Fast Algorithm for Large Datasets. Austrian Journal of Statistics 41, 243–265. doi:10.17713/ajs.v41i4.1548
Sinha, S.K. and J.N.K. Rao (2009). Robust small area estimation. Canadian Journal of Statistics 37, 381–399. doi:10.1002/cjs.10029
saemodel()
, makedata()
,
fitsaemodel()
# use the landsat data head(landsat) # set up the model model <- saemodel(formula = HACorn ~ PixelsCorn + PixelsSoybeans, area = ~CountyName, data = subset(landsat, subset = (outlier == FALSE))) # summary of the model summary(model) # Huber M-estimate with robustness tuning constant k = 2 m <- fitsaemodel("huberm", model, k = 2) m # summary of the fitted model/ estimates summary(m) # robust prediction of the random effects and the area-level means (robust # EBLUP) using the counts-specific means (landsat_means) head(landsat_means) # for robust prediction, we use the robustness tuning constant 'k = 1.8' m_predicted <- robpredict(m, landsat_means, k = 1.8) head(m_predicted) # extract prediction as matrix as.matrix(m_predicted) # extract residuals from the predictions head(residuals(m_predicted)) # prediction incl. MSPE; parametric bootstrap with only 'reps = 10' # replications (for demonstration purposes; in practice, 'reps' should be # considerably larger) m_predicted_mspe <- robpredict(m, landsat_means, k = 1.8, reps = 10, progress_bar = FALSE) head(m_predicted_mspe)
# use the landsat data head(landsat) # set up the model model <- saemodel(formula = HACorn ~ PixelsCorn + PixelsSoybeans, area = ~CountyName, data = subset(landsat, subset = (outlier == FALSE))) # summary of the model summary(model) # Huber M-estimate with robustness tuning constant k = 2 m <- fitsaemodel("huberm", model, k = 2) m # summary of the fitted model/ estimates summary(m) # robust prediction of the random effects and the area-level means (robust # EBLUP) using the counts-specific means (landsat_means) head(landsat_means) # for robust prediction, we use the robustness tuning constant 'k = 1.8' m_predicted <- robpredict(m, landsat_means, k = 1.8) head(m_predicted) # extract prediction as matrix as.matrix(m_predicted) # extract residuals from the predictions head(residuals(m_predicted)) # prediction incl. MSPE; parametric bootstrap with only 'reps = 10' # replications (for demonstration purposes; in practice, 'reps' should be # considerably larger) m_predicted_mspe <- robpredict(m, landsat_means, k = 1.8, reps = 10, progress_bar = FALSE) head(m_predicted_mspe)
Function saemodel()
is used to specify a model. Once a model
has been specified, it can be fitted using
fitsaemodel()
by different estimation methods.
saemodel(formula, area, data, type = "b", na.omit = FALSE) ## S3 method for class 'saemodel' print(x, ...) ## S3 method for class 'saemodel' summary(object, ...) ## S3 method for class 'saemodel' as.matrix(x, ...)
saemodel(formula, area, data, type = "b", na.omit = FALSE) ## S3 method for class 'saemodel' print(x, ...) ## S3 method for class 'saemodel' summary(object, ...) ## S3 method for class 'saemodel' as.matrix(x, ...)
formula |
a |
area |
a one-sided |
data |
data.frame. |
type |
|
na.omit |
|
x |
an object of class |
object |
an object of the class |
... |
additional arguments (not used). |
Function saemodel()
is used to specify a model.
model
is a symbolic description (formula
of the
fixed-effects model to be fitted.
A typical model has the form response ~ terms
where
response
is the (numeric) response vector and
terms
is a series of terms which specifies a linear
predictor for response (explanatory variables); see
formula
.
A formula
has an implied intercept term. To remove
this use either y ~ x - 1
or y ~ 0 + x
;
see formula
for more details of allowed formulae.
area
is a symbolic description (formula
) of
the random effects (nested error structure). It must be
right-hand side only formula consisting of one term,
e.g., ~ areaDefinition
.
The data must no contain missing values.
The design matrix (i.e., matrix of the explanatory variables
defined the right-hand side of model
) must have full column
rank; otherwise execution is terminated by an error.
Once a model has been specified, it can be fitted by
fitsaemodel()
.
An instance of the S3 class "saemodel"
Rao, J.N.K. (2003). Small Area Estimation, New York: John Wiley and Sons.
# use the landsat data head(landsat) # set up the model model <- saemodel(formula = HACorn ~ PixelsCorn + PixelsSoybeans, area = ~CountyName, data = subset(landsat, subset = (outlier == FALSE))) # summar of the model summary(model)
# use the landsat data head(landsat) # set up the model model <- saemodel(formula = HACorn ~ PixelsCorn + PixelsSoybeans, area = ~CountyName, data = subset(landsat, subset = (outlier == FALSE))) # summar of the model summary(model)