--- title: "Robust Estimation and Prediction Under the Unit-Level SAE Model" author: "Tobias Schoch" output: html_document: css: "fluent.css" highlight: tango vignette: > %\VignetteIndexEntry{Robust Estimation and Prediction Under the Unit-Level SAE Model} %\VignetteEngine{knitr::rmarkdown} \usepackage[utf8]{inputenc} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "", prompt = TRUE, fig.align = "center" ) library("rsae") ``` ```{css, echo = FALSE} .my-sidebar-orange { padding-left: 1.5rem; padding-top: 0.5rem; padding-bottom: 0.25rem; margin-top: 1.25rem; margin-bottom: 1.25rem; border: 1px solid #eee; border-left-width: 0.75rem; border-radius: .25rem; border-left-color: #ce5b00; } .my-sidebar-blue { padding-left: 1.5rem; padding-top: 0.5rem; padding-bottom: 0.25rem; margin-top: 1.25rem; margin-bottom: 1.25rem; border: 1px solid #eee; border-left-width: 0.75rem; border-radius: .25rem; border-left-color: #1f618d; } ``` ## Outline This vignette is organized as follows: 1. Getting started 2. Exploring the data 3. Model specification 4. Parameter estimation 5. Robust prediction of the area-level means 6. Mean square prediction error ## 1 Getting started In small area estimation (SAE), we distinguish two types of models: * model A: basic area-level model (also known as Fay-Herriot model), * model B: basic unit-level model, The classification of the models (A or B) is from [Rao (2003, Chapter 7)](#biblio). The current version of the package implements the following estimation methods under the **unit-level model (model B)**: * maximum-likelihood (ML) estimation, * robust Huber type *M*-estimation. The package can be installed from CRAN using `install.packages("rsae")`. Once the `rsae` package has been installed, we need to load it to the current session by `library("rsae")`. **Work flow** The prediction of the area-level means takes three steps. * model specification using `saemodel()`, * model fitting using `fitsaemodel()`, * prediction of the random effects and the area-specific means using `robpredict()`; this step includes the computation of the mean square prediction error. ## 2 Exploring the data We use the `landsat` data of [Battese et al. (1988)](#biblio), which is loaded by ```{r} data("landsat") ``` The `landsat` data is a compilation of survey and satellite data from [Battese et al. (1988)](#biblio). It consists of data on segments (primary sampling unit; 1 segement approx. 250 hectares) under corn and soybeans for 12 counties in north-central Iowa. * The survey data on the areas under corn and soybeans (reported in hectares) in the 37 segments of the 12 counties have been determined by USDA Statistical Reporting Service staff, who interviewed farm operators. A segment is about 250 hectares. * For the LANDSAT satellite data, information is recorded as "pixels". A pixel is about 0.45 hectares. In the three smallest counties (Cerro Gordo, Hamilton, and Worth), data is available only for one sample segment. All other counties have data for more than one sample segment (i.e., unbalanced data). The largest area (Hardin) covers six units. The data for the observations 32, 33, and 34 are shown below ```{r} landsat[32:34,] ``` We added the variable `outlier` to the original data. It flags observation 33 as an outlier, which is in line with the discussion in [Battese et al. (1988)](#biblio). ## 3 Model specification We consider estimating the parameters of the basic unit-level model ([Battese et al. , 1988](#biblio)) $$ \begin{equation*} \mathrm{HACorn}_{i,j} = \alpha + \beta_1 \mathrm{PixelsCorn}_{i,j} + \beta_2 \mathrm{PixelsSoybeans}_{i,j} + u_i + e_{i,j}, \end{equation*} $$ where $j=1,\ldots, n_i$, $i=1, \ldots,12$, and * $\alpha$, $\beta_1$, and $\beta_2$ are unknown real-valued coefficients, * the $u_i$'s are area-specific random effects, $u_i \sim N(0, \sigma_u^2)$, and $\sigma_u^2 \geq 0$ is unknown, * the $e_i$'s are residual errors, $e_{ij} \sim N(0, \sigma_e^2)$, and $\sigma_e^2 > 0$ is unkown, * the $u_i$'s and $e_{ij}$'s are independent. The model is defined with the help of the`saemodel()`function: ```{r} bhfmodel <- saemodel(formula = HACorn ~ PixelsCorn + PixelsSoybeans, area = ~ CountyName, data = subset(landsat, subset = (outlier == FALSE))) ``` where * `formula` defines the fixed-effect part of the model (the `~` operator separates dependent and independent variables; by default, the model includes a regression intercept), * `area` specifies the area-level random effect (variable `CountyName` serves as area identifier; note that the argument `area` is also a `formula` object), * `data` specifies the `data.frame` (here, we consider the subset of observations that are not flagged as outliers). If you need to know more about a particular model, you can use the `summary()` method. ## 4 Parameter estimation Having specified `bhfmodel`, we consider estimating its parameters by different estimation method. ### 4.1 Maximum-likelihood estimation The maximum likelihood (ML) estimates of the parameters are computed by ```{r} mlfit <- fitsaemodel(method = "ml", bhfmodel) ``` On print, object `mlfit` shows the following. ```{r} mlfit ``` Inferential statistics for the object `mlfit` are computed by the `summary()` method. ```{r} summary(mlfit) ``` Function `coef()` extracts the estimated coefficients from a fitted model. * By default, function `coef()` is called with argument `type = "both"`, which implies that the fixed effects and the random effect variances are returned. * Calling the function with `type = "ranef"` or `type = "fixef"` returns the random effect variances or the fixed effects, respectively. Function `convergence()` can be useful if the algorithm did not converge; see [Appendix](#appendix).

**Good to know. When the mixed linear model is not appropriate**

Suppose that our data do not have area-specific variation. If we nevertheless fit the basic unit-level model to the data, the output looks as follows. ```{r, echo = FALSE} set.seed(12345) n <- 200; beta <- c(1, 1) cst <- rep(1, n) x <- rnorm(n) y <- as.matrix(cbind(cst, x)) %*% beta + rnorm(n) areaid <- rep(1:10, each=10) df <- data.frame(y=y, x=x, areaid=areaid) m <- saemodel(y ~ x, area=~areaid, data=df) fitsaemodel("ml", m) ``` The report indicates that the random-effect variance is close to zero or equal to zero. Thus, it is more appropriate to use a linear regression model instead of the basic unit-level model.

### 4.2 Huber-type *M*-estimation The Huber-type *M*-estimator is appropriate for situations where the response variable is supposed to be (moderately) contaminated by outliers. The *M*-estimator downweights residual outliers, but it *does not limit* the effect of high-leverage observations. The Huber-type *M*-estimator of `bhfmodel` is ```{r} huberfit <- fitsaemodel("huberm", bhfmodel, k = 1.5) ``` where `k` is the robustness tuning constant of the Huber $\psi$-function ($0

**Good to know. Selection of robustness tuning constant**

In the simple location-scale model, the tuning constant `k = 1.345` defines an *M*-estimator of location that has an asymptotic relative efficiency w.r.t. the ML estimate of approx. 95\% at the Gaussian core model. This property does *not* directly carry over to robust estimators of mixed-linear models. Therefore, we should not blindly select `k = 1.345`.

If the algorithm did not converge, see paragraph *safe mode* (below) and [Appendix](#appendix). Inferential statistics for `huberfit` are computed by the `summary()` method. ```{r} summary(huberfit) ``` The output is separated into 2 blocks. The first block shows inferential statistics of the estimated fixed effects. The second block reports the degree of downweighting that is applied to outlying residuals at the final iteration (by estimating equations, EE, separately). The more the value of "sum(wgt)/n" deviates from 1.0, the more downweighting has been applied. The methods `coef()` and `convergence()` are also available. #### Safe mode In the safe mode, the algorithm is initialized by a high-breakdown-point regression estimator. **Note.** In order to use the safe mode, the `robustbase` package of [Maechler et al. (2021)](#biblio) must be installed. The safe mode is entered by specifying one the following initialization methods: * `init = "lts"` : least trimmed squares (LTS) regression estimator; see [Rousseeuw (1984)](#biblio) and [Rousseeuw and van Driessen (2006)](#biblio), * `init = "s"` : regression *S*-estimator; see [Rousseeuw and Yohai (1984)](#biblio) and [Salibian-Barerra and Yohai (2006)](#biblio) in the call of `fitsaemodel()`. The safe mode uses a couple of internal tests to check whether the estimates at consecutive iterations behave well. Notably, it tries to detect cycles in the sequence iteratively refined estimates; see [Schoch (2012)](#biblio). ## 5 (Robust) prediction of the area-level means The package implements the following methods to predict the area-level means: * empirical best linear unbiased predictor (EBLUP); see [Rao (2003, Chapter 7.2)](#biblio), * robust prediction method of [Copt and Victoria-Feser (2009)](#biblio); see also [Heritier et al. (2009, p. 113--114)](#biblio). EBLUP and robust predictions are computed by ```{r, eval = FALSE} robpredict(fit, areameans = NULL, k = NULL, reps = NULL, progress_bar = TRUE) ``` where * `fit` is a fitted model (ML estimate or *M*-estimate), * `k` is the robustness tuning constant of the Huber $\psi$-function for robust prediction. By default `k` is `NULL` which means that the procedure inherits the tuning constant `k` that has been used in fitting the model; see `fitsaemodel()`. For the ML estimator, $k$ is taken as a large value that "represents" infinity; see argument `k_Inf` in `fitsaemodel.control()`. * `reps` and `progress_bar` are used in the computation of the mean square prediction error (see below). In the `landsat` data, the county-specific population means of pixels of the segments under corn and soybeans are recorded in the variables MeanPixelsCorn and MeanPixelsSoybeans, respectively. Each sample segment in a particular county is assigned the county-specific means. Therefore, the population mean of the variables MeanPixelsCorn and MeanPixelsSoybeans occurs $n_i$ times. The unique county-specific population means are obtained using ```{r} dat <- unique(landsat[-33, c("MeanPixelsCorn", "MeanPixelsSoybeans", "CountyName")]) dat <- cbind(rep(1,12), dat) rownames(dat) <- dat$CountyName dat <- dat[,1:3] dat ``` The first column of `dat` is a dummy variable (because `bhfmodel` has a regression intercept). The second and third column represent, respectively, the county-specific population means of segments under corn and soybeans. Consider the ML estimate, `mlfit`, of the `bhfmodel` model. The EBLUP of the area-level means is computed by ```{r} pred <- robpredict(mlfit, areameans = dat) pred ``` because (by default) `k = NULL` . By explicitly specifying `k` in the call of `robpredict()`, we can—in principle—compute robust predictions for the ML estimate instead of the EBLUP. Consider the Huber *M*-estimate `huberfit` (with tuning constant `k = 1.5`, see call of above). If `k` is not specified in the call of `robpredict()`, robust predictions are computed with $k$ equal to 1.5; otherwise, the robust predictions are based on the value of `k` in the call of `robpredict()`. Object `pred` is a `list` with slots `"fixeff"`, `"raneff"`, `"means"`, `"mspe"`, etc. For instance, the predicted means can be extracted by `pred$means` or `pred[["means"]]`. A `plot()` function is available to display the predicted means. Function `residuals()` computes the residuals. ### 6 Mean square prediction error Function `robpredict()` can be used to compute bootstrap estimates of the mean squared prediction errors (MSPE) of the predicted area-level means; see [Sinha and Rao (2009)](#biblio). To compute the MSPE, we must specify the number of bootstrap replicates `(reps)`. If `reps = NULL`, the MSPE is not computed. Consider (for instance) the ML estimate. EBLUP and MSPE of the EBLUP based on 100 bootstrap replicates are computed by ```{r} pred <- robpredict(mlfit, areameans = dat, reps = 100, progress_bar = FALSE) pred ``` The number of `reps = 100` is kept small for illustration purposes; in practice, we should choose larger values. The progress bar has been disabled as it is not suitable for the automatic vignette generation. A visual display of the predicted area-level means obtains by `plot(pred, type = "l", sort = "means")`. ## References {#biblio} COPT, S. and M.-P. VICTORIA-FESER (2009). *Robust prediction in mixed linear models*, Tech. rep., University of Geneva. BATTESE, G. E., R. M. HARTER and W. A. FULLER (1988). An error component model for prediction of county crop areas using, *Journal of the American Statistical Association* **83**, 28--36. [DOI: 10.2307/2288915](https://doi.org/10.2307/2288915) FELLNER, W. H. (1986). Robust estimation of variance components, *Technometrics* **28**, 51--60. [DOI: 10.1080/00401706.1986.10488097](https://doi.org/10.1080/00401706.1986.10488097) HERITIER, S., E. CANTONI, S. COPT, and M.-P. VICTORIA-FESER (2009). *Robust Methods in Biostatistics*, New York: John Wiley & Sons. MAECHLER, M., P. J. ROUSSEEUW, C. CROUX, V. TODOROC, A. RUCKSTUHL, M. SALIBIAN-BARRERA, T. VERBEKE, M. KOLLER, M. KOLLER, E. L. T. CONCEICAO and M. A. DI PALMA (2023). *robustbase: Basic Robust Statistics*. R package version 0.99-1. [URL: CRAN.R-project.org/package=robustbase](https://CRAN.R-project.org/package=robustbase). RAO, J.N.K. (2003). *Small Area Estimation*, New York: John Wiley and Sons. RICHARDSON, A. M. and A. H. WELSH (1995). Robust Restricted Maximum Likelihood in Mixed Linear Models, *Biometrics* **51**, 1429--1439. [DOI: 10.2307/2533273](https://doi.org/10.2307/2533273) ROUSSEEUW, P. J. (1984). Least Median of Squares Regression, *Journal of the American Statistical Association* **79**, 871--880. [DOI: 10.2307/2288718](https://doi.org/10.2307/2288718) ROUSSEEUW, P. J. and K. VAN DRIESSEN (2006). Computing LTS Regression for Large Data Sets, *Data Mining and Knowledge Discovery* **12**, 29--45. [DOI: 10.1007/s10618-005-0024-4](https://doi.org/10.1007/s10618-005-0024-4) ROUSSEEUW, P. J. and V. YOHAI (1984). Robust Regression by Means of S Estimators, in *Robust and Nonlinear Time Series Analysis*, ed. by FRANKE, J., W. HÄRDLE AND R. D. MARTIN, New York: Springer, 256--274. SALIBIAN-BARRERA, M. and V. J. YOHAI (2006). A Fast Algorithm for S-Regression Estimates, *Journal of Computational and Graphical Statistics* **15**, 414--427. [DOI: 10.1198/106186006x113629](https://doi.org/10.1198/106186006x113629) SCHOCH, T. (2012). Robust Unit-Level Small Area Estimation: A Fast Algorithm for Large Datasets. *Austrian Journal of Statistics* **41**, 243--265. [DOI: 10.17713/ajs.v41i4.1548](https://doi.org/10.17713/ajs.v41i4.1548) SINHA, S.K. and J.N.K. RAO (2009). Robust small area estimation. *Canadian Journal of Statistics* **37**, 381--399. [DOI: 10.1002/cjs.10029](https://doi.org/10.1002/cjs.10029) ## Appendix ### A Failure of convergence Suppose that we computed ```{r} est <- fitsaemodel("ml", bhfmodel, niter = 3) ``` The algorithm did not converge and we get the following output ```{r} est ``` To learn more, we call `convergence(est)` and get ```{r} convergence(est) ``` Clearly, we have deliberately set the number of (overall or outer loop) iterations equal to `niter = 3` to illustrate the behavior; see also "niter = 3" on the line "overall loop" in the above table. As a consequence, the algorithm failed to converge. The maximum number of iterations for the fixed effects ("fixeff") is 200, for the residual variance estimator ("residual var") is 200, for the area-level random effect variance is ("area raneff var") is 100. The default values can be modified; see documentation of `fitsaemodel.control()`. The last table in the above output shows the number of iterations that the algorithm required for the `niter = 3` overall loop iteration (i.e., in the first loop, we have `fixef = 2`, `residual var = 2` and `area raneff var = 18` iterations). From this table we can learn how to adjust the default values in case the algorithm does not converge.