Theory for missing not at random non-response correction

Introduction

This document reveals all the theory needed to understand methodology behind correction of missing not at random type of non-responses in survey samples. It covers methods such as:

Generalized calibration
Generalized calibration with more variables in calibration than response model
Generalized method of moments (also known as GMM)
Empirical likelihood estimation
Non-parametric methods
Exponential tilting
Latent approach

Knowledge included in this particular paper allows one to understand correction techniques and should be treated by one as supplementary resource to our R programming language package called MNAR.

Interpretation of Missing Not at Random

In statistical surveys, missing data plays the most important role in the family of non-random errors. Their presence affects the process of estimating the unknown global values of a population by biasing given estimators and reducing their precision. The reason for such behavior lies in the characteristics of differences between respondents—participants who answered every question—and those participants who did not manage to provide answers to every question (item non-response) or did not answer at all (unit non-response), called non-respondents.

There happens to be a bunch of methods to deal with non-responses. However, the main idea behind the construction of them allows us to categorize them—we could split them into weighting methods and imputation methods. In most cases, one is able to determine which of the above groups of methods shall be used.

The imputation methods are used when dealing with item non-response—questions with a lack of answers are being corrected by, e.g., replenishing missing data. The weighting methods shall be used when dealing with unit non-responses in order to correct, by using a set of auxiliary variables, the weights of respondents and non-respondents in the sample such that known population totals are being reproduced. The choice of mentioned auxiliary variables matters and is strictly tied to the estimation process, thus it arises as the biggest problem in this category of methods. One might use a combination of both kinds of methods to eliminate the negative impact of non-responses.

Investigation of the described methodology starts with an explanation of the assumptions, settings, and notation behind sampling, responding, and estimation. Starting with the basic notation, let $U$ denote the population of size $N$ with a probability sample $s$ of length $n$ , $s \subseteq U$ , $n \leq N$ . According to the sampling design, let $\pi_k$ denote first-order inclusion probability of $k$ -th element of population $U$ in sample $s$ . Thus, under given sampling design, $d_k = \frac{1}{\pi_k}$ denotes an initial weight of $k$ -th element. Our main goal, when dealing with survey data, is to estimate the total of population $U$ , written as $\begin{equation}\label{eq:Pop. total} Y = \sum_{k=1}^{N}{y_k}, \end{equation}$ where $y_k$ is value of target variable $Y$ for $k$ -th element, $k = 1,...,N$ . Natural and usual choice here is to consider Horvitz-Thompson estimator of the form: $\begin{equation}\label{eq:HT estimator} \hat{Y}_{\text{HT}} = \sum_{k \in s}{d_k y_k}. \end{equation}$ By design, $\hat{Y}_{\text{HT}}$ is an unbiased estimator. If there happens to be non-respondents in the sample, then the summation is done over the subset $r$ of respondents, $r \subseteq s$ .

Weighting methods

Usually, if non-responses occur, summation in () provides underestimated values compared to the population total from (). Thus, it is needed to perform correction of initial weights $d_k$ under given sampling design - in other words, we have to perform of described $d_k$ ’s.

In general we have the following settings: - Information on the target variable $Y$ is only available for respondents.

Information on auxiliary variables $X$ is available under the following settings:
- Unit-level data is available for respondents and non-respondents.
- Unit-level data is available only for respondents, but we have population totals for the reference population.

Case when dimensions of calibration and response-model variables coincide

Let $\boldsymbol{x}= (x_1, x_2, ..., x_p)^{\text{T}}$ denote benchmark vector of chosen auxiliary variables and $\boldsymbol{x}_{k} = (x_{1_k}, x_{2_k}, ..., x_{p_k})^{\text{T}}$ is the vector of auxiliary variables for $k$ -th element of the sample $s$ . Settings state that $\boldsymbol{X}$ , which is the vector of global auxiliary variables’ values is known, i.e. $\begin{equation}\label{eq:Aux. Total} \boldsymbol{X}= \left(\sum_{k=1}^{N}{x_{1_k}}, \sum_{k=1}^{N}{x_{2_k}}, ..., \sum_{k=1}^{N}{x_{p_k}}\right)^{\text{T}} = \sum_{k \in U}{\boldsymbol{x}_k}. \end{equation}$ If any of auxiliary values total is not known, one might use $x_{i_{k}}$ instead of $y_k$ ’s into (), i.e. $\hat{X}^{i}_{\text{HT}} = \sum_{k=1}^{N}{x_{i_k}}, \; i= 1, ..., p.$ However, using $\boldsymbol{x}_k$ instead of $y_k$ does not always work in process of estimation $\boldsymbol{X}$ . One needs to perform slightly different weights than $d_k$ ’s. Those weights, denoted as $w_k$ ’s as solutions to optimization problem of form $\begin{equation}\label{eq: optimization w_k} \operatorname{argmin}_{w_k}{\sum_{k\in r}G_k\left(w_k,d_k\right)}, \end{equation}$ where $G_k$ is a strictly convex, differentiable function, for which $G_k(d_k,d_k) = 0$ and $G_k(1) = G'_k(1) = 0$ . Also, there exists a additional condition which has to be satisfied, namely: $\begin{equation}\label{eq: calib eq} \sum_{k\in r}{w_k\boldsymbol{x}_k} = \sum_{k\in U}{\boldsymbol{x}_k}. \end{equation}$ Equation () is also being called as . Using Lagrange multipliers method, it is shown in Deville and Särndal (1992), that vector of calibration weights might be written as: $\begin{equation}\label{eq: w_k minimizers form} w_k = d_k F_k (\boldsymbol{\lambda}^{\text{T}}\boldsymbol{z}_k) \end{equation}$ where $\boldsymbol{z}_k$ is a vector of instrumental variables, coinciding, in sense of dimensions with $\boldsymbol{x}_k$ . Later in this paper, we will consider situation where $\boldsymbol{z}_k$ has got higher dimension than $\boldsymbol{x}_k$ . $F_k$ is the inverse of $G_k'(w_k, d_k)$ , defined as: $\begin{equation}\label{eq: partial derivative of G_k} G_k'(w_k, d_k) = \frac{\partial{G_k(w_k, d_k)}}{\partial w_k}. \end{equation}$ There are various ideas to choose function $G_k$ but it is a common case to consider $G_k$ of form: $\begin{equation}\label{eq: G_k example} G_k(w_k,d_k) = \frac{\left(w_k - d_k\right)^2}{2d_k} \end{equation}$ For such choice, the solution $w_k$ of problem stated in () is expressed by Estevao and Särndal (n.d.) as: $\begin{equation}\label{eq:G_k optimizers} w_k = d_k(1 + \boldsymbol{z}_k^{\text{T}}\boldsymbol{\lambda}), \end{equation}$ where $\boldsymbol{g}$ is defined as follows: $\begin{equation}\label{eq: g vector} \boldsymbol{g}= \left(\sum_{k \in r}{d_k\boldsymbol{x}_k\boldsymbol{z}_k^{\text{T}}}\right)^{-1} \times \left(\boldsymbol{X}- \sum_{k\in r}{d_k \boldsymbol{x}_k}\right). \end{equation}$ Using obtained $w_k$ , known as the linear weights, a new, so called “calibration-weighted” estimator of target variable total from () is of the form: $\begin{equation}\label{eq: calibration-weighted estimator} \hat{Y}_{\text{cal}} = \sum_{k \in r}{w_k y_k}, \end{equation}$ which can be rendered as: $\begin{equation}\label{eq: rendered calibration-weighted estimator} \hat{Y}_{\text{cal}} = \sum_{k \in r }{d_k y_k} + \left(\boldsymbol{X}- \sum_{k\in r}{d_k \boldsymbol{x}_k}\right)\boldsymbol{b}, \end{equation}$ where $\begin{equation*} \boldsymbol{b}= \left(\sum_{k \in r}{d_k\boldsymbol{z}_k\boldsymbol{x}_k^{\text{T}}}\right)^{-1} \times \sum_{k \in r}{d_k\boldsymbol{z}_k y_k}. \end{equation*}$ Notice, that $\hat{Y}_{cal}$ is no longer unbiased by design. However it might be consistent, which is described in Isaki and Fuller (1982).

How does one formulate the prediction model in this case? Let’s denote two indicator random variables: $\begin{equation*} \displaystyle I_j = 1{j \in U} \;\;\; R_j = 1{k\in r}. \end{equation*}$ Kott and Chang (2010) proposed the double-protection justiﬁcation set of equations: $\begin{equation}\label{eq: double-security-pred.framework} \left\{\begin{array}{lll} y_k &= \boldsymbol{x}_k^{\text{T}} \boldsymbol{\beta}_{\boldsymbol{x}} + \epsilon_k\\ \boldsymbol{z}_k &= \boldsymbol{x}_k^{\text{T}}\boldsymbol{\Gamma}+ \boldsymbol{\eta}_k^{\text{T}},\\ \end{array} \right. \end{equation}$ where $\boldsymbol{\Gamma}$ is usually on full-rank (not necessarily), $\boldsymbol{\beta}_{\boldsymbol{x}}$ is a coefficients vector and $\begin{equation} E{\left(\epsilon_k|\boldsymbol{x}_j,I_j,R_j\right)} = 0, \;\; E{\left(\boldsymbol{\eta_k}|\boldsymbol{x}_j,I_j,R_j\right)} = 0. \end{equation}$ Under proposal from () there is a property in form of: $\begin{equation}\label{eq: property of 2-sec prediction} \left(y_k - \boldsymbol{z}_k^{\text{T}}\boldsymbol{\beta}_{\boldsymbol{z}}\right)|\boldsymbol{x}_k = \left(\epsilon - \boldsymbol{\eta}_k^{\text{T}}\boldsymbol{\Gamma}^{-1}\boldsymbol{\beta}_{\boldsymbol{x}}\right)|\boldsymbol{x}_k, \end{equation}$ where $\begin{equation*} \beta_{\boldsymbol{z}} = \Gamma^{-1}\beta_{\boldsymbol{x}}. \end{equation*}$

When there are more calibration than response-model variables

First, lets consider $\boldsymbol{b}_{\boldsymbol{z}}^{*}$ , a asymptotic limit of: $\begin{equation}\label{eq: b_z} \boldsymbol{b}_{\boldsymbol{z}} = \left(\sum_{k \in S}{d_kR_kF'_k(\boldsymbol{x}_k^{\text{T}}\boldsymbol{\lambda})\boldsymbol{x}_k}\boldsymbol{z}_k^{\text{T}}\right)^{-1} \times \sum_{k \in S}{d_kR_kF'_k(\boldsymbol{x}_k^{\text{T}}\boldsymbol{\lambda})\boldsymbol{x}_ky_k}, \end{equation}$ which, alongside with $\boldsymbol{b}_{\boldsymbol{z}}$ , is said to exist apart from result of the prediction. When prediction model fails, we got $\boldsymbol{b}_{\boldsymbol{z}} - \boldsymbol{\beta}_{\boldsymbol{z}}$ as long as $\boldsymbol{\lambda}$ converged to a finite $\boldsymbol{\lambda}^{*}$ . Chang and Kott (2008) considered this case and extended weighting approach by replacing the reformulated calibration equation from (): $\begin{equation}\label{eq: reformulated calib. eq.} \boldsymbol{s} = \frac{1}{N} \left[\sum_{k \in S}{d_kR_kF_k(\boldsymbol{x}_k^{\text{T}}\boldsymbol{\lambda})\boldsymbol{z}_k}\boldsymbol{x}_k- \sum_{k \in S}{d_k\boldsymbol{x}_k}\right] = \boldsymbol{0} \end{equation}$ by finding $\boldsymbol{\lambda}$ that minimizes $\boldsymbol{s}^{\text{T}}\boldsymbol{Q}\boldsymbol{s}$ for some symmetric and positive $\boldsymbol{Q}$ . There are various ways to pick $\boldsymbol{Q}$ as well as dealing with $\boldsymbol{\Gamma}$ not being on full rank. Couple of examples might be found in Kott and Liao (2017). For example, one of the options is to use $\displaystyle \mathbf{\boldsymbol{Q}}^{-1} = \text{DIAG}\left[\left({N}^{-1} \sum_{S} d_k \mathbf{\boldsymbol{x}}_k\right) \left({N}^{-1}\sum_{S} d_k \mathbf{\boldsymbol{x}}_k^\top \right)\right]$ . After finding $\boldsymbol{\lambda}$ , dimension of $\boldsymbol{x}_k^{\text{T}}$ is reduced in such way: $\begin{align*} \tilde{\boldsymbol{x}}_k^T &= N^{-1} \boldsymbol{Q}\sum_{j \in S} d_j R_j F_k' \left( \boldsymbol{z}_j^T \boldsymbol{\lambda}\right) \boldsymbol{x}_j \boldsymbol{z}_j^T\\ &= \boldsymbol{x}_k^T \left( \sum_{j \in S} d_j R_j F_k' \left( \boldsymbol{z}_j^T \boldsymbol{\lambda}\right) \boldsymbol{x}_j \boldsymbol{x}_j^T \right)^{-1} \sum_{j \in S} d_j R_j F_k' \left( \boldsymbol{z}_j^T \mathbf{g} \right) \boldsymbol{x}_j \boldsymbol{z}_j^T\\ &= \boldsymbol{x}_k^T \mathbf{B}_{\boldsymbol{z}}. \end{align*}$ Another approach to component reduction, proposed by Andridge and Little (2011), works without searching $\boldsymbol{\lambda}$ or does not even rely on picking $\boldsymbol{Q}$ matrix- idea relies on satisfying $\begin{equation}\label{eq: reformulated calib.eq} \sum_{k \in S} w_k \tilde{\boldsymbol{x}}_k = \sum_{k \in S} d_k R_k F_k \tilde{\boldsymbol{x}}_k = \sum_{k \in S} d_k \tilde{\boldsymbol{x}}_k \end{equation}$ and setting $\begin{equation}\label{eq: setting A&L} \tilde{\boldsymbol{x}}_k^{\text{T}} = \boldsymbol{x}_k^{\text{T}}\boldsymbol{A}^{\text{T}}, \end{equation}$ where $\boldsymbol{A}^{\text{T}} =\left(\sum_{S}{R_j\boldsymbol{x}_j\boldsymbol{x}_j^{\text{T}}}\right)^{-1} \sum_{S}{R_j\boldsymbol{x}_j\boldsymbol{z}_j^{\text{T}}}$ . Again, the reduction of dimensions is needed and with such obtained $\tilde{\boldsymbol{x}}_k$ one is able to perform generalized calibration weighting technique. By far, this method is implemented as a part of MNAR:gencal() function.

References

Andridge, Rebecca R, and Roderick JA Little. 2011. “Proxy Pattern-Mixture Analysis for Survey Nonresponse.” Journal of Official Statistics 27 (2): 153.

Chang, T., and P. S. Kott. 2008. “Using Calibration Weighting to Adjust for Nonresponse Under a Plausible Model.” Biometrika 95 (3): 555–71. https://doi.org/10.1093/biomet/asn022.

Deville, Jean-Claude, and Carl-Erik Särndal. 1992. “Calibration Estimators in Survey Sampling.” Journal of the American Statistical Association 87 (418): 376–82.

Estevao, Victor, and Carl Särndal. n.d. “A Functional Form Approach to Calibration.” Journal of Of®cial Statistics 16 (4): 379±399. https://www.proquest.com/scholarly-journals/functional-form-approach-calibration/docview/1266846662/se-2.

Isaki, C. T., and W. A. Fuller. 1982. “Survey Design Under the Regression Super-Population Model.” Journal of the American Statistical Association 77 (377): 89–96.

Kott, Phillip S., and Ted Chang. 2010. “Using Calibration Weighting to Adjust for Nonignorable Unit Nonresponse.” Journal of the American Statistical Association 105 (491): 1265–75. https://doi.org/10.1198/jasa.2010.tm09016.