Theory for missing not at random non-response correction
Maciej Ostapiuk and Maciej Beręsewicz
theory.Rmd
Introduction
This document reveals all the theory needed to understand methodology behind correction of missing not at random type of non-responses in survey samples. It covers methods such as:
Generalized calibration
Generalized calibration with more variables in calibration than response model
Generalized method of moments (also known as GMM)
Empirical likelihood estimation
Non-parametric methods
Exponential tilting
Latent approach
Knowledge included in this particular paper allows one to understand
correction techniques and should be treated by one as supplementary
resource to our R programming language package called
MNAR
.
Interpretation of Missing Not at Random
In statistical surveys, missing data plays the most important role in the family of non-random errors. Their presence affects the process of estimating the unknown global values of a population by biasing given estimators and reducing their precision. The reason for such behavior lies in the characteristics of differences between respondents—participants who answered every question—and those participants who did not manage to provide answers to every question (item non-response) or did not answer at all (unit non-response), called non-respondents.
There happens to be a bunch of methods to deal with non-responses. However, the main idea behind the construction of them allows us to categorize them—we could split them into weighting methods and imputation methods. In most cases, one is able to determine which of the above groups of methods shall be used.
The imputation methods are used when dealing with item non-response—questions with a lack of answers are being corrected by, e.g., replenishing missing data. The weighting methods shall be used when dealing with unit non-responses in order to correct, by using a set of auxiliary variables, the weights of respondents and non-respondents in the sample such that known population totals are being reproduced. The choice of mentioned auxiliary variables matters and is strictly tied to the estimation process, thus it arises as the biggest problem in this category of methods. One might use a combination of both kinds of methods to eliminate the negative impact of non-responses.
Investigation of the described methodology starts with an explanation of the assumptions, settings, and notation behind sampling, responding, and estimation. Starting with the basic notation, let denote the population of size with a probability sample of length , , . According to the sampling design, let denote first-order inclusion probability of -th element of population in sample . Thus, under given sampling design, denotes an initial weight of -th element. Our main goal, when dealing with survey data, is to estimate the total of population , written as where is value of target variable for -th element, . Natural and usual choice here is to consider Horvitz-Thompson estimator of the form: By design, is an unbiased estimator. If there happens to be non-respondents in the sample, then the summation is done over the subset of respondents, .
Weighting methods
Usually, if non-responses occur, summation in () provides underestimated values compared to the population total from (). Thus, it is needed to perform correction of initial weights under given sampling design - in other words, we have to perform of described ’s.
In general we have the following settings: - Information on the target variable is only available for respondents.
-
Information on auxiliary variables
is available under the following settings:
- Unit-level data is available for respondents and non-respondents.
- Unit-level data is available only for respondents, but we have population totals for the reference population.
Case when dimensions of calibration and response-model variables coincide
Let denote benchmark vector of chosen auxiliary variables and is the vector of auxiliary variables for -th element of the sample . Settings state that , which is the vector of global auxiliary variables’ values is known, i.e. If any of auxiliary values total is not known, one might use instead of ’s into (), i.e. However, using instead of does not always work in process of estimation . One needs to perform slightly different weights than ’s. Those weights, denoted as ’s as solutions to optimization problem of form where is a strictly convex, differentiable function, for which and . Also, there exists a additional condition which has to be satisfied, namely: Equation () is also being called as . Using Lagrange multipliers method, it is shown in Deville and Särndal (1992), that vector of calibration weights might be written as: where is a vector of instrumental variables, coinciding, in sense of dimensions with . Later in this paper, we will consider situation where has got higher dimension than . is the inverse of , defined as: There are various ideas to choose function but it is a common case to consider of form: For such choice, the solution of problem stated in () is expressed by Estevao and Särndal (n.d.) as: where is defined as follows: Using obtained , known as the linear weights, a new, so called “calibration-weighted” estimator of target variable total from () is of the form: which can be rendered as: where Notice, that is no longer unbiased by design. However it might be consistent, which is described in Isaki and Fuller (1982).
How does one formulate the prediction model in this case? Let’s denote two indicator random variables: Kott and Chang (2010) proposed the double-protection justification set of equations: where is usually on full-rank (not necessarily), is a coefficients vector and Under proposal from () there is a property in form of: where
When there are more calibration than response-model variables
First, lets consider
,
a asymptotic limit of:
which, alongside with
,
is said to exist apart from result of the prediction. When prediction
model fails, we got
as long as
converged to a finite
.
Chang and Kott (2008) considered this case and
extended weighting approach by replacing the reformulated calibration
equation from ():
by finding
that minimizes
for some symmetric and positive
.
There are various ways to pick
as well as dealing with
not being on full rank. Couple of examples might be found in Kott and Liao (2017). For example,
one of the options is to use
.
After finding
,
dimension of
is reduced in such way:
Another approach to
component reduction, proposed by Andridge and
Little (2011), works without
searching
or does not even rely on picking
matrix- idea relies on satisfying
and setting
where
.
Again, the reduction of dimensions is needed and with such obtained
one is able to perform generalized calibration weighting technique. By
far, this method is implemented as a part of MNAR:gencal()
function.