User Note - Variance Estimation

Analysis and Variance Estimation with IPUMS NHIS

The National Health Interview Survey (NHIS) is a complex, multistage probability sample that incorporates stratification, clustering, and oversampling of some subpopulations (Black, Hispanic, and Asian) in some years. For more information about the NHIS sample design, users are advised to review the user note on SAMPLE DESIGN. Because of the complex sample design of the NHIS, users of IPUMS NHIS data must make use of sampling weights to produce representative estimates. Analysts are advised to review the user note on SAMPLING WEIGHTS with IPUMS NHIS data for additional information on the use of weights and on the different weights available.

While appropriate use of sampling weights will produce correct point estimates (e.g., means, proportions), statistical techniques that account for the complex sample design are also necessary to produce correct standard errors and statistical tests. Specifically, variables to account for the complex sample design (STRATA and PSU) are available in the IPUMS NHIS dataset and must be used to obtain appropriate variance estimates (standard errors) when computing annual estimates, pooled estimates, or multivariate estimates.

Due to confidentially issues, many of the original sample design variables are suppressed in the NHIS public use files. NCHS has released public use design variables representing pseudo-strata and pseudo-PSU variables for the years 1968 to the present. These are constructed variables for analytic purposes that do not reflect the actual stratification and primary sampling units used to draw the sample. Using these variables in analysis performs reasonably well as an alternative to using the original design variables.

IPUMS NHIS Technical Variables for Analysis and Variance Estimation

Three technical variables are needed for analysis of the IPUMS NHIS data:

  1. A sampling weight (i.e., PERWEIGHT, SAMPWEIGHT, SUPP1WT, SUPP2WT, SUPP3WT, MORTWT, MORTWTSA, FWEIGHT, or HHWEIGHT) must be chosen, based on the sampling universe of the variables.  The sampling weight represents the inverse probability of selection into the sample, with adjustment for non-response, as well as post-stratification adjustments for age, race/ethnicity, and sex. Analysts should review variable descriptions of the variables of interest and the user note on SAMPLING WEIGHTS for more information about which weight to use.
  2. STRATA is an integrated variable that represents the impact of the sample design stratification on the estimates of variance and standard errors. It is constant within a sample design period and changes between sample design periods, with the exception of 1995 and 1996. The years 1995-2005 are in the same design period; however, NCHS released strata variables for 1995-1996 that differ from those for 1997-2005. NCHS advises treating 1995-1996 and 1997-2005 as independent samples.
  3. PSU is an integrated variable that represents the impact of the sample design clustering on the estimates of variance and standard errors. It is constant within a sample design period and changes between sample design periods, with the exception of 1995 and 1996. The sample design periods are 1963-1972, 1973-1984, 1985-1994, 1995-2005, 2006-2015, and 2016-present. The years 1995-2005 are in the same design period; however, NCHS released PSU variables for 1995-1996 that differ from those for 1997-2005

Variance Estimation for 1963-Present

We have constructed our survey design variables so that they can be used when examining data from one year or from many years. To do this, we employed the method suggested by Korn and Graubard (1999) and also recommended in the NCHS guidance for pooling data from one NHIS survey over multiple years and sample designs. It is the concatenated design period pooling approach (what Korn and Graubard refer to as situation 2 on page 280 of Analysis of Health Surveys). In this approach, standard errors are calculated by pooling data from many years and sample design periods into a single file, while treating strata and PSUs from the same design period as the same but treating strata and PSUs from the different design period as independent. For example, NHIS 1993 and 1994 have the same sample design; therefore, the STRATA and PSUs are comparable throughout the entire sample design period. However, if we merge the 1993-1994 data with data from 1995, then there will be two sets of uniquely identified STRATA and PSU variables, because 1995 used a different sample design than 1993-1994.

General Syntax to Account for Sample Design

The following general syntax will allow users to account for sampling weights and design variables when using STATA, SAS, or SAS-callable SUDAAN to estimate, for example, means using IPUMS NHIS data.

STATA

svyset psu [pweight=perweight], strata(strata)
svy: mean var1  

SAS

proc sort data = datasetname;
  by strata psu;
run;
proc surveymeans data = datasetname;
  weight perweight;
  strata strata ;
  cluster psu;
  var var1;
run;

SAS-callable SUDAAN

proc sort data = datasetname;
  by strata psu;
run;
proc descript data = datasetname filetype = sas design = wr;
  nest strata psu;
  weight perweight;
  var var1;
  print nsum wsum mean semean / nohead;
run;

Pooling Multiple Years of Data

The sampling weights in IPUMS NHIS represent annual inflation factors. In other words, for each individual, the weight reflects the number of people that individual survey respondent represents in the total U.S. population for a given year. Thus, if the analyst chooses to use multiple years of data, the sampling weight needs to be adjusted. There are several issues underlying variance estimation with complex survey data from more than one year:

  1. Annual samples within survey design periods are not statistically independent, because they are drawn from the same geographic areas each year. Thus, treating them as independent may result in standard errors that are too small. Years within design periods that are identical in design must be grouped together.
  2. Sample design periods are conceptually and statistically independent. Approximately every 10 years, NHIS constructs a new sample design which may include some different geographic areas than were included in the previous design period. Therefore, different design periods should be treated as independent. 
  3. Pooling across sample design periods requires accounting for each distinct design period.

If an analyst chooses to use multiple years of data, the sampling weight needs to be adjusted. For example, imagine that an analyst wants to use data from 2000-2009, pooling 10 years of data. The sampling weights need to be adjusted so that the total sample will represent the U.S. population (on average) for the 10-year period. The simplest adjustment method is to divide the weight by the number of years of data pooled (i.e., dividing PERWEIGHT by 10 in this example). More sophisticated methods of adjustment can be employed, if the analyst is so inclined. However, it is not clear that these methods perform substantially better.

The integrated variables STRATA and PSU in the IPUMS NHIS database have been adjusted from the original NHIS design variables to account for these issues. Thus, the analyst can simply select the STRATA and PSU variables to use for analysis of one year or for many years of IPUMS NHIS data.

Subsetting IPUMS NHIS Data

Often, analysts are interested in restricting analyses to a specific population (e.g., children under age 5 or American Indians/Alaska Natives). In these situations, many analysts will then either exclude all other cases in the database or use an if-statement during analysis. While correct point estimates will still be produced if the remaining cases are properly weighted, standard errors may be incorrectly computed.

If the analyst is interested in a specific subpopulation, it is necessary to use analytic techniques that do not compromise the sample design information. Specifically, it is typically recommended to use the full database with a statistical package (such as STATA, SAS, or SUDAAN) that can accommodate subpopulation analysis.

Syntax for Subpopulation Analysis

The following syntax demonstrates, generally, how an analyst can conduct subpopulation analysis using IPUMS NHIS data without compromising the design structure of the data. This approach has the effect of producing estimates for the population of interest, while incorporating the full sample design information for variance estimation.

STATA

svyset psu [pweight=perweight], strata(strata)
	  	svy, subpop(if age >= 65): mean var1

SAS

subpopvar = 1 if age ge 65;
  else subpopvar = 0;
proc sort data = datasetname;
  by strata psu;
run; 
proc surveymeans data = datasetname;
  weight perweight;
  strata strata ;
  cluster psu;
  domain subpopvar; 
  var var1;
run;

SAS-callable SUDAAN

proc sort data = datasetname;
by strata psu;
run;
proc descript data = datasetname filetype = sas design = wr;
  nest strata psu;
  weight perweight;
  subpopn age >= 65/NAME = "Population 65 years and older";
  print nsum wsum mean semean / nohead;
run;

REFERENCES

Korn, E.L. and Graubard, B.I. (1999) Analysis of Health Surveys. New York: John Wiley & Sons.

National Center for Health Statistics. (1975). Health Interview Survey Procedure 1957-1974. Vital Health Stat, 1(11).
http://www.cdc.gov/nchs/data/series/sr_01/sr01_011acc.pdf

National Center for Health Statistics. (1985). The National Health Interview Survey Design, 1973-84, and Procedures, 1975-83. Vital Health Stat, 1(18).
http://www.cdc.gov/nchs/data/series/sr_01/sr01_018acc.pdf

National Center for Health Statistics. (1989). Design and estimation for the National Health Interview Survey, 1985-94. Vital Health Stat, 2(110).
http://www.cdc.gov/nchs/data/series/sr_02/sr02_110.pdf

National Center for Health Statistics. (1999). National Health Interview Survey: Research for the 1995-2004 Redesign. Vital Health Stat, 2(126).
http://www.cdc.gov/nchs/data/series/sr_02/sr02_126.pdf

National Center for Health Statistics. (2000). Design and Estimation for the National Health Interview Survey, 1995-2004. Vital Health Stat, 2(130).
http://www.cdc.gov/nchs/data/series/sr_02/sr02_130.pdf

National Center for Health Statistics. (2017). Survey Description, National Health Interview Survey, 2016. Hyattsville, MD.
ftp.cdc.gov/pub/Health_Statistics/NCHS/Dataset_Documentation/NHIS/2016/srvydesc.pdf

NCHS Guidance on Variance Estimation

National Center for Health Statistics. "Variance Estimation for the 1963-72 NHIS Public Use Person Data."
www.cdc.gov/nchs/data/nhis/6372var.pdf

National Center for Health Statistics. "Variance Estimation for the 1973-84 NHIS Public Use Person Data."
www.cdc.gov/nchs/data/nhis/7384var.pdf

National Center for Health Statistics. "Variance Estimation for the 1985-94 NHIS Public Use Person Data."
www.cdc.gov/nchs/data/nhis/8594var.pdf

National Center for Health Statistics. "Variance Estimation for NHIS Public Use Person Data, 1995 & 1996."
www.cdc.gov/nchs/data/nhis/96var.pdf

National Center for Health Statistics. "Variance Estimation and Other Analytic Issues in the 1997-2005 NHIS."
www.cdc.gov/nchs/data/nhis/9705var.pdf

National Center for Health Statistics. "Variance Estimation and Other Analytic Issues, NHIS 2006-2010."
www.cdc.gov/nchs/data/nhis/2006var.pdf

NCHS Guidance for NHIS-Linked Mortality Files

National Center for Health Statistics. "National Health Interview Survey (1986-2004) Linked Mortality Files. Analytic Guidelines."
www.cdc.gov/nchs/data/datalinkage/nhis_mort_analytic_guidelines.pdf

National Center for Health Statistics. Office of Analysis and Epidemiology. Analytic Guidelines for NCHS 2011 Linked Mortality Files, August, 2013. Hyattsville, Maryland.
www.cdc.gov/nchs/data/datalinkage/2011_linked_mortality_analytic_guidelines.pdf

Back to Top