User Note - Variance Estimation
Analysis and Variance Estimation with IPUMS NHIS
The National Health Interview Survey (NHIS) is a complex, multistage probability sample that incorporates stratification, clustering, and oversampling of some subpopulations (Black, Hispanic, and Asian) in some years. For more information about the NHIS sample design, users are advised to review the user note on SAMPLE DESIGN. Because of the complex sample design of the NHIS, users of IPUMS NHIS data must make use of sampling weights to produce estimates that are representative of the civilian, noninstitutionalized population. Analysts are advised to review the user note on SAMPLING WEIGHTS with IPUMS NHIS data for additional information on the use of weights and on the different weights available.
While appropriate use of sampling weights will produce correct point estimates (e.g., means, proportions), statistical techniques that account for the complex sample design are also necessary to produce correct standard errors and statistical tests. Specifically, variables to account for the complex sample design (STRATA and PSU) are available in the IPUMS NHIS dataset and must be used to obtain appropriate variance estimates (standard errors) when computing annual estimates, pooled estimates, or multivariate estimates.
Due to confidentiality issues, many of the original sample design variables are suppressed in the NHIS public use files. NCHS has released public use design variables representing pseudo-strata and pseudo-PSU variables for the years 1963 to the present. These are constructed variables for analytic purposes that do not reflect the actual stratification and primary sampling units used to draw the sample. Using these variables in analysis performs reasonably well as an alternative to using the original design variables.
IPUMS NHIS Technical Variables for Analysis and Variance Estimation
Three technical variables are needed for analysis of the IPUMS NHIS data:
- A sampling weight (e.g., PERWEIGHT, SAMPWEIGHT, MORTWT) must be chosen, based on the sampling universe of the variables. The sampling weight represents the inverse probability of selection into the sample, with adjustment for non-response, as well as post-stratification (before 2019) or raking (2019 and later) adjustments for age, race/ethnicity, and sex. Analysts should review variable descriptions of the variables of interest and the user note on SAMPLING WEIGHTS for more information about which weight to use.
- STRATA is an integrated variable that represents the impact of the sample design stratification on the estimates of variance and standard errors. It is constant within a sample design period and changes between sample design periods, with the exception of 1995 and 1996. The sample design periods are 1963-1972, 1973-1984, 1985-1994, 1995-2005, 2006-2015, and 2016-present. Although the years 1995-2005 are in the same design period, NCHS released strata variables for 1995-1996 that differ from those for 1997-2005. NCHS advises treating 1995-1996 and 1997-2005 as independent samples.
- PSU is an integrated variable that represents the impact of the sample design clustering on the estimates of variance and standard errors. It is constant within a sample design period and changes between sample design periods, with the exception of 1995 and 1996. The years 1995-2005 are in the same design period; however, NCHS released PSU variables for 1995-1996 that differ from those for 1997-2005.
Variance Estimation for 1963-Present
We have constructed our survey design variables so that they can be used when examining data from one year or from many years. To do this, we employed the method suggested by Korn and Graubard (1999) and also recommended in the NCHS guidance for pooling data from one NHIS survey over multiple years and sample designs. It is the concatenated design period pooling approach (what Korn and Graubard refer to as situation 2 on page 280 of Analysis of Health Surveys). In this approach, standard errors are calculated by pooling data from many years and sample design periods into a single file, while treating strata and PSUs from the same design period as the same but treating strata and PSUs from the different design period as independent. For example, NHIS 1993 and 1994 have the same sample design; therefore, the STRATA and PSUs are comparable throughout the entire sample design period. However, if we merge the 1993-1994 data with data from 1995, then there will be two sets of uniquely identified STRATA and PSU variables, because 1995 used a different sample design than 1993-1994.
General Syntax to Account for Sample Design
The following general syntax will allow users to account for sampling weights and design variables when using Stata, R, SAS, or SAS-callable SUDAAN to estimate, for example, means using IPUMS NHIS data.
svyset psu [pweight=perweight], strata(strata) svy: mean var1
R using the ipumsr package
library(survey) library(srvyr) data <- as_survey(data, id = PSU, weight = PERWEIGHT, strata = STRATA, nest = TRUE) summarise(data, var1_mean = survey_mean(var1, na.rm = TRUE))
proc sort data = datasetname; by strata psu; run; proc surveymeans data = datasetname; weight perweight; strata strata ; cluster psu; var var1; run;
proc sort data = datasetname; by strata psu; run; proc descript data = datasetname filetype = sas design = wr; nest strata psu; weight perweight; var var1; print nsum wsum mean semean / nohead; run;
Pooling Multiple Years of Data
The sampling weights in IPUMS NHIS represent annual inflation factors. In other words, for each individual, the weight reflects the number of people that individual survey respondent represents in the total U.S. population for a given year. Thus, if the analyst chooses to use multiple years of data, the sampling weight needs to be adjusted. There are several issues underlying variance estimation with complex survey data from more than one year:
- Annual samples within survey design periods are not statistically independent, because they are drawn from the same geographic areas each year (Moriarity, et al. 2022). Thus, treating them as independent may result in standard errors that are too small. Years within design periods that are identical in design must be grouped together.
- Sample design periods are conceptually and statistically independent. Approximately every 10 years, NHIS constructs a new sample design which may include some different geographic areas than were included in the previous design period. Therefore, different design periods should be treated as independent.
- Pooling across sample design periods requires accounting for each distinct design period.
If an analyst chooses to use multiple years of data, the sampling weight needs to be adjusted. For example, imagine that an analyst wants to use data from 2000-2009, pooling 10 years of data. The sampling weights need to be adjusted so that the total sample will represent the U.S. population (on average) for the 10-year period. The simplest adjustment method is to divide the weight by the number of years of data pooled (i.e., dividing PERWEIGHT by 10 in this example). More sophisticated methods of adjustment can be employed, if the analyst is so inclined. However, it is not clear that these methods perform substantially better.
The integrated variables STRATA and PSU in the IPUMS NHIS database have been adjusted from the original NHIS design variables to account for these issues. Thus, the analyst can simply select the STRATA and PSU variables to use for analysis of one year or for many years of IPUMS NHIS data.
TAKE NOTE: Special Considerations when Pooling Data
1. Change to sampling weight methodology implemented in 2019. The process of generating sampling weights changed sharply from the approach employed in 2018 and earlier years. Because of this marked change, which was accompanied by a major redesign of the NHIS questionnaire and data collection approach, it is not possible to know whether any changes detected between 2019 and earlier years are due to changes in the sampling weights, the questionnaire or data collection redesign, or reflect actual change in the phenomena under study. Results of a test conducted in 2018-19 by NCHS indicate that differences in prevalence estimates between pre-2019 and 2019 forward years of data are likely influenced by the 2019 redesign. Based on the results of the Bridge Test, IPUMS NHIS recommends that users do not compare the trends in the pre-2019 with the trends in the 2019-forward data. NCHS has signaled that they plan to release additional evaluation results as more 2019-forward data become available. We will update our guidance based on the findings of any such evaluations.
2. Extra adjustments needed when pooling 2019 and 2020 samples. To improve adjustment of the 2020 sampling weights for nonresponse, NCHS re-contacted selected 2019 NHIS sample adults to complete the 2020 NHIS interview between August and December of 2020. This longitudinal sample, also known as the 2020 followback sample, is comprised of 10,415 sample adults and can be analyzed as a one-time longitudinal panel with observations, spaced one year apart, that take place before and during the COVID-19 pandemic. Because both the 2019 and the 2020 samples contain these 10,415 sample adults, however, special measures must be taken when combining the 2019 and 2020 samples for analyses where users wish to pool 2019 and 2020 together to increase sample size. NCHS advises adjusting the sample and the sampling weight when combining the 2019 and 2020 samples. First, drop any sample adult records with zero values on the partial sample weight for 2020 (PARTWEIGHT). This will retain only the 2019 observations of longitudinal sample members in the pooled sample. Second, use PARTWEIGHT rather than SAMPWEIGHT for sample adults in the 2020 sample. Note that sample children were not included in the 2019-2020 longitudinal sample and the adjustment described above does not need to be made for pooled analyses of sample children in the 2019-2020 samples. We provide sample code (in Stata) to make this adjustment for an IPUMS NHIS extract containing both the 2019 and 2020 samples:
Make any other adjustments to pooled_weight for analyses of pooled data as described in previous sections. For more information about COVID-19 impacts on NHIS data collection, please see our user note.
drop if age > 17 & partweight == 0 & year == 2020 gen pooled_weight = . replace pooled_weight = sampweight if year == 2019 | (year == 2020 & age < 18) replace pooled_weight = partweight if year == 2020 & age > 17 & age != .
3. Sampling weight adjustments needed when analyzing COVID-19 data. Because much of the COVID-related content available in the 2020-2021 samples is not available for all calendar quarters of data collection in those years, analysts must adjust the annual sampling weights to account for partial year coverage to correctly produce estimates based on the 2020-2021 COVID data. For example, the NHIS collected data about COVID-19 beginning in calendar quarter 3 of 2020, adult COVID-19 vaccination information beginning in calendar quarter 2 of 2021, and COVID-19 vaccination for children aged 12-17 beginning in calendar quarter 3 of 2021. Depending on analytical goals, NCHS outlines several different approaches for correctly producing estimates based on the 2020-2021 COVID-19 information (based on guidance found in the 2020 and 2021 NHIS survey descriptions). Below, we outline the approach for correctly producing population estimates based on partial year data. For other use cases, such as pooling together all available calendar quarters where COVID-19 information was collected or to produce semi-annual trend estimates, please refer to the section on "Analyzing 2021 NHIS" in the 2021 NHIS survey description (starting on p. 41).
To produce correct population estimates based on measures collected for only part of the year, analysts will need to adjust the annual sampling weights included in the 2020 and 2021 data. All of the COVID-19 content available for the 2020 NHIS was collected only in calendar quarters 3 and 4, and some of the COVID-19 content available for the 2021 NHIS, such as COVID-19 vaccination for adults and COVID-19 vaccination for children ages 12-17, was collected only in some calendar quarters of 2021. To adjust the sampling weights, analysts should first create an interim sampling weight that sets the value of the annual sampling weight (SAMPWEIGHT) to zero for all calendar quarters where the COVID-19 data were not collected by NHIS. They should then multiply the interim sampling weight by the number of calendar quarters in the year by the number of calendar quarters the data were collected to inflate the weight to cover a full calendar year. For example, the COVID-19 vaccination measure for sample adults was collected for calendar quarters 2, 3, and 4 in 2021, so the multiplier would be 4/3; similarly, the COVID-19 vaccination measure for children ages 12-17 was collected for calendar quarters 3 and 4, so the multiplier would be 4/2, doubling the interim weight. To illustrate for adults in 2021 (in Stata):
gen covid_interimwt = sampweight replace covid_interimwt = 0 if intervwqtr == 1 gen covidwt = covid_interimwt*4/3
Subsetting IPUMS NHIS Data
Often, analysts are interested in restricting analyses to a specific population (e.g., children under age 5 or American Indians/Alaska Natives). In these situations, many analysts will then either exclude all other cases in the database or use an if-statement during analysis. While correct point estimates will still be produced if the remaining cases are properly weighted, standard errors may be incorrectly computed.
If the analyst is interested in a specific subpopulation, it is necessary to use analytic techniques that do not compromise the sample design information. Specifically, it is typically recommended to use the full database with a statistical package (such as Stata, SAS, or SUDAAN) that can accommodate subpopulation analysis.
Syntax for Subpopulation Analysis
The following syntax demonstrates, generally, how an analyst can conduct subpopulation analysis using IPUMS NHIS data without compromising the design structure of the data. This approach has the effect of producing estimates for the population of interest, while incorporating the full sample design information for variance estimation.
svyset psu [pweight=perweight], strata(strata) svy, subpop(if age >= 65): mean var1
R using the ipumsr package
library(survey) library(srvyr) data <- as_survey(data, id = PSU, weight = PERWEIGHT, strata = STRATA, nest = TRUE) subset(data, age >= 65) %>% summarise(var1_mean = survey_mean(var1, na.rm = TRUE))
subpopvar = 1 if age ge 65; else subpopvar = 0; proc sort data = datasetname; by strata psu; run; proc surveymeans data = datasetname; weight perweight; strata strata ; cluster psu; domain subpopvar; var var1; run;
proc sort data = datasetname; by strata psu; run; proc descript data = datasetname filetype = sas design = wr; nest strata psu; weight perweight; subpopn age >= 65/NAME = "Population 65 years and older"; print nsum wsum mean semean / nohead; run;
Korn, E.L. and Graubard, B.I. (1999) Analysis of Health Surveys. New York: John Wiley & Sons.
National Center for Health Statistics. (1975). Health Interview Survey Procedure 1957-1974. Vital Health Stat, 1(11).
National Center for Health Statistics. (1985). The National Health Interview Survey Design, 1973-84, and Procedures, 1975-83. Vital Health Stat, 1(18).
National Center for Health Statistics. (1989). Design and estimation for the National Health Interview Survey, 1985-94. Vital Health Stat, 2(110).
National Center for Health Statistics. (1999). National Health Interview Survey: Research for the 1995-2004 Redesign. Vital Health Stat, 2(126).
National Center for Health Statistics. (2000). Design and Estimation for the National Health Interview Survey, 1995-2004. Vital Health Stat, 2(130).
Moriarity, C. and Parsons, V. (2015). "2016 Sample Redesign of the National Health Interview Survey." Paper presented at the 2015 Joint Statistical Meetings, Survey Research Methods Section.
National Center for Health Statistics. (2017). Survey Description, National Health Interview Survey, 2016. Hyattsville, MD.
National Center for Health Statistics. (2020). Survey Description, National Health Interview Survey, 2019. Hyattsville, MD.
NCHS Guidance on Variance Estimation
National Center for Health Statistics. "Variance Estimation for the 1963-72 NHIS Public Use Person Data."
National Center for Health Statistics. "Variance Estimation for the 1973-84 NHIS Public Use Person Data."
National Center for Health Statistics. "Variance Estimation for the 1985-94 NHIS Public Use Person Data."
National Center for Health Statistics. "Variance Estimation for NHIS Public Use Person Data, 1995 & 1996."
National Center for Health Statistics. "Variance Estimation and Other Analytic Issues in the 1997-2005 NHIS."
National Center for Health Statistics. "Variance Estimation and Other Analytic Issues, NHIS 2006-2010."
National Center for Health Statistics. Survey Description, National Health Interview Survey, 2020. Hyattsville, MD. 2021.
National Center for Health Statistics. Survey Description, National Health Interview Survey, 2021. Hyattsville, MD. 2022.
Moriarity C, Parsons VL, Jonas K, Schar BG, Bose J, Bramlett MD. Sample Design and Estimation Structures for the National Health Interview Survey, 2016-2025. National Center for Health Statistics. Vital Health Stat 2(191). 2022. DOI: https://dx.doi.org/10.15620/cdc:115394
NCHS Guidance for NHIS-Linked Mortality Files
National Center for Health Statistics. "National Health Interview Survey (1986-2004) Linked Mortality Files. Analytic Guidelines."
National Center for Health Statistics. Office of Analysis and Epidemiology. Analytic Guidelines for NCHS 2011 Linked Mortality Files, August, 2013. Hyattsville, Maryland.
National Center for Health Statistics. The Linkage of National Center for Health Statistics Survey Data to the National Death Index -- 2019 Linked Mortality File (LMF): Linkage Methodology and Analytic Considerations, June 2022. Hyattsville, MD. https://www.cdc.gov/nchs/data/datalinkage/2019NDI-Linkage-Methods-and-Analytic-Considerations-508.pdf