Frequently Asked Questions (FAQ)

General information about the project
    What is IPUMS NHIS?
    What is in the future for IPUMS NHIS?
    How do IPUMS NHIS files differ from the NHIS public use files already in distribution?
    How does IPUMS NHIS add value to NHIS data?
Getting started
    Where should a new user start?
    How do I get access to IPUMS NHIS data?
Basic concepts
    What are microdata?
    What is "integration"? What are "integrated variables"?
    What are "weights"?
    What does "universe" mean in the variable descriptions?
    Why is a variable from the NHIS that I have worked with before not included in IPUMS NHIS?
    Can I combine IPUMS NHIS data with other NHIS variables needed for my research?
    How is a record uniquely identified?
Data Limitations and Cautions to Users
    What are the major limitations of the data?
    Are there aspects of IPUMS NHIS data about which to be particularly careful?
    Is help available if I encounter problems using IPUMS NHIS?
Getting data
    How do I obtain data?
    What is the data format?
    What is the best way to use the extract system?
    How long does a data extract take?
    How does "sample selection" work on the IPUMS NHIS website?
    Can I get the original data?

General information about the project

What is IPUMS NHIS? [top]

The IPUMS National Health Interview Series (IPUMS NHIS) is a harmonized set of data and documentation based on material originally included in the public use files of the U.S. National Health Interview Survey (NHIS) and distributed for free over the Internet. IPUMS NHIS variables are given consistent codes and have been thoroughly documented to facilitate cross-temporal comparisons. The "integration" process is described more fully below.

IPUMS NHIS is not a collection of compiled statistics; it is composed of microdata. Each record represents a person, with all characteristics of that person numerically coded. These person records are organized into households, making it possible to study the characteristics of people in the context of their families or other co-residents. Because the data refer to individuals and not tables, researchers commonly use a statistical package to analyze the records in the IPUMS NHIS database. A data extraction system enables users to select only the survey years and variables they require. Researchers may also analyze IPUMS NHIS data using the online tabulator on the website.

What is in the future for IPUMS NHIS? [top]

The IPUMS NHIS Project is funded by a grant from the National Institute of Child Health and Human Development (NICHD). We plan annual data releases that will add additional variables, more variable groups, and new website features.

We hope to continue the project beyond our present five-year funding period, but we will have to secure further funding as our current grant expires. To be successful, we need to demonstrate the existence of a large body of works based on IPUMS NHIS data or documentation. If you use IPUMS NHIS to create educational materials, satisfy a course requirement, or prepare a report, presentation, publication, or thesis, please tell us about it, by adding to our bibliography site.

How do IPUMS NHIS files differ from the NHIS public use files already in distribution? [top]

Public use files for the NHIS are the basis for IPUMS NHIS data. These original public use files also include variables not yet included in the IPUMS NHIS public database. Researchers can access the original files through the NHIS website. Directions on how to link variables from these original NHIS public use files to IPUMS NHIS data extracts can be found in the user note on linking.

The IPUMS NHIS Project recodes the original public use data to increase consistency over time. For the most part, IPUMS NHIS does not use the same variable names included in the original public use data; variables have been renamed to increase consistency over time and within subject categories. Detail from the original public use variables is preserved in IPUMS NHIS integrated variables, but codes and value labels are often different in the IPUMS NHIS version of a variable. For a crosswalk between the original NHIS public use variable names and the IPUMS NHIS variable names, use the variable concordance (as a search tool or downloaded into an Excel file).

How does IPUMS NHIS add value to NHIS data? [top]

IPUMS NHIS includes many variable features not available for the original NHIS public use files. By using the IPUMS NHIS data extraction system, analysts can select the years and variables they are interested in and work with a single dataset, without having to link or combine multiple files. IPUMS NHIS provides on-line documentation that describes variable meaning and addresses comparability issues, along with providing information about years available, universes, codes and frequencies, question wording, appropriate weights, and source files for each included variable. The online tabulator allows experienced researchers to produce results quickly and allows new researchers to answer questions and create tables without downloading data or using a statistical package.

The large number of topics covered by the NHIS make it difficult for researchers to determine from the public use files which variables are available across time. Changes in variable names, even when the question wording remains the same, pose a further challenge. IPUMS NHIS provides consistent variable names and displays which variables are available, by year and topic, in a user-friendly display on its website. Once researchers identify the years and variables relevant to their research project, they can create a data extract or analyze online the variables they choose.

Getting started

Where should a new user start? [top]

The natural starting points are the "Select Data" or "Browse and Select Data" links on the top banner and the left navigation bar. These links open the variables page, the primary tool for exploring the contents of IPUMS NHIS. By default, the variables page displays one variable group at a time for all years in the data series. You can change the view option to show all groups simultaneously, but the page can get very large and slow to load. You can also filter the information at any point to include only the years of interest to you ("Select samples"). More detailed information on using the variable menu is available.

When you select samples, the page will display only variables present in those survey years. An "x" indicates the availability of a variable for a given year in the current IPUMS NHIS database.

On the variables page, clicking on a variable name brings up the variable's documentation. The information about the variable is contained in a number of tabs. The default tab is the brief description of the variable. For many variable descriptions, additional information on such topics as data collection, definitions, and related variables will display if a user clicks on a "show more" link. By clicking on hyperlinks within a variable description, you can access similar information for closely related variables. The "comparability" tab discusses comparability issues across years. The "questionnaire text" tab compiles the survey questions pertaining to the variable. Other tabs report the years the variable is available, the variable universe (i.e., who was asked the question), the appropriate weight(s) to use, and the name and source files of the original NHIS variable(s).

On the variables page, clicking on "codes" brings up the codes and labels for the associated variable and shows the availability of categories across survey years. These categories can suggest the types of research possible with a given sample. Via the codes page, users can also view the unweighted frequencies for each response category in each year. (Codes and frequencies for a variable can also be accessed as a tab within a variable description.)

If you have a specific substantive interest, such as "asthma," you may wish to use the "Search Variables" feature on the variables page. Entering a word (such as "asthma") and hitting the "search" button will bring up a list of all variables that include that term in the variable name, label, and, if you wish, variable descriptions and categories. Thus, for example, searching on "asthma" brings up a list of variables appearing not only in the "asthma" condition group, but also in other variable groups, such as causes of activity limitation and conditions treated with alternative medicine modalities.

Throughout the variable documentation system, there are buttons to "Add to cart." Any variables you select in this way are put in your data cart to include in a data extract. Your selections only last for the current web session.

The Data Cart in the upper right of the variables page keeps track of your variable and sample selections. Once you have made some selections, you can click on "View Cart" to review your choices. If you have selected variables and samples, you can enter the data extract system. To make a data extract, you must be registered to use IPUMS NHIS data. You can, however, log in as a "guest" and explore the steps involved in making a data extract without actually producing an extract. Detailed instructions for using the extraction system are below.

Before beginning analysis of IPUMS NHIS data, users are advised to review the material in the "user notes" section. These user notes discuss such issues as variance estimation, sample design, the use of weights, and how to link variables from the original NHIS public use files with IPUMS NHIS data. The user notes also provide counts of the number of person and household records in the IPUMS NHIS database for each year.

How do I get access to IPUMS NHIS data? [top]

Access to the documentation and to IPUMS NHIS data analysis using the online tabulator is freely available. To get access to the data for downloading a customized data extract, users must agree to specified conditions of responsible use, which are similar to the conditions for using the NHIS public use files.

For purposes of internal recordkeeping, and to provide the IPUMS NHIS staff with a clear sense of the user constituency (to improve outreach and better serve users), registration also requires users to provide some information about themselves, such as their discipline, academic or non-academic status, and institutional affiliation. Registered users are automatically added to the IPUMS NHIS e-mail list and receive occasional newletters reporting data releases and new website features. To register for access to the data, go to the IPUMS NHIS registration webpage.

Basic concepts

What are microdata? [top]

Microdata are composed of individual records containing information collected on persons and households. The unit of observation is the individual. The responses of each person to the different survey questions are recorded in separate variables.

Microdata stand in contrast to more familiar "summary" or "aggregate" data. Aggregate data are compiled statistics, such as a table of marital status by sex for some locality. There are no such tabular or summary statistics in the IPUMS NHIS data.

Microdata are flexible. One need not depend on published statistics from a survey that compiled the data in a certain way, if at all. Users can generate their own statistics from the data in any manner desired, including performing individual-level multivariate analyses.

See an image of IPUMS NHIS data here. In this example, the data are presented in hierarchical format, which means that a household record is followed by individual records for each person in that household. If users chose the default option of rectangularized data, the household record's information appears at the beginning of the record for each person in that household, and there are no separate household records.

What is "integration"? What are "integrated variables"? [top]

Integration is the process of making variables more comparable across survey years. For example, every year the NHIS collected data on completed schooling, but the presentation of these data changed over time. From the 1960s through 1981, educational data are reported as years of completed schooling, grouped into intervals; for 1982-96, these data are reported as years of completed schooling in single years; and for 1997 forward, these data are reported as degrees attained for those with more than a high school education. IPUMS NHIS includes several education-related variables to preserve the detail for every period, but it also provides a bridging variable, EDUCREC2, that recodes the educational data into a single, consistent coding scheme.

Because some survey years provide more detail for a given variable than is the case in other years, a coding scheme that reduced variables down to the lowest common denominator across all survey years would inevitably lose important information. As a result, many IPUMS NHIS integrated variables use composite coding schemes. The first one or two digits of the code provide information available across all samples. The next one or two digits provide additional information available in a broad subset of years. Finally, trailing digits provide detail only rarely available. All meaningful detail in the original NHIS public use files is therefore available to researchers, if they need it, but they can confine their attention to the less detailed information if they wish. For example, the first digit of the "employment status" variable (EMPSTAT) groups the population into the broad categories generally used by the U.S. Bureau of Labor Statistics (i.e., working, with a job but not at work, unemployed, not in the labor force, and, for children, not in universe), while subsequent digits provide additional detail (e.g., currently employed but was not at work and was looking for work during the previous week) only available prior to 1997.

A second feature of integration in IPUMS NHIS is combining, into a single variable, material covering comparable substantive ground but appearing in different types of files in the original NHIS public use data. For example, IPUMS NHIS combines information on whether an individual had a usual place for medical care into a single variable (USUALPL). This information appeared for many years under different variable names and in several different types of files (including the Health Promotion and Disease Prevention, Cancer Control, Child Health, and Access to Care supplements and the core Sample Adult and Sample Child files) in the original NHIS public use files.

Most of the integration work is carried out using translation tables. This example of a translation table covers selected years of data for DVINT, interval since last doctor visit. In almost every year, the NHIS included this question, but the original NHIS public use files coded responses into different intervals in different years. Moreover, the variable was included in the person files for 1963-96 and in the sample adult and sample child files for 1997 forward. IPUMS NHIS combines these data into a single variable and uses a composite coding scheme to facilitate comparisons across years (using the first digit) without losing the detail present in every year (using the second digit).

A third, key component of integration is the variable documentation, which highlights important comparability issues. Particularly important are comparability problems that are not evident from the coding structure, such as changes in the survey question wording and shifts in the variable universe. The IPUMS NHIS Project staff must exercise their judgment in composing this documentation, because there is no formula for it. So that users need not depend totally on us, IPUMS NHIS documentation also includes web-accessible copies of the survey forms, with the "survey text" tab of each variable description reproducing the question(s) related to that variable.

What are "weights"? [top]

NHIS data are collected through a complex stratified sampling scheme that includes oversampling of some population subgroups. This means that persons and households with some characteristics are over-represented in the samples, while others are underrepresented. To obtain representative statistics using IPUMS NHIS data, users must apply weights.

IPUMS NHIS contains several weights. Which weight to use depends on the unit of analysis (household or person) and the sampling approach for the variable(s) in question (e.g., all persons versus a sample adult or sample child drawn from each family).

Each variable description contains a tab specifying the weight that should be used with that variable in each year, if that variable were analyzed in isolation. If multiple variables using different sampling strategies and weights are combined in one table or in a multivariate analysis, then the weight employed should fit the variable with the most restrictive sampling scheme. For example, the variables AGE and SEX apply to all persons, and therefore take the weight PERWEIGHT, while ASTHMAEV (ever told had asthma) was collected for sample persons only and takes the weight SAMPWEIGHT. To make a table of ASTHMAEV by AGE, controlling for SEX (using either downloaded data or the online tabulator), users should apply SAMPWEIGHT, which matches ASTHMAEV, the variable with the most restrictive sampling scheme.

For more information about the use of weights with IPUMS NHIS data, consult the User Note on Weights.

What does "universe" mean in the variable descriptions? [top]

The universe is the population at risk of having a response for the variable in question. In most cases, these are the households or persons to whom the survey question was asked, as reflected on the survey questionnaire. For example, employment variables do not include children, since the NHIS does not ask children about employment.

Cases that are outside of the universe for a variable are labeled "NIU" (Not In Universe) on the codes page. A change in a variable's universe across years is a common data comparability issue.

In some cases, IPUMS NHIS imposes a different variable universe than the one found in the original NHIS public use data. Usually this is done to distinguish cases with meaningful zeros (e.g., adults who reported having individual incomes of zero dollars) from cases where "NIU" was also originally coded as zero (e.g., for children, who were not asked about their individual incomes). The "universe" tab of each variable description specifies the universe for that variable.

Why is a variable from the NHIS that I have worked with before not included in IPUMS NHIS? [top]

As of Spring 2013, IPUMS NHIS includes more than 12,000 integrated variables covering the period 1963-2011. During the remaining years of the current IPUMS NHIS grant period, we will add more variables to our public database every year. The next IPUMS NHIS data release, which will add the 2012 data and new variables on disability and occupations, is planned for the summer of 2013.

Experienced analysts of NHIS data who do not find a familiar variable included in IPUMS NHIS can instead make use of the original public use data. In addition, as explained immediately below, users can add variables they need from the NHIS public use files to an IPUMS NHIS data extract, using IPUMS NHIS linking keys.

In a few cases, the NHIS survey contained questions about topics not included in the original NHIS public use files. The survey responses may never have been processed, or the responses may be included only as part of a composite recoded variable, or the variable may have been left out of the public use files due to confidentiality concerns. Because the NHIS public use files are the raw material used to create the IPUMS NHIS database, variables missing from the public use files are missing from IPUMS NHIS.

To learn whether a specific variable included in the NHIS public use files is currently included in IPUMS NHIS, users can employ the IPUMS NHIS-NHIS concordance feature accessible through the left sidebar of the IPUMS NHIS homepage. Using the concordance feature, a researcher may enter the name of an NHIS variable and learn the corresponding IPUMS NHIS variable name (or enter an IPUMS NHIS variable name and learn the original NHIS variable name). To access the full crosswalk between IPUMS NHIS and NHIS variable names, download the comprehensive concordance as an Excel file, through the concordance page.

Can I combine IPUMS NHIS data with other NHIS variables needed for my research? [top]

Interested users can combine variables from IPUMS NHIS and NHIS public use files. Variables from the original NHIS public use files (that are not yet in the IPUMS NHIS system) can be linked to an IPUMS NHIS data extract. IPUMS NHIS has created linking keys from the series of original NHIS variables that are used to uniquely identify households (NHISHID) or persons (NHISPID). More information, including an overview of NHIS unique identifiers, IPUMS NHIS linking keys, and general guidance on how to link variables from NHIS public use files to an IPUMS NHIS data extract, can be found in a user note on linking. In addition to the general guidance offered by this user note, sample Stata and SAS programs are provided to facilitate the linking process.

How is a record uniquely identified? [top]

Three variables constitute a unique identifier for each person record in IPUMS NHIS: YEARP, SERIALP, and PERNUM (survey year, household identifier, and person number within the household). The combination of YEAR and SERIAL (which have the same values as YEARP and SERIALP on the person record) constitute the unique household identifier on the household record. These are IPUMS NHIS constructed variables and will not be found in the NHIS public use files.

Individual households and persons can also be uniquely identified in a manner consistent with the NHIS public use files. The combination of YEAR and NHISHID will uniquely identify households, while the combination of YEAR and NHISPID will uniquely identify persons. NHISHID and NHISPID are constructed from data elements in the original NHIS public use files to produce the unique identifiers as defined by NHIS. These unique identifiers can be used as linking keys to merge variables from the NHIS public use files to IPUMS NHIS data. More information about linking NHIS variables to IPUMS NHIS data can be found in the user note on linking.

Data Limitations and Cautions to Users

What are the major limitations of the data? [top]

The data consist entirely of records for individual persons and households from the public use files of the NHIS. IPUMS NHIS does not deliver aggregate or published statistics from the survey. Researchers interested in aggregate data will find it on the National Center for Health Statistics website.

The number of persons and households in the survey varies from year to year, but, on average, the survey covers about 100,000 persons in about 42,000 households each year. Exact figures on the number of households and persons included in IPUMS NHIS in each year are available in the user note on sample sizes. While the NHIS ranks as one of the largest surveys conducted annually by the U.S. government, the samples may not supply enough cases to reliably study some subpopulations. To achieve adequate sample sizes for some subgroups, researchers may wish to combine data from two or more survey years.

Because the NHIS data are public-use, measures have been taken to assure confidentiality. Names and other identifying information are suppressed. You cannot find specific individuals in the IPUMS NHIS data or use these data for genealogical research. Moreover, because the NHIS uses population samples to generate the data, there is no guarantee that any given individual will be in the dataset. Finally, the registration form requires potential users to commit to using the data responsibly, including utilizing the data for statistical reporting and analysis only and making no effort to identify particular individuals in the data.

Geographic detail in the NHIS public use files and thus in IPUMS NHIS is limited to the identification of census regions and, in some years, a few large metropolitan statistical areas. Researchers can access more geographic detail and add it to an IPUMS NHIS data extract by working with the staff of the NCHS Research Data Center (RDC). If your research proposal is approved, you can access restricted data (including geographic identifiers) through on-site analysis at an NCHS or Census Restricted Data Center, via remote access, or with the paid assistance of NCHS RDC staff.

Are there aspects of IPUMS NHIS data about which to be particularly careful? [top]

IPUMS NHIS is an integrated dataset based on the NHIS public use files. However, IPUMS NHIS coding schemes follow different conventions than NHIS in many instances. For example, NHIS uses the convention of 1 = Yes and 2 = No, while IPUMS NHIS uses the convention 1 = No and 2 = Yes. To pick another example, blanks in the original NHIS public use data files are converted to numeric values (usually beginning with a 0 or a 9, to indicate "not in universe" cases) in IPUMS NHIS. Moreover, to achieve comparably coded variables over time, IPUMS NHIS has recoded most variables from the original NHIS coding schemes. Users are strongly urged to review the IPUMS NHIS documentation carefully and to not assume that variable values will be coded the same in IPUMS NHIS as they were in the NHIS files.

The NHIS uses a complex sampling scheme, so all IPUMS NHIS samples are weighted. Put another way, individuals in the data do not all represent an identical number of persons in the population in a given year. It is therefore necessary to use the appropriate weight variables when analyzing these samples. A user note on sampling weights discusses the proper use of weights with IPUMS NHIS data. In addition, the "Weights" section at the top of each variable description specifies the suggested IPUMS NHIS weight to use with that variable, by year.

The NHIS does not contain the full universe of persons in the U.S. population. Rather, the survey samples the civilian non-institutionalized population and thus excludes such persons as residents of nursing homes and members of the armed forces living in barracks. A user note on sample design contains information about the NHIS sampling scheme and changes in NHIS sampling over time. Changes in the sampling methodology have implications for variance estimation over time, and a user note on variance estimation discusses appropriate practices using IPUMS NHIS data.

It is important to examine the documentation for the variables you are using. The codes and labels for variable categories do not tell the whole story. Two features of the variable documentation merit special attention. First, examine the universe for a variable (the population at risk of answering the question), which can differ subtly or markedly across years. Second, read the comparability discussions for the variables in which you are interested. Users intending to use race or Hispanic origin as a variable across multiple survey years may find useful not only the IPUMS NHIS variable descriptions but also the NHIS Race and Hispanic Origin Information page on the National Center for Health Statistics website.

Reproductions of the relevant portions of the survey form are available within variable descriptions (via the "survey text" link). For PDF reproductions of the original NHIS survey forms and retyped versions of the same material, click on the Surveys link in the left sidebar of the IPUMS NHIS home page.

By default, the extract system rectangularizes the data: it puts the household information on the person records and drops the separate household record. This can distort analyses at the household level. The number of observations will be inflated to the number of person records. To get the proper number of household observations, either select the first person in each household or select the "hierarchical" box in the extract system. The rectangularizing feature also drops any non-interviewed households. Despite these complications, the great majority of researchers prefer the rectangularized format, which is why it is the default output of our system.

Is help available if I encounter problems using IPUMS NHIS? [top]

Users who encounter problems with the IPUMS NHIS extract system, data, or documentation can e-mail ipums@umn.edu to receive assistance. The IPUMS NHIS Project staff also welcomes feedback from users who encounter errors, inconsistencies, or lack of clarity in the data and documentation. Users who contact us with information about a legitimate and substantial error in the data or documentation will be sent a complimentary IPUMS NHIS mug.

Getting data

How do I obtain data? [top]

IPUMS NHIS data are delivered through our data extraction system. If users choose to analyze data with a statistical package (rather than using the online tabulator), they select the variables and years they are interested in, and the system creates a custom-made extract containing only this information. The system will pool data from multiple survey years into a single data file; in fact, it was primarily designed for this purpose. Detailed instructions for using the data extraction system are available below.

Data are generated on our server. The system sends out an email message to the user when the extract is completed. The user must download the extract and analyze it on a local machine. Instructions for downloading and reading the data are available here.

The IPUMS NHIS data extract system is accessed through the Data Cart, which becomes clickable once you have selected variables and samples. Before an extract is created, you will be prompted to login as an IPUMS NHIS data user (with your e-mail address and password) or to register as an IPUMS NHIS data user.

What is the data format? [top]

IPUMS NHIS produces fixed-column ASCII data. With the exception of the "Ps" and "Hs" that identify record type (distinguishing between person and household records), IPUMS NHIS data are entirely numeric. By default, the extraction system rectangularizes the data, putting household information on the person records. With rectangularization, there are no separate household records in the data extract. No information is lost, and most researchers prefer this format. The default can, however, be overridden as an option when making a data extract, to yield hierarchical data consisting of household records followed by the person records of household members.

In addition to the ASCII data file, the system creates a statistical package syntax file to accompany each extract. The syntax file is designed to read in the ASCII data while applying appropriate variable and value labels. The statistical packages Stata, SAS, and SPSS are supported. You must download the syntax file with the extract, or you will be unable to read the data. The syntax file requires minor editing to identify the location of the data file on your local computer. Directions regarding these minor edits are included in the guide to using the data extract system.

A codebook file is also created with each extract. This codebook file records the characteristics of your extract and should be downloaded for recordkeeping.

All data files are created in gzip compressed format. You must decompress the file to analyze it. Most data decompression utilities will handle the files. For example, if you are using Windows Vista or 7, right-click the file name and choose the option to "Extract all." Among the available free software for decompressing files are Winzip (for Windows) and MacGZIP (for Macs).

What is the best way to use the extract system? [top]

The data extraction system is a flexible tool. There is no need to download variables or survey years you do not expect to use for your current analysis. The system records every extract you make. You can reload and modify an old extract, dropping or adding variables or survey years. To do so, go to the Download or Revise Extracts page and click on the "Revise" link.

Some variables are preselected for you. The data extract system automatically supplies variables that indicate the sample (YEAR), are needed for variance estimation (PSU and STRATA), uniquely identify records (SERIAL and PERNUM), and are used for weighting the variables and years selected.
Use of weights with IPUMS NHIS data is discussed in the user note on sampling weights.

How long does a data extract take? [top]

The time needed to make an extract differs, depending on the number and size of samples requested and the load on our server. Creating an extract generally takes only a few minutes. The system sends an email upon completion of the extract, so there is no need to stay active on the IPUMS NHIS site during the creation of the extract.

How does "sample selection" work on the IPUMS NHIS website? [top]

When a user first enters the variable documentation system, data samples from all years are selected by default. Every variable in the system will display on relevant screens.

Users can filter the information displayed by selecting only the sample years of interest to them. Only the variables available in the selected sample years will then appear in the variable lists. The integrated variable descriptions and codes pages will be filtered to display only the linked survey text and codes and frequencies corresponding to the selected samples. Sample selections can be altered at any time in your session. Selections do not persist beyond the current session.

Can I get the original data? [top]

As noted, the raw material for the IPUMS NHIS database comes from the NHIS public use files provided by the National Center for Health Statistics. These original data are available on the NHIS data and documentation page. The National Center for Health Statistics' NHIS public use files also include variables and supplements not yet covered by IPUMS NHIS.

Data Cart

Your data extract

Frequently Asked Questions (FAQ)