User Note - Link NHIS Public Use Files to IPUMS NHIS Data



Purpose of Merging

In some circumstances, users may want to link additional variables from the original National Health Interview Survey (NHIS) public use files (that are not yet in the IPUMS NHIS system) to an IPUMS NHIS data extract. A linking key must be used for this purpose. IPUMS NHIS has created linking keys from the series of original NHIS variables that are used to uniquely identify households or individuals. However, to ensure correct linkage, unique identifiers identical to those created in the IPUMS NHIS data must be generated in the NHIS public use data. While the text that follows is relevant to both person-level merging and household-level merging, most of the discussion emphasizes person-level merging, as that is the most likely need for IPUMS NHIS users. However, the same principles generally apply to household-level merging.

Back to Top

Unique Identifiers in NHIS Public Use Files

Unique identifiers in NHIS data vary slightly across time, due to changes in the variables released in the public use data files. Please refer to tables 1 and 2 for details. Following the specified sequence of linking variables is critical for creating an IPUMS NHIS-compatible unique identifier in the NHIS data. The variable names of these unique identifiers in the NHIS data should not be changed if users intend to use the IPUMS NHIS Stata and SAS syntax files provided here for merging.

Table 1: Household-level unique identifiers in NHIS data
Year File NHIS variable sequence
1963-1968 H quarter psurandr week segment hhid
1969 H quarter psurandr weekcen segment hhid
1970–1991, 1993, 1994 H quarter psunumr weekcen segnum hhnum
1992 H year quarter psunumr weekcen segnum hhnum
1995–1996 H hhid
1997–present H hhx
Table 2: Person-level unique identifiers in NHIS data
Year File NHIS variable sequence
1963-1968 P quarter psurandr week segment hhid person
1969 P quarter psurandr weekcen segment hhid pernum
1970–1991, 1993, 1994 P quarter psunumr weekcen segnum hhnum pnum
1992 P year quarter psunumr weekcen segnum hhnum pnum
1995–1996 P hhid pnum
1997–2003 P hhx fmx px
2004–2018 P hhx fmx fpx
2019–present P hhx pernum
Table 3: Injury/poisoning-level unique identifiers in NHIS data
Year File NHIS variable sequence
1997-1999 I, Z hhx, fmx, px, injepno or poiepno (depending on I or Z file)
2000-2003 I hhx, fmx, px, injepno
2004-2017 I hhx, fmx, fpx, injepno

Back to Top

Linking Keys in IPUMS NHIS

IPUMS NHIS has taken the different unique identifiers across years of NHIS data into account and generated linking keys in the IPUMS NHIS data. There are three linking keys in IPUMS NHIS. The first linking key is NHISHID, which is a unique identifier for household records. The second linking key is NHISPID, which is a unique identifier for the person records. The third linking key is NHISIID, which is a unique identifier for the injury/poisoning records, available through the IPUMS NHIS hierarchical extract system. Each IPUMS NHIS linking key uniquely identifies households, persons, or injury/poisoning episodes, respectively, across all samples.

Certain modifications to the original unique identifiers in NHIS data have been made in IPUMS NHIS to achieve comparably coded but uniquely identified across the multiple years of data. These modifications include:

  1. Concatenating the original unique identifier variables and generating a single string or character variable as the linking key;
  2. Padding each component variable with leading zeros to achieve comparable width across years;
  3. Including the year in each identifier to create unique identifiers across the entire IPUMS NHIS dataset; and
  4. Replacing erroneous characters.

Take Note: Linking Challenges

Fiscal versus Calendar Quarters and Years in 1963-1967

The 1963-1967 NHIS surveys were collected according to the fiscal year, rather than the calendar year approach adopted for 1968-forward. The IPUMS NHIS versions of these files reflect calendar year instead of fiscal year to improve comparability over time. This harmonization adds an additional step to linking IPUMS NHIS data with files available from NHIS/NCHS because of differences between fiscal year and calendar year quarter, although the frequencies remain consistent and the quarters still correctly represent the month in which the household was interviewed.

To link the 1963-1967 IPUMS NHIS files with NHIS/NCHS files, please see the attached .do file, courtesy of Marianne Wanamaker, which modifies the IPUMS NHIS calendar quarters to match the original NHIS fiscal year quarters.

Back to Top

Separate Injury and Poisoning Files, 1997-1999

In 1997-1999, questions about poisoning episodes were asked about separately from injury episodes; injury and poisoning data were released in two separate files in these years. To improve comparability with the 2000-forward files (where injury and poisoning episodes are offered in a single file), IPUMS NHIS has combined the poisoning and injury episode data for 1997-1999 into a single record type. Episodes that were originally released in the poisoning file in 1997-1999 can be by identified by the variable IRPOISYN. IRPOISYN also denotes poisoning episodes from the combined injury and poisoning files in 2000-forward surveys to help users easily identify poisoning episodes.

IPUMS NHIS's harmonization work on combining injury and poisoning records in 1997-1999 does create an additional challenge for linking injury or poisoning level data with other NHIS data. The unique identifier, NHISIID can be used to this end, with some modifications on behalf of the user. In 1997-1999 NHISIID is a concatenation of IPUMS NHIS variables YEAR, HHX, FMX, PX, IRPOISYN, and IRINJNUM. The inclusion of IRPOISYN allows users to determine if the record was originally available on the poisoning or injury file to differentiate between the two data sources and create unique identifiers.

Back to Top

Erroneous character replacement, 1969-1981

Erroneous characters (including blanks, denoted by Bs) were replaced in the the following cases:

Year Original String Errnoeous Character(s) Replacement Character(s) IPUMS NHIS's NHISPID value
1969 19694636303951BB BB 00 1969463630395100
1970 19703349252403BB BB 00 1970334925240300
1970 19703524232305BB BB 00 1970352423230500
1970 19703737232305BB BB 00 1970373723230500
1970 19703956222104BB BB 00 1970395622210400
1970 19703179242601BB BB 99 1970317924260199
1971 19712710315603D0 D0 08 1971271031560308
1971 19713331312305BB BB 00 1971333131230500
1971 19714702313707BB BB 00 1971470231370700
1972 19721603225204BB BB 00 1972160322520400
1972 19722169335111BB BB 00 1972216933511100
1973 197325240947040B B 0 1973252409470400
1973 197328810946010| | 4 1973288109460104
1974 197410590711060B B 0 1974105907110600
1974 1974107413071B01 B 9 1974107413071901
1974 1974107413071B02 B 9 1974107413071902
1975 197545210400040B B 0 1975452104000400
1975 19754957102802BB BB 00 1975495710280200
1976 197646550539030B B 0 1976465505390300
1977 1977158601351B01 B 9 1977158601351901
1977 1977158601351B02 B 9 1977158601351902
1977 197749181333022B 2B 02 1977491813330202
1978 19783521070603BB BB 00 1978352107060300
1979 19791918031901BB BB 00 1979191803190100
1981 19811666012903BB BB 00 1981166601290300

These same modifications must be applied to the NHIS data to ensure proper linkage. To help users merge variables from the original NHIS public use data with IPUMS NHIS data, the IPUMS NHIS staff have provided linking syntax files examples for groups of years for 1982 forward that share the same linking keys. See below for an overview of the merging process and a discussion of the merging syntax files, with annotated examples.

Back to Top

Merging Variables from Original NHIS Public Use Files to IPUMS NHIS Data Files

There are three general steps that users need to take to merge variables from the original NHIS public use file to an IPUMS NHIS data file:

The discussion that follows will mostly focus on person-level merging, since that is likely to be the most common need for IPUMS NHIS users. However, the same principles generally apply to household-level merging.

  1. Obtain original NHIS public use files
    Original NHIS public use files can be downloaded from the National Center for Health Statistics (NCHS).

  2. Download and Edit Merging Syntax Files
    To help users properly link variables from the original NHIS public use data with IPUMS NHIS data, the IPUMS NHIS staff have provided linking syntax files examples for groups of years for 1982 forward that share the same linking keys.

    These linking files will work with multiple years of IPUMS NHIS data if users merge on YEAR and NHISPID for person-level files. YEAR and NHISHID is likewise needed for merging household-level files. Users can copy and paste each individual linkage program to a single file as needed, including programming statements for whichever years of data are required for a particular research project.

    The merging syntax files contain four sections. Users will need to edit sections 1 and 2 for their specific research project. Sections 3 and 4 will run based on user specifications made in the preceding two sections. The following discussion provides a general overview of these four sections.

    Section 1 is where users will specify the directory location and name of the data files with which they are working.

    This specification includes the original NHIS data file to be merged, the IPUMS NHIS data file, and the newly created merged data file.

    Section 2 is where users will specify the names of the specific variables from the NHIS data that they want to merge to their IPUMS NHIS data.

    Users are cautioned that variables in the NHIS public use files may or may not be comparable over time. Users are strongly advised against merging entire NHIS data files. Rather, users should identify variables from the NHIS source data for which a recoding plan is already devised. Users should then merge only this subset of variables with IPUMS NHIS data. We strongly recommend that users rename variables in the NHIS source data before the merge, so they can clearly distinguish NHIS variables from IPUMS NHIS variables.

    Section 3 contains syntax to prepare the NHIS data for each specific year. Users do not need to make any changes in this section. The syntax is written to check whether there are duplicates of the unique identifiers and records this information in the log. For person-level files, there should not be any duplicates. For episode-level files, such as conditions or doctor visits, duplicates will occur because an individual can have none, one, or many records. As mentioned previously, linking keys need to be created in the NHIS source data to ensure correct linking to the IPUMS NHIS data. The code in this section also generates linking keys that are identical to the linking keys in IPUMS NHIS for the same year.

    Finally, the syntax in this section checks for duplicates in the newly created linking key and writes this information to the log file. Results of this second duplicates check should be the same as the first. Next, the user-selected variables (specified in section 2) are kept, and this modified NHIS data file is saved as a temporary file for the merge.

    Section 4 contains code to merge the data files and to assess the quality of the merge. Users do not need to make any changes in this section. The syntax is written so that a user's specified IPUMS NHIS data file is accessed, duplicates of the linking key in the IPUMS NHIS data file are assessed, and the results are written to the log. The modified NHIS data file, with a subset of variables, is then merged to the IPUMS NHIS data file. Syntax has been written to assess the status of the merge, and the results are written to the log. Duplicate checks and merge statistics can be reviewed by the user to evaluate the status of the merge. Additional information about interpreting the merge results is given below.

  3. Merge NHIS data to IPUMS NHIS data

    Once the merging syntax files are edited with user specifications, these files can be run to complete the merge. This section contains a discussion of the two types of merging users may encounter and how to assess the quality of the merge, using the statistics produced by the merging syntax files.

    There are two main types of merges possible when combining NHIS source data with IPUMS NHIS data. The first merge type is a one-to-one merge (for example, merging person-level variables from the NHIS person files or sample adult files, where there is only one possible record per person, to the IPUMS NHIS data). A second merge type is a many-to-one merge (for example, merging NHIS condition files where there can be none, one, or many condition records for a person in the IPUMS NHIS data).

    As discussed earlier, NHISHID and NHISPID uniquely identify households and persons, respectively, within each year. When using multiple years of IPUMS NHIS data, there is no need to subset multiple year data into single year files for proper merging. However, users must include the variable YEAR in combination with NHISHID or NHISPID for linking to be successful.

Back to Top

Checking Merge Results

Users should review the results of each merge, to ensure that the merge occurred as expected. After merging, tabulating the frequencies of the variable _merge within a year will allow users to assess the status of the merge. The values of the _merge variable report the merge status for each record. Values of _merge are as follows:

_merge = 1  observation in master dataset only (the IPUMS NHIS data)
_merge = 2  observation in merging dataset only (the original NHIS data)
_merge = 3  observation in both master (IPUMS NHIS) and merging (NHIS) datasets  

Back to Top

Merge type 1: One-to-one merge within a year

There are two types of one-to-one merges that users may see. The first is an exact match, and the second is a subset match.

An exact match occurs when there is exactly one record in the original NHIS data file for each individual record in the IPUMS NHIS data. For example, users merging additional variables from the NHIS person files with IPUMS NHIS data will have exactly one record per person per year in both the NHIS data and the IPUMS NHIS data. After merging, results like those described in the following examples should show in the results window and the log file.

For example, the following merge assessment statistics will occur when merging additional variables from the 1994 NHIS person file to IPUMS NHIS data. These statistics will appear in the results window and the log file. All records have a value of _merge=3, since all observations are in both the IPUMS NHIS data (master file) and the NHIS data (merging file). The frequency of observations in _merge=3 should equal the total number of observations in the IPUMS NHIS file for the specified year or the total number of observations in the NHIS file being merged.

Stata example:

.  bysort year: tab _merge
-> year = 1994
_merge| Freq.  Percent  Cum.
3| 116,179 100.00 100.00
Total| 116,179 100.00  

SAS example:

The FREQ Procedure DATA SET SOURCE
FOR OBS
_merge Frequency Percent Cumulative
Frequency
Cumulative
Percent
116179 100.00 116179  100.00

A subset match occurs when the NHIS data file only represents a sub-sample of the NHIS survey respondents in a given year. For example, the 2005 NHIS cancer supplement file includes only a subset of adults from the 2005 NHIS data. Users who wish to merge this supplement with IPUMS NHIS data will only have matching records for a subset of those in the original NHIS data. While this is still a one-to-one merge, the merge will only occur for this subset of adults in the IPUMS NHIS data. How the merge occurs will be similar to the one-to-one match, but after merging, results like those described in the following examples should show in the results window and the log file.

For example, the following merge assessment statistics will occur when merging variables from the 2005 cancer supplement to IPUMS NHIS data. These statistics will appear in the results window and the log file. Records now have values of _merge=1 and _merge=3, since observations either are only in the IPUMS NHIS data (master file) or are in both the IPUMS NHIS data (master file) and the NHIS data (merging file). The frequency of observations for _merge=3 should equal the total number of observations in the NHIS file that was merged (n = 31,428). The frequency of total observations should equal the total number of person-record observations in the IPUMS NHIS file for the specified year (n = 98,649).

Stata example:

.  bysort year: tab _merge
-> year = 2005
_merge| Freq.  Percent  Cum.  
1| 67,221 68.14 68.14 IPUMS NHIS only
3| 31,428 31.86 100.00 IPUMS NHIS and Cancer Supp
Total| 98,649 100.00    

SAS example:

The FREQ Procedure DATA SET SOURCE
FOR OBS
_merge Frequency Percent Cumulative
Frequency
Cumulative
Percent
 
67221 68.14 67221  68.14 IPUMS NHIS only
31428 31.86 98649  100.00 IPUMS NHIS and Cancer Supp

Back to Top

Merge type 2: Many-to-one merge within a year

Some NHIS data files, such as condition files and doctor visit files, contain multiple records for some persons. When merging these files, the IPUMS NHIS data will be expanded to represent multiple records for those individuals who had multiple records in the merging file. Duplicates in the linking key are now expected for individuals who have more than one record.

As an example, the following merge assessment statistics will occur when merging variables from the 1974 condition file to IPUMS NHIS data. These statistics will appear in the results window and the log file. Checking the merge status is more difficult in this situation. The frequency of observations for _merge=3 should equal the total number of observations in the original NHIS data that was merged (n = 37, 453). The total number of observations (n = 126,571) should now be larger than the number of observations in the IPUMS NHIS data for the specified year (n = 116,287) but smaller than the combination of the number of records in the master file (n = 116,287) and the number of records in the merging file (n = 37,453). This is because some individuals have multiple records, some have a single record, and some have no record in the merging file.

Stata example:

. bysort year: tab _merge
-> year = 1974
_merge| Freq. Percent  Cum.
1| 89,118 70.41 70.41
3| 37,453 29.59 100.00
Total| 126,571 100.00  

SAS example:

The FREQ Procedure DATA SET SOURCE
FOR OBS
_merge Frequency Percent Cumulative
Frequency
Cumulative
Percent
89118 70.41 89118 70.41
37453 29.59 126571 100.00

Back to Top

Example of merging variables from NHIS data with a multiple-year IPUMS NHIS data file

When merging data from multiple years to a multi-year IPUMS NHIS data file, users can copy and paste syntax from each individual year to a single syntax file. In the first merge, the user must specify the directory path and name of their IPUMS NHIS data file (see section 1.3 in the example below) and specify the directory path of the final merged data file they are creating (see section 1.4 in the example below). In the second, or subsequent, merge, the user must use the directory path and name of the final merged data file specified in the previous merge (Section 1.4) as the IPUMS NHIS data master file (Section 1.3). This will ensure that variables from multiple years of NHIS data are merged with a single multi-year IPUMS NHIS data file.

The following links provide a simplified example of merging two years of NHIS data to a multi-year IPUMS NHIS file. For this example, the IPUMS NHIS data file contains variables from 2004 and 2005. We want to merge additional variables from the NHIS 2004 and NHIS 2005 data files. Users can run each of the year-specific syntax files individually. Alternatively, users can copy and paste syntax from the separate merging syntax files to a single merging syntax file. Users are urged to specify carefully all directory paths and file name, paying special attention to the master data file to be specified in the second year. The master data file is now the merged file from the previous section.

Stata example (pdf)
SAS example (pdf)

Back to Top

Help with Merging

This user note and the accompanying Stata and SAS syntax files were written to provide general guidance and facilitate the process of merging variables from the original NHIS public use files to an IPUMS NHIS data file. We attempted to anticipate issues that might occur in the most common merging scenarios. However, if problems arise, users are encouraged to contact IPUMS NHIS for assistance.

For assistance, please e-mail us at: ipums@umn.edu

Please provide a brief description of the problem and attach the Stata or SAS log file for us to review.

Back to Top

Last revised: 6 September 2013