User Note - Link NHIS Public Use Files to IPUMS NHIS Data
- Purpose of Merging
- Unique Identifiers in NHIS Public Use Files
- Linking Keys in NHIS
- Merging Variables from Original NHIS Public Use Files to IPUMS NHIS Data Files
- Obtain original NHIS public use files
- Download and Edit Merging Syntax Files
- Merge NHIS data to IPUMS NHIS date
- Check merge results
- Merge type 1: One-to-one merge within a year
- Merge type 2: Many-to-one merge within a year
- Example of merging variables from NHIS data with a multiple-year IPUMS NHIS data file
- Help with Merging
Purpose of Merging
In some circumstances, users may want to link additional variables from the original National Health Interview Survey (NHIS) public use files (that are not yet in the IPUMS NHIS system) to an IPUMS NHIS data extract. A linking key must be used for this purpose. IPUMS NHIS has created linking keys from the series of original NHIS variables that are used to uniquely identify households or individuals. However, to ensure correct linkage, unique identifiers identical to those created in the IPUMS NHIS data must be generated in the NHIS public use data. While the text that follows is relevant to both person-level merging and household-level merging, most of the discussion emphasizes person-level merging, as that is the most likely need for IPUMS NHIS users. However, the same principles generally apply to household-level merging.
Unique Identifiers in NHIS Public Use Files
Unique identifiers in NHIS data vary slightly across time, due to changes in the variables released in the public use data files. Please refer to tables 1 and 2 for details. Following the specified sequence of linking variables is critical for creating an IPUMS NHIS-compatible unique identifier in the NHIS data. The variable names of these unique identifiers in the NHIS data should not be changed if users intend to use the IPUMS NHIS Stata and SAS syntax files provided here for merging.
Year | File | NHIS variable sequence |
---|---|---|
1963-1968 | H | quarter psurandr week segment hhid |
1969 | H | quarter psurandr weekcen segment hhid |
1970–1991, 1993, 1994 | H | quarter psunumr weekcen segnum hhnum |
1992 | H | year quarter psunumr weekcen segnum hhnum |
1995–1996 | H | hhid |
1997–present | H | hhx |
Year | File | NHIS variable sequence |
---|---|---|
1963-1968 | P | quarter psurandr week segment hhid person |
1969 | P | quarter psurandr weekcen segment hhid pernum |
1970–1991, 1993, 1994 | P | quarter psunumr weekcen segnum hhnum pnum |
1992 | P | year quarter psunumr weekcen segnum hhnum pnum |
1995–1996 | P | hhid pnum |
1997–2003 | P | hhx fmx px |
2004–2018 | P | hhx fmx fpx |
2019–present | P | hhx pernum |
Year | File | NHIS variable sequence |
---|---|---|
1997-1999 | I, Z | hhx, fmx, px, injepno or poiepno (depending on I or Z file) |
2000-2003 | I | hhx, fmx, px, injepno |
2004-2017 | I | hhx, fmx, fpx, injepno |
Linking Keys in IPUMS NHIS
IPUMS NHIS has taken the different unique identifiers across years of NHIS data into account and generated linking keys in the IPUMS NHIS data. There are three linking keys in IPUMS NHIS. The first linking key is NHISHID, which is a unique identifier for household records. The second linking key is NHISPID, which is a unique identifier for the person records. The third linking key is NHISIID, which is a unique identifier for the injury/poisoning records, available through the IPUMS NHIS hierarchical extract system. Each IPUMS NHIS linking key uniquely identifies households, persons, or injury/poisoning episodes, respectively, across all samples.
Certain modifications to the original unique identifiers in NHIS data have been made in IPUMS NHIS to achieve comparably coded but uniquely identified across the multiple years of data. These modifications include:
- Concatenating the original unique identifier variables and generating a single string or character variable as the linking key;
- Padding each component variable with leading zeros to achieve comparable width across years;
- Including the year in each identifier to create unique identifiers across the entire IPUMS NHIS dataset; and
- Replacing erroneous characters.
Take Note: Linking Challenges
Fiscal versus Calendar Quarters and Years in 1963-1967
The 1963-1967 NHIS surveys were collected according to the fiscal year, rather than the calendar year approach adopted for 1968-forward. The IPUMS NHIS versions of these files reflect calendar year instead of fiscal year to improve comparability over time. This harmonization adds an additional step to linking IPUMS NHIS data with files available from NHIS/NCHS because of differences between fiscal year and calendar year quarter, although the frequencies remain consistent and the quarters still correctly represent the month in which the household was interviewed.
To link the 1963-1967 IPUMS NHIS files with NHIS/NCHS files, please see the attached .do file, courtesy of Marianne Wanamaker, which modifies the IPUMS NHIS calendar quarters to match the original NHIS fiscal year quarters.
Separate Injury and Poisoning Files, 1997-1999
In 1997-1999, questions about poisoning episodes were asked about separately from injury episodes; injury and poisoning data were released in two separate files in these years. To improve comparability with the 2000-forward files (where injury and poisoning episodes are offered in a single file), IPUMS NHIS has combined the poisoning and injury episode data for 1997-1999 into a single record type. Episodes that were originally released in the poisoning file in 1997-1999 can be by identified by the variable IRPOISYN. IRPOISYN also denotes poisoning episodes from the combined injury and poisoning files in 2000-forward surveys to help users easily identify poisoning episodes.
IPUMS NHIS's harmonization work on combining injury and poisoning records in 1997-1999 does create an additional challenge for linking injury or poisoning level data with other NHIS data. The unique identifier, NHISIID can be used to this end, with some modifications on behalf of the user. In 1997-1999 NHISIID is a concatenation of IPUMS NHIS variables YEAR, HHX, FMX, PX, IRPOISYN, and IRINJNUM. The inclusion of IRPOISYN allows users to determine if the record was originally available on the poisoning or injury file to differentiate between the two data sources and create unique identifiers.
Erroneous character replacement, 1969-1981
Erroneous characters (including blanks, denoted by Bs) were replaced in the the following cases:
Year | Original String | Errnoeous Character(s) | Replacement Character(s) | IPUMS NHIS's NHISPID value |
---|---|---|---|---|
1969 | 19694636303951BB | BB | 00 | 1969463630395100 |
1970 | 19703349252403BB | BB | 00 | 1970334925240300 |
1970 | 19703524232305BB | BB | 00 | 1970352423230500 |
1970 | 19703737232305BB | BB | 00 | 1970373723230500 |
1970 | 19703956222104BB | BB | 00 | 1970395622210400 |
1970 | 19703179242601BB | BB | 99 | 1970317924260199 |
1971 | 19712710315603D0 | D0 | 08 | 1971271031560308 |
1971 | 19713331312305BB | BB | 00 | 1971333131230500 |
1971 | 19714702313707BB | BB | 00 | 1971470231370700 |
1972 | 19721603225204BB | BB | 00 | 1972160322520400 |
1972 | 19722169335111BB | BB | 00 | 1972216933511100 |
1973 | 197325240947040B | B | 0 | 1973252409470400 |
1973 | 197328810946010| | | | 4 | 1973288109460104 |
1974 | 197410590711060B | B | 0 | 1974105907110600 |
1974 | 1974107413071B01 | B | 9 | 1974107413071901 |
1974 | 1974107413071B02 | B | 9 | 1974107413071902 |
1975 | 197545210400040B | B | 0 | 1975452104000400 |
1975 | 19754957102802BB | BB | 00 | 1975495710280200 |
1976 | 197646550539030B | B | 0 | 1976465505390300 |
1977 | 1977158601351B01 | B | 9 | 1977158601351901 |
1977 | 1977158601351B02 | B | 9 | 1977158601351902 |
1977 | 197749181333022B | 2B | 02 | 1977491813330202 |
1978 | 19783521070603BB | BB | 00 | 1978352107060300 |
1979 | 19791918031901BB | BB | 00 | 1979191803190100 |
1981 | 19811666012903BB | BB | 00 | 1981166601290300 |
These same modifications must be applied to the NHIS data to ensure proper linkage. To help users merge variables from the original NHIS public use data with IPUMS NHIS data, the IPUMS NHIS staff have provided linking syntax files examples for groups of years for 1982 forward that share the same linking keys. See below for an overview of the merging process and a discussion of the merging syntax files, with annotated examples.
Merging Variables from Original NHIS Public Use Files to IPUMS NHIS Data Files
There are three general steps that users need to take to merge variables from the original NHIS public use file to an IPUMS NHIS data file:
- Obtain the NHIS data;
- Download and edit the merging syntax files for Stata or SAS; and
- Merge the data files.
The discussion that follows will mostly focus on person-level merging, since that is likely to be the most common need for IPUMS NHIS users. However, the same principles generally apply to household-level merging.
-
Obtain original NHIS public use files
Original NHIS public use files can be downloaded from the National Center for Health Statistics (NCHS). -
Download and Edit Merging Syntax Files
To help users properly link variables from the original NHIS public use data with IPUMS NHIS data, the IPUMS NHIS staff have provided linking syntax files examples for groups of years for 1982 forward that share the same linking keys.These linking files will work with multiple years of IPUMS NHIS data if users merge on YEAR and NHISPID for person-level files. YEAR and NHISHID is likewise needed for merging household-level files. Users can copy and paste each individual linkage program to a single file as needed, including programming statements for whichever years of data are required for a particular research project.
- Person (Stata) Linking Files
- Person (SAS) Linking Files
- Household (Stata) Linking Files
- Household (SAS) Linking Files
The merging syntax files contain four sections. Users will need to edit sections 1 and 2 for their specific research project. Sections 3 and 4 will run based on user specifications made in the preceding two sections. The following discussion provides a general overview of these four sections.
Section 1 is where users will specify the directory location and name of the data files with which they are working.
This specification includes the original NHIS data file to be merged, the IPUMS NHIS data file, and the newly created merged data file.
Section 2 is where users will specify the names of the specific variables from the NHIS data that they want to merge to their IPUMS NHIS data.
Users are cautioned that variables in the NHIS public use files may or may not be comparable over time. Users are strongly advised against merging entire NHIS data files. Rather, users should identify variables from the NHIS source data for which a recoding plan is already devised. Users should then merge only this subset of variables with IPUMS NHIS data. We strongly recommend that users rename variables in the NHIS source data before the merge, so they can clearly distinguish NHIS variables from IPUMS NHIS variables.
Section 3 contains syntax to prepare the NHIS data for each specific year. Users do not need to make any changes in this section. The syntax is written to check whether there are duplicates of the unique identifiers and records this information in the log. For person-level files, there should not be any duplicates. For episode-level files, such as conditions or doctor visits, duplicates will occur because an individual can have none, one, or many records. As mentioned previously, linking keys need to be created in the NHIS source data to ensure correct linking to the IPUMS NHIS data. The code in this section also generates linking keys that are identical to the linking keys in IPUMS NHIS for the same year.
Finally, the syntax in this section checks for duplicates in the newly created linking key and writes this information to the log file. Results of this second duplicates check should be the same as the first. Next, the user-selected variables (specified in section 2) are kept, and this modified NHIS data file is saved as a temporary file for the merge.
Section 4 contains code to merge the data files and to assess the quality of the merge. Users do not need to make any changes in this section. The syntax is written so that a user's specified IPUMS NHIS data file is accessed, duplicates of the linking key in the IPUMS NHIS data file are assessed, and the results are written to the log. The modified NHIS data file, with a subset of variables, is then merged to the IPUMS NHIS data file. Syntax has been written to assess the status of the merge, and the results are written to the log. Duplicate checks and merge statistics can be reviewed by the user to evaluate the status of the merge. Additional information about interpreting the merge results is given below.
-
Merge NHIS data to IPUMS NHIS data
Once the merging syntax files are edited with user specifications, these files can be run to complete the merge. This section contains a discussion of the two types of merging users may encounter and how to assess the quality of the merge, using the statistics produced by the merging syntax files.
There are two main types of merges possible when combining NHIS source data with IPUMS NHIS data. The first merge type is a one-to-one merge (for example, merging person-level variables from the NHIS person files or sample adult files, where there is only one possible record per person, to the IPUMS NHIS data). A second merge type is a many-to-one merge (for example, merging NHIS condition files where there can be none, one, or many condition records for a person in the IPUMS NHIS data).
As discussed earlier, NHISHID and NHISPID uniquely identify households and persons, respectively, within each year. When using multiple years of IPUMS NHIS data, there is no need to subset multiple year data into single year files for proper merging. However, users must include the variable YEAR in combination with NHISHID or NHISPID for linking to be successful.
Checking Merge Results
Users should review the results of each merge, to ensure that the merge occurred as expected. After merging, tabulating the frequencies of the variable _merge within a year will allow users to assess the status of the merge. The values of the _merge variable report the merge status for each record. Values of _merge are as follows:
_merge = 1 observation in master dataset only (the IPUMS NHIS data)
_merge = 2 observation in merging dataset only (the original NHIS data)
_merge = 3 observation in both master (IPUMS NHIS) and merging (NHIS) datasets
Merge type 1: One-to-one merge within a year
There are two types of one-to-one merges that users may see. The first is an exact match, and the second is a subset match.
An exact match occurs when there is exactly one record in the original NHIS data file for each individual record in the IPUMS NHIS data. For example, users merging additional variables from the NHIS person files with IPUMS NHIS data will have exactly one record per person per year in both the NHIS data and the IPUMS NHIS data. After merging, results like those described in the following examples should show in the results window and the log file.
For example, the following merge assessment statistics will occur when merging additional variables from the 1994 NHIS person file to IPUMS NHIS data. These statistics will appear in the results window and the log file. All records have a value of _merge=3, since all observations are in both the IPUMS NHIS data (master file) and the NHIS data (merging file). The frequency of observations in _merge=3 should equal the total number of observations in the IPUMS NHIS file for the specified year or the total number of observations in the NHIS file being merged.
Stata example:
. bysort year: tab _merge -> year = 1994 |
|||
_merge| | Freq. | Percent | Cum. |
3| | 116,179 | 100.00 | 100.00 |
Total| | 116,179 | 100.00 |
SAS example:
The FREQ Procedure DATA SET SOURCE FOR OBS |
||||
_merge | Frequency | Percent | Cumulative Frequency |
Cumulative Percent |
3 | 116179 | 100.00 | 116179 | 100.00 |
A subset match occurs when the NHIS data file only represents a sub-sample of the NHIS survey respondents in a given year. For example, the 2005 NHIS cancer supplement file includes only a subset of adults from the 2005 NHIS data. Users who wish to merge this supplement with IPUMS NHIS data will only have matching records for a subset of those in the original NHIS data. While this is still a one-to-one merge, the merge will only occur for this subset of adults in the IPUMS NHIS data. How the merge occurs will be similar to the one-to-one match, but after merging, results like those described in the following examples should show in the results window and the log file.
For example, the following merge assessment statistics will occur when merging variables from the 2005 cancer supplement to IPUMS NHIS data. These statistics will appear in the results window and the log file. Records now have values of _merge=1 and _merge=3, since observations either are only in the IPUMS NHIS data (master file) or are in both the IPUMS NHIS data (master file) and the NHIS data (merging file). The frequency of observations for _merge=3 should equal the total number of observations in the NHIS file that was merged (n = 31,428). The frequency of total observations should equal the total number of person-record observations in the IPUMS NHIS file for the specified year (n = 98,649).
Stata example:
. bysort year: tab _merge -> year = 2005 |
||||
_merge| | Freq. | Percent | Cum. | |
1| | 67,221 | 68.14 | 68.14 | IPUMS NHIS only |
3| | 31,428 | 31.86 | 100.00 | IPUMS NHIS and Cancer Supp |
Total| | 98,649 | 100.00 |
SAS example:
The FREQ Procedure DATA SET SOURCE FOR OBS |
|||||
_merge | Frequency | Percent | Cumulative Frequency |
Cumulative Percent |
|
1 | 67221 | 68.14 | 67221 | 68.14 | IPUMS NHIS only |
3 | 31428 | 31.86 | 98649 | 100.00 | IPUMS NHIS and Cancer Supp |
Merge type 2: Many-to-one merge within a year
Some NHIS data files, such as condition files and doctor visit files, contain multiple records for some persons. When merging these files, the IPUMS NHIS data will be expanded to represent multiple records for those individuals who had multiple records in the merging file. Duplicates in the linking key are now expected for individuals who have more than one record.
As an example, the following merge assessment statistics will occur when merging variables from the 1974 condition file to IPUMS NHIS data. These statistics will appear in the results window and the log file. Checking the merge status is more difficult in this situation. The frequency of observations for _merge=3 should equal the total number of observations in the original NHIS data that was merged (n = 37, 453). The total number of observations (n = 126,571) should now be larger than the number of observations in the IPUMS NHIS data for the specified year (n = 116,287) but smaller than the combination of the number of records in the master file (n = 116,287) and the number of records in the merging file (n = 37,453). This is because some individuals have multiple records, some have a single record, and some have no record in the merging file.
Stata example:
. bysort year: tab _merge -> year = 1974 |
|||
_merge| | Freq. | Percent | Cum. |
1| | 89,118 | 70.41 | 70.41 |
3| | 37,453 | 29.59 | 100.00 |
Total| | 126,571 | 100.00 |
SAS example:
The FREQ Procedure DATA SET SOURCE FOR OBS |
||||
_merge | Frequency | Percent | Cumulative Frequency |
Cumulative Percent |
1 | 89118 | 70.41 | 89118 | 70.41 |
3 | 37453 | 29.59 | 126571 | 100.00 |
Example of merging variables from NHIS data with a multiple-year IPUMS NHIS data file
When merging data from multiple years to a multi-year IPUMS NHIS data file, users can copy and paste syntax from each individual year to a single syntax file. In the first merge, the user must specify the directory path and name of their IPUMS NHIS data file (see section 1.3 in the example below) and specify the directory path of the final merged data file they are creating (see section 1.4 in the example below). In the second, or subsequent, merge, the user must use the directory path and name of the final merged data file specified in the previous merge (Section 1.4) as the IPUMS NHIS data master file (Section 1.3). This will ensure that variables from multiple years of NHIS data are merged with a single multi-year IPUMS NHIS data file.
The following links provide a simplified example of merging two years of NHIS data to a multi-year IPUMS NHIS file. For this example, the IPUMS NHIS data file contains variables from 2004 and 2005. We want to merge additional variables from the NHIS 2004 and NHIS 2005 data files. Users can run each of the year-specific syntax files individually. Alternatively, users can copy and paste syntax from the separate merging syntax files to a single merging syntax file. Users are urged to specify carefully all directory paths and file name, paying special attention to the master data file to be specified in the second year. The master data file is now the merged file from the previous section.
Stata example (pdf)
SAS example (pdf)
Help with Merging
This user note and the accompanying Stata and SAS syntax files were written to provide general guidance and facilitate the process of merging variables from the original NHIS public use files to an IPUMS NHIS data file. We attempted to anticipate issues that might occur in the most common merging scenarios. However, if problems arise, users are encouraged to contact IPUMS NHIS for assistance.
For assistance, please e-mail us at: ipums@umn.edu
Please provide a brief description of the problem and attach the Stata or SAS log file for us to review.
Last revised: 6 September 2013
Supported By