Risks to Patient Privacy: A Re-identification of Patients in Maine and Vermont Statewide Hospital Data

Ji Su Yoo; Alex; ra Thaler; Latanya Sweeney; Jinyan Zang

Abstract

Forty-eight states in the United States collect statewide inpatient discharge data that include personal health information of each patient’s hospital visit [1]. A 2013 survey found that 33 of those states subsequently sold or otherwise disclosed copies of the data, but only three states de-identified data consistent with the standards established under the Health Information Portability and Accountability Act (HIPAA), the U.S. federal regulation that dictates the rules by which personal health information is shared [2]. While states are not mandated to follow HIPAA when de-identifying their data, many states still use a version of HIPAA to de-identify discharge data. Did the other 30 states put the privacy of personal health data at risk? To answer this question, Latanya Sweeney tested whether Washington State’s hospital data was vulnerable to re-identification. The study showed that Washington State’s inpatient data allowed for the correct matching of 35 of 81 (or 43 percent) individuals identified in the news stories to the anonymized discharge data released by the states local newspaper stories to anonymized hospital visits [3]. After the study, Washington State improved its anonymization standard for publicly available data and added an application process for others to receive more detailed nonpublic discharge data. Despite this successful outcome, many states were not convinced that the same re-identification strategy would be successful on their datasets. One reason was a belief that Washington State was more vulnerable because it shared patient age in months, a practice not followed by many other states. Is this correct? Are other states exempt from this re-identification strategy? To find out, we repeated the approach on statewide health data from 2010 in Maine and 2011 in Vermont using a total of 291 local news stories.

Results summary: We found that 69 of 244 (or 28.3 percent) of the names from local news stories uniquely matched to one hospitalization in the Maine hospital data. Even if redacted to the HIPAA Safe Harbor standard, the Maine data still allowed eight matches (or 3.2 percent). In Vermont, we found that 16 of the 47 (or 34 percent) names from the news stories uniquely matched to one hospitalization. If the Vermont data complied with the HIPAA Safe Harbor standard, five matches would result (or 10.6 percent). Such findings suggest that patients’ personal information is vulnerable to re-identification even when hospital data is de-identified according to HIPAA Safe Harbor guidelines. We call for more rigorous inquiry on the vulnerabilities that exist even when following HIPAA Safe Harbor as a standard for de-identification.

Introduction

Forty-eight states collect and disseminate patient-level data for inpatient, outpatient, and emergency room visits that they then share with both private and public organizations [1]. Statewide collections of hospital data are useful to researchers, regulators, and businesses. Sharing a complete set of hospital visits within a state enables all kinds of analyses of hospital and physician performances [4]. Prior research studies have examined the effect of racial and geographical disparities on clinical trial participation [5]; studied the impact of critical access hospital conversion on patient safety [6]; analyzed the relationship between private health insurance and the increase in certain medical procedures [7]; and compared motorcycle accident results in states with and without helmet laws [8]. These are all worthy purposes to publicly provide patient-level statewide health data.

Before sharing the statewide collections of health data, each state decides how to de-identify patient-level data to protect privacy. These data include the patient’s demographics, diagnoses, medical procedures, charges, and sources of payments for each hospitalization. Most states follow a version of the Safe Harbor method of de-identification found in the Health Information Portability and Accountability Act (HIPAA) Privacy Rule. HIPAA is the U.S. federal regulation that dictates the rules by which personal health information is shared by doctors, hospitals, and insurance companies. Statewide collections exist under the authority of state law and therefore, statewide hospital datasets are not subject to HIPAA. While states do not have to follow HIPAA, the Safe Harbor de-identification guidelines are still widely recognized as a standard for de-identifying all kinds of health data.

Under de-identification, a state removes all explicit identifiers, such as name and address, and replaces some specific values with generalized equivalents, such as reporting age instead of the day, month, and year of birth [9]. The resulting data has no name, address, or Social Security number and the only demographics that appear may be the patient’s age, gender, and in some cases, a partial ZIP code. Some states, such as Maine and Vermont, apply their own alternative de-identification standard instead of adhering to HIPAA requirements for de-identification. Maine and Vermont both share a patient’s admission quarter, a date that is more specific than year.

To most people the resulting de-identified data can appear anonymous. Under this assumption, the data can be shared widely with researchers, private businesses, and the public without compromising personal privacy. But what if the state got de-identification wrong? What if even following the HIPAA Safe Harbor method for de-identification is not sufficient?

Assuming that states try to protect their residents’ medical data privacy in how they de-identify their data, initial efforts at de-identification may not be successful, since risk-free de-identification is difficult. But diagnosis codes that are specific to one person’s hospital visit and a patient’s age, gender, and ZIP code may allow for the name of the patient to be associated with the de-identified record. The diagnosis codes that appear in the data describe the purpose of each visit and any health issues that may be relevant to or complicate care. For example, the record for a patient hospitalized with a broken arm because of a motorcycle crash could include a diagnosis code about his broken arm but also other codes describing alcohol, illegal drug use, diabetes, and other conditions. Therefore, even with the removal of explicit identifiers, patient records can be traced back to an individual.

In this paper we test two states’ de-identification standards using publicly available statewide hospital data. Since Maine and Vermont have engineered their own alternative de-identification standards, we first test the level of risk in the state’s standards. We also test HIPAA Safe Harbor standards using a version of the Maine and Vermont statewide datasets that adhere to HIPAA de-identification guidelines. If we demonstrate the risks to re-identification from HIPAA Safe Harbor or a state’s current alternative standard, states have an opportunity to respond to improve their practices to reduce the risk in future data releases.

A cycle of research and improvement can improve states’ de-identification standards and lower risk to patient privacy while still preserving the underlying utility of the data for public health and other purposes. However, in many states recipients of a copy of statewide dataset must sign an agreement that they will not re-identify patients. Some agreements include penalties that apply if the state learns that a recipient did re-identify patients. In practical terms, the data-use agreement gags the data recipient from talking about any re-identification vulnerabilities. It does not, however, necessarily stop re-identifications from occurring.

In fact, for those who may have the greatest incentive to exploit the data, a data-use agreement is not necessarily a deterrent. Among users of statewide hospital data are data analytic companies[10][11], many of which sell comprehensive, longitudinal datasets of patient histories that include state hospital data. Financial incentives to link data could encourage re-identifications and outweigh any concerns raised by a data-use agreement. The data-use agreements can also discourage anyone from exposing the vulnerabilities since there is no requirement to report vulnerabilities. Data-use agreements that penalize re-identifications do not allow for data protection but re-identification studies such as this one can provide specific guidance on the existing weakness in de-identification methods and encourage states to implement stronger privacy protections.

Background

As mentioned earlier, 48 states collect patient information on each hospitalization that occurs in the state, and in 2013, 33 states subsequently disclosed copies of the data [1]. Of these states, only 3 states de-identified their data in a manner that was as rigorous as HIPAA requires [2]. The other 30 states shared data in a manner that was less protected than HIPAA warranted.

Washington State was one of these states. For $50, Sweeney purchased a copy of Washington State’s publicly available dataset [3]. It had virtually all hospitalizations occurring in the state in 2011, and included patient demographics, diagnoses, procedures, attending physician, hospital, a summary of charges, and how the bill was paid. It did not contain patient names or addresses other than residential ZIP codes.

Newspaper stories in Washington State for the same year that contained the word "hospitalized" often included a patient’s name and residential information and explained the reason for the hospitalization, such as vehicle accident or assault. This is the same kind of information an employer, family, friend or neighbor might know about a patient.

Sweeney assembled a sample of 81 news stories and found that news information uniquely and exactly matched medical records in the state database for 35 of the 81 sample cases (or 43 percent), thereby putting names to patient records. An independent news reporter verified matches by contacting patients [12]. Matches included high profile cases, such as politicians, professional athletes, and businesspeople. Some of the codes included sensitive information beyond the purpose of the visit, such as drug and alcohol use and sexually transmitted diseases.

After becoming aware of the experimental results, Washington State addressed the problem by requiring a data-use agreement of the publicly available version of the data and making a more detailed version available only through an application process [13]. California also changed its data disclosure practice [14]. Washington State found a means to provide detailed data to researchers and others while improving the privacy of the publicly available version.

In summary, a total of 30 states shared statewide hospital data in a manner less restricted than HIPAA’s standard before the Washington State re-identifications. After the Washington State re-identifications became known, 28 states remained unchanged.

Two of these states are Maine and Vermont. Will the re-identification strategy used on Washington State hospitalization data work on Maine and Vermont’s hospitalization data?

In this study, we use "named individual" to denote someone that we can identify by full name. In our re-identification strategy, someone who is identified by name in a local news article as being admitted to a Maine or Vermont hospital is considered a "named individual." We determine that there is a "unique re-identification" when a de-identified or anonymized record can be strongly associated with a named individual who most logically matches the details of the record." We use the term "binsize" to label the number of records that matches to a named individual. A "unique re-identification" has a binsize 1 (we write k = 1) where k is the number of de-identified hospital records that match to a named individual based on a combination of field values such as geocode, gender, age, etc. See Figure 1 for a visual depiction.

Figure 1. There are three hospital records that match the gender, age, and hometown profile of the named individual so binsize=3. If there was only one record of a 75-year-old female from Portland in the hospital records, then we call it a unique re-identification.

Hospital records often include the patient’s previous and relevant conditions, even if those conditions have no relation to the reason for the current admission. If a male patient is hospitalized due to two-vehicle collision with a motorcycle, a re-identified hospital record may expose a diagnosis for a sexually transmitted disease. In cases where more than one patient fits the details of a hospital record, all patients can still be strongly linked to that record.

Methods

The approach in the prior Washington State study [1] used a sample of local news stories that described accidents and incidents that led to hospitalization of named individuals because those kinds of news stories usually included the dates of visit, the patient’s hometown, the name of the hospital, and the general reason for the visit — i.e., the same kind of information that others may know about a person’s hospital visit The Bangor Daily News printed an example of such a news story: "James Smith, 64, of Bangor was treated at Eastern Maine Medical Center in Portland after a fire erupted when he lit a cigarette while using an oxygen tank." Smith was admitted to Eastern Maine Medical Center and his diagnosis codes likely related to burns and tobacco use.

In this study, we match news stories about named individuals to hospital visits to determine how successful we can uniquely associate a known patient to a de-identified hospital visit.

Materials

This study used a dataset of hospital visits occurring in a state within one year and a sample of news stories drawn from local newspapers from the same state and in the same year. The news stories describe named individuals hospitalized, primarily as a consequence of various accidents. The specific statewide datasets (MaineDataset and VermontDataset) and news stories used in this study (RawNewsStories) are detailed below.

MaineDataset

The Maine Health Data Organization (MHDO) has collected health care data since 1990 [15]. In 2013, we acquired a statewide hospital dataset called the 2010 Unrestricted Hospital Outpatient data from the MHDO for $1,125 [16]. The data purportedly contained all inpatient hospitalizations occurring in Maine hospitals in 2010 [17]. We refer to this dataset as the "MaineDataset."

Each row (or record) in the MaineDataset describes a single hospitalization; there are 105,808 rows. The fields of data associated with each hospital visit include: the quarter (January–March, April–June, July–September, or October–December) in which the hospitalization occurred; the patient’s age, gender, race and ethnicity; coded values for the patient’s hometown, hospital, diagnoses, medical procedures; the procedure dates (month, day, year) for medical procedures; status at discharge; point of admission; priority of the visit; type of insurance; and list of charges for the visit. There were 99 fields of data for each visit, all of which are listed in Appendix A. Figure 2 shows the 36 subset of fields used in this study.

Figure 2. List of 36 relevant data fields contained in the Maine statewide dataset for 2010 MaineDataset for each hospital visit occurring in the state. All reported diagnoses (i.e. fields "dx1," "dx2," ... "dx11") and procedures ("ps1," "ps2," ... "ps6") are alphanumeric codes found in the International Classification of Diseases, ninth revision (ICD9-CM). ICD9-CM codes indicate specific diseases, conditions, and procedures. The MHDO includes a data dictionary that defines demographic fields and hospitalization details that appear as numeric codes in the hospital record.

The MaineDataset has no explicit patient identifiers, such as name, address or Social Security number; it is de-identified. There are no explicit identifiers assigned to patients to indicate when multiple visits belong to the same patient. Each distinct patient visit is its own row in the data. Figure 3 shows an excerpt of one sample record from the MaineDataset with sample coded values explained using the MHDO dictionary [18] and the Research Data Assistance Center dictionary [19].

Figure 3. Sample of one hospital record. Values in the sample are artificial.

The MHDO Physical Record Number Format ("mhdoid") is a unique number in the format "QYYHPNNNNNNN," where "Q"=admission quarter, "YY"=two-digit admission year, "HP"=two-digit hospital code, and "NNNNNNN"=unique integer assigned to each hospital record. The Hospital Code ("hp") is a unique, six-digit number assigned to each Maine hospital and clinic. The patient’s gender ("sx") is recorded as "F" or "M" in the record, and race ("race") is coded in numeric values: "1"=American Indian or Alaska Native, "2"=Asian, "3"=Black or African American, "4"=Native Hawaiian or Other Pacific Islander, or "5"=White. Ethnicity ("ethn") has three possible values: "1"=Hispanic, "2"=Non-Hispanic, or "8"=Unknown.

The MaineDataset does not include patient identifiers such as name and address but it denotes a patient’s "geocode." In Maine, a geocode is a five-digit number that includes a two-digit county number followed by a three-digit code that refers to the minor civil division [20]. For example, the county code for Kennebec County is 11 and the geocode for Augusta, a city in Kennebec County, is 11020. The Maine Office of GIS (MEGIS) provides a table of Maine geocodes that are used in the hospital data [21].

The MHDO lists 921 geocodes and 694 of them can be found in the MaineDataset. Figure 4 shows 20 geocodes with the highest number of records in the MaineDataset and the population of the geocode according to the 2010 U.S. Census [22].

Figure 4. Sample of 20 geocodes in Maine and the population per geocode. Highlighted geocodes have populations over 20,000. Towns are ordered by population.

VermontDataset

We acquired the Vermont Emergency Department data from the HealthVermont.gov website. There was no cost for the data, and the website did not require any data-use agreement, credential, or application. We refer to this dataset as the "VermontDataset" [23]. Each record in the VermontDataset describes an individual emergency department encounter during 2011 in 14 hospitals. In 2011, there were 268,984 emergency department admissions in Vermont. We show the fields of data associated with each hospital visit below. There are 79 fields of data for each visit, which are all listed in Appendix B. Figure 5 shows the subset of fields we used in re-identifying Vermont’s health data.

Figure 5. Sample list of the relevant fields in the Vermont statewide dataset for 2011. The dataset reports diagnoses (i.e. fields "DX1," "DX2," … "DX20") and procedures ("PX1," "PX2," ... " PX20") as alphanumeric codes found in the International Classification of Diseases, ninth revision (ICD9-CM). HealthVermont.gov provides a data dictionary that explains the numeric codes. The Hospital Code ("HNUM2") is a unique eight-digit number assigned to each Vermont hospital and clinic. The patient’s gender ("SEX") is recorded in coded numeric values: as "1"=Male, "2"= Female, or "0"=Unknown.

There are significant differences between the MaineDataset and the VermontDataset.

The VermontDataset is twice as large as the MaineDataset because it contains emergency department encounters whereas the MaineDataset only contains inpatient admissions.
In terms of data fields, VermontDataset reports age and location identifiers in more aggregated values than the MaineDataset. Maine provides Age in years while Vermont releases an aggregate "Age Group" for each record.
Vermont also creates an aggregate "ZIP code group" category for patient location information. A patient with ZIP code group "052" could be from any ZIP code within the "05200-05299 range, excluding 05201." An excluded ZIP code is usually a large city in a ZIP code range. For example, 05201 is the town of Bennington with a population of 15,829. It is important to note that when a large city is excluded from the ZIP code group, both the remaining ZIP codes and the large city should then have a population over 20,000 in order to be listed as separate ZIP Codes in the data and still comply with HIPAA Safe Harbor. This is not the case for all ZIP code groups in the VermontDataset as shown in Figure 6.

Figure 6. List of ZIP code groups in VermontDataset and 2010 Population in each ZIP code group according to the U.S. Census. Highlighted ZIP code groups have populations of more than 20,000.

Both the MaineDataset and the VermontDataset use standardized diagnoses and procedure codes from the International Classification of Diseases, ninth edition (or ICD9). The ICD9 defines more than 15,000 diagnosis and procedure codes [24] grouped into three categories: numeric codes that describe medical diseases (e.g., diseases of the digestive system or complications of pregnancy), codes beginning with an E that describe external causes of injury or poisoning (e.g., motor vehicle accidents or assaults), and codes beginning with a V that describe factors that may influence health status (e.g., communicable diseases, drug dependency, or tobacco use). Both the MaineDataset and the VermontDataset record each patient’s diagnoses using ICD9-CM diagnosis codes. The MaineDataset includes procedure codes but the VermontDataset does not. We obtained a list of ICD9 diagnosis and procedure codes from the Center for Medical Service to interpret the conditions listed in the medical records. See Figure 7 for examples.

Figure 7. Examples of ICD9-CM codes and long descriptions.

Our study used admission codes, which indicate the patient’s original location before transfer to the hospital. In both the Maine Inpatient dataset and the Vermont Emergency Department dataset, these codes specify if an admitted patient came in through another hospital facility, the emergency room, law enforcement, etc. Similarly discharge codes indicate a patient’s status including transferred to another facility, rehab, or deceased ("Expired"). We used these details to differentiate between candidate hospital records that match to the same named individual from the NewsStories.

Unlike diagnoses and procedure codes, admission and discharge status codes do not follow a standardized format across the states. Examples of coded numeric values for Maine and Vermont are listed in Figure 8.

Figure 8. Examples of admission and discharge status codes

Materials: NewsStories

We performed three searches on LexisNexis Academic, an aggregator and search engine for news stories, to find articles published in a sample of local newspapers in Maine and Vermont [25]. In Maine, our first search was for stories published in 2010 that contained the word "hospitalize," "hospitalized," or "transported," and retrieved 324 stories. Our second search used the terms "injured" and "hospital," and retrieved 316 stories. Finally, our search for stories containing "hospitalize" and "medical," "hospitalized" and "medical," "transported" and "medical," or "injured" and "medical" retrieved 472 stories. In total, our searches yielded 1,112 stories about car crashes, motorcycle accidents, ski accidents, and other incidents that resulted in hospitalizations in Maine hospitals. Using the same sequence of keyword searches for Vermont, we retrieved 201, 105, and 69 searches respectively, for a total of 375 news articles.

For Maine, the sample of newspaper stories came from local new sources including Bangor Daily News (Maine), Portland Press Herald (Maine), Brattleboro Reformer (Vermont), and Lowell Sun (Massachusetts). Vermont stories came from Brattleboro Reformer (Vermont), The Keene Sentinel (New Hampshire), and The Press-Republican (New York). The percentages in Figure 9 show that half of the stories in Maine came from the Bangor Daily News and more than half of the stories in Vermont came from the Brattleboro Reformer.

Figure 9. List of newspaper sources that publish stories about local accidents and hospitalizations.

News stories about the same incident often appeared more than once in the LexisNexis results. Some stories described patients sent to hospitals in neighboring states. A few stories concerned people who died before ever being admitted to a hospital. Other stories lacked the name of the individual. After we removed duplicates and stories lacking the requisite name of a patient to a Maine or Vermont hospital, we had a sample of 177 stories in Maine and 38 stories in Vermont. We term these as the "MaineNewsStories" and the "VermontNewsStories." These collections of text include the newspaper, byline, publication date, and content of each news story.

Subjects

In Maine and Vermont, the news stories referenced individuals involved in a motor vehicle accident, a fire, or other accidents resulting in a visit to the hospital. These incidents from news stories that appeared in Maine or Vermont newspapers also contained demographic details about the individual, such as age and hometown, the incident that led to the hospitalization, and the hospital involved.

Approach

Figure 10. Re-identification strategy to associate a named individual in a news story to a de-identified hospital record. The match uses unique occurrences of the combination of age, gender, hometown, hospital, admission quarter, and diagnoses, as related to the incident described in the news story.

The approach is simple: match information from each news story describing a named individual’s visit to a hospital, to the statewide data of hospital visits and see which matches uniquely associate one person from a news story to one and only one hospital record.

Figure 10 shows a graphical depiction. A news story contains the patient’s name, age, gender and hometown, identifies the hospital, the patient’s admission date, and describes the incident in ways that relate to diagnosis codes. If patient information from a news story matches one and only one hospitalization record, then we say it is a "unique match" (or re-identification). A match associates a named individual to a distinct hospital record.

This is the same as approach used in the Washington State experiment [1]. We match patient demographics and incident descriptions found in news stories to hospital records. We followed the steps below to conduct the matches for both Maine and Vermont.

Step 1. Transcribe News Stories

In Step 1, we read each news story in the compilation of 1,112 MaineNewsStories and 374 VermontNewsStories. We recorded the name, age, hometown of the patient, publication date, hospitalization and injury details extracted from the newspaper stories, and compiled these as the "NewsData." We did not transcribe the hospitalization details or demographic record of any individual who was not mentioned by name. If an individual’s name appeared in more than one news story, we deleted any duplicates and only counted an individual’s name twice if he or she was involved in another incident that resulted in a hospitalization. Some NewStories do not name injured individuals because reporters may not be able to verify the name or choose to redact name of a minor for legal reasons.

Step 2. Direct Matching of NewsData to Hospitalizations

We matched information from each news story describing a named individual’s hospital visit to the statewide data. In the end, we determined which matches uniquely associate one person from a news story to one and only one hospital record.

Step 2a. In the first phase of the matchup, we wrote a simple computer program that matched the information extracted from the news stories (NewsData) to the de-identified hospital records (MaineDataset) using the following five fields: Hospital Code (hp), Geocode (geocode), Gender (sx), Age (age), and Admission Quarter (admqtr). To match the hospital records in the VermontDataset, we used Hospital Number (HNUM2), ZIP code group (ZIPGRP), Gender (SEX), Age Group (AgeGRP) and Admission Quarter (ADMID_QTR).

Step 2b. In the second phase, we took the subset of named individuals who matched to at least one de-identified hospital record based on hospital code, geocode, gender, age and admit quarter. We conducted a thorough visual verification that confirmed that the injury from the NewsData matched with additional details from the hospital. We inspect all available data points including: diagnosis codes (dx1–dx9), procedure codes(ps1–ps10), length of stay (los), point of origin of admission (sadm), discharge status (disp), race, and ethnicity. The VermontDataset does not contain procedure codes or race and ethnicity.

Finally, for the MaineDataset we inspected the procedure dates (MM/DD/YY) and validate whether they matched the time period of the accident according to the newspaper article. If there was no procedure date listed, we did not use procedure date as a criteria for matching. The VermontData does not have procedure codes or dates.

Step 3. Identifiability of the Data

Sweeney previously found that the "identifiability," or the number of records that share the same set of characteristics within a dataset, of the HIPAA Safe Harbor provision was 0.04 percent [26][27]. That is, 0.04 percent of records are uniquely identifiable based on only {year of birth, gender and ZIP code}.

We assessed the baseline level of how many individuals were already uniquely identifiable in each dataset to better understand re-identification results. For the MaineDataset, we calculated identifiability using {age, gender, geocode, hospital and admission quarter}. In the VermontDataset, we used {age group, gender, ZIP code group, hospital and admission quarter}. Then, we evaluated the level of identifiability again without the admission quarter.

Step 4. Direct Matching using a HIPAA Compliant Version of the Data

Neither the MaineDataset and VermontDataset meet the minimum HIPAA Safe Harbor requirements, as both provide more details with dates and geographical identifiers than Safe Harbor allows. Maine and Vermont share quarter, which is more specific than providing dates in years. They also share geographic identifiers, such as geocode or ZIP code. We constructed "HIPAA Datasets" for each statewide dataset to test the effectiveness of HIPAA Safe Harbor guidelines.

For the MaineDataset, we redacted all geocodes that had a 2010 population of less than 20,000. See Figure 11 for remaining geocodes. We redacted all dates not listed as years, including admission quarter and procedure dates. We call this new, redacted data the "Maine HIPAA Dataset."

Figure 11. List of geocodes that remains in Maine HIPAA Dataset.

There is an aggregate "ZIP code group" field in the VermontDataset for patient location information. However, some ZIP code groups do not have a population over 20,000. We redact all groups of ZIP codes that had a 2010 population of less than 20,000. See Figure 12 for remaining ZIP code groups. We redacted the admission quarter since it is a date indicator that is not listed in years. We call this redacted data the "Vermont HIPAA Dataset."

Figure 12. List of ZIP code groups that remains in Vermont HIPAA Dataset.

We wrote a computer program to match named individuals to de-identified hospital records using age, gender, hometown and hospital code. Then, we systematically reviewed any named individual who matched at least one de-identified hospital record. The goal was to verify the injury with a de-identified record using additional data points. In Maine, this includes: diagnosis codes, procedure codes and dates, admission and discharge codes, race and ethnicity, and length of stay, if available. In Vermont, we excluded procedure codes and dates, and race and ethnicity.

We note that the unique matches in Washington State’s data were all correct, as verified by contacting the re-identified patients. Since the methods in this paper reproduce the basic methodology of matching news stories with de-identified hospital records, we assumed a similar level of correctness and did not contact patients to verify the matches. While the quality of the news stories in Maine and Vermont may vary from those in Washington, this study assumes that the level of detail in local reporting does not vary since the detail in news stories from Maine and Vermont are similar to the hospitalization details in the Washington news articles.

Results

For clarity, we report our findings following the same steps that are used to describe our methods.

Results for Step 1. Transcribe News Stories

There were 177 news articles that mentioned 244 named individuals taken to a Maine hospital. The subjects in Vermont are a sample of 47 named individuals mentioned in the 38 news stories who required hospitalization in Vermont. We only counted news articles that name at least one individual who was admitted to a Maine hospital in 2010 or who visited a Vermont emergency room in 2011. Figures 13 and 14 show the fields present in news stories for named individuals in each state.

Figure 13. Distribution of fields in Maine NewsData.

Figure 14. Distribution of fields in Vermont NewsData

Results for Step 2. Direct Matching of Newspaper Stories to Hospitalizations

Results for Step 2a. We matched each of the named records in NewsData with the hospital records using only age, gender, geocode, admission quarter, and hospital code. In Maine, 136 of the 244 named individuals (or 55.5 percent) matched with at least one hospital record (binsize k>1) and 47 out of the 244 (or 19.3 percent) matched with exactly one de-identified hospital record (binsize k=1).

Out of the 136 named individuals, 24 (or 17.6 percent) had two candidate hospital records (binsize 2) and 11 (or 8.1 percent) that had three candidate hospital records (binsize 3). In sum, 89 (or 65.4 percent) named individuals had more than one candidate hospital record. Figure 15 shows the breakdown of how many named individuals matched to n de-identified records by binsize. A total of 114 out of 136 (or 83.8 percent) named individuals had re-identifications of binsize k<20 and 94 of the 136 (or 69.1 percent) named individuals had a binsize k≤5.

Figure 15. Matching records where k<50 are graphically sorted by binsize. First stage re-identifications are based only on age, gender, geocode, admit quarter and hospital code. All demographics came from Maine newspaper articles. In Step 2b, we matched these named individuals to a de-identified hospital record based on diagnoses code, discharge code, procedure code and date, length of stay, race, and ethnicity, if available. The cumulative number of named individuals matching to a de-identified hospital record are represented as well.

In the first stage of the matchup for Vermont, we ran the python code to cross-reference Vermont NewsData with the VermontDataset based on age group, gender, ZIP code group, admission quarter, and hospital code. Forty out of the 47 (or 85.1 percent) named individuals matched to at least one de-identified hospital record (binsize k≥1). Two out of the 40 matched to exactly one hospital record and 39 named individuals matched to more than one record. We graphically showed how many de-identified records match to each named individual using binsizes where k<50 in Figure 16. We cross-linked these named individuals in Step 2b using diagnoses code, discharge code, procedure code and date, and length of stay.

Figure 16. VermontDataset re-identifications based only on age group, gender, ZIP code group, admission quarter and hospital code. Matching records with binsize k<50 are graphically displayed by binsize.

Results for Step 2b. After Step 2a, there were 136 named individuals who matched to one or more hospital records in the MaineDataset after matching the de-identified records based on age, gender, geocode, admission quarter, and hospital code. We confirmed that the diagnoses codes, discharge code, procedure date, length of stay, race, and ethnicity from the de-identified record matched the specifications from the news data if the new story included the data.

Out of the 47 named individuals that matched exactly to one de-identified record in Maine, 26 uniquely matched a de-identified hostpital record after visual verification. Out of the 89 named individuals that matched to more than one de-identified hospital record, 35 uniquely matched to one de-identified record after visual verification. In total, we found 69 out of 244 (or 28.3 percent) unique re-identifications. Figure 17 describes the process of elimination based on this methodology.

In 2010, 94.4 percent of the population in Maine was White. The MaineDataset was also homogenous: 95.4 percent of the records referred to White patients. When verifying race and ethnicity, we could more easily check whether minorities in the de-identified hospital data matched our newspaper stories because we could use online searches for public images (if available) to confirm a match.

There were 47 named individuals that matched to one or more hospital records in the VermontDataset based only on age group, gender, ZIP code group, hospital, and admission quarter. After verifying diagnoses code, discharge status, and length of stay, 16 named individuals matched to a single hospital record. Figure 17 visually depicts the re-identification strategy for Vermont.

Figure 17. Outline of each stage of the re-identification strategy. There were 69 out of 244 (or 28.3 percent) unique re-identifications in Maine and 16 out of 47 (or 10.5 percent) unique re-identifications in Vermont. In Step 2a, we cross-referenced NewsData (hometown, age, gender, hospital, and admission quarter) with hospital data (geocode or ZIP code group, age or age group, gender, hospital, admission quarter). In Step 2b, we verify the records using diagnoses, discharge, admission codes, length of stay and if available, procedure date, race, and ethnicity.

Results for Step 3. General vulnerability based on the identifiability of the data

Previous studies found identifiability of data cleaned under HIPAA Safe Harbor provision to be 0.04 percent [26][27]. In comparison, there are 56,735 records out of 105,808 (or 53.6 percent) that are uniquely identifiable using the characteristics {age, gender, geocode, hospital and quarter} in the MaineDataset and 4,570 records out of 268,984 (or 0.55 percent) in the VermontDataset using the same combination of fields. We denote unique identifiability when there is only one record in a de-identified dataset with the specific combination of field values mentioned above. A record with a binsize of 2 indicates that there are 2 records in the de-identified dataset that share the same specific combination of values. The higher the number of records that are uniquely identifiable in the dataset, the higher the identifiability of the dataset.

More than half of the records in the MaineDataset are unique when only considering {age, gender, geocode, hospital and quarter}. Figure 18 shows the binsize results. The minimum binsize is 1, the maximum binsize is 70, the average binsize is 24 and standard deviation of 17. Records having a binsize of 1 and 2 make up 73.7 percent (or 78,027 out of 105,808) of the MaineDataset, and records that have binsizes 1 through 3 make up 83.4 percent of all de-identified records.

When we consider the identifiability of the MaineDataset excluding admission quarter, the percentage of values that are unique decreases to 33.7 percent and binsize total increases to 63. The minimum binsize is still 1, but maximum binsize increases to 231, with an average binsize of 43 and standard deviation of 41. The cumulative records that have binsize of 1 and 2 make up 51.3 percent of the MaineDataset. Binsizes of 1 through 3 make up 61.4 percent of the records. The exclusion of admission quarter decreases the number of unique values (binsize 1) and increases the binsize of the de-identified records. Together, Figures 18 and 19 show a comparison of the larger binsizes and lower number of unique records when quarter is excluded.

Figure 18. MaineDataset identifiability of records using of {age, gender, geocode, hospital, quarter}. Unique records that have binsize 1 make up more than half of the total number of de-identified records. The first chart shows the counts by binsize for all binsizes (range 1 to 70) and the second chart shows identifiability rates by cumulative percentage for all binsizes.

Figure 19. MaineDataset identifiability of records using of {age, gender, geocode, hospital}. The first chart shows the number of patient records that have binsizes that are 50 or less. The second chart displays identifiability rates by binsize. The third chart zooms in on the second chart and shows the cumulative identifiability rates by binsize for the binsizes 1 to 50.

The VermontDataset has fewer unique records. This is due in part to Vermont’s reporting of age and ZIP code using age group and ZIP code group. Out of 268,984 records, there were 4,570 records (or 1.7 percent) identifiable to a binsize of 1 and 8,501 (or 3.2 percent) records identifiable to a binsize of 1 and 2. There are a total of 301 binsizes. These are significantly lower identifiability estimates than the MaineDataset. This result is expected because the VermontDatSet aggregates both age and ZIP code into groups. Subsequent charts show the cumulative percentage of patients according to binsize and the high percentage of patients that have a small binsizes. Minimum binsize is 1, maximum binsize was 566 with average binsize of 164 and standard deviation of 109. Figure 20 depicts the identifiability of records in the VermontDataset.

Figure 20. Identifiability of records in the VermontDataset based on {age, gender, geocode, hospital and quarter}. The first graph shows the number of records with binsize of 50 or less (k≤50) in order to show the number of unique records already in the data. The second graph shows the cumulative percentage of patients who are identifiable to show the holistic binsize range and identifiability relative to the entire VermontDataset. The third graph zooms in on the second graph and shows the identifiability rates for the first 50 binsizes.

Even though the number of records that have binsize of 1 within the VermontDataset is lower than those in the MaineDataset, in Step 3 we observed a higher rate of re-identification for Vermont than for Maine. This might be explained by the fact that Vermont and Maine contain different kinds of hospitalization data. The VermontDataset is a record of all emergency department visits and MaineDataset is a record of inpatient admissions. Not all the named individuals in the news articles for Maine were reported as being "admitted" to a hospital after a motor vehicle accident, fire, or incident, so it is more likely that named individuals could be found in an emergency department dataset than in an inpatient dataset when they are "transported" or "seen" in the hospital.

After we exclude admission quarter, the minimum binsize is 1, maximum is 2,019, with an average of 340 and standard deviation of 308. There are 1,473 records (or 0.55 percent), 2,851 (or 1.06 percent) with a binsize of 1 and 2, and 4,003 (or 1.49 percent) with binsize of 1 through 3. The total number of binsizes is 465. Again, unique records decrease when we exclude admission quarter. Figure 21 shows the frequency of binsizes that are less than 50.

Figure 21. Identifiability of records based on {age group, gender, ZIP code group, and hospital} in the VermontDataset. The first graph shows the number of records with binsize of 50 or less (k≤50) and the second graph shows the cumulative percentage of identifiable patients by binsize. The third graph zooms in on the second graph and shows the identifiability rates for the first 50 binsizes.

Results for Step 5. Direct Matching using a HIPAA Compliant Version of the Data

We created a hospital dataset that followed the minimum HIPAA Safe Harbor guidelines. We redacted admission quarter and all geocodes from the MaineDataset that had a 2010 population of fewer than 20,000. There were 694 geocodes present in the MaineDataset, but only 7 remain present in the Maine HIPAA Dataset: Portland, Lewinston, Bangor, Auburn, Biddeford, Sanford, and South Portland (See Figure 11).

We did the matchup only on Hospital Code (hp), Gender (sx), and Age (age). There were 40 named individuals who matched to multiple de-identified hospital records of a binsize greater than or equal to 1. The distribution of the matchup is listed in Figure 22 according to binsize. There were fewer named individuals with small binsizes in the HIPAA Maine results since there is less specificity when 99.14 percent (688 out of 694) of geocodes were omitted from the Maine HIPAA Dataset. The minimum binsize was 1, maximum was 10,433, and average binsize was 782 with a standard deviation of 2,213.

We took the 40 potential matches and verified diagnoses codes, discharge codes, procedure codes, and length of stay, if available. We uniquely re-identified eight out of 244 (or 3.28 percent) HIPAA-compliant records from the MaineDataset.

Figure 22. First stage re-identifications of HIPAA-compliant de-identified hospital records based only on age, gender, geocode (if population>20,000), and hospital code. We matched these candidates’ de-identified hospital records further. We did not consider records that matched to more than 200 named individuals (k>200) for further visual inspection. We sorted records with binsize > 200 graphically and show the cumulative number of de-identified hospital records by binsize.

The VermontDataset does not follow HIPAA Safe Harbor standards. We removed any ZIP code groups with populations of fewer than 20,000 to create the Vermont HIPAA Dataset. In total, we dropped 10 ZIP code groups from the VermontDataset to create the HIPAA-compliant dataset. The ZIP code groups left in the de-identified dataset are listed in Figure 12. Forty-four named individuals match to at least one de-identified HIPAA hospital record. Again, binsize indicates the number of de-identified HIPAA records that match to a named individual. The minimum binsize was 7, maximum was 9,297, average binsize was 1,503 with a standard deviation of 2,595. See Figure 23 for binsize frequencies.

We took the matches that had a binsize less than or equal to 200 and further verified potential matches using diagnosis codes, discharge codes, procedure codes, and length of stay, if available. We uniquely re-identified five out of 47 (or 10.6 percent) HIPAA-compliant records from the VermontDataset.

Figure 23. Vermont HIPAA Dataset re-identifications by binsize. These named individuals later match to a single HIPAA de-identified hospital record using visual inspection. Records that matched to more than 200 named individuals (k>200) were not considered for further visual inspection.

Discussion

This re-identification uses de-identified hospital discharge data and a simulated HIPAA version of the data. First, we demonstrate that de-identification protocols do not provide reasonable expectations of privacy protections for patients even though the datasets may contain redacted or aggregated data. Second, we recognize that states tend to follow variations that are close to the HIPAA Safe Harbor guidelines, and so create HIPAA versions of the datasets. We re-identified 28.3 percent of named individuals from Maine news article and 34 percent of the named individuals from Vermont. When using HIPAA versions of the data, we re-identified 3.2 and 10.6 percent of named individuals in Maine and Vermont, respectively.

Maine and Vermont’s re-identification rates are lower than the Washington State study, which re-identified 43 percent of named individuals. This difference may be due to the fact that Washington’s data reported age using month and year. Maine shared patient age and Vermont shared patient age using age groups. However, state data using standards similar to HIPAA still are not immune from re-identification. Our results demonstrate that even states that generalized demographic data such as age can have high re-identification rates. So, the methods of this study show that we can uniquely re-identify patients even when these state data use HIPAA Safe Harbor standards.

Our results demonstrate that HIPAA Safe Harbor is not the guideline or framework for effectively anonymizing data to prevent re-identifications. What may be needed is the flexibility to consider the rapidly increasing availability of data when assessing de-identification methods instead of a fixed standard of de-identification checklists, as HIPAA Safe Harbor suggests. While we used the HIPAA Safe Harbor for our test, we do not suggest that the HIPAA Safe Harbor is the ideal standard.

Further Risks to Privacy

Re-identified hospital records for motor vehicle or home accidents sometimes included diagnosis codes describing pre-existing conditions that have nothing to do with the patient’s reason for hospitalization, such as those detailing alcoholism, diabetes, and depression. For instance, after uniquely re-identifying a hospital record concerning a pedestrian who was hit by a car, we learned that the person involved in the car crash was a college student with a previous history of depression, PTSD, and anxiety (see Figure 24). Such associations can be damaging, especially when re-identified hospital records reveal sensitive or stigmatized mental health histories. In one instance, a political candidate’s campaign was marred when an attacker exposed her prior suicide attempt [28]. Even though this political candidate won in the end, and some may argue that this incident helped her campaign, her mental health was exposed without her consent. Diagnoses codes can also include information about unemployment status, cannabis and opioid use, and histories of physical or domestic violence.

Figure 24. Example re-identification that revealed sensitive information about a patient involved in a motor vehicle accident.

Re-identification may also become easier as hospitals use more specific diagnoses and procedure codes. The new edition of the International Classification of Diseases, 10th edition (or ICD10) [29], is more detailed than the ICD9 codes that we used in this re-identification strategy. The ICD10 codes allow hospitals to indicate which leg (left or right) sustained injuries. ICD9 codes did not allow for that level of distinction. It may in fact be faster to verify diagnoses codes using more specific ICD10 codes.

Newspaper articles are a unique way to re-identify data because they routinely report and publish detailed accounts of accidents that result in a hospital visit. This re-identification strategy relies on the quality and specificity of the news stories. We found that more specific injury details from the news stories rendered more diagnoses fields that we could use to confirm matches that were based on age, gender, geocode or ZIP code, admission quarter and hospital. The solution is not to censor journalists from reporting local events nor is it to stop data sharing altogether. Our findings suggest that patients are vulnerable to re-identification not just due to the specific strategy in this paper which uses news stories, but also to other sources of the same kind of information about hospitalizations and medical conditions, including prescription data, medical mailing lists, employers, debtors, friends, and family.

The Role of Re-identifications

Currently, some organizations have legally prohibited re-identification of state records because they are disruptive. Maine has the following clause in their data-use agreement [30]:

2. DATA PRIVACY AND DATA SUPPRESSION

C. The Data Applicant and Data Recipient will not use the MHDO Data in any way, or link these data to other records or data bases, if the result allows for identifying individuals, unless authorized in writing by the MHDO.

States can reduce disruption, not by banning re-identification experiments, but by encouraging researchers to report vulnerabilities. More sensitive data should be made available under an application process that asks for research or use purposes. A data-use agreement should allow recipients who apply to run approved re-identification experiments on the condition that they report results to the state and that they not disclose the re-identified patients. There can be penalties ranging from a loss of access to the data to cease-and-desist orders prohibiting the data from being used if re-identification results are not reported, or the data is not used for stated purposes.

Such an agreement can also direct anyone holding a vulnerable version of the data to stop using it once a vulnerability is confirmed until a better-redacted version is provided or they qualify for an exemption. Doing so encourages responsible sharing in a manner that allows data holders to improve with scientific knowledge and innovation. States and legislatively-mandated agencies or organizations can also do these tests internally as well, since they can access full datasets and assess different levels of data suppression. They can determine how much information to distribute to maximize both the utility of the data and the privacy of the patients.

Vermont does not require data-use agreements or have data access restrictions. Vermont’s existing protocols, while not necessarily compliant with HIPAA Safe Harbor, do apply simple aggregations (from ZIP code to "ZIP code group" and age to "Age Group) and redactions (no personal identifiers such as name or address included). Re-identification experiments such as this study provide the evidence needed to improve data sharing practices.

One state that reviewed its de-identification practices since 2012 is Maine [31]. In 2016, Maine implemented new rules establishing different levels of statewide hospital datasets available for public access (see Appendix C for all data fields included in the three data levels), requiring applications and data-use agreements, publishing a list of data requesters and recipients, and allowing a comment period so data providers and the public could to voice concerns to the MHDO about the release of data to anyone requesting it. These new rules are an effective improvement. Designating data levels and creating more stringent requirements to acquire sensitive data may better deter re-identification of patient data. Tiered-access for data sharing has the potential to be effective in communicating clear access requirements [32]. Data-purchase applications and data-use agreements also establish legal protections and inform recipients of the expectations for privacy when using the data. Publishing a list of data requestors and recipients permits the public to participate in holding both the MHDO and data requestors accountable and transparent about purposes and uses of the data.

Despite these new rules, Maine’s lowest tiered data access still does not properly conform to HIPAA Safe Harbor because it contains admission and discharge quarter and year, and health service area. There may be utility in receiving dates in quarters or at other levels more granular than years to researchers looking at occurrence rates of seasonal diseases, but this paper suggests that including admission quarters increases the identifiability of a dataset. Our study found 26 unique re-identified matches in the Maine data and eight unique re-identified matches in the simulated HIPAA data that did not include admission quarter. Our results using the Maine HIPAA Dataset, suggest that our re-identification strategy would still be successful using the most unrestricted version of the Maine dataset.

The costs for unrestricted hospital data are low. Sweeney obtained the data for Washington State for only $50 and Maine’s 2010 hospital data was $1,125. Vermont’s data for Inpatient, Outpatient, Emergency Department, Expanded Outpatient and Revenue Data for 2006 through 2014 is all available online for free [23]. A 2012 survey of inpatient unrestricted data costs for 33 states reports that the total cost to obtain data from all 33 states would be between $39,729 and $48,705 [2]. The costs ranged from $0 (Vermont and New Hampshire) to $10,000 (Tennessee).

Beyond statewide health data, many academic publications ask authors to submit a version of the data used to report result as a condition of publication (for examples, see [33]) to verify published findings, promote further studies, and possibly lower research costs by allowing data reuse. In the recent years, government offices at all levels, enforcement agencies, and non-profit organizations are launching open data portals for the sake of transparency, accountability, and research.

HIPAA Safe Harbor’s framework is often considered the de facto standard for protecting patient privacy even though HIPAA has not been rigorously confirmed to guarantee privacy [34][35]. However, these standards have not been rigorously tested. There are scientifically tested privacy solutions that do guarantee privacy protections, such as k-anonymity, which uses either generalization or suppression of fields to guarantee a defined level of anonymity to individual records in a given dataset [36]. Differential privacy is another method using synthetic datasets that still retain statistical properties of the original dataset [37]. All data-sharing methods should continuously undergo a cycle of testing and subsequent improvement in order to guarantee formal privacy protections like these.

A re-identification experiment helped Washington State to improve its data sharing practices. We hope these experiments will similarly help Maine and Vermont to elect an improved privacy approach. At large, our results suggest that all states that share hospital data revisit vulnerabilities and de-identification practices. The de-identification checklist that HIPAA Safe Harbor promotes is the bare minimum protection against re-identification. Policy-makers and data-sharing centers should consider scientifically tested protocols that guarantee privacy protections to patients, especially since they cannot opt out of inclusion in hospital records.