A Host of Troubles: Re-Identifying Airbnb Hosts Using Public Data

Aron Szanto; Neel Mehta

Abstract

Airbnb claims to make hosts anonymous by only providing their first name, town, and approximate location. In this paper, we introduce a method to probabilistically re-identify hosts by cross-referencing with public voter files, showing that Airbnb’s anonymization method is vulnerable to attacks without access to specialized data. We survey Airbnb listings in nine Wisconsin localities of varying size and apply our algorithm to each, uncovering purportedly private host information.

Results summary: We find that, of the registered voters whose names match a host’s first name and town, the voter who lives closest to the approximate location given by Airbnb is the true host in 94% of cases. This suggests that Airbnb’s location fuzzing is insufficient to preserve the anonymity of their hosts. Further, we find that our method can uniquely re-identify the hosts of 40% of all listings in a random sample. Last, we detail how our methodology could provide regulators and law enforcement the tools to curb illicit renting activity but could also compromise the privacy of 1.2 million hosts.

Methods

Voter File

United States voter files include the name, address, and usually the demographic and contact information of every registered voter in a state, county, or other district. This includes all registered voters, including those who have not voted in a long time and those who have never voted. Each of the 50 states and Washington, D.C. maintain such a file, and these files are often available for free or for purchase from each state [24].

For our study, which emphasizes housing, geography, and population density, we use the state of Wisconsin as a case study because its population, population density, size, and home ownership rate are each close to the national average. Wisconsin has the 25th highest population density of the 50 states [9], is the 20th largest by population, and is the 23rd largest state by area [10]. Wisconsin also features a diverse mix of urban areas (such as Milwaukee, Madison, and Green Bay), suburbs (such as Sheboygan, La Crosse, and Wauwatosa), and rural areas (such as Portage, Reedsburg, and Elkhorn) [11]. Additionally, Wisconsin’s home ownership rate is 67.8% [12], similar to the United States’ overall rate of 63.5% [13].

Taken together, these facts suggest that Wisconsin is a reasonably representative sample of the United States for studies that focus on housing, geography, and population density.

Airbnb Listings

Airbnb’s API is not publicly documented, so we used an open-source API wrapper to download metadata about Airbnb listings in Wisconsin [14]. Given a rough search area, e.g. Milwaukee, WI, and date range, e.g. January 1st to January 8th, the API returned a list of all Airbnb listings in that general area available within that date range.

The API revealed a variety of metadata about each listing, including:

The listing’s name (e.g. "Small Room in Milwaukee");
The town the house is in;
A fuzzed location of the house, given as a (latitude, longitude) pair;
The host’s first name;
URLs of pictures of the house;
A link to the listing on airbnb.com, which let us find the listing’s description, reviews, and other details.

Airbnb shows users a shaded circle centered around the fuzzed location of the listing, as seen in Figure 1. After reviewing several listings from around the US, we found that the fuzzed location and circle are static, i.e., they are the same no matter when or where one views the listing. Moreover, the radius of the circle was very small, with an average of 0.284 miles and a maximum of 0.344 miles, and was uncorrelated with the population density of the locality of the listing. We determined this value by randomly sampling Airbnb listings in 10 US cities, finding the latitude and longitude of the center of each listing’s circle. Then, using Google Maps, we found a landmark on the circumference of each circle, as seen in Figure 2. We then used Vincenty’s algorithm [16] — discussed in detail in Appendix A — to find the distance between the circle’s center and the landmark on the circumference, yielding the circle’s radius.

Figure 2. Example of calculating the radius of a circle drawn around an Airbnb listing in New York City. The Alvin Ailey American Dance Theater lies exactly on the circumference of the circle, so computing the distance from the theater to the circle’s center (which can be found with an API) yields the circle’s radius.

We gathered a full list of listings in nine selected Wisconsin cities, shown in Table 1 and mapped in Figure 3 [15]. We chose:

Three large cities (Milwaukee, Madison, and Green Bay);
Three mid-sized cities with populations near 50,000 (La Crosse, Sheboygan, and Wauwatosa);
Three small cities with populations near 10,000 (Portage, Elkhorn, and Reedsburg).

Table 1. Selected Wisconsin cities and populations.

Figure 3. Map of selected Wisconsin cities. From west to east: La Crosse, Reedsburg, Portage, Madison, Elkhorn, Green Bay, Wauwatosa, Milwaukee, Sheboygan.

This allows us to compare the efficacy of our algorithm across different types of cities.

We randomly chose listings available between July 1st and July 2nd, 2017, which was roughly two months after the date of experimentation (late April 2017). We report 693 listings, as shown in Table 2.

Table 2. Number of Airbnb listings available from 7/1/17 to 7/2/17 for selected Wisconsin cities.

Algorithmic Re-identification

For each town, we sampled a random set of listings in the town, dropping expired listings and those removed by the host.

We then attempted to algorithmically re-identify the host and house of each sampled listing. Suppose we had an Airbnb listing A_i which provided the fuzzed location (Φ_i , L_i), where Φ_i is the latitude and L_i is the longitude. We filtered the Wisconsin voter records to only include voters with the same first name and hometown as the listing’s host. We call these voters candidates.

For each candidate C_k , we ran their voter registration address through the Microsoft Bing reverse geocoder, which mapped the address to a latitude/longitude pair (Φ_k , L_k ). We found the distance between the voter’s house address and the Airbnb listing’s fuzzed location. To do this, we used Vincenty’s algorithm [16] to find the geodesic distance s_i,k between (Φ_i , L_i ) and (Φ_k , L_k ). Our implementation of Vincenty’s algorithm is discussed in more detail in Appendix A.

We then ranked all the candidates for the listing by the geodesic distance s_i,k from their house to the fuzzed location. We hypothesized that the candidate with the lowest s_i,k was the most likely to own the house.

Validation of Re-identifications

We proceeded to test the validity of our algorithm’s predictions though three steps.

Step 1. For each sampled listing, we found every registered voter whose first name and city matched the listing. These voters were called candidates.

Step 2. For each sampled listing, we chose the 10 candidates closest to the fuzzed location; we call these finalists. (If there were fewer than 10 candidates, all became finalists.) We found that choosing the top 10 was sufficient to ensure correctness.

Step 3. We then attempted to manually determine which of the finalists was most likely to be the real host of the listing. We split our methods into two groups: personal matching techniques and property matching techniques. The personal matching techniques included:

Using Google Reverse Image Search on the host’s profile picture. Many times, hosts re-used that picture on other websites, such as on a LinkedIn profile, which revealed their last name and other demographics.
Searching LinkedIn or Google for the host’s name, hometown, and this information often helped narrow down the list of potential matched voters, as some hosts’ Airbnb profiles would provide identifying information such as their alma mater or job.

The property matching techniques included:

Using Google Maps Satellite View or Street View to visually compare a voter’s house to pictures provided in the listing. Figures 5 and 6 provide an example of visual re-identification.
Using reviews and descriptions to infer the house’s location. For instance, a review on one listing said that the house was “past three mailboxes, up the road, and the second house on the right”. We looked up each of the matched houses on Google Maps and only found one that fit that criteria. In another case, reviews indicated that a listing was “blocks from the beach” and “next to a park and elementary school”. Again, only one of the matched houses fit all of those criteria.
Leveraging the fact that many Airbnb listings were actual bed-and-breakfasts; hosts would just provide the names of the bed-and-breakfast companies (e.g. Sparrow Sun Wellness Bed and Breakfast). A simple Google search revealed the bed-and-breakfast’s address and the owner’s full name.
Using the situations in which listing owners included Google Maps screenshots of their properties, making them easy to re-identify.

After applying the personal and property identification techniques, we named a candidate a unique re-identification if one or both sets of identification techniques yielded a conclusive match. We note that there was never a case in which the two types of identification techniques pointed to two different candidates, i.e., there was never disagreement. Indeed, nearly all of the unique re-identifications were supported by both sets of techniques.

Results

We found unique re-identifications in 34 of the 84 Airbnb listings we tested, or 40.48%. That is, our algorithm was 40.48% successful in identifying a single individual who was very likely to own the house in a particular listing.

Most strikingly, of the 34 unique re-identifications, 32 of them represented the voter matching on name and town who lived closest to the fuzzed location. In other words, 94% of unique re-identifications could have been predicted just by finding the closest individual who matched the first name and town on the listing. We found that every individual uniquely re-identified was one of the 3 closest finalists, as seen in Figure 4.

Figure 4. Assuming a match on first name and town, the voter who lives closest to the approximate location given by Airbnb is the true host 94% of the time.

This lends credence to our decision to name the top 10 closest candidates as finalists. Recall that we tested only the finalists to see if any of them were likely to be the owner of a particular listing. Thus, accidentally leaving the likely owner of a house out of the finalists pool could lead to false negatives. But since each unique re-identification was within the top 3 and almost all were the closest, choosing 10 candidates as finalists made it extremely unlikely that we would leave the likely owner out of the finalists pool. It gave significant buffer room, as no unique re-identification occupied a place between 4 and 10. As such, choosing 10 candidates gave us high confidence in our ability to correctly predict the true host.

Example Re-Identification

This example is based on a real listing that we uniquely re-identified, but we use pseudonyms and other fictitious details to preserve the privacy of the individual(s) involved. We denote details adapted to protect privacy with an asterisk (*).

In this example, consider a listing in Madison, Wisconsin owned by David*.

Figure 5. An image attached to one Airbnb listing.

Figure 6. We looked up one hypothesized address on Google Maps Street View. Comparing the images gives strong evidence that the listing is indeed at this address.

First, our algorithm found that there were 350* people named David in Madison, WI. That is, there were 350 candidates for this re-identification.

Our algorithm narrowed down these 350 candidates into 10 finalists, namely the 10 Davids* in Madison who lived closest to the fuzzed location of the Airbnb listing.

Those were David Adams, David Bell, David Cook, David Daniels, David Edwards, David Francis, David Gray, David Hodge, David Ian, and David Johnson*. The closest David, David Gray, lived 0.09 miles from the fuzzed location (which was at the edge of one of the lakes in Madison), while the farthest, David Bell, was 0.50 miles away.

We hypothesized that the closest person was the most likely to be the host. We wanted to check if any of the algorithm’s proposed finalists, and especially the finalist who lived closest to the fuzzed location, were the true owners of the house. Running our two sets of identification techniques, we found that for the personal case, Mr. Gray’s Airbnb profile indicates that he attended Harvard University*. A Google search for “David Gray Harvard Madison”* brought up his LinkedIn page, which confirms that he attended Harvard. The LinkedIn profile picture also matches the Airbnb profile picture. For the property matching techniques, we searched Mr. Gray’s address via Google Maps Street View and found that the house visually matched a picture provided in the listing. The listing also says that the house is on the water’s edge, and Mr. Gray was the only David whose house was on the water. Last, an image of the property shows the view from the back porch, which includes a glimpse of the Wisconsin State Capitol building. Using Google Maps Satellite View, we found that the state capitol building was right across the water from this house and that the photographed view was plausibly taken from this house.

With these data points, we were able to name David Gray a unique re-identification. That is, we concluded that Mr. Gray almost certainly owns this property, validating the algorithm’s prediction.

Cities

As Table 3 shows, our algorithm had varying success levels across cities. It was most successful in Sheboygan, with a 76% unique re-identification rate, but had limited success in La Crosse and Wauwatosa.

Table 3. Unique re-identification success by city, sorted by city population (largest first)

We then grouped cities by their size, as seen in Table 4. The largest cities had the lowest success rate, while mid-sized cities saw the highest rate of likely matches.

Table 4. Success of re-identification by city size

Distances

Of the 34 listings we identified as unique re-identifications, the associated properties were, on average, just 0.136 miles from the fuzzed locations provided by Airbnb. The distribution is shown in Figure 7, and full summary statistics are available in Table 5.

Figure 7. Distances between unique re-identifications and their Airbnb listings’ fuzzed locations. On average, the likely houses are just 0.136 miles away from the fuzzed locations.

Table 5. Summary of distances between uniquely re-identified properties and associated Airbnb fuzzed location.

Continuing Analysis

While our original study used listings collected in mid-2017, we repeated this study in April 2018 on four diverse, current Wisconsin listings. We present a case study on one such listing owned by a host named Sam* in Madison, including the other three in the appendix. We performed a unique re-identification for this listing as follows.

Figure 8. Fuzzed radius around the listing in question

Our algorithm found 667 Sams who live in Madison and are registered to vote in Wisconsin. Each Sam had an associated address from the Wisconsin voting records. We used reverse geolocation to find the latitude and longitude of each Sam’s house, as well as the latitude and longitude of the Airbnb listing’s radius. We found that Sam Reese* lived the closest to the center of the radius.

We confirmed that the listing belonged to Sam Reese using several methods:

On her profile, Ms. Reese mentioned that she teaches geology at Quaker Ridge* College. A quick Google search for “Sam Reese Geology Madison” led to her faculty profile.
The listing description includes, "It is very close to Block Park* (0.7 miles), Trader Joe’s (4 blocks) and a variety of restaurants on Henry* Street. The zoo is nearby in Vilas Park." We used the distance of 4 blocks from Trader Joe’s and proximity to Vilas Park to narrow down a small strip of area where the listing could be. Sam Reese was the only Sam whose house fell in this narrow strip.
Google Maps Street View confirmed that this listing indeed matched Ms. Reese's house.

Figure 9. Listing image on Airbnb

Figure 10. Google Maps Street View of Sam Reese’s house

Thus, we are confident in naming Sam Reese as a unique re-identification.

The center of the fuzzed listing’s radius has latitude 43.062030906137714, latitude -89.41555717331067, per Airbnb. Ms. Reese’s house is just 0.0617 miles (326 feet) away.

The combination of voting registration data and Airbnb approximate locations continues to be an effective tool for re-identification. Airbnb seems not to have made changes to its platform that would make this analysis materially different, with the location radius still under 0.3 miles for each listing in the new sample. We conclude that the results from the 2017 sample remain representative of the platform’s privacy shortcomings in 2018.

Discussion

Our results suggest that despite Airbnb’s efforts to keep personal data private, it is a relatively simple task to re-identify ostensibly anonymous users. Indeed, in 94% of cases, it suffices to find the voter matching on first name and town who lives closest to the location provided by a listing. These findings have important implications for regulators, users, and Airbnb itself.

Impact on Regulations

Our findings may be relevant to the ongoing regulatory battle between American cities and Airbnb. Cities object to Airbnb for facilitating renters’ avoidance of hotel taxes and for supporting use of scarce housing supply for short-term rentals. Moreover, many of the properties are rent-controlled and are intended for low-income city residents, rather than for the operation of short-term housing rentals [18, 19]. Co-op boards and building management companies, meanwhile, have started to institute clauses in their lease agreements prohibiting short-term sublets to maintain the privacy and comfort of their tenants [23]. In New York and San Francisco, regulators and building management entities requested the names and addresses of Airbnb hosts, only to face difficulties in obtaining this data. Our results suggest that regulators and management alike could turn to an algorithm like ours to obtain this data themselves.

Below we explore two case studies for how our findings could help streamline the process of enforcing short-term rental regulation. In both cases, regulators require data regarding hosts whose listings are potentially illegal and have policies that they could enforce once they obtain the data. Our research could help these regulators with the missing middle step of acquiring the data without the cooperation of the company.

Hotel Registration in San Francisco

In 2017 the city of San Francisco won an injunction against Airbnb that requires all Airbnb hosts in the city to register with city regulators, providing information including name and address [18]. A victory for regulators, this ruling allowed them to put restrictions on hosts, such as limiting each host to one rental. These policies sought to prevent unscrupulous hosts from abusing rent-controlled housing units or from circumventing laws intended to protect hotel customers. However, this step forward came at the cost of an expensive and protracted legal battle, and it still allows for a “lag period during which illegal hosts can rent out homes before city officials identify them” [18].

Alternatively, city officials could use an algorithm like the one devised in this study to automatically re-identify hosts across the city. That is, this algorithm would help officials get much the same data that they are already looking for – namely, the names and addresses of Airbnb hosts – much faster than using legal or legislative means.

Illegal Landlords in New York

Cities could also use our algorithm to enforce regulation policies already on the books that they have difficulty enforcing. In New York, for example, some landlords use Airbnb to operate illegal hotels in rent-controlled housing units, to the detriment of the low-income residents for whom the housing is intended. Indeed, then-New York Attorney General Eric Schneiderman stated that two-thirds of the state’s Airbnb listings are exploitative in this manner [20]. New York State passed a law in 2016 making it illegal to advertise a listing for housing for a term shorter than 30 days [21], a move widely seen as aimed at Airbnb.

Unlike San Francisco, New York found it fairly easy to pass short-term rental regulation, but despite the new rules, the city had difficulty enforcing the law due to the fact that neither Airbnb hosts nor their addresses were fully identifiable. New York regulators pressed Airbnb to expose the names and addresses of those individuals who were violating the law, but Airbnb refused, only allowing regulators to view anonymized, aggregated data by making an appointment at Airbnb offices in the city. Even then, city officials could only view and take notes but not leave with a copy of the data [19].

Given Airbnb’s reluctance to divulge information about its hosts, our results suggest that cities such as New York might find it easier to crack down on hosts who attempt to sidestep the law by using an algorithm like ours to re-identify them.

Though a boon to regulators, it should be noted that our findings do not come without risk to individuals. Our methodology intersects the group of possible matches by location with the group of possible matches by name; this may return a group of people who match both criteria. Aggressive regulators might target an individual who meets both criteria but is not the true owner of the listing. Unless regulators ensure that they target only the correct individual, then “innocent bystanders” may be harmed due only to their circumstance of name and town of residence.

Last, we note that for ethical and institutional reasons, we did not contact any Airbnb hosts. Though this means that we do not claim positive proof of a correspondence, the evidence we find matching hosts to listings for our unique re-identifications is such that the probability of a false positive is vanishingly slim.

Other Outcomes of Re-Identification

Airbnb charges hosts a 3-5% commission on each transaction and guests a 6-12% fee for each rental [22]. As a result, Airbnb is incentivized to ensure that guests and hosts only interact through the website. Indeed, the website implements stringent filters within its messaging service to prevent guests and hosts from exchanging contact information. However, our findings provide a natural way to circumvent these barriers by allowing would-be renters to find contact information for the listing elsewhere and to make an end-run around Airbnb’s system. In addition to privacy risks, re-identification poses some more immediate risks for hosts. Instantly bookable homes on the platform, for instance, may be at higher risk for burglary. Clearly, the ability to re-identify the true owners and locations of Airbnb listings has wide-reaching and worrying effects.

Differences in Likelihood of Re-Identification

Our findings are differentially applicable to hosts who live in densely and sparsely populated areas. Because the probabilistic area that Airbnb provides has a fixed radius, hosts may find themselves either well-hidden or thoroughly exposed, depending on how many people live within that radius. For example, compare Figures 11 (showing a densely populated area) and 12 (showing a sparsely populated area). Assuming all other variables (e.g., population homogeneity) are equal between the two regions, it would be more difficult to re-identify the listing in Figure 11. Because the population density is so high in the area in Figure 11, there is a high probability of false positives, making it less likely — ceteris paribus — that the host could be re-identified. In contrast, the host in Figure 12 is more vulnerable — ceteris paribus — due to the relative sparsity of residential units in the location radius. Any positive match in the voter database for Figure 12 would likely be the correct match, in contrast with the many false positives that might be identified in the denser area of Figure 11.

Figure 11. Re-identification would be more difficult here, ceteris paribus, due to the high population density within the radius.

Figure 12. Re-identification would be fairly easy here, ceteris paribus, because there are few opportunities for false positives with such population sparsity.

To address the issue of population density, Airbnb might consider increasing the location radius to better protect hosts. However, while increasing the radius might decrease the risk of re-identification, the utility to prospective renters is significantly diminished. Many renters rely on the provided location to decide where they choose to stay, and greater uncertainty in location could make listings less appealing. While a renter in Coon Falls, Wisconsin may not notice a difference if the fuzzing radius was increased to one mile, a renter in New York City certainly would: a mile’s radius could hide the difference between renting an apartment in downtown Manhattan and one in New Jersey!

Another potential approach could be to adjust the probability distribution of where the listing could fall within the location radius. As Figure 11 shows, the majority of listings were located close to the center of the fuzzed circle. Airbnb could make listings equally likely to appear anywhere in the circle, but again this would increase uncertainty of location and reduce the utility to potential renters.

Figure 13. Denser cities had more voters with the same first names for the median listing, making re-identification harder.

This scenario highlights the tradeoff between privacy and utility: while increasing the security of users’ anonymous information might decrease their risk of being identified, it makes Airbnb’s service significantly less useful to users who have definite preferences as to their rental location. Regardless, if Airbnb were to adjust their service to address the vulnerabilities that we find, the first thing they might consider is to make the location radius dynamic, not static, and adjust the radius based on the population density in the area. Higher-density areas might require smaller radii, while lower-density areas could afford to have larger radii, which is especially important because fewer people will live in a circle of any given size in a lower-density area. Our analysis bore this trend out: Figure 13 demonstrates the significant effect of population density on the number of preliminary matches, i.e., the number of voters matching a listing’s first name and town. In higher-density areas like Milwaukee, there could be as many as three dozen people, on average, who match these criteria in expectation.

Though differential population density might provide avenues for Airbnb to improve its service by adjusting location circle radii, it should be noted that other factors of population homogeneity may also affect the probability of successful re-identifications. For example, high racial and age homogeneity in one locality may increase the number of people with the same first name, making re-identification harder still by allowing more people to pass through the name filter for a given town.

Airbnb might use other, more complex, methods to further reduce the risk of re-identification. The company could, for instance, use advanced techniques from computer vision and natural language processing, such as convolutional neural networks and entity recognition models, to prevent hosts from including photos of their house fronts or mentioning the names of their employers. While this may alleviate privacy concerns, reducing the amount of information available to guests about hosts may make it harder for trust to develop on the platform.

Shortcomings of Methods

Our re-identification algorithm was successful in re-identifying 40% of the sampled Airbnb listings, but this raises the question of why it was unsuccessful in the remaining 60%. We provide a few consistent patterns among the listings that our algorithm could not re-identify:

Our algorithm looked for all voters whose first names exactly matched the host’s listed name. This worked fine for most names but failed in the case of nicknames. For instance, a man named Joseph might be entered in the voter records as Joseph but might go by Joe on Airbnb. After initially noticing the problem, we adjusted the algorithm to check if the host name was just contained within the voter name, instead of requiring an exact match. This would allow us to catch people named Samuel who go by Sam or people named Elizabeth who go by Beth. However, the case of Joe vs. Joseph would still cause our algorithm to fail. A simple extension to this project would use or build a database that maps nicknames to possible official names.
Furthermore, some couples listed two names on Airbnb, e.g. Jay and Ken. This would again break the name-matching algorithm. We attempted to rectify this by stripping any text that appears after "and", "&", or "+" (which would yield only the first of the couple’s names), but this is fragile, and some couples might still slip by.
Some Airbnb hosts may not be registered to vote, meaning that our method would necessarily fail for them.
A few hosts had recently moved from other states or countries, so they were not included in the Wisconsin voter rolls. We imagine that this problem would be exacerbated in states with more in-migration, such as New York or California.
Several hosts did not live at the property they were listing on Airbnb; the properties could have been rental or investment properties instead of their primary residence. Extensions of this project could leverage public real estate records or commercial name/address databases to match properties to owners to solve this problem or the previous one. For regulators in New York or San Francisco, this would be useful to crack down on hosts who list many properties on Airbnb.
In rural locations, Google Maps Street View was often unavailable, and Satellite View was not helpful because the houses were obscured by trees. In several such cases, we did not fully validate a re-identification, despite reasonably high confidence.
Hosts with common names were more difficult to re-identify because there are more matched voters in any given area. This made it more difficult to pinpoint which person the host was.

Future Studies

Our findings provide the basis for several areas of study. First, our re-identification algorithm relies on time-intensive human validation to confirm that the listing corresponds to the address listed on the voter record. We propose that our approach is conducive to execution via crowdsourcing platforms such as Amazon Mechanical Turk. This could render our approach highly scalable. Further, it may be possible to use advanced computer vision techniques such as convolutional neural networks to encode features of houses such that their publicly-listed images (on websites like Zillow and Google Street View) can be accurately matched against pictures provided in the Airbnb listing.

Since it’s possible to choose an Airbnb listing and re-identify the host from voter registration data, it may be possible to do the converse. Given an individual’s voter record, future work might show that it is feasible to perform a targeted attack to determine if that individual owns a listing on Airbnb. This may pose a considerable threat to targeted individuals’ privacy and security.

Last, our re-identification algorithm might be improved by algorithmically cross-referencing with more data sources, which would provide stricter filtering and reduce false positives. Potential data sources include social media profiles, public property tax records, Zillow’s database of homes, the Multiple Listing Service that allows realtors to find information about homes, and more.