Published on October 29, 2015. Views: 144779. Downloads: 12446. Suggestions: 0.
Who Knows What About Me? A Survey of Behind the Scenes Personal Data Sharing to Third Parties by Mobile Apps
Jinyan Zang, Krysta Dummit, James Graves, Paul Lisker, and Latanya Sweeney
Interact with Data
What types of user data are mobile apps sending to third parties? We chose 110 of the most popular free mobile apps as of June-July 2014 from the Google Play Store and Apple App Store, across 9 categories likely to handle potentially sensitive data about users including job information, medical data, and location. For each app, we used a man-in-the-middle proxy to record HTTP and HTTPS traffic that occurred while using the app and looked for transmissions that include personally identifiable information (PII), behavior data such as search terms, and location data, including geo-coordinates. An app that collects these data types may not need to notify the user in current permissions systems.
Results summary: We found that the average Android app sends potentially sensitive data to 3.1 third-party domains, and the average iOS app connects to 2.6 third-party domains. Android apps are more likely than iOS apps to share with a third party personally identifying information such as name (73% of Android apps vs. 16% of iOS apps) and email address (73% vs. 16%). For location data, including geo-coordinates, more iOS apps (47%) than Android apps (33%) share that data with a third party. In terms of potentially sensitive behavioral data, we found that 3 out of the 30 Medical and Health & Fitness category apps in the sample share medically-related search terms and user inputs with a third party. Finally, the third-party domains that receive sensitive data from the most apps are Google.com (36% of apps), Googleapis.com (18%), Apple.com (17%), and Facebook.com (14%). 93% of Android apps tested connected to a mysterious domain, safemovedm.com, likely due to a background process of the Android phone. Our results show that many mobile apps share potentially sensitive user data with third parties, and that they do not need visible permission requests to access the data. Future mobile operating systems and app stores should consider designs that more prominently describe to users potentially sensitive user data sharing by apps.
Since the introduction of the Apple App Store in 2007 and the Google Play Store in 2008, smartphones (and more recently, tablets and other devices running mobile operating systems) have become a dominant means of personal computing. Smartphones run programs called applications or “apps,” and the Apple App Store and Google Play Store make millions of apps available for all kinds of uses. Google reports one billion monthly active users (MAUs) on its Android platform , and estimates put iOS MAUs at 500 million-600 million . More than 1.5 million different apps are available to users on both the App Store and Play Store , with the average consumer using 26 apps per month .
Given the popularity of apps on smartphones, consumers worry about how much personal information apps share. In a survey of more than 2,000 Americans, the Pew Research Center found that 54% of users decided to not install an app after learning about how much personal information they would need to share to use it . Pew indicated that 30% of users reported uninstalling an app already on their phone because they learned that it collected personal information that they did not want to share . Similar rates of avoiding apps or uninstalling apps due to privacy concerns are seen for both iOS and Android users . Consumers are sensitive about the collection of geolocation data, with 30% of smartphone owners turning off the location tracking feature of their phone owing to concerns about who might access that information . In a different survey of more than 1,100 Americans, 70% of respondents said that they would “definitely not allow” a cellphone provider to use their location to tailor ads . In another survey of more than 3,100 Americans, 60% reported being “very upset” if an app shares their location with an advertiser .
Why do apps trigger concerns from consumers and governments? First, an app may share a unique IDs related to a device such as a System ID, SIM card ID, IMEI, MEID, MAC address, UDID, etc. The ID can be used to track an individual [10, 11]. Second, an app can request user permission to access device functions and potentially personal or sensitive data, with the most popular requests being access to network communications, storage, phone calls, location, hardware controls, system tools, contact lists, and photos & videos [12, 13]. Some apps practice over-privileging, where the app requests permissions to access more data and device functions than it needs for advertising and data collection [14, 16, 17, 18, 19]. Third, any data collected by the app may be sent to a third party, such as an advertiser [10, 46, 58]. Fourth, a user may have a hard time understanding permission screens and other privacy tools in a device’s operating system [15, 21].
In 2010, a Wall Street Journal (WSJ) study raised concerns about the amount of data sharing by apps. The WSJ conducted a survey of 101 popular Android and iOS apps . By using network analysis to examine the data transmitted by different apps, the WSJ found that 56 apps sent the device’s unique ID to third parties without a user’s awareness or consent . Forty-seven apps sent the device’s location . Five sent age, gender, and other personal details to third parties . One major beneficiary of this data sharing was Google, with its AdMob, AdSense, Analytics and DoubleClick products receiving data from 38 of the 101 apps tested . Publication of the Wall Street Journal report prompted multiple lawsuits against Apple, Pandora, The Weather Channel, Dictionary.com, and 5 other app developers in the US and Canada [23, 24, 25].
Given these changes in the app marketplace, our study examines how frequently apps share geo-location information. In addition, what other kinds of personal data are apps sharing today, and with what parties?
There are three main approaches to surveying data sharing by mobile apps: permissions analysis, static code analysis and dynamic analysis.
Permissions analysis examines the permission requests from an app either before installation or during use as disclosed to the user, usually on the app’s download page in the Google Play Store or Apple App Store [12, 13, 35, 36, 37]. The benefit of this approach is that it allows efficient review of thousands of apps at once. The shortcoming is that the review is only at a high level, without knowing whether the app actually collects the requested data and who receives it . One study of over 22,000 Android apps found that free apps and “look-alike” apps with names similar to popular ones request more permissions . There is a correlation between number of downloads and the number of permission requests. The greater the number of downloads, the more likely the app requests more permissions . Barrera and Oorschot’s review of 1,100 Android apps found an exponential decay in the number of apps requesting large numbers of permissions. A few apps ask for very many permissions .
Static code analysis studies the code of an app after decompiling to look for the permissions it requests as part of its design [14, 20, 38, 39, 40, 41, 42, 44]. This approach provides more insight into the design of the app and remains fairly easy to automate, but its accuracy depends on the decompiler used. Also, results may be too high because they include “dead code” that never actually executes in use . In one review of 114,000 apps, on average, each app had at least one ad library, which often requests permissions to access the Wi-Fi network, camera, contact list, microphone, and browser history . Static analysis can also detect over-privileging in apps . Beyond just looking at permissions, Egele et al. found that more than half of the 1,400 iOS apps in their sample shared a unique device ID . Static analysis can also uncover potential malware and vulnerabilities, such as ad libraries directly fetching and running unexpected code from the Internet [40, 41, 44].
Dynamic analysis can capture what is actually happening when an app is used, but it requires human intervention, which makes it more difficult to scale [10, 22, 43, 45, 46, 47]. Taintdroid for Android tracks private information flows from the app to its destination . In one study, it found 97 out of the 145 apps tested sent potentially private information such as phone information, device IDs, or geo-coordinates to primary or third-party servers . Researchers at the University of Washington expanded on Taintdroid’s functions to look for leakages of other data types such as AndroidID and to check for commonly used native functions such as MD5 hashing that may obscure the extent of data sharing . Another dynamic analysis method uses a virtual private network (VPN) to monitor traffic from a device, employing tools such as Meddle or AntMonitor, which have found apps on iOS and Android that share personally identifiable information, such as name, email, and password, as plaintext [46, 47]. The 2010 WSJ study used a third method by monitoring a man-in-the-middle Wi-Fi network that a device used to connect to the Internet .
For this study, we focused on examining actual transmissions of personal data by apps during routine use. As our method, we selected dynamic analysis monitoring with a man-in-the-middle Wi-Fi network, as used in the WSJ report.
Other research at the FTC and Privacy Rights Clearinghouse also uses this approach [22, 48, 49]. In 2014, a researcher from the FTC’s Mobile Lab conducted a runtime analysis of 15 health and fitness apps on mobile phones . In total, from the apps tested, 18 third parties received device-specific identifiers, 14 received consumer-specific identifiers, and 22 received other health information . Overall, the study found that 12 of the apps surveyed transmitted information to 76 different hosts, with many of the third parties receiving information from multiple apps . This result supported the findings of a 2013 Privacy Rights Clearinghouse study that surveyed 43 mobile health and fitness apps and found that the biggest risks to the privacy of the personal information of users of mobile health and fitness apps resulted from apps using unencrypted connections to third-party advertisers and analytics services . Our study expands on this work by studying 110 apps in health and other categories with potentially sensitive data.
Our goal was to select and use popular free apps from the Google Play Store for Android and the Apple App Store for iOS and to record the amount and kind of sensitive data transmitted from the user’s device. Afterwards, we analyzed the recordings looking for different kinds of information sharing—specifically, personally identifiable information (PII), behavioral, and location information shared with a primary or third-party domain.
Selecting the apps
Using the apps
To test an app, we simulated typical use for 10 to 20 minutes, sufficient to establish personal accounts with passwords, populate requests with personally identifiable information (PII), and use the basic functionalities of the app such as conducting a search, looking at a page of results, or playing one level of a game. Thus, the time spent on each app varied and depended on the nature of the app. We set all permissions to the most permissible—i.e., we allowed all requests for sharing geolocation and agreed to any other permission requests. However, we generally did not permit push notifications, which allow an app to send data in the background when not in use, such as when a different app was being tested. We wanted to avoid contaminating the data capture during each app’s testing with push notifications that would cause background activity from unrelated apps to bleed through. We also deleted all apps on the tested smartphone not essential to the operating system. We tested our iOS apps on an iPhone 5 and the Android apps on a Samsung Galaxy S3.
Recording app communications
For each app, we assumed the flows that occur during app testing are likely due to that app’s activity. As mentioned, we minimized background processes such as push notification for other apps as much as possible to reduce contamination. However, we could not shut off all background processes, such as those related to the phone’s own operating system. Thus if Android or iOS sent traffic to the domains of Google, Apple, or others during testing, these connections might have been recorded as belonging to the specific app that was open for testing.
Analyzing the recorded app communication data
We used Python scripts to help analyze captured data. These scripts searched for transmissions in clear text for different kinds of personal data that we put into an app, such as PII and behavioral data, as well as data from the phone, such as geolocation via longitude and latitude values. Table 1 lists the kinds of personal data types that our scripts tracked and defines the categories we used. A complete list of the terms can be found in the Appendix.
When our scripts found a potential occurrence in clear text of a match to one of our inputs, we visually inspected the occurrence to determine whether the match was accurate. For example, if we input a birthday field into an app as June 1, 1980 and the script found a potential match to “06011980” in one of recorded communications for the app, we visually inspected to make sure that the match looked like part of a transmission related to a birthday rather than being part of a very large integer. One limitation of our approach is that we only had HTTP and HTTPS data, and we only looked for clear text matches based on our list of terms. Thus, if the app uses a different protocol to transmit the data or hashes data like birthday date into a less obvious string, our approach would not identify that transmission of the potentially sensitive data.
Table 1. Kinds of potentially sensitive data shared. A complete list of terms tracked for each data type and related variations can be found in the Appendix.
For analysis, we merged domains that are the same at the top two levels. For example, we combined “trafficservicecdn.telenav.com” and “logshedcn.telenav.com,” which are subdomains of “telenav.com”. In the case of websites that have country-specific suffixes, such as “ad-x.co.uk”, we merged three levels of naming. We researched each domain in order to categorize it as either a primary domain belonging to the app-maker or as a third-party domain.
We used the statistical package R to render graphs, using bipartite graphs to show how apps connected to domains where they sent potentially sensitive data. See our data citation below to access an archived copy of raw communications captured, analyses, and scripts used.
We tested 110 free apps, 55 each from the Google Play Store and the Apple App Store. We tested and recorded these apps in two waves. Wave 1 was done on June 24-26, 2014 and Wave 2 on July 15-22, 2014. During Wave 1, we chose the five most popular free apps from the Google Play Store in each of the following categories: Business, Games, Health & Fitness, and Travel & Local. In the App Store, we tested similar categories: Business, Games, Health & Fitness, and Navigation. In July 2014, we expanded our testing with Wave 2 and tested the five most popular free apps in the Play Store categories Communication, Medical, and Shopping and in the App Store categories Lifestyle, Medical, and Photo & Video. In addition, we made deeper dives—testing ten apps rather than five—in the categories Health & Fitness, Social, and Travel & Local for the Play Store and in the Health & Fitness, Navigation, and Social categories for the App Store. We chose the targeted categories in Wave 1 and 2 due to their likely handling of potentially sensitive data including job information, medical data, and location. Wave 2 did not re-test apps previously tested in Wave 1. Table 2 and 3 show the list of the apps in Android and iOS that we tested along with their wave for testing. When there was a problem testing an app, we replaced that app with the next most popular app not already tested. A complete list of all apps, including those we were unable to test is in the Appendix.
Table 2. List of tested Android apps. These were the most popular apps on Google Play for Android accessed during Wave 1 (June 2014, highlighted in orange) and during Wave 2 (July 2014) in the eight categories of Business, Communication, Games, Health & Fitness, Medical, Shopping, Social, and Travel & Local. Apps appear alphabetically per category. A more thorough list of apps, including those that could not be tested, appears in the Appendix.
Table 3. List of tested iOS apps. These were the most popular apps on the App Store for iOS accessed during Wave 1 (June 2014, highlighted in orange) and during Wave 2 (July 2014) in the eight categories of Business, Games, Health & Fitness, Lifestyle, Medical, Navigation, Photo & Video, and Social. Apps appear alphabetically per category. A more thorough list of apps, including those that could not be tested, appears in the Appendix.
Out of the 55 apps that we tested for Android, Text Free, Glide, and Map My Walk sent potentially sensitive data to the most primary and third-party domains (Figure 1). The top three domains that received potentially sensitive data from the largest number of apps are google.com, googleapis.com, and facebook.com, though that appears to be less the case for location data versus PII or behavior data (Figures 1, 2, 3, and 4). Facebook.com was also the primary domain for three of the apps tested: Facebook Messenger, Facebook Pages, and Instagram.
For PII data, Text Free, Glide, and Map My Walk again rose to the top as sending data to the most domains (Figure 2). For behavior data such as a search term input into the app, Pinterest and Drugs.com are the top two apps (Figure 3). In the case of location data, such as the user’s current coordinates, Text Free and MapQuest sent the data to the most domains (Figure 4).
Figure 1. Sensitive data sharing by Android apps. Apps (left) connected to various domains (right. The color of the line indicates whether the domain is that of the primary maker (orange) of the app or of a third party (black). Apps with bigger circles shared sensitive data with more domains, both primary and third-party.
Figure 2. PII data sharing by Android apps. Apps (left) connected to various domains (right). The color of the line indicates whether the domain is that of the primary maker (orange) of the app or of a third party (black). Apps with bigger circles shared PII data with more domains, both primary and third-party.
Figure 3. Behavior data sharing by Android apps. Apps (left) connected to various domains (right). The color of the line indicates whether the domain is that of the primary maker (orange) of the app or of a third party (black). Apps with bigger circles shared behavior data with more domains, both primary and third-party.
Figure 4. Location data sharing by Android apps. Apps (left) connected to various domains (right). The color of the line indicates whether the domain is that of the primary maker (orange) of the app or of a third party (black). Apps with bigger circles shared location data with more domains, both primary and third-party.
In general, there were only one or two primary domains per app that received sensitive data, but the average number of third-party domains was 3.1 (Table 4). Health & Fitness and Communication apps sent sensitive data, mostly PII data, to more third-party domains than apps in other categories. Text Free, an app listed under the Social category of the Play Store, sent sensitive data to 11 third-party domains, more than any other app, with 9 domains receiving PII data, 2 receiving behavior data, and 6 receiving location data. The apps in the sample generally sent PII data to more third-party domains than behavior or location data. Glide, Map My Walk, and Text Free are each sending PII data to 7 or more third-party domains. Many apps have no observable traffic to any third-party domains that contain behavior or location data, hence the many empty cells in Table 4 for those two columns.
Table 4. Distribution of domains receiving any sensitive data for Android apps tested. Empty cells indicate no observed data of that type was sent to a primary or third-party domain by the app.
Out of the 55 apps that we tested for iOS, Local Scope sent potentially sensitive data to the most primary and third-party domains (Figure 5). The top three domains that received potentially sensitive data from the most apps are apple.com, yahooapis.com, and exacttargetapis.com, especially for location data versus PII or behavior data (Figures 5, 6, 7, and 8).
For PII data, Pinterest, Map My Run, MapQuest, Piano Tiles, and Timehop rose to the top as sending data to the most domains (Figure 6). For behavior data such as search terms input into the app, Local Scope is the app sending data to the most domains (Figure 7). For location data such as the user’s current GPS coordinates, Local Scope again sent that data to the most domains (Figure 8).
Figure 5. Sensitive data sharing by iOS apps. The color of the line indicates whether the domain is that of the primary maker (orange) of the app or of a third party (black). Apps with bigger circles shared sensitive data with more domains, both primary and third-party.
Figure 6. PII data sharing by iOS apps. The color of the line indicates whether the domain is that of the primary maker (orange) of the app or of a third party (black). Apps with bigger circles shared PII data with more domains, both primary and third-party.
Figure 7. Behavior data sharing by iOS apps. The color of the line indicates whether the domain is that of the primary maker (orange) of the app or of a third party (black). Apps with bigger circles shared behavior data with more domains, both primary and third-party.
Figure 8. Location data sharing by iOS apps. The color of the line indicates whether the domain is that of the primary maker (orange) of the app or of a third party (black). Apps with bigger circles shared location data with more domains, both primary and third-party.
Much like Android apps, iOS apps usually send sensitive data just one or two primary domains, but on average to 2.6 third-party domains (Table 5). Every category had a mix of apps that sent sensitive data to third-party domains and apps that did not. Local Scope, an app listed under the Navigation category of the App Store, sent sensitive data to 17 third-party domains, more than any other app, with 15 domains receiving behavior data, and 17 receiving location data. Piano Tiles and Pinterest both sent PII data to at least 3 third-party domains. Job Search – Indeed.com and Local Scope sent behavior data to at least 3 third-party domains. Job Search – Snagajob, Nike+ Running, Groupon, Walgreens, Urgent Care, Local Scope, and Phone Tracker sent location data to at least 3 third-party domains.
Table 5. Distribution of domains receiving any sensitive data for iOS apps tested. Empty cells indicate no observed data of that type was sent to a primary or third-party domain by the app.
Potentially sensitive data types shared with third-party domains
For Android apps, the most common data type shared with a third-party domain is a user’s email address, which is PII data, with 73% of the Android apps transmitting that data (Table 6 and 8). Other commonly shared data types in Android include name (49% of apps), address (25% of apps), and phone information such as IMEI number (24% of apps) for the PII data category, username (25% of apps) for the behavior data category, and location data such as the user’s current GPS coordinates (33% of apps) (Table 8).
Less commonly shared data types may still be potentially sensitive data. For example, the Drugs.com app shared medical info input by the user in testing—including words such as “herpes” or “interferon”—with 5 third-party domains: doubleclick.net, googlesyndication.com, intellitxt.com, quantserve.com, and scorecardresearch.com. None of the 5 domains directly received any PII from the app, though google.com and googleapis.com did receive names and email addresses while the app ran. For a different type of potentially sensitive behavior data, the Business category apps, Job Search and Snagajob, shared employment-related search terms such as “driver,” “cashier,” and “burger” with third-party domains google.com, google-analytics.com, scorecardresearch.com, and 2o7.net during testing. One of the domains in the Job Search app, google.com, also received PII data including the user’s email address. The third-party domains that received passwords from apps include crashlytics.com for RunKeeper, appspot.com for Snapchat, and instagram.com for Timehop.
Finally, some apps sent to the same third-party domain potentially sensitive combinations of data such as name and current GPS location. Facebook.com connected with 7 apps, American Well, Groupon, Pinterest, RunKeeper, Tango, Text Free, and Timehop, to access this data combination. Appboy.com received this data on the Glide app.
Table 6. Categories of sensitive data (columns) shared to third-party domains by Android apps (rows). Cells shaded orange indicate that at least one third-party domain received data of that category while the selected app ran. The values inside orange cells show specifically how many third-party domains received the data.
For iOS apps, the most common data type shared with a third-party domain was a user’s current location and GPS coordinates, with 47% of the apps transmitting that data (Table 7 and 8). Other commonly shared data types in iOS include name (18% of apps) and email address (16% of apps) in the PII data category (Table 8). 4 out of the 5 Game apps tested transferred name and email data to the domain apple.com, specifically to Apple’s Game Center site at service.gc.apple.com. Pinterest, a Social category app, sent names to 4 third-party domains, yoz.io, facebook.com, crittercism.com, and flurry.com. The third-party domains that received passwords from apps include instagram.com for Timehop for InstaSize and appspot.com for SnapChat.
A few different apps shared potentially sensitive behavior data from user inputs and searches with third-party domains. For example, Period Tracker Lite shared an input into a symptom field of “insomnia” with apsalar.com. In the Business category, the two Job Search apps from Indeed.com and Snagajob shared employment-related inputs such as “Nurse” and “Car mechanic” with 4 third-party domains, 207.net, healthcareresource.com, google-analytics.com, and scorecardresearch.com.
Finally, compared to Android, fewer of the tested iOS apps sent the same third-party domain potentially sensitive combinations of data such as name and current GPS location. Facebook.com connected with 2 apps, Pinterest and Timehop, to access this data combination.
Table 7. Categories of sensitive data (columns) shared to third-party domains by iOS apps (rows). Cells shaded orange indicate that at least one third-party domain received data of that category while the selected app ran. The values inside orange cells show specifically how many third-party domains received the data.
Compared to Android apps, fewer iOS apps shared PII and behavior data with third-party domains. In some data types, the contrast is significant, with 73% of Android apps transmitting email addresses versus 16% of iOS apps. In addition, 49% of Android apps transferred either first or last name, compared to 18% of iOS apps. On the other hand, more iOS apps (47%) than Android apps (33%) transmitted current location data, including GPS coordinates, to a third-party domain. In terms of all 110 apps tested across both operating systems, the top three data types most commonly shared were email (45% of apps), location (40% of apps), and name (34% of apps).
Table 8. Summary of the number of apps in Android and iOS sharing data with third-party domains by data type. We looked for phone info for Android apps only.
Third-party domains that received potentially sensitive data from the most apps
Table 9 shows the 13 third-party domains that received potentially sensitive data from at least 4 of the Android or iOS apps that we tested. The top 6 third-party domains provided API functions for the app that allowed the app to access code libraries and datasets provided by Google, Apple, Facebook, ExactTarget, and Yahoo. Apps mostly shared PII and behavior data with Google.com and Googleapis.com in Android, while apple.com received location data in iOS. Since we were not able to disable all background processes ran by the Android or iOS operating systems during, some of the observed data transmissions to Google or Apple domains may have been due to unrelated background processes. No single analytics or advertising third-party domain dominated in receiving potentially sensitive data across a large number of the apps in the sample. The most popular analytics domain, google-analytics.com, and the most popular advertising domain, scorecardresearch.com, received data from only 5% of the apps tested. We found 94 distinct third-party domains that received at least one instance of potentially sensitive data from one of the 110 apps tested.
Table 9. Top 13 third-party domains that received any sensitive data from the apps tested. The top 13 domains received sensitive data from at least 4 apps in the sample. The table categorizes each domain by its primary function for its API, analytics, or advertising-related capabilities.
One third-party domain not included the tables and figures presented is safemovedm.com, which was connected to by 51 or 93% of the Android apps tested. The purpose of this domain connection is unclear at this time; however, its ubiquity is curious. When we used the phone without running any app, connections to this domain continued. It may be a background connection being made by the Android operating system; thus we excluded it from the tables and figures in order to avoid mis-attributing this connection to the apps we tested. The relative emptiness of the information flows sent to safemovedm.com indicate the possibility of communication via other ports outside of HTTP not captured by mitmproxy. These other ports—which may be monitored by sniffers such as Wireshark—may be of future interest in a subsequent mobile app security study.
Figure 9. Flow of information from mitmproxy for connections to safemovedm.com. Since mitmproxy only examines HTTP and HTTPS traffic, it may be possible that other tools such as Wireshark might be used in future studies to monitor FTP and other types of traffic to safemovedm.com.
We found that many mobile apps transmitted potentially sensitive user data to third-party domains, especially a user’s current location, email, and name. In general, iOS apps were less likely to share sensitive data of nearly every type with third-party domains than were Android apps, except for location data (Table 8). One reason might be the App Store human curation process that checks to see if apps only ask for personal information for app-related purposes [61, 62]. Collecting location data, including GPS coordinates, requires an app to request the permission of the user, which would occur before installation on the app download page for Android and as a pop-up notification during use for iOS [64, 64]. Thus, receiving location data requires user approval of a more prominent notification for iOS, and we saw more iOS apps (47%) sending location data to third parties than Android apps (33%). Our results for each operating system were in line with other studies [22, 36, 58]. In contrast, we found significantly less sharing of behavioral data, such as search terms from Medical and Health & Fitness apps, compared to previous research on data-sharing on healthcare websites. A 2015 study of more than 80,000 healthcare webpages found that on 70% of the pages, third parties can learn about the specific “conditions, treatments, and diseases” viewed [65, 66]. In our study, only 3 apps out of 30 Medical and Health & Fitness apps sent medical info, including search terms, to a third party (Table 6, 7).
The average Android app sent sensitive data to 3.1 third-party domains, and the average iOS app connected to 2.6 third-party domains. The top domains that received sensitive data from the most apps belonged to Google and Apple (Table 9). Other studies have found a similar dominance by Google [10, 46, 58]. One factor may be the mobile ad networks and services operated by Google with AdMob, DoubleClick, and Google Analytics , and by Apple with iAds . It is also possible that system processes that we were unable to turn off on Android and iOS sent data to the two companies’ domains in the background while we tested our apps. Besides Google and Apple, no other third-party domain in our study received data from more than 14% of the apps tested. By contrast, the reach of third-party advertisers on websites is very extensive, with the top 12 ad networks all reaching more than 50% of American Internet users .
Implications for technology design and policy
The results of this study point out that the current permissions systems on iOS and Android are limited in how comprehensively they inform users about the degree of data sharing that occurs. Apps on Android and iOS today do not need to have permission request notifications (Figure 10, 11) for user inputs like PII and behavioral data. Three options are under current development by researchers, regulators, and companies for users who may want to more comprehensively protect their privacy while using mobile apps. These are: (1) send false data in response to app requests, (2) allow users to opt out of data collection, and (3) design app stores to prominently inform users about third parties who may receive their data.
Figure 10. iOS permission request notification for location data . iOS does not require apps to have notifications like this for PII or behavior data.
Figure 11. Android permission request notification for location data . Android does not require apps to have notifications like this for PII or behavior data.
Researchers have designed tools that can protect user privacy by sending false data to satisfy permission requests from apps. MockDroid, TISSA, and AppFence are three examples that send fake information back to the app if it makes certain API calls [70, 71, 80]. It may be possible to modify these tools to send fake user data inputs as well when the recipient is a third-party domain, though that may also impact the experience of the app for targeted advertising and other functions that depend on accurate user data.
Despite the lack of formal regulation, Google and Apple in recent years implemented tracking prevention settings to a degree on their mobile operating systems. In September 2012, Apple launched a “Limit Ad Tracking” feature as part of iOS 6 that blocks ad networks from collecting their IDFA, a unique device ID . By April 2014, Apple stated that it may remove or deny apps that don’t respect the “Limit Ad Tracking” setting . Following Apple’s lead, Google implemented a similar “Opt out of interest-based ads” setting in Android KitKat in October 2013 . However, as Google notes, this setting will not stop interest-based ads not served by Google or not part of the Google Display Network . Also, even if a user opts out of interest-based ads, an app may still track user activity for “contextual advertising, frequency capping, conversion tracking, reporting, security and fraud detection” . Interestingly, as gatekeepers of the operating system, companies such as Apple and Google moved faster than the regulators in providing and enforcing an opt-out option to tracking on mobile apps to consumers.
Finally, app stores can show the degree of third-party data sharing more prominently on their app download page to inform users before they install the app. Many apps may describe the degree of data collection and sharing with third parties in their privacy policies, which research has found to be confusing, dense, misunderstood, and often ignored by consumers [79, 81]. App stores may choose to feature this information more noticeably, using notices similar to what exists for children’s apps today. Apps meant to be used for children, such as learning or game apps, are under more scrutiny by regulators as a result of laws such as the federal Children’s Online Privacy Protection Act (COPPA), enforced by the FTC to control the amount of geolocation data, photos, videos, audio recordings, and persistent identifiers collected and shared by apps without parental consent . In California, S.B. 568 gives minors the right to an “Eraser Button” that will remove any content or information they submitted to websites or apps . Beyond regulations, civil society groups such as Moms With Apps have signed on more than 300 app developers to practice best practices by disclosing the data collection and sharing activities of their apps . Mom With Apps even built its own app store, which allows parents to filter apps by requirements such as “Works without internet,” “No in-app purchases,” “No links to social networks,” and “No advertising” . In September 2013, Apple launched Kids App Store, which includes apps for children that comply with COPPA restrictions and limit advertising . Google followed in April 2015 with the launch of its “Designed For Families” program for Android apps . In conclusion, app stores could possibly adapt the current designs for children’s apps more broadly to apply to all apps by clearly describing the degree of third-party data sharing by an app before it is downloaded.
To expand on the results of this study, future research can focus on improving the accuracy of the Internet traffic captured for each app, testing more apps under different conditions, and reviewing whether privacy policies reflect the data collection and sharing activities recorded for each app.
We can improve the app testing process by looking at non-TCP traffic, leakage through simple hashing like MD5, and less contaminated transmissions without background system processes. The Privacy Rights Clearinghouse study of 43 health apps used a man-in-the-middle proxy like ours to monitor all HTTP and HTTPS traffic, and it also had WireShark and tcpdump to monitor all packet-level traffic that is not on TCP . Therefore, a future study could also incorporate WireShark and other tools to examine non-TCP traffic for data leakage (we note that one study using a VPN method to capture all device traffic found that over 90% of traffic volume from apps is on TCP ). We might also look for potential leakage of simply encrypted versions of sensitive user data sent not as clear text but by using common hashes such as MD5 that may be vulnerable to attack . One 2011 study found that multiple apps sent AndroidIDs simply hashed with MD5 as plaintext to different ad networks including Google’s DoubleClick and AdMob . There was no “salting” or extending the identifier with new data to make it more difficult to decrypt by those who may have intercepted the plaintext traffic . Finally, future work may be able to use modified tools such as Taintdroid to monitor both the operating system and the app so as to better distinguish between leakage that occurred as a result of app activity versus a background system process .
Future research might also expand the scope of our testing with more apps under different conditions. We tested apps in only 9 categories in the Play Store and App Store. As of September 2015, the Play Store had 27 categories, and the App Store had 23 categories [89, 90], so there are many more categories to examine. We could also test paid apps to see whether their data sharing patterns differ from those of free apps. Beyond just testing more apps, we can re-test the apps in our sample to track changes in their data sharing over time. Finally, we could test apps under the condition of turning on Android’s “opt out of interest-based ads” and iOS’ “Limit Ad Tracking” settings to see if we observe a difference in app activity.
Sensitive data that was searched for on Android.
Sensitive data that was searched for on iOS.
All Apps that were investigated on Android.
All Apps that were investigated on iOS.
Jinyan Zang is an experienced researcher on consumer protection, data security, and privacy issues as a Research Fellow at the Federal Trade Commission and a Research Analyst at Harvard University. He is currently working with Prof. Latanya Sweeney as the Managing Editor of Technology Science, the first open access, peer reviewed, online publication by Harvard University of research on the unforeseen consequences of technology on society. He graduated cum laude in 2013 from Harvard College with a BA in Economics.
Krysta Dummit is a first-year PhD candidate in Chemistry at the Massachusetts Institute of Technology. Her research interests include organometallic chemistry, synthesis, and catalyst design. While obtaining her BA at Princeton, she obtained a certificate in Computer Science and worked as a Research Fellow in Technology and Data Governance under Dr. Latanya Sweeney at the Federal Trade Commission.
Jim Graves is a PhD student in Engineering and Public Policy at Carnegie Mellon University, where his research focuses on the law and economics of data privacy. Before returning to school, he worked as a data security and networking professional for over 15 years. Jim earned his JD from William Mitchell College of Law, where he was Editor-in-Chief of the Law Review, and holds an M.S. in Information Networking and a B.S. in Mathematics and Computer Science, both from Carnegie Mellon University.
Paul Lisker '17 is a Computer Science student at Harvard College with a minor in Government. A former Technology and Data Governance fellow at the Federal Trade Commission and software engineer at a data–use–protection start-up, he is passionate about the growing intersection of technology, privacy and government. He is a proud Mexican-American dual citizen also interested in international technology concerns. Tweet hello paullisker || http://lisker.me
Latanya Sweeney is Professor of Government and Technology in Residence at Harvard University, Director of the Data Privacy Lab at Harvard, Editor-in-Chief of Technology Science, and was formerly the Chief Technology Officer of the U.S. Federal Trade Commission. She earned her PhD in computer science from the Massachusetts Institute of Technology and her undergraduate degree from Harvard. More information about Dr. Sweeney is available at her website at latanyasweeney.org. As Editor-In-Chief of Technology Science, Professor Sweeney was recused from the review of this paper.
This work was conducted at the Federal Trade Commission during the summer of 2014 as part of the Summer Research Fellows Program. All statements, analyses and conclusions are the authors’ and do not necessarily reflect any position held by the Federal Trade Commission or any Commissioner.
Zang J, Dummit K, Graves J, Lisker P, Sweeney L. Who Knows What About Me? A Survey of Behind the Scenes Personal Data Sharing to Third Parties by Mobile Apps. Technology Science. 2015103001. October 29, 2015. https://techscience.org/a/2015103001/
The data is under classification review.
Enter your recommendation for follow-up or ongoing work in the box at the end of the page. Feel free to provide ideas for next steps, follow-on research, or other research inspired by this paper. Perhaps someone will read your comment, do the described work, and publish a paper about it. What do you recommend as a next research step?