Web Privacy Census

Ibrahim Altaweel; Nathaniel Good; Chris Jay Hoofnagle

Abstract

Most people may believe that online activities are tracked more pervasively now than they were in the past. In 2011, we started surveying the online mechanisms used to track people online (e.g., HTTP cookies, Flash cookies and HTML5 storage). We called this our Web Privacy Census. We repeated the study in 2012. In this paper, we update the study to 2015.

Results summary: Our approach uses web crawler software to simulate online browsing behavior, and we record the occurrences of tracking mechanisms for the top 100, 1,000, and 25,000 most popular websites. We found that users who merely visit the homepages of the top 100 most popular sites would collect over 6,000 HTTP cookies in the process (see Top 100 Websites - Shallow Crawl). Eighty-three percent of cookies are third-party cookies. The homepages of popular sites placed cookies for 275 third-party hosts. In just visiting the homepage of popular sites, we found 32 websites placed 100 or more cookies, 7 websites placed 200 or more cookies, and 6 websites placed 300 or more cookies. We found that Google tracking infrastructure is on 92 of the top 100 most popular websites and on 923 of the top 1,000 websites. This means that Google's ability to track users on popular websites is unparalleled, and it approaches the level of surveillance that only an Internet Service Provider can achieve.

Background

As early as 1995, Beth Givens of the Privacy Rights Clearinghouse suggested that federal agencies create benchmarks for online privacy. The first attempts at web measurement found relatively little tracking online in 1997: only 23 of the most popular websites used cookies on their homepages [3]. But within a few years, tracking for network advertising appeared on many websites. By 2011, all of the most popular websites employed cookies. Below is a historical summary. Table 1 presents a reverse timeline.

The Electronic Privacy Information Center made the earliest attempts to enumerate privacy practices in a systematic fashion. In June 1997, it released “Surfer Beware: Personal Privacy and the Internet,” a survey of the top 100 websites. Only 17 of the top 100 websites had privacy policies. Twenty-three sites used cookies. This observation may underrepresent the actual number of sites using cookies. It appears that EPIC used a “surface crawl” to detect those cookies, meaning that it only visited the homepage of the site and did not click other links. By 2009, Soltani et al. found cookies on 98 of the top 100 sites, and by 2011, Ayenson et al. found cookies on all 100 most popular sites [1] (see discussion below).
In “Surfer Beware II: Notice is Not Enough, published in June 1998”, EPIC surveyed websites of companies that had recently joined the Direct Marketing Association [4]. At the time, the Direct Marketing Association (DMA) committed to basic privacy protections, including notice and an ability for consumers to opt out. EPIC found 76 new members of the DMA, but only 40 had websites. Of those 40, all collected personal information. Only eight of the sites had a privacy policy.
The Federal Trade Commission conducted the first large-scale privacy measurement study in “Privacy Online: A Report to Congress,” released in June 1998. The Commission examined the privacy practices of 1,402 websites using a sophisticated sample procedure to ensure that a variety of consumer-oriented websites were studied (health, retail, financial, sites directed at children, and the most popular websites). The FTC found that
“the vast majority of Web sites—upward of 85%—collect personal information from consumers. Few of the sites—only 14% in the Commission’s random sample of commercial Web sites—provided any notice with respect to their information practices, and fewer still—approximately 2%—provided notice by means of a comprehensive privacy policy.” [5]
In EPIC’s “Surfer Beware III: Privacy Policies without Privacy Protection,” the group surveyed the practices of 100 ecommerce sites [6]. This 1999 report was the most comprehensive but also the last of the EPIC surveys. It evaluated sites for compliance with a full range of fair information practices, such as whether the site collected personal information, whether the site linked to a privacy policy, whether the site agreed to a seal program, and whether users had access and correction rights for personal information. Eighty-six of the sites used cookies, 18 lacked privacy policies, and 35 had some form of network advertiser active on the site. The text of the report makes it clear that EPIC evaluated both the privacy politics of these sites and tested them to see whether they set cookies. However, it is unclear whether EPIC performed a surface crawl of just the homepage or a deeper crawl that explored more of the site.
In May 2000, the Federal Trade Commission released a survey of sites that detected third-party cookies [7]. In its study, the FTC drew from two groups of websites: those with more than 39,000 visits a month and a second sample of popular sites (91 of the top 100). The FTC found that, “57% of the sites in the Random Sample and 78% of the sites in the Most Popular Group allow the placement of cookies by third parties…. The majority of the third-party cookies in the Random Sample and in the Most Popular Group are from network advertising companies that engage in online profiling.”
In a multiple-year study of 1,200 websites, Bala Krishnamurthy and Craig Wills found increasing collection of information about users from an increasingly concentrated group of tracking companies [8]. Krishnamurthy and Wills describe what we call “DNS aliasing” in their paper (also described in their 2006 paper), a practice where, “…what appeared to be a server in one organization (e.g. w88.go.com) was actually a DNS CNAME alias to a server (go.com.112.2o7.net) in another organization (Omniture).” They found a massive increase in such aliasing: “…the percentage of first-party servers with multiple top third-party domains has risen from 24% in Oct’05 to 52% in Sep’08…This increase is significant because it shows that now for a majority of these first-party servers, users are being tracked by two and more third-party entities.” It is also significant because through DNS aliasing, tracking companies can present cookies to users directly as first parties, thereby circumventing third-party cookie blocking. By decoding aliased domains, Krishnamurthy and Wills found that third-party trackers became more concentrated. Sampling from five periods, they found the concentration grew from 40% in October 2005 to 70% in September 2008. Further, they found that “The overall share of the top-five families: Google, Omniture, Microsoft, Yahoo and AOL extends to more than 75% of our core test set with Google alone having a penetration of nearly 60%.”
In June 2009, Gomez et al. published the KnowPrivacy report. The report focused on several areas of consumer privacy and featured a large-scale crawl of sites based on data from Ghostery [9]. Google-owned trackers were present on over 88% of a sample of 393,829 distinct domains. Further, in a survey of the top 100 sites, Google Analytics appeared on 81 of them.
In August 2009, Soltani et al. demonstrated that popular websites used “Flash cookies” to track users [10]. Some advertisers adopted this technology because it allowed persistent tracking even where users took steps to avoid web profiling. Soltani et al. also demonstrated “respawning” on top sites with Flash technology. This allowed sites to reinstate HTTP cookies deleted by a user, making tracking more resistant to users’ privacy-seeking behaviors. In a survey of the top 100 sites according to Quantcast, Soltani et al. found 3602 cookies set on 98 of the top 100 sites. They also found 281 Flash cookies set on 54 of the top 100 sites.
In July 2010, Julia Angwin, Tom McGinty, and Ashkan Soltani of the Wall Street Journal reported that in a scan of the top 50 sites, 3,180 “tracking files” (comprising HTTP cookies, Flash cookies, and web beacons) were detected [11]. Twelve sites set over 100 each.
In 2010, Michael Coates surveyed the top 1,000 websites in order to determine how many used HTTPS [12]. Coates sent a basic HTTPS request to these sites, and they responded with 559 cookies. Coates’s method appeared to not collect any third-party cookies.
Flash cookies are now a major focus of research. In 2001, McDonald and Cranor of Carnegie Mellon investigated the presence of Flash cookies on websites [13]. They found a dramatic decline from the Soltani et al. investigation in 2009. McDonald and Cranor found Flash cookies on only 20 of the top 100 sites. They were also careful to attempt to determine whether Flash cookie values were unique or not. Six of the top 100 sites had Flash cookies that were not unique, and thus probably not used to track individuals.
Krishnamurthy et al. made significant contributions to the study of privacy “leakage.” In a study of websites that required registration, they found that a majority of the popular sites they analyzed “directly leak sensitive and identifiable information to third-party aggregators” [14]. The problem they identified was widespread: “56% of the 120 popular sites in our study (75% if we include userids) directly leak sensitive and identifiable in formation to third-party aggregators.”
In July 2011, Stanford Law/Computer Science graduate student Jonathan Mayer released “FourthParty,” an “open-source platform for measuring dynamic web content” [15]. Mayer posted the raw data from web crawls made with the platform and released two reports flowing from the system. In the first, Mayer tested how members of the Network Advertising Initiative (NAI) interpret opt-outs [16]. The NAI considers the scope of opt out rights to pertain only to targeting ads, not to tracking. Thus, if a consumer opts out, NAI members may still track them. Mayer found that half of the NAI members tested (N=64) still used tracking cookies despite an opt-out.
In the second report, Mayer found that in developing FourthParty, he detected “browser history stealing” [17]. This is a practice where a website “exploits link styling to learn a user’s web browsing history. The approach is simple: to test whether the user has visited a link, add it to a page and check how it’s styled.”
In August 2011, Ayenson et al. surveyed the top 100 websites, simulating a user session by clicking on 10 random links on each site [1]. They detected cookies on all top 100 sites. They found 5,675 cookies, 4,615 of which were set by third parties. They detected 600 third-party hosts. Of the top 100 sites, 97, including popular government website,s used Google-controlled cookies. Ayenson et al. found that 17 sites used HTML5 local storage, and seven of those sites had HTML5 local storage and HTTP cookies with matching values [1]. Flash cookies were present on 37 of the top 100 sites.
In October 2011, Jonathan Mayer tested signup and interaction on 185 of the Quantcast top 250 sites. He found 113 of the sample leaked user ids or usernames to third parties [18].
In “Pixel Perfect: Fingerprinting Canvas in HTML5,” a study done in 2012 by Mowery and Shacham, the relationship between the web browser and the operating system was investigated in order to understand how each system creates its own fingerprint [19]. Binding the browser with an operating system functionality and hardware allows website to have more information about users. Additionally, Three-dimensional graphics (WebGL) and browser font are used to produce a unique image, which is used as a fingerprint, that can be used to track users online.
“Understanding What They Do With What They Know,” released in 2012 by Wills, et al., investigated what Web advertisers do with information gathered from a user [20]. Advertisements shown to users during experimental controlled browsing sessions and personal interests shown in Ad Preference Managers were analyzed and discussed. The authors found that the Google ad network displays personalized ads, which are categorized in the Ad Preference manager of the user. The ad network uses personal information, including users’ private information, in the data collected to generate advertisements in real time. The study also discovered that even though Facebook does not generate ads based on users’ browsing behavior on non-Facebook sites, it uses the Facebook Like button to understand users’ interests and show ads based on their interests.
“FPDetective: Dusting the Web for Fingerprinters,” released in 2013 by Acar, discussed how the FPDetective framework detects and analyzes web-based fingerprints [21]. The study also found weaknesses in both the Tor browser and Firegloves, two browsers that pride themselves on concealing fingerprints, that would allow online trackers to identify a user. The authors used FPDetective as a crawler and were able to gather the information to pick up on properties that relate to a user’s fingerprint.
Malandrino, Krishnamurthy et al.’s “Privacy Awareness about Information Leakage: Who Knows About Me?” study considered users’ lack of access to and awareness of their private information online [22]. The study compared the amount of sensitive information leaked when using different privacy protection tools, including NoTrace, AdBlock Plus, Ghostery, NoScript, and RequestPolicy. Although they concluded that no privacy extension can fully protect users online, NoTrace was praised for showing users a behind-the-scenes view of the availability of their personal information to trackers.
Olejnik et al. in “Why Johnny Can’t Browse in Peace: On the Uniqueness of Web Browsing History Patterns” investigated how history-based user fingerprinting is done [23]. With a dataset of 300k users’ web browsing histories, the pages users visited, and sites they repeatedly returned to, the study found that more than 69% of users have a unique fingerprint. Consequently, web browsing histories can easily be traced to particular users and their personal preferences by web authors.
Mayer and Mitchell explored third-party tracking and advertising in their study, “Third-Party Web Tracking: Policy and Technology.” They used FourthParty, an open-source web platform that measures dynamic web content, to crawl Alexa’s Top 500 sites [24]. In the study, Mayer and Mitchell found that of the 11 ad-blocking tools they tested, all blocked third-party advertising. However, the ad-blocking tools did not differentiate between advertising content and advertising-related tracking content. They concluded that without the configuration of options, ad-blocking software can only be slightly effective, and so is primarily a solution for more advanced users.
In “Privacy and Online Social Networks: Can Colorless Green Ideas Sleep Furiously,” Krishnamurthy discussed online social networks (OSNs) and their responsibility, as the parties with the most detail about their users’ interactions, to be more transparent about the flow of users’ private information to other sites over time [25]. Krishnamurthy believed that with more transparency and tools such as the Facebook extension Privacy IQ, users can get a better understanding of their privacy and what actions they may need to take to attain their preferred level of privacy on social networks. He suggested that OSNs have the means to bridge the gap between users and privacy protection and should be invested in doing so.
In “Fast and Reliable Browser Identification with JavaScript Engine Fingerprinting,” Mulazzani et al. also studied how spoofing a user agent string, a string that a browser or other applications generate and send to web servers to identify themselves, does not successfully hide the user’s identity [26]. They tested the underlying JavaScript engine in multiple browsers and browser versions to find that they could reliably determine the user’s browser without regard to the user agent at all.
In the 2013 study, “Cookieless Monster: Exploring the Ecosystem of Web-Based Device Fingerprinting,” on web-based device fingerprinting, Nikiforakis et al. of the University of California, Santa Barbara surveyed more than 800,000 users and conducted a 20-page crawl of Alexa’s top 10,000 websites [27]. They found that users who installed browser or user agent-spoofing extensions create a more unique fingerprint for themselves. The study found that the extensions are not able to completely hide the browser’s identity (they are unable to spoof particular methods or properties), resulting in the user being even more recognizable.
In a 2014 device fingerprinting position paper, “Obfuscation For and Against Device Fingerprinting,” Acar discusses the power and knowledge asymmetry that arises in relation to device fingerprinting because a user has no knowledge of where his or her data is used and no control over how it is gathered [28]. Acar also comments on the uselessness of spoofing user agents as a way to prevent tracking. The conclusion is that more effective tools such as obfuscation with the Tor browser are needed to combat fingerprinting.
In “Cookies That Give You Away: Evaluating the Surveillance Implications of Web Tracking,” released in 2014, Reisman et al. discovered that multiple web pages with embedded trackers can connect a user’s web page visits back to the specific user [29]. By using simulated browsing profiles, the also discovered that over half of the most popular web pages that have embedded trackers leak a user’s identity to other parties.
“The Web Never Forgets: Persistent Tracking Mechanisms in the Wild,” a study done in 2014 by Acar et al., focused on a tracking mechanism called canvas fingerprinting [30]. A canvas fingerprint is an image with text that is drawn in the browser and sent to the requesting site the user is on. This type of tracking produces a unique fingerprint without the user being aware, because each system produces a different image. This paper discusses cookie syncing and respawning as tracking techniques to be wary of because they allow domain-to-domain communication and consistent tracking, respectively, after a user wipes their cookies.

Table 1. Reverse timeline of online privacy measures.

Since our Web Privacy Census of 2012, online advertising and metrics companies have developed even more sophisticated ways to track and identify individuals online. So, in this study for 2015, we intended to formalize the benchmarking process and measure Internet tracking consistently over time. In this Web Privacy Census, we seek to explore:

How many entities are tracking users online?
What vectors (technologies) are most popular for tracking users?
Is there displacement (i.e., a shift from one tracking technology to another) in tracking practices?
Is there greater concentration of tracking companies online?
What entities have the greatest potential for online tracking and why?

Results

Top 100 Websites - Shallow Crawl

In our shallow crawl, we detected cookies on 99 of the top 100 websites, in comparison with all 100 in October 2012. In total, we detected 6,280 HTTP cookies for the top 100 websites, compared to 3,152 in October 2012. In 2015, with our shallow crawl, we found 3 websites that placed 300 or more cookies.

Figure 1 shows the distribution of cookies for the top 100 sites. The x-axis is the number of cookies, and the y-axis is the number of sites.

A significant number of top sites used Flash cookies, but the biggest increases are in the use of HTML5. In 2012, we found that 34 sites use HTML5. In this investigation, 76 sites used HTML5 when we investigated three links on the site. Also, many “keys” are included in HTML5 cookies. In our shallow crawl, we detected more than 800 keys in HTML5 storage.

Most HTTP cookies—83% of them—came from a third-party host. We detected 275 third-party hosts among the third-party cookies. This means that Internet tracking remains diffuse. A user who browses the most popular websites must vet dozens, even hundreds of policies to understand the state of data collection online.

At the same time, one player has an outsized ability to track online. Google Analytics had cookies on 15 sites; Google’s ad tracking network, doubleclick.net, had cookies on 26 sites; youtube.com, also owned by Google, had cookies on 8 sites. Overall, Google had a presence on 85 of the top 100 websites.

Facebook had a presence on 20 of the top 100 websites.

The most frequently appearing cookie keys for the top 100 sites in our shallow crawl were: uid, _ga, __qca, i, __uuid.

Figure 1. Histogram showing number of HTTP cookies (horizontal axis) found on the top 100 websites using shallow crawl. Vertical axis is the number of websites with a given number of cookies.

Top 100 Shallow Flash Cookies and HTML5 Local Storage

Figure 2 shows an increase in the number of Flash cookies from 2012 to 2015 on the 100 most popular web pages using shallow crawl. We tracked 877 HTML5 storage keys for these same sites.

Figure 2. Historical comparison of Flash cookies and HTML5 storage from 2012 to 2015 appearing on the homepages of the top 100 websites using shallow crawl.

Top 100 Websites—Deep Crawl

When we visited sites and made two clicks on the same domain, we detected cookies on all 100 top websites. In total, we detected 12,857 HTTP cookies for the top 100 websites, compared to 6,485 in October 2012. Figure 3 shows a summary of the key tracking metrics.

Figure 3. Key tracking metrics found in 2015 with comparisons to 2012.

In 2015, our deep crawl found that 11 websites placed 300 or more cookies. Figure 4 shows a summary. Google Analytics had cookies on 52 of the top sites; doubleclick.net had cookies on 73 sites; YouTube had cookies on 19 sites. Overall, Google had a presence on 92 of the top 100 websites. Facebook had a presence on 57 of the top 100 websites.

Our observations about Flash cookies and HTML5 storage in the shallow crawl were also reflected in a crawl to three links on sites. Flash cookies grew modestly, but sites now use HTML5 to store many keys about site visitors.

The most frequently appearing cookie keys for the top 100 sites in our deep crawl were: _ga, uid, __utma, __utmz, id.

Figure 4. Histogram showing number of HTTP cookies (horizontal axis) found on the top 100 websites using deep crawl. Vertical axis is the number of websites with a given number of cookies.

Top 1,000 Websites - Shallow Crawl

In 2015, with a shallow crawl, we detected cookies on 94% of the top 1,000 websites. In total, there were 80,821 HTTP cookies for the top 1,000 websites. Forty-six sites placed 300 or more cookies.

The most frequently appearing cookie keys for the top 1,000 sites in our shallow crawl were: _ga _utma, _utmz, _qca, uid.

Figure 6 shows the distribution of cookies for the top 1,000 sites. The x-axis is the number of cookies, and the y-axis is the number of sites. Most cookies—87% of them—were placed by a third-party host. We detected more than 797 third-party hosts among the third-party cookies. Google Analytics had cookies on 151 of the top sites; doubleclick.net had cookies on 212 sites; youtube.com had cookies on 65 sites. Overall, Google had a presence on 844 of the top websites.

Facebook had a presence on 182 of the top websites.

The most frequently appearing cookie keys for the top 1,000 sites in our shallow crawl were: _ga, __utma, __utmz, __qca, uid.

Figure 6. Histogram showing number of HTTP cookies (horizontal axis) found on the top 1,000 websites using shallow crawl. Vertical axis is the number of websites with a given number of cookies.

Top 1,000 Websites - Deep Crawl

In 2015, with a deep crawl, we detected cookies on 95% of the top 1,000 websites. In total, there were 134,769 HTTP cookies for the top 1,000 websites, compared to 65,381 in 2012. One hundred and thirty sites placed 300 or more cookies.

The most frequently appearing cookie keys for the top 1,000 sites in our deep crawl were: _ga, _utma, _utmz, optimizelySegments, optimizeltEndUserID.

Figure 8 shows the distribution of cookies for the top 100 sites. The x-axis is the number of cookies, and the y-axis is the number of sites. Most cookies—92% of them—were placed by a third-party host. We detected more than 880 third-party hosts among the third-party cookies.

Google Analytics had cookies on 581 of the top sites; doubleclick.net had cookies on 754 sites; youtube had cookies on 121. Overall, Google had a presence on 923 of the top websites.

Facebook had a presence on 548 of the top websites.

The most frequently appearing cookie keys for the top 1,000 sites in our deep crawl were: ga, utma, id, utmz, and optimizeitEndUserID.

Figure 8. Histogram showing number of HTTP cookies (horizontal axis) found on the top 1,000 websites using deep crawl. Vertical axis is the number of websites with a given number of cookies.

Top 1,000 Deep Flash Cookies and HTML5 Local Storage

Figure 9 shows an increase in the number of Flash cookies from 2012 to 2015 found on the 1,000 most popular web pages using deep crawl. We tracked 6,309 HTML5 storage keys for these same sites.

Figure 9. Historical comparison of Flash cookies and HTML5 storage from 2012 to 2015 appearing on the homepages of the top 1,000 websites using deep crawl.

Top 1,000 Deep Flash Cookies and HTML5 Local Storage

Top 25,000 Websites—Shallow Crawl

We detected HTTP cookies on 81% of the top 25,000 websites. In total, we detected 1,065,076 HTTP cookies on the top 25,000 websites, compared to 476,492 in October 2012. In 2015, with our shallow crawl, we found 568 sites placing 300 or more cookies.

Figure 10 shows the distribution of cookies for the top 25,000 sites. The x-axis is the number of cookies, and the y-axis is the number of sites. Most cookies—87% of them—come from a third-party host. We detected more than 8,839 third-party hosts among the third-party cookies. Google Analytics had cookies on 11,521 of the top sites; doubleclick.net had cookies on 5,883; YouTube had cookies on 1,453. Overall, Google had a presence on 18,375 of the top 25,000 websites.

Facebook had a presence on 2,123 of the top websites.

The most frequently appearing cookie keys for the top 25,000 sites in our shallow crawl were:__utma,__utmz,_ga,__utmb,__gads,__qca.

Figure 10. Histogram showing number of HTTP cookies (horizontal axis) found on the top 25,000 websites using shallow crawl. Vertical axis is the number of websites with a given number of cookies.

Top 25,000 Shallow Flash Cookies and HTML5 Local Storage

Figure 11 shows an increase in the number of Flash cookies from 2012 to 2015 on the 25,000 most popular web pages using shallow crawl. We tracked 48,949 HTML5 storage keys for these same sites.

Figure 11. Historical comparison of Flash cookies and HTML5 storage from 2012 to 2015 appearing on the homepages of the top 25,000 websites using shallow crawl.

Figure 12 lists the names of the top trackers.

Figure 12. The top trackers found in the study and the number of distinct websites on which they were found.

Table 2 shows a summary of the number of tracking technologies (HTTP cookies, Flash cookies and HTML5 cookies) returned by the top-level websites for the top 100, 1,000 and 25,000 domains we visited. It displays the sum per category as well as the percentage overall.

Table 2. Overall summary of results for shallow and deep crawls for the top 100, 1,000 and 25,000 websites.

Limitations of crawler methods. For the October 2015 report, the crawler did not login to any sites, nor bypass any modal dialogs, and therefore our data does not record how cookies changed based on additional information provided by users logging into third-party services or requesting further access to the main site. Additionally, as the crawler automated selection of URLs for deep crawls, we did not necessarily capture any retargeting based on a human action (e.g., adding items to a shopping cart). We limited deep crawls to HTML anchor tags found and did not follow links set by JavaScript. Additionally, we randomly selected from links obtained by the deep crawler, and we consequently did not take into account page layout and visual layout in the selection process. We ran the crawl using Firefox with no add-ons.

Limitations of data collection methods. The identification and classification of third- and first-party cookies is a complex task. Many tracking and advertising companies are owned by other sites that have different domain names. For example, DoubleClick is owned by Google. For consistency in categorizing third-party cookies, the public suffix list was leveraged to combine suffixes consistent with previous work. We classified cookies from the top-level domain as first-party, while we classified cookies from a domain outside of the top-level domain third-party. This limited analysis of third-party domains to domains syntactically considered to be third parties. The analysis is not reflective of any underlying agreements or connections that may exist between multiple domains, through “DNS aliasing,” for instance, where a primary domain assigns a subdomain to a tracking company. Under such an arrangement, ordinary third-party cookies are instantiated in a first-party fashion. The ranking list used was Quantcast's top 1 million sites in the United States. This ranking may be different in other countries. Some websites on Quantcast’s top 1 million list don’t wish to be listed on the list and are marked as “Hidden profile”. We crawled top 100, 1000, and 25,000 excluding “hidden profiles”.